dnn: accelerate gelu via vectorized erf #25147

fengyuentau · 2024-03-04T08:10:54Z

Depends on #25872.

Part of the acceleration of ViTs inference with dnn.

Perf

Tested on Apple M1 with test case Layer_Elementwise.elementwise/0. Data in milliseconds.

Khadas VIM4 (A311D2 SoC)

Geometric mean (ms)
                             Name of Test                                 gelu   gelu.patch gelu.patch
                                                                                                vs
                                                                                               gelu
                                                                                            (x-factor)
VIT_B_32::DNNTestNetwork::OCV/CPU                                       245.312   233.760      1.05
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)       1.604     0.938       1.71

Intel i7-12700K

Geometric mean (ms)
                             Name of Test                                gelu   gelu.patch gelu.patch
                                                                                               vs
                                                                                              gelu
                                                                                           (x-factor)
VIT_B_32::DNNTestNetwork::OCV/CPU                                       40.041    38.357      1.04
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)       0.111    0.037       2.99

Apple M1

Geometric mean (ms)
                             Name of Test                                gelu  gelu.patch gelu.patch
                                                                                              vs    
                                                                                             gelu   
                                                                                          (x-factor)
VIT_B_32::DNNTestNetwork::OCV/CPU                                       85.868   75.852      1.13   
VIT_B_32::DNNTestNetwork::OCV/CPU_FP16                                  84.684   74.188      1.14   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)      0.734    0.119       6.17   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU_FP16) 0.769    0.123       6.24

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Linux OpenCL,Win64 OpenCL

vpisarev · 2024-03-04T21:22:24Z

@fengyuentau, excellent! is there such a dramatic difference really?

btw, I played a bit with 'erf' approximation, as GELU can be computed exactly via erf: GELU(x) = x/2*(1 + erf(x/sqrt(2))) and found that erf can be computed very accurately without any v_select() and without v_exp():

#include <math.h>
#include <algorithm>
#include <stdio.h>
int main(int argc, char** argv) {
    float maxerr = 0.f;
    for (int i = -100000; i < 100000000; i++) {
        float x0 = i*0.0001f;
        float y0 = erff(x0);
        float x = fabsf(x0), sx = x0 >= 0 ? 1.f : -1.f;
        // see https://en.wikipedia.org/wiki/Error_function; one of Abramowitz and Stegun approximations
        float d = (((((0.0000430638f*x + 0.0002765672f)*x + 0.0001520143f)*x +
            0.0092705272f)*x + 0.0422820123f)*x + 0.0705230784f)*x + 1.f;
        d = 1.f/d;        
        d *= d; d *= d;
        d *= d; d *= d;
        float y1 = sx*(1.f - d);
        float err = fabsf(y0 - y1);
        maxerr = std::max(maxerr, err);
    }
    printf("maxerr=%.3g\n", maxerr);
    return 0;
}

note that extracting the sign sx and then putting it back (sx*...) can be done with element-wise v_and() and v_or() intrinsics and the sign mask 0x80000000 - no select is needed here.

Please, try it out.

fengyuentau · 2024-03-05T03:11:12Z

is there such a dramatic difference really?

I will test on other platforms as well to see whether they can reach the same improvement.

I did not notice erf(-x) = -erf(x) (it is indeed true since erf is an odd function) in the beginning. Let me try your implementation.

vpisarev · 2024-03-05T06:24:27Z

this is another implementation that follows yet another Abramowitz & Stegun approximation and also matches vectorized version from PyTorch (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_float.h#L187-L218):

inline float fast_erf(float x) {
   float sx = x >= 0 ? 1.f : -1.f;
   float t = 1.f/fmaf(fabsf(x), 0.3275911f, 1.f);
   float r = fmaf(1.061405429f, t, -1.453152027f);
   r = fmaf(r, t, 1.421413741f);
   r = fmaf(r, t, -0.284496736f);
   r = fmaf(r, t, 0.254829592f);
   return sx*(1.f - r*t*expf(-x*x));
}

I guess, it should be slower than the previous version that I suggested, but 1) it matches PyTorch and 2) it is more accurate, especially around 0.

fengyuentau · 2024-03-05T11:03:11Z

@vpisarev With accuracy issue resolved (hopefully), the updated performance results are

#####
# M1
#####
# Paddle
[ PERFSTAT ]    (samples=100   mean=0.13   median=0.13   min=0.11   stddev=0.02 (13.2%))
# Wiki
[ PERFSTAT ]    (samples=100   mean=0.07   median=0.07   min=0.06   stddev=0.01 (11.5%))
# PyTorch
[ PERFSTAT ]    (samples=100   mean=0.10   median=0.09   min=0.08   stddev=0.01 (14.3%))
#####
# i7-12700K
#####
# Paddle
[ PERFSTAT ]    (samples=100   mean=0.05   median=0.05   min=0.05   stddev=0.00 (0.4%))
# Wiki
[ PERFSTAT ]    (samples=100   mean=0.02   median=0.02   min=0.02   stddev=0.00 (1.5%))
# PyTorch
[ PERFSTAT ]    (samples=100   mean=0.04   median=0.04   min=0.04   stddev=0.00 (0.5%))

Note that the method from wiki does not need v_exp.

vpisarev · 2024-03-05T13:29:12Z

@fengyuentau, thank you for the detailed experiments! What about accuracy (absolute and relative)? how far are those implementations from std::erf()?

fengyuentau · 2024-03-05T13:57:46Z

All the dnn accuracy tests are now green. I don't know how to give absolute and relative accuracy though.

vpisarev · 2024-03-05T15:41:22Z

ok, I found that PyTorch and Paddle Paddle versions are very close to each other. Paddle Paddle is slightly more accurate, but the difference is small. They both are noticeably more accurate than the fastest 'exp-less' formula, especially around 0, but when you compute GELU, i.e. multiply by 'x*0.5' in the end, the drop in accuracy is not as noticeable.

May I suggest to drop Paddle Paddle version to keep the source more compact? 'wiki' version might be preserved, just in case, but I would use 'PyTorch' approximation as the default option.

vpisarev · 2024-03-05T15:50:02Z

also, please rewrite the implementation using scalable universal intrinsics. This way we could get even better performance if we move those kernels to separate dynamically dispatched source files and compile them with AVX2/AVX512/RVV

fengyuentau · 2024-03-06T03:45:55Z

ok, I found that PyTorch and Paddle Paddle versions are very close to each other. Paddle Paddle is slightly more accurate, but the difference is small. They both are noticeably more accurate than the fastest 'exp-less' formula, especially around 0, but when you compute GELU, i.e. multiply by 'x*0.5' in the end, the drop in accuracy is not as noticeable.

May I suggest to drop Paddle Paddle version to keep the source more compact? 'wiki' version might be preserved, just in case, but I would use 'PyTorch' approximation as the default option.

Thank you for the test. I think I can do it next time comparing difference between accuracy of implementations. Let's keep paddle and wiki version in the git history and keep the pytorch one in the code only.

also, please rewrite the implementation using scalable universal intrinsics. This way we could get even better performance if we move those kernels to separate dynamically dispatched source files and compile them with AVX2/AVX512/RVV

I am not sure whether I do it in the right way by only changing all v_setall_f32 to vx_setall_f32. I don't find a straight example on the usage of scalable intrinsics.

modules/dnn/src/layers/elementwise_layers.cpp

vpisarev · 2024-03-09T08:25:00Z

modules/dnn/src/layers/elementwise_layers.cpp

+            v_float32 half = vx_setall_f32(0.5f),
+                      one = vx_setall_f32(1.0f),
+                      reciprocal_sqrt2 = vx_setall_f32(M_SQRT1_2);
+            for (; i <= len - vlanes * 4; i += vlanes * 4) {


it's quite heavy function inside. 1) does it make sense to unroll the loop by vlanes*4? 2) with such aggressive unrolling there is a risk that we will have very long tail which will slow down performance significantly. I suggest to make unrolling less aggressive. 2) it's element-wise operation. Often in NCHW the product of H and W may be quite odd number far from power-of-two. You should process (cn1-cn0)*planeSize as a single 1D array, so that you will have just one tail, not many tiles

I tested with unrolling by 2 but did not observe significant difference.

You should process (cn1-cn0)*planeSize as a single 1D array, so that you will have just one tail, not many tiles

It is not completely correct. The loops and parallel look like the following:

for b in batch: for c in channel: for i in h * w: (this most inner loop is parallelled by the number of threads) // ...

So for each thread the workload should be b * c * stripeSize (planeSize is the step to the next segment), which is (cn1 - cn0) * len using terms in the code.

This parallelism is used across all activations in the file.

fengyuentau · 2024-07-03T07:36:26Z

v_erf_approx is only used in gelu. Maybe we do not need to put it in univeral intrinsics.

fengyuentau · 2024-07-03T16:39:42Z

v_erf_approx is only used in gelu. Maybe we do not need to put it in univeral intrinsics.

Discussed and decided to put v_erf in universal intrinsics.

asmorkalov · 2024-07-05T10:51:01Z

The optimization totally makes sense!
Jetson tk1 (armv7+neon):

elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)                                                                               6.855      2.593      2.64

Core i5-2500K (AVX, no AVX2):

elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/CPU)                                                                               1.256      0.434      2.89   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/OCL)                                                                               0.265      0.258      1.03   
elementwise::Layer_Elementwise::({ 1, 50, 3072 }, "Gelu", OCV/OCL_FP16)                                                                          0.261      0.257      1.02

core: add v_erf #25872 This patch adds v_erf, which is needed by #25147. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake

asmorkalov · 2024-07-05T12:48:12Z

@fengyuentau please rebase and switch to new v_erf. Manual port to 5.x is required too.

modules/dnn/perf/perf_layer.cpp

asmorkalov · 2024-07-07T08:46:38Z

@vpisarev please take a look again.

asmorkalov

👍

vpisarev · 2024-07-08T07:34:43Z

@fengyuentau, let's delay integration of this PR a bit while @WanliZhong will add vectorized v_erf(). What do you think?

fengyuentau · 2024-07-08T07:36:38Z

@fengyuentau, let's delay integration of this PR a bit while @WanliZhong will add vectorized v_erf(). What do you think?

It has been done in #25872.

vpisarev · 2024-07-08T09:02:29Z

cool!

fengyuentau added optimization category: dnn labels Mar 4, 2024

fengyuentau added this to the 4.10.0 milestone Mar 4, 2024

fengyuentau requested a review from vpisarev March 4, 2024 08:10

This comment was marked as resolved.

Sign in to view

vpisarev reviewed Mar 9, 2024

View reviewed changes

modules/dnn/src/layers/elementwise_layers.cpp Outdated Show resolved Hide resolved

vpisarev requested changes Mar 9, 2024

View reviewed changes

modules/dnn/src/layers/elementwise_layers.cpp Outdated Show resolved Hide resolved

vpisarev reviewed Mar 9, 2024

View reviewed changes

modules/dnn/src/layers/elementwise_layers.cpp Outdated Show resolved Hide resolved

vpisarev reviewed Mar 9, 2024

View reviewed changes

fengyuentau changed the title ~~dnn: improve speed of activation layers via vectorization for ViTs~~ dnn: accelerate gelu via vectorized erf Mar 26, 2024

asmorkalov modified the milestones: 4.10.0, 4.11.0 May 16, 2024

fengyuentau force-pushed the dnn/elementwise_layers/speedup branch from 9c67f3a to 50ccf83 Compare July 3, 2024 07:33

fengyuentau marked this pull request as ready for review July 3, 2024 07:34

fengyuentau mentioned this pull request Jul 3, 2024

add gelu conformance tests opencv/opencv_extra#1189

Merged

fengyuentau force-pushed the dnn/elementwise_layers/speedup branch from 6cd8ef5 to 25ba16a Compare July 5, 2024 02:55

fengyuentau mentioned this pull request Jul 5, 2024

core: add v_erf #25872

Merged

6 tasks

fengyuentau added 2 commits July 6, 2024 23:25

added v_erf and implemented gelu acceleration via vectorization

a4758bc

remove anonymous v_erf and use v_erf from intrin_math

7c5df99

fengyuentau force-pushed the dnn/elementwise_layers/speedup branch from 25ba16a to 7c5df99 Compare July 6, 2024 15:55

asmorkalov self-requested a review July 7, 2024 08:44

asmorkalov reviewed Jul 7, 2024

View reviewed changes

modules/dnn/perf/perf_layer.cpp Show resolved Hide resolved

enable perf for ov and cuda backend

f4f2e6b

asmorkalov approved these changes Jul 8, 2024

View reviewed changes

vpisarev requested review from vpisarev and asmorkalov July 8, 2024 09:02

vpisarev approved these changes Jul 8, 2024

View reviewed changes

asmorkalov approved these changes Jul 8, 2024

View reviewed changes

asmorkalov merged commit e3858cc into opencv:4.x Jul 8, 2024
28 of 30 checks passed

fengyuentau deleted the dnn/elementwise_layers/speedup branch July 9, 2024 07:01

asmorkalov mentioned this pull request Jul 16, 2024

(5.x) Merge 4.x #25915

Merged

asmorkalov mentioned this pull request Aug 5, 2024

fix compilation errors caused by namespace #25987

Merged

6 tasks

Uh oh!

dnn: accelerate gelu via vectorized erf #25147

dnn: accelerate gelu via vectorized erf #25147

Uh oh!

Conversation

fengyuentau commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Perf

Khadas VIM4 (A311D2 SoC)

Intel i7-12700K

Apple M1

Pull Request Readiness Checklist

Uh oh!

vpisarev commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuentau commented Mar 5, 2024

Uh oh!

vpisarev commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

fengyuentau commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vpisarev commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fengyuentau commented Mar 5, 2024

Uh oh!

vpisarev commented Mar 5, 2024

Uh oh!

vpisarev commented Mar 5, 2024

Uh oh!

fengyuentau commented Mar 6, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vpisarev Mar 9, 2024

Choose a reason for hiding this comment

Uh oh!

fengyuentau Mar 11, 2024

Choose a reason for hiding this comment

Uh oh!

fengyuentau commented Jul 3, 2024

Uh oh!

fengyuentau commented Jul 3, 2024

Uh oh!

asmorkalov commented Jul 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

asmorkalov commented Jul 5, 2024

Uh oh!

Uh oh!

asmorkalov commented Jul 7, 2024

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

vpisarev commented Jul 8, 2024

Uh oh!

fengyuentau commented Jul 8, 2024

Uh oh!

vpisarev commented Jul 8, 2024

Uh oh!

Uh oh!

Uh oh!

fengyuentau commented Mar 4, 2024 •

edited

Loading

vpisarev commented Mar 4, 2024 •

edited

Loading

vpisarev commented Mar 5, 2024 •

edited

Loading

fengyuentau commented Mar 5, 2024 •

edited

Loading

vpisarev commented Mar 5, 2024 •

edited

Loading

asmorkalov commented Jul 5, 2024 •

edited

Loading