CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
dnn: accelerate gelu via vectorized erf #25147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dnn: accelerate gelu via vectorized erf #25147
Conversation
@fengyuentau, excellent! is there such a dramatic difference really? btw, I played a bit with 'erf' approximation, as GELU can be computed exactly via erf:
note that extracting the sign Please, try it out. |
I will test on other platforms as well to see whether they can reach the same improvement. I did not notice |
this is another implementation that follows yet another Abramowitz & Stegun approximation and also matches vectorized version from PyTorch (https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_float.h#L187-L218):
I guess, it should be slower than the previous version that I suggested, but 1) it matches PyTorch and 2) it is more accurate, especially around 0. |
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
This comment was marked as resolved.
@vpisarev With accuracy issue resolved (hopefully), the updated performance results are
Note that the method from wiki does not need |
@fengyuentau, thank you for the detailed experiments! What about accuracy (absolute and relative)? how far are those implementations from |
All the dnn accuracy tests are now green. I don't know how to give absolute and relative accuracy though. |
ok, I found that PyTorch and Paddle Paddle versions are very close to each other. Paddle Paddle is slightly more accurate, but the difference is small. They both are noticeably more accurate than the fastest 'exp-less' formula, especially around 0, but when you compute GELU, i.e. multiply by 'x*0.5' in the end, the drop in accuracy is not as noticeable. May I suggest to drop Paddle Paddle version to keep the source more compact? 'wiki' version might be preserved, just in case, but I would use 'PyTorch' approximation as the default option. |
also, please rewrite the implementation using scalable universal intrinsics. This way we could get even better performance if we move those kernels to separate dynamically dispatched source files and compile them with AVX2/AVX512/RVV |
Thank you for the test. I think I can do it next time comparing difference between accuracy of implementations. Let's keep paddle and wiki version in the git history and keep the pytorch one in the code only.
I am not sure whether I do it in the right way by only changing all |
v_float32 half = vx_setall_f32(0.5f), | ||
one = vx_setall_f32(1.0f), | ||
reciprocal_sqrt2 = vx_setall_f32(M_SQRT1_2); | ||
for (; i <= len - vlanes * 4; i += vlanes * 4) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's quite heavy function inside. 1) does it make sense to unroll the loop by vlanes*4? 2) with such aggressive unrolling there is a risk that we will have very long tail which will slow down performance significantly. I suggest to make unrolling less aggressive. 2) it's element-wise operation. Often in NCHW the product of H and W may be quite odd number far from power-of-two. You should process (cn1-cn0)*planeSize
as a single 1D array, so that you will have just one tail, not many tiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested with unrolling by 2 but did not observe significant difference.
You should process (cn1-cn0)*planeSize as a single 1D array, so that you will have just one tail, not many tiles
It is not completely correct. The loops and parallel look like the following:
for b in batch:
for c in channel:
for i in h * w: (this most inner loop is parallelled by the number of threads)
// ...
So for each thread the workload should be b * c * stripeSize
(planeSize is the step to the next segment), which is (cn1 - cn0) * len
using terms in the code.
This parallelism is used across all activations in the file.
9c67f3a
to
50ccf83
Compare
|
Discussed and decided to put |
6cd8ef5
to
25ba16a
Compare
The optimization totally makes sense!
Core i5-2500K (AVX, no AVX2):
|
core: add v_erf #25872 This patch adds v_erf, which is needed by #25147. ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [x] The PR is proposed to the proper branch - [x] There is a reference to the original bug report and related work - [x] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [x] The feature is well documented and sample code can be built with the project CMake
@fengyuentau please rebase and switch to new v_erf. Manual port to 5.x is required too. |
25ba16a
to
7c5df99
Compare
@vpisarev please take a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@fengyuentau, let's delay integration of this PR a bit while @WanliZhong will add vectorized v_erf(). What do you think? |
It has been done in #25872. |
cool! |
Depends on #25872.
Merge wtih opencv/opencv_extra#1189.
Part of the acceleration of ViTs inference with dnn.
Perf
Tested on Apple M1 with test case
Layer_Elementwise.elementwise/0
. Data in milliseconds.Khadas VIM4 (A311D2 SoC)
Intel i7-12700K
Apple M1
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.