CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
Optimization of DNN using native RISC-V vector intrinsics. #20287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution looks like translated version of AVX code. I prpose:
- Do not compute tails with scalar code as it's done for Intel and ARM, but set vector size with last loop iteration with
vl = vsetvl_e32m2(tail_size);
It makes code simplier and it should be faster. - There is no alternative for vectorization for now unlike SSE/AVX/AVX2/AVX512. I do not think that we need
checkHardwareSupport
for vectorized code.
@vpisarev Please take a look and provide your feedback. |
@hanliutong, thanks for the patch. Please, try it on DNN tests under QEMU. Also, I do not see the port of the most important, convolution kernel. It looks like we need to accelerate a bit, so far the proposed patch is a little small for 1+ month of work. It's a lot of work to do. As @asmorkalov said, the code looks like as almost direct translation of SSE2 code. What I'd like to see:
|
bool tail = false; | ||
if (j + FASCONV_BASE_VECSZ > blockSize) | ||
{ | ||
if (j == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this branch? It case if blockSize
is too small, tail should be called without loop iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to explain why there is a branch about j==0
.
When j + 4 > blockSize
, it usually means we meet a tail then we use a tail mask and let j = blockSize - 4
(compute the last 4 elements and store with mask).
Example:
- Assume blockSize=5 and output is 5*1, ignore the output channel
- The mask will be
[0, 0, 0, 1]
- The first loop iteration will compute the first 4 elements in the output matrix, then the output should be
[√, √, √, √, TBD]
. Now, j = 4. - The second also the last loop iteration will compute the last element with the help of the mask. Actually, we compute the last 4 elements but only store the last one because of the mask. In detail, at the beginning of this iteration, j=4 and we find that
j + 4 > blockSize
, so we letj = blockSize - 4
, now, j=1 then we compute the last 4 elements and only store the last one.
However, there is another situation here when the blockSize is smaller than 4. In this case, we should not use mask but use vl
. I will explain with an example.
Example:
- Assume
blockSize=1
and output is 1*1, also ignore the output channel. - The mask will be also
[0, 0, 0, 1],
which can NOT use. - Instead, we let vl=1 and load-compute-store directly with the help of
vl
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job! 👍
Optimization of DNN using native RISC-V vector intrinsics. * Use RVV to optimize fastGEMM (FP32) in DNN. * Use RVV to optimize fastGEMM1T in DNN. * Use RVV to optimize fastConv in DNN. * Use RVV to optimize fastDepthwiseConv in DNN. * Vectorize tails using vl. * Use "vl" instead of scalar to handle small block in fastConv. * Fix memory access out of bound in "fastGEMM1T". * Remove setvl. * Remove useless initialization. * Use loop unrolling to handle tail part instead of switch.
PR for GSoC'21 project on Optimize OpenCV DNN for RISC-V.
This PR is going to add the functions implemented by RVV intrinsics (0.10) in layers_common.simd.hpp, which has 4 functions as below.
In this PR, we assume that vlen=128, which means if RVV Vector Registers are 256-bit or more longer, then the current implementation will use only a part of them. We will make the implementation adjustable to different vector sizes by other PR(s).
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.