Optimization of DNN using native RISC-V vector intrinsics. #20287

hanliutong · 2021-06-21T03:12:49Z

PR for GSoC'21 project on Optimize OpenCV DNN for RISC-V.

This PR is going to add the functions implemented by RVV intrinsics (0.10) in layers_common.simd.hpp, which has 4 functions as below.

In this PR, we assume that vlen=128, which means if RVV Vector Registers are 256-bit or more longer, then the current implementation will use only a part of them. We will make the implementation adjustable to different vector sizes by other PR(s).

Functions	Implement && Build	Vectorize tails (No Scala)	Max used of vReg (v0-v31)
fastGEMM	✅	✅	≈ 14*2
fastGEMM1T	✅	✅	≈ 16*2
fastConv	✅	✅	≈ 4+14*2
fastDepthwiseConv	✅	✅	≈ 23*2

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or other license that is incompatible with OpenCV
The PR is proposed to proper branch
There is reference to original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

asmorkalov

The solution looks like translated version of AVX code. I prpose:

Do not compute tails with scalar code as it's done for Intel and ARM, but set vector size with last loop iteration with vl = vsetvl_e32m2(tail_size); It makes code simplier and it should be faster.
There is no alternative for vectorization for now unlike SSE/AVX/AVX2/AVX512. I do not think that we need checkHardwareSupport for vectorized code.

asmorkalov · 2021-07-06T14:19:28Z

@vpisarev Please take a look and provide your feedback.

modules/dnn/src/layers/layers_common.simd.hpp

vpisarev · 2021-07-07T09:00:53Z

@hanliutong, thanks for the patch. Please, try it on DNN tests under QEMU. Also, I do not see the port of the most important, convolution kernel. It looks like we need to accelerate a bit, so far the proposed patch is a little small for 1+ month of work. It's a lot of work to do. As @asmorkalov said, the code looks like as almost direct translation of SSE2 code. What I'd like to see:

using a bigger blocks. Right now it's 4x2 for GEMM. It takes ~12-16 registers. RVV offers 32 registers. We can use larger blocks.
more intelligent processing of tails, I believe, RVV provides a method to mask out some of the lanes of a vector register, so that a scalar part in C is not needed.
But please, before 1 and 2 add optimization of convolution kernels, as soon as possible.

modules/dnn/src/layers/layers_common.simd.hpp

asmorkalov · 2021-08-02T10:20:49Z

modules/dnn/src/layers/layers_common.simd.hpp

+            bool tail = false;
+            if (j + FASCONV_BASE_VECSZ > blockSize)
+            {
+                if (j == 0) {


Why do you need this branch? It case if blockSize is too small, tail should be called without loop iteration.

I would like to explain why there is a branch about j==0.

When j + 4 > blockSize, it usually means we meet a tail then we use a tail mask and let j = blockSize - 4 (compute the last 4 elements and store with mask).

Example:

Assume blockSize=5 and output is 5*1, ignore the output channel

The mask will be [0, 0, 0, 1]

The first loop iteration will compute the first 4 elements in the output matrix, then the output should be [√, √, √, √, TBD]. Now, j = 4.

The second also the last loop iteration will compute the last element with the help of the mask. Actually, we compute the last 4 elements but only store the last one because of the mask. In detail, at the beginning of this iteration, j=4 and we find that j + 4 > blockSize, so we let j = blockSize - 4, now, j=1 then we compute the last 4 elements and only store the last one.

However, there is another situation here when the blockSize is smaller than 4. In this case, we should not use mask but use vl. I will explain with an example.

Example:

Assume blockSize=1 and output is 1*1, also ignore the output channel.

The mask will be also [0, 0, 0, 1], which can NOT use.

Instead, we let vl=1 and load-compute-store directly with the help of vl.

asmorkalov

Good job! 👍

Optimization of DNN using native RISC-V vector intrinsics. * Use RVV to optimize fastGEMM (FP32) in DNN. * Use RVV to optimize fastGEMM1T in DNN. * Use RVV to optimize fastConv in DNN. * Use RVV to optimize fastDepthwiseConv in DNN. * Vectorize tails using vl. * Use "vl" instead of scalar to handle small block in fastConv. * Fix memory access out of bound in "fastGEMM1T". * Remove setvl. * Remove useless initialization. * Use loop unrolling to handle tail part instead of switch.

Use RVV to optimize fastGEMM (FP32) in DNN.

05613a7

hanliutong force-pushed the dev-rvv-0.10 branch from 421c282 to 05613a7 Compare June 23, 2021 08:36

asmorkalov requested review from asmorkalov and vpisarev June 23, 2021 14:35

asmorkalov added category: dnn optimization platform: riscv labels Jun 23, 2021

Use RVV to optimize fastGEMM1T in DNN.

12d4a64

asmorkalov requested changes Jul 6, 2021

View reviewed changes

vpisarev reviewed Jul 7, 2021

View reviewed changes

modules/dnn/src/layers/layers_common.simd.hpp Show resolved Hide resolved

Use RVV to optimize fastConv in DNN.

acd820c

asmorkalov requested changes Jul 14, 2021

View reviewed changes

modules/dnn/src/layers/layers_common.simd.hpp Outdated Show resolved Hide resolved

modules/dnn/src/layers/layers_common.simd.hpp Outdated Show resolved Hide resolved

hanliutong added 4 commits July 15, 2021 15:27

Use RVV to optimize fastDepthwiseConv in DNN.

a73a6fa

Vectorize tails using vl.

968cf58

Use "vl" instead of scalar to handle small block in fastConv.

4968ee0

Fix memory access out of bound in "fastGEMM1T".

216d5b7

asmorkalov requested changes Jul 28, 2021

View reviewed changes

modules/dnn/src/layers/layers_common.simd.hpp Outdated Show resolved Hide resolved

modules/dnn/src/layers/layers_common.simd.hpp Outdated Show resolved Hide resolved

modules/dnn/src/layers/layers_common.simd.hpp Outdated Show resolved Hide resolved

hanliutong added 3 commits July 29, 2021 13:09

Remove setvl.

2866f36

Remove useless initialization.

c1446f0

Use loop unrolling to handle tail part instead of switch.

e2acb63

asmorkalov reviewed Aug 2, 2021

View reviewed changes

hanliutong marked this pull request as ready for review August 4, 2021 07:36

hanliutong changed the title ~~WIP: Optimization of DNN using native RISC-V vector intrinsics.~~ Optimization of DNN using native RISC-V vector intrinsics. Aug 9, 2021

hanliutong mentioned this pull request Aug 9, 2021

Make the implementation of optimization in DNN adjustable to different vector sizes with RVV intrinsics. #20521

Merged

6 tasks

asmorkalov self-requested a review August 9, 2021 07:35

asmorkalov approved these changes Aug 9, 2021

View reviewed changes

alalek assigned asmorkalov Aug 10, 2021

alalek merged commit aaca498 into opencv:master Aug 10, 2021

alalek mentioned this pull request Oct 15, 2021

(5.x) Merge 4.x #20886

Merged

hanliutong mentioned this pull request Nov 19, 2021

Further optimize DNN for RISC-V Vector. #21086

Merged

6 tasks

hanliutong deleted the dev-rvv-0.10 branch November 19, 2021 08:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Optimization of DNN using native RISC-V vector intrinsics. #20287

Optimization of DNN using native RISC-V vector intrinsics. #20287

Uh oh!

hanliutong commented Jun 21, 2021 •

edited

Loading

Uh oh!

asmorkalov left a comment

Uh oh!

asmorkalov commented Jul 6, 2021

Uh oh!

Uh oh!

vpisarev commented Jul 7, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asmorkalov Aug 2, 2021

Uh oh!

hanliutong Aug 4, 2021

Uh oh!

asmorkalov left a comment

Uh oh!

Uh oh!

Uh oh!

Optimization of DNN using native RISC-V vector intrinsics. #20287

Optimization of DNN using native RISC-V vector intrinsics. #20287

Uh oh!

Conversation

hanliutong commented Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Jul 6, 2021

Uh oh!

Uh oh!

vpisarev commented Jul 7, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

asmorkalov Aug 2, 2021

Choose a reason for hiding this comment

Uh oh!

hanliutong Aug 4, 2021

Choose a reason for hiding this comment

Uh oh!

asmorkalov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hanliutong commented Jun 21, 2021 •

edited

Loading