Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318

hanliutong · 2024-10-16T13:55:39Z

The modification of this patch involves the RVV backend of Universal Intrinsic, replacing LMUL=1 with LMUL=2.

Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation

This patch is generally beneficial in performance.

We compiled OpenCV with Clang-19.1.1 and GCC-14.2.0 , ran it on CanMV-k230 and Banana-Pi F3. Then we have four scenarios on combinations of compilers and devices. In opencv_perf_core, there are 3363 cases, of which:

901 (26.8%) cases achieved more than 5% performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from 3.35x to 4.35x
75 (2.2%) cases had more than 5% performance loss in all four scenarios, indicating that these cases are better with LMUL=1 instead of LMUL=2. This involves Mat_Transform, hasNonZero, KMeans, meanStdDev, merge and norm2. Among them, Mat_Transform only has performance degradation in a few cases (8UC3), and the actual execution time of hasNonZero is so short that it can be ignored. For KMeans, meanStdDev, merge and norm2, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for merge Add the HAL implementation for the merge function on RISC-V Vector. #26216 )

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

hanliutong · 2024-10-16T13:59:51Z

This PR is marked as a draft because its implementation needs further discussion.

Possible alternatives & Discussion

After discussion with @mshabunin, we believe that it might be better to set LMUL in a more generalized way. Such as maybe we can switch LMUL between m1/m2/m4 by a compile flag, if we can refactor current intrinsics to reduce explicit type/suffix usage.

However, this patch only implements LMUL=2, because ...

Experiment shows that only few of cases perform is better with LMUL=1. It doesn't look like we need to switch, at least not to m1.
In the current RVV backend, some functions (v_lut, v_pack, v_reduce_sum4, v_dotprod_expand) use four times the base LMUL registers (m1->m4; m2->m8). If we choose LMUL=4, we need to modify the function implementation (not just type/suffix) since there is no LMUL=16 available.

hanliutong · 2024-10-17T09:22:15Z

Update: try to fix CI failed by disable RVV backend for transform_32f.

PR:4.x / Linux-RISC-V-Clang / BuildAndTest (pull_request) Failing after 74m
[ PASSED ] 12094 tests.
[ FAILED ] 2 tests, listed below:
[ FAILED ] Core_Transform.accuracy
[ FAILED ] Core_TransformLarge.accuracy

mshabunin

Looks good to me. I think we can merge it.

Probably we'll need to port this PR to 5.x separately (not in a reqular merge), because it would require more changes in FP16 part on that branch.

asmorkalov · 2024-10-23T05:11:34Z

@mshabunin I'm working on 4.x->5.x merge right now. I'll merge the PR in couple of days.

Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. opencv#26318 The modification of this patch involves the RVV backend of Universal Intrinsic, replacing `LMUL=1` with `LMUL=2`. Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation This patch is generally beneficial in performance. We compiled OpenCV with `Clang-19.1.1` and `GCC-14.2.0` , ran it on `CanMV-k230` and `Banana-Pi F3`. Then we have four scenarios on combinations of compilers and devices. In `opencv_perf_core`, there are 3363 cases, of which: - 901 (26.8%) cases achieved more than `5%` performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from `3.35x` to `4.35x` - 75 (2.2%) cases had more than `5%` performance loss in all four scenarios, indicating that these cases are better with `LMUL=1` instead of `LMUL=2`. This involves `Mat_Transform`, `hasNonZero`, `KMeans`, `meanStdDev`, `merge` and `norm2`. Among them, `Mat_Transform` only has performance degradation in a few cases (`8UC3`), and the actual execution time of `hasNonZero` is so short that it can be ignored. For `KMeans`, `meanStdDev`, `merge` and `norm2`, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for `merge` opencv#26216 ) ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake

Fix issues in RISC-V Vector (RVV) Universal Intrinsic #27006 This PR aims to make `opencv_test_core` pass on RVV, via following two parts: 1. Fix bug in Universal Intrinsic when VLEN >= 512: - `max_nlanes` should be multiplied by 2, because we use LMUL=2 in RVV Universal Intrinsic since #26318. - Related tests are also expanded to match longer registers - Relax the precision threshold of `v_erf` to make the tests pass 2. Temporary fix #26936 - Disable 3 Universal Intrinsic code blocks on GCC - This is just a temporary fix until we figure out if it's our issue or GCC/something else's This patch is tested under the following conditions: - Compier: GCC 14.2, Clang 19.1.7 - Device: Muse-Pi (VLEN=256), QEMU (VLEN=512, 1024) ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake

Use LMUL=2 in the RVV backend of Universal Intrinsic.

d205619

asmorkalov requested a review from mshabunin October 16, 2024 15:37

asmorkalov added optimization platform: riscv labels Oct 16, 2024

asmorkalov added this to the 4.11.0 milestone Oct 16, 2024

Disable RVV backend for transform_32f.

efe327d

mshabunin approved these changes Oct 22, 2024

View reviewed changes

hanliutong marked this pull request as ready for review October 22, 2024 14:35

asmorkalov assigned mshabunin Oct 23, 2024

asmorkalov merged commit 35571be into opencv:4.x Oct 24, 2024
29 of 30 checks passed

hanliutong mentioned this pull request Nov 1, 2024

Use LMUL=2 in the RISC-V Vector (RVV) FP16 part. (5.x) #26396

Merged

6 tasks

asmorkalov mentioned this pull request Nov 2, 2024

5.x merge 4.x #26404

Merged

hanliutong mentioned this pull request Mar 3, 2025

Fix issues in RISC-V Vector (RVV) Universal Intrinsic #27006

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318

Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318

Uh oh!

hanliutong commented Oct 16, 2024

Uh oh!

hanliutong commented Oct 16, 2024

Uh oh!

hanliutong commented Oct 17, 2024

Uh oh!

mshabunin left a comment

Uh oh!

asmorkalov commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318

Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318

Uh oh!

Conversation

hanliutong commented Oct 16, 2024

Pull Request Readiness Checklist

Uh oh!

hanliutong commented Oct 16, 2024

Possible alternatives & Discussion

Uh oh!

hanliutong commented Oct 17, 2024

Uh oh!

mshabunin left a comment

Choose a reason for hiding this comment

Uh oh!

asmorkalov commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!