CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. #26318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This PR is marked as a draft because its implementation needs further discussion. Possible alternatives & DiscussionAfter discussion with @mshabunin, we believe that it might be better to set However, this patch only implements
|
Update: try to fix CI failed by disable RVV backend for
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think we can merge it.
Probably we'll need to port this PR to 5.x separately (not in a reqular merge), because it would require more changes in FP16 part on that branch.
@mshabunin I'm working on 4.x->5.x merge right now. I'll merge the PR in couple of days. |
Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. opencv#26318 The modification of this patch involves the RVV backend of Universal Intrinsic, replacing `LMUL=1` with `LMUL=2`. Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation This patch is generally beneficial in performance. We compiled OpenCV with `Clang-19.1.1` and `GCC-14.2.0` , ran it on `CanMV-k230` and `Banana-Pi F3`. Then we have four scenarios on combinations of compilers and devices. In `opencv_perf_core`, there are 3363 cases, of which: - 901 (26.8%) cases achieved more than `5%` performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from `3.35x` to `4.35x` - 75 (2.2%) cases had more than `5%` performance loss in all four scenarios, indicating that these cases are better with `LMUL=1` instead of `LMUL=2`. This involves `Mat_Transform`, `hasNonZero`, `KMeans`, `meanStdDev`, `merge` and `norm2`. Among them, `Mat_Transform` only has performance degradation in a few cases (`8UC3`), and the actual execution time of `hasNonZero` is so short that it can be ignored. For `KMeans`, `meanStdDev`, `merge` and `norm2`, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for `merge` opencv#26216 ) ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
Use LMUL=2 in the RISC-V Vector (RVV) backend of Universal Intrinsic. opencv#26318 The modification of this patch involves the RVV backend of Universal Intrinsic, replacing `LMUL=1` with `LMUL=2`. Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation This patch is generally beneficial in performance. We compiled OpenCV with `Clang-19.1.1` and `GCC-14.2.0` , ran it on `CanMV-k230` and `Banana-Pi F3`. Then we have four scenarios on combinations of compilers and devices. In `opencv_perf_core`, there are 3363 cases, of which: - 901 (26.8%) cases achieved more than `5%` performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from `3.35x` to `4.35x` - 75 (2.2%) cases had more than `5%` performance loss in all four scenarios, indicating that these cases are better with `LMUL=1` instead of `LMUL=2`. This involves `Mat_Transform`, `hasNonZero`, `KMeans`, `meanStdDev`, `merge` and `norm2`. Among them, `Mat_Transform` only has performance degradation in a few cases (`8UC3`), and the actual execution time of `hasNonZero` is so short that it can be ignored. For `KMeans`, `meanStdDev`, `merge` and `norm2`, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this for `merge` opencv#26216 ) ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
Fix issues in RISC-V Vector (RVV) Universal Intrinsic #27006 This PR aims to make `opencv_test_core` pass on RVV, via following two parts: 1. Fix bug in Universal Intrinsic when VLEN >= 512: - `max_nlanes` should be multiplied by 2, because we use LMUL=2 in RVV Universal Intrinsic since #26318. - Related tests are also expanded to match longer registers - Relax the precision threshold of `v_erf` to make the tests pass 2. Temporary fix #26936 - Disable 3 Universal Intrinsic code blocks on GCC - This is just a temporary fix until we figure out if it's our issue or GCC/something else's This patch is tested under the following conditions: - Compier: GCC 14.2, Clang 19.1.7 - Device: Muse-Pi (VLEN=256), QEMU (VLEN=512, 1024) ### Pull Request Readiness Checklist See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request - [x] I agree to contribute to the project under Apache 2 License. - [x] To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV - [ ] The PR is proposed to the proper branch - [ ] There is a reference to the original bug report and related work - [ ] There is accuracy test, performance test and test data in opencv_extra repository, if applicable Patch to opencv_extra has the same branch name. - [ ] The feature is well documented and sample code can be built with the project CMake
The modification of this patch involves the RVV backend of Universal Intrinsic, replacing
LMUL=1
withLMUL=2
.Now each Universal Intrinsic type actually corresponds to two RVV vector registers, and each Intrinsic function also operates two vector registers. Considering that algorithms written using Universal Intrinsic usually do not use the maximum number of registers, this can help the RVV backend utilize more register resources without modifying the algorithm implementation
This patch is generally beneficial in performance.
We compiled OpenCV with
Clang-19.1.1
andGCC-14.2.0
, ran it onCanMV-k230
andBanana-Pi F3
. Then we have four scenarios on combinations of compilers and devices. Inopencv_perf_core
, there are 3363 cases, of which:5%
performance improvement in all four scenarios, and the average speedup of these test cases (compared to scalar) increased from3.35x
to4.35x
5%
performance loss in all four scenarios, indicating that these cases are better withLMUL=1
instead ofLMUL=2
. This involvesMat_Transform
,hasNonZero
,KMeans
,meanStdDev
,merge
andnorm2
. Among them,Mat_Transform
only has performance degradation in a few cases (8UC3
), and the actual execution time ofhasNonZero
is so short that it can be ignored. ForKMeans
,meanStdDev
,merge
andnorm2
, we should be able to use the HAL to optimize/restore their performance. (In fact, we have already done this formerge
Add the HAL implementation for the merge function on RISC-V Vector. #26216 )Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.