dnn (cuda): support broadcasting if a.rank() != b.rank() #24834

fengyuentau · 2024-01-09T07:31:36Z

Inspired by #24786. This PR keeps the fusion of NaryEltwise and Concat while addressed the data missing problem via supporting broadcasting if a.rank() != b.rank().

Resolves #23977
Resolves #24606
Resolves #24635
Resolves #24721

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

fengyuentau · 2024-01-09T08:01:07Z

Tried to add yolov8n to test on different backends, but turns out we may have more problems, especially in CUDA_FP16 target:

[ RUN      ] DNNTestNetwork.YOLOv8n/0, where GetParam() = CUDA/CUDA
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 0.0010376 vs 0.0001
First run  |ref| = 638.03076171875
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 0.00109863 vs 0.0001
Second run  |ref| = 638.5064697265625
[  FAILED  ] DNNTestNetwork.YOLOv8n/0, where GetParam() = CUDA/CUDA (288 ms)
[ RUN      ] DNNTestNetwork.YOLOv8n/1, where GetParam() = CUDA/CUDA_FP16
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:76: Failure
Expected: (normL1) <= (l1), actual: 0.0118579 vs 0.004
First run  |ref| = 638.03076171875
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 6.54901 vs 0.02
First run  |ref| = 638.03076171875
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:76: Failure
Expected: (normL1) <= (l1), actual: 0.0119177 vs 0.004
Second run  |ref| = 638.5064697265625
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 8.83636 vs 0.02
Second run  |ref| = 638.5064697265625
[  FAILED  ] DNNTestNetwork.YOLOv8n/1, where GetParam() = CUDA/CUDA_FP16 (601 ms)
[ RUN      ] DNNTestNetwork.YOLOv8n/2, where GetParam() = OCV/OCL
[ WARN:0@1.061] global ocl4dnn_conv_spatial.cpp:1931 loadTunedConfig OpenCV(ocl4dnn): consider to specify kernel configuration cache directory through OPENCV_OCL4DNN_CONFIG_PATH parameter.
OpenCL program build log: dnn/dummy
Status -11: CL_BUILD_PROGRAM_FAILURE
-cl-no-subgroup-ifp
Error in processing command line: Don't understand command line argument "-cl-no-subgroup-ifp"!
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 0.00161743 vs 0.0001
First run  |ref| = 638.03076171875
/workspace/cuda_naryeltwise_broadcast/opencv/modules/dnn/test/test_common.impl.hpp:79: Failure
Expected: (normInf) <= (lInf), actual: 0.00117493 vs 0.0001
Second run  |ref| = 638.5064697265625
[  FAILED  ] DNNTestNetwork.YOLOv8n/2, where GetParam() = OCV/OCL (3184 ms)
[ RUN      ] DNNTestNetwork.YOLOv8n/3, where GetParam() = OCV/OCL_FP16
[       OK ] DNNTestNetwork.YOLOv8n/3 (545 ms)
[----------] 4 tests from DNNTestNetwork (4618 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (4618 ms total)
[  PASSED  ] 1 test.
[  FAILED  ] 3 tests, listed below:
[  FAILED  ] DNNTestNetwork.YOLOv8n/0, where GetParam() = CUDA/CUDA
[  FAILED  ] DNNTestNetwork.YOLOv8n/1, where GetParam() = CUDA/CUDA_FP16
[  FAILED  ] DNNTestNetwork.YOLOv8n/2, where GetParam() = OCV/OCL

Abdurrahheem · 2024-01-09T08:13:39Z

@fengyuentau once this PR is complete (currently yolov8 is not supported on CUDA here, AFAK) does it mean that PR #24786 is going be obsolete?

fengyuentau · 2024-01-09T08:23:01Z

does it mean that PR #24786 is going be obsolete?

Yes.

currently yolov8 is not supported on CUDA here

It's not true. There are some minor differences in the results between CPU and CUDA/CUDA, which is OK I think, but the differences are much bigger when it comes to the CUDA_FP16 target. I guess we lose some accuracy in Sigmoid and such. Need an in-depth investigation.

asmorkalov · 2024-01-09T09:36:09Z

Locally I observe several test failures like this:

[----------] 1 test from Layer_Test_Eltwise_bcast
[ RUN      ] Layer_Test_Eltwise_bcast.brute_force/10, where GetParam() = ("sum", 3, CUDA/CUDA)
Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'
/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: Failure
Expected: re = net.forward() doesn't throw an exception.
  Actual: it throws.
Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'
/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: Failure
Expected: re = net.forward() doesn't throw an exception.
  Actual: it throws.
Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'
/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: Failure
Expected: re = net.forward() doesn't throw an exception.
  Actual: it throws.
Exception message: OpenCV(4.9.0-dev) /mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/src/cuda/../cuda4dnn/csl/tensor.hpp:1047: error: (-215:Assertion failed) rank() >= 2 in function 'squeeze'
/mnt/projects/Projects/OpenCV/opencv-master/modules/dnn/test/test_layers.cpp:2053: Failure
Expected: re = net.forward() doesn't throw an exception.
  Actual: it throws.
[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/10, where GetParam() = ("sum", 3, CUDA/CUDA) (1536 ms)

Full list:

[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/10, where GetParam() = ("sum", 3, CUDA/CUDA)
[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/11, where GetParam() = ("sum", 3, CUDA/CUDA_FP16)
[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/15, where GetParam() = ("sum", 4, CUDA/CUDA)
[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/16, where GetParam() = ("sum", 4, CUDA/CUDA_FP16)
[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/20, where GetParam() = ("sum", 5, CUDA/CUDA)
[  FAILED  ] Layer_Test_Eltwise_bcast.brute_force/21, where GetParam() = ("sum", 5, CUDA/CUDA_FP16)

fengyuentau · 2024-01-09T11:46:12Z

It was due to there are inputs of shape [1] (1d mat) in these failed tests. In cuda backend, there are asserts checking rank>=2. So it is not feasible to run these tests with CUDA backend without bypassing the assert checks.

It works previously because it was not actually testing the CUDA backend; If two inputs have different dimensions, it falls back to CPU implementation. So it tests nothing related to the CUDA backend in these case. See below for the fall back (Line 804-805):

opencv/modules/dnn/src/layers/nary_eltwise_layers.cpp

Lines 800 to 811 in 5c9ad9d

    
           auto input_0_shape = inputs[0].dynamicCast<CUDABackendWrapper>()->getShape(); 
        
           for (int i = 1; i < inputs.size(); i++) 
        
           { 
        
               auto input_i_shape = inputs[i].dynamicCast<CUDABackendWrapper>()->getShape(); 
        
               if (input_0_shape.size() != input_i_shape.size()) 
        
                   return Ptr<BackendNode>(); 
        
               // check if the shape can be supported by `eltwise_ops.cu`, or return the default BackendNode 
        
               for (int j = 0; j < input_0_shape.size(); j++) 
        
                   if (input_0_shape[j] != input_i_shape[j] && 
        
                       input_0_shape[j] != 1 && input_i_shape[j] != 1) 
        
                       return Ptr<BackendNode>(); 
        
           }

With that being said, I propose to turn off these tests specifically for CUDA backend. @asmorkalov What do you think?

@WanliZhong Please join this talk as well.

fengyuentau · 2024-01-09T11:51:14Z

It was due to there are inputs of shape [1] (1d mat) in these failed tests. In cuda backend, there are asserts checking rank>=2. So it is not feasible to run these tests with CUDA backend without bypassing the assert checks.

It works previously because it was not actually testing the CUDA backend; If two inputs have different dimensions, it falls back to CPU implementation. So it tests nothing related to the CUDA backend in these case. See below for the fall back (Line 804-805):

opencv/modules/dnn/src/layers/nary_eltwise_layers.cpp

Lines 800 to 811 in 5c9ad9d

auto input_0_shape = inputs[0].dynamicCast<CUDABackendWrapper>()->getShape();

for (int i = 1; i < inputs.size(); i++)

{

auto input_i_shape = inputs[i].dynamicCast<CUDABackendWrapper>()->getShape();

if (input_0_shape.size() != input_i_shape.size())

return Ptr<BackendNode>();

// check if the shape can be supported by `eltwise_ops.cu`, or return the default BackendNode

for (int j = 0; j < input_0_shape.size(); j++)

if (input_0_shape[j] != input_i_shape[j] &&

input_0_shape[j] != 1 && input_i_shape[j] != 1)

return Ptr<BackendNode>();

}

With that being said, I propose to turn off these tests specifically for CUDA backend. @asmorkalov What do you think?

@WanliZhong Please join this talk as well.

Or we still fall back to CPU when dimension is 1.

WanliZhong · 2024-01-09T11:53:49Z

I propose fallback when dim is 1 to make sure cuda run correctly rather than throw an error

fengyuentau · 2024-01-09T12:24:04Z

It was due to there are inputs of shape [1] (1d mat) in these failed tests. In cuda backend, there are asserts checking rank>=2. So it is not feasible to run these tests with CUDA backend without bypassing the assert checks.
It works previously because it was not actually testing the CUDA backend; If two inputs have different dimensions, it falls back to CPU implementation. So it tests nothing related to the CUDA backend in these case. See below for the fall back (Line 804-805):

opencv/modules/dnn/src/layers/nary_eltwise_layers.cpp

Lines 800 to 811 in 5c9ad9d

auto input_0_shape = inputs[0].dynamicCast<CUDABackendWrapper>()->getShape();

for (int i = 1; i < inputs.size(); i++)

{

auto input_i_shape = inputs[i].dynamicCast<CUDABackendWrapper>()->getShape();

if (input_0_shape.size() != input_i_shape.size())

return Ptr<BackendNode>();

// check if the shape can be supported by `eltwise_ops.cu`, or return the default BackendNode

for (int j = 0; j < input_0_shape.size(); j++)

if (input_0_shape[j] != input_i_shape[j] &&

input_0_shape[j] != 1 && input_i_shape[j] != 1)

return Ptr<BackendNode>();

}

With that being said, I propose to turn off these tests specifically for CUDA backend. @asmorkalov What do you think?
@WanliZhong Please join this talk as well.

Or we still fall back to CPU when dimension is 1.

It does not work due to the 1d mat is actually produced during the broadcasting implementation in the CUDA backend. Let me find another solution to this.

fengyuentau · 2024-01-09T13:01:38Z

New commits should resolve this problem.

asmorkalov · 2024-01-10T08:48:55Z

Pass tests with CUDA locally now.

fengyuentau · 2024-01-10T08:53:17Z

Sporadic crash in PR:4.x / macOS-ARM64-Vulkan / BuildAndTest (pull_request). This patch makes no changes on Caffe and Vulkan.

initial commit

ad5c900

fengyuentau added bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib category: dnn labels Jan 9, 2024

fengyuentau added this to the 4.10.0 milestone Jan 9, 2024

fengyuentau requested review from dkurt, Abdurrahheem and WanliZhong January 9, 2024 07:31

add yolov8n test (disabled)

7b68013

fengyuentau added 2 commits January 9, 2024 21:00

resolve assert rank()>=2

7bca508

do not hide the actual exception message

da4ee25

empty commit to trigger build

f2f0efd

Abdurrahheem approved these changes Jan 10, 2024

View reviewed changes

dkurt approved these changes Jan 10, 2024

View reviewed changes

asmorkalov assigned dkurt Jan 11, 2024

asmorkalov merged commit e7ccff9 into opencv:4.x Jan 11, 2024

fengyuentau deleted the cuda_naryeltwise_broadcast branch January 11, 2024 07:07

Abdurrahheem mentioned this pull request Jan 11, 2024

Remove exception for NaryEltwise #24786

Closed

6 tasks

asmorkalov mentioned this pull request Jan 19, 2024

5.x merge 4.x #24862

Merged

asmorkalov mentioned this pull request Jan 23, 2024

5.x merge 4.x #24912

Merged

Abdurrahheem mentioned this pull request Feb 7, 2024

opencv4.8.1 gpu yolo8 have problem #24472

Closed

4 tasks

fengyuentau mentioned this pull request May 5, 2024

Version 4.9 ，The inference results of the ONNX model differ between GPU inference and CPU inference. #25512

Closed

4 tasks

Garrypeng mentioned this pull request Jun 18, 2024

Build Opencv4.10.0 Visual studio 2017 and 2019 generating reported error link opencv_core4100.lib failed #25777

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

dnn (cuda): support broadcasting if a.rank() != b.rank() #24834

dnn (cuda): support broadcasting if a.rank() != b.rank() #24834

Uh oh!

fengyuentau commented Jan 9, 2024 •

edited

Loading

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

Abdurrahheem commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

asmorkalov commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

WanliZhong commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

asmorkalov commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024

Uh oh!

Uh oh!

Uh oh!

dnn (cuda): support broadcasting if a.rank() != b.rank() #24834

dnn (cuda): support broadcasting if a.rank() != b.rank() #24834

Uh oh!

Conversation

fengyuentau commented Jan 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

Abdurrahheem commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

asmorkalov commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

WanliZhong commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

fengyuentau commented Jan 9, 2024

Uh oh!

asmorkalov commented Jan 10, 2024

Uh oh!

fengyuentau commented Jan 10, 2024

Uh oh!

Uh oh!

fengyuentau commented Jan 9, 2024 •

edited

Loading