dnn: add layer normalization for vision transformers #23047

fengyuentau · 2022-12-28T05:52:48Z

add layer norm onnx parser
add layer norm impl
add layer norm onnx simplifier for both cases of constants being Constant and Initializer
add test model generation code for layer_norm_expanded and layer_norm_expanded_initializer

Benchmark:

Layer	Mean (ms)	Median (ms)	Min (ms)
layer norm expanded	0.43	0.42	0.40
layer norm (this pr)	0.02	0.02	0.01

*: tested with size 1x50x768 on Apple M1.

Pull Request Readiness Checklist

See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request

I agree to contribute to the project under Apache 2 License.
To the best of my knowledge, the proposed patch is not based on a code under GPL or another license that is incompatible with OpenCV
The PR is proposed to the proper branch
There is a reference to the original bug report and related work
There is accuracy test, performance test and test data in opencv_extra repository, if applicable
Patch to opencv_extra has the same branch name.
The feature is well documented and sample code can be built with the project CMake

force_builders=Linux OpenCL

fengyuentau · 2023-01-13T09:07:38Z

@rogday Could you review this pull request if possible?

rogday

Thank you for contribution! LGTM 👍

modules/dnn/src/layers/layer_norm.cpp

…lback for OCL_FP16

alalek

Left some optimization and minor comments.

modules/dnn/src/layers/layer_norm.cpp

alalek · 2023-01-16T08:07:00Z

modules/dnn/src/layers/layer_norm.cpp

+        std::vector<Mat> inputs, outputs;
+        inputs_arr.getMatVector(inputs);
+        outputs_arr.getMatVector(outputs);
+        const int nstripes = getNumThreads();


const int nstripes = getNumThreads();

This doesn't look as a reliable design.
This scheme assumes that all threads has the same speed and they are not interrupted.
It is not true for OS with preemptive execution (all widely used OS for the last 25+ years). Some cores may handle background tasks or interrupts.

Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).

nstripes should be based on subtask's "grain size" (subtask time >>> scheduling overhead) instead of number of available threads.

Some information is available here: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html

Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).

And this is why I happened fix the openmp issue on macOS. I was doing benchmarking on vision transformers on onnxruntime, mnn and opencv dnn the other day, and found that both onnxruntime and mnn can run with 4 threads on my apple m1 by default, but opencv dnn uses all threads (8) instead. It is not available to set numThreads with gcd and when I tried with openmp, things went wrong with building issues.

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible. Somehow cv::Range needs to support step as well in order to work with "grain size"...

I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible

It doesn't look as an improvement really.

E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.

Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.

I am not sure what you are exactly asking for here. Basically every other **Invoker of other layers uses the same strategy. If we want to take care of big+little core CPUs, that is going to be another pull request I think.

modules/dnn/src/layers/layer_norm.cpp

…t if from nested loop; use size_t in place of ull

alalek

Thank you for the update!

modules/dnn/src/layers/layer_norm.cpp

alalek · 2023-01-20T05:57:30Z

modules/dnn/src/layers/layer_norm.cpp

+        std::vector<Mat> inputs, outputs;
+        inputs_arr.getMatVector(inputs);
+        outputs_arr.getMatVector(outputs);
+        const int nstripes = getNumThreads();


I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible

It doesn't look as an improvement really.

E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.

Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.

modules/dnn/src/layers/layer_norm.cpp

…square division outside of loop; use std::max to ensure positive value before std::sqrt

alalek

Refactored references and parallel_for usage.

modules/dnn/perf/perf_layer.cpp

alalek · 2023-01-20T16:22:18Z

modules/dnn/test/test_onnx_importer.cpp

@@ -2394,6 +2394,36 @@ TEST_P(Test_ONNX_layers, Tile)
    testONNXModels("tile", pb);
 }

+TEST_P(Test_ONNX_layers, LayerNorm)
+{
+    testONNXModels("test_layer_normalization_2d_axis0", pb, 0, 0, false, true, 3);


There are many error messages during the model import:

[ RUN ] Test_ONNX_layers.LayerNorm/0, where GetParam() = OCV/CPU [ INFO:0@132.156] global onnx_importer.cpp:831 populateNet DNN/ONNX: loading ONNX v8 model produced by 'backend-test'. Number of nodes = 1, initializers = 0, inputs = 3, outputs = 3 [ INFO:0@132.156] global onnx_importer.cpp:725 parseOperatorSet DNN/ONNX: ONNX opset version = 17 [ INFO:0@132.156] global onnx_importer.cpp:997 handleNode DNN/ONNX: processing node with 3 inputs and 3 outputs: [LayerNormalization]:(onnx_node_output_0!Y) from domain='ai.onnx' [ERROR:0@132.156] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'Mean'. Does model imported properly? [ERROR:0@132.156] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'InvStdDev'. Does model imported properly?

We should not have them.

The reason why they exist:

These ONNX models are taken from the ONNX conformance tests. I think it is better not modifying them.

I took a look at the opencv-onnx functionalities and did not find how to remove them completely in our onnx importer:
// Remove additional outputs (Mean, InvStdDev) if (node_proto.output_size() > 1) { auto outputName = node_proto.output(0); opencv_onnx::NodeProto node_proto_ = node_proto; node_proto_.clear_output(); node_proto_.add_output(outputName); addLayer(layerParams, node_proto_); }

@rogday Do you happen to know how to remove optional (node & graph) output in ONNX importer?

I removed optional outputs from those ONNX models in the end. Turned out it is not straightforward to modify outputs of ONNX graph proto.

alalek · 2023-01-20T16:23:45Z

modules/dnn/src/layers/layer_norm.cpp

+            CV_CheckTypeEQ(src.type(), dst.type(), "");
+            CV_Assert(scale.isContinuous());
+
+            CV_CheckGE(epsilon, 0.0f, "");


Added this extra check

fengyuentau · 2023-01-21T03:09:53Z

modules/dnn/src/layers/layer_norm.cpp

+            double nstripes = ((size_t)p.total * p.normSize) * (1 / 1024.0);
+            parallel_for_(Range(0, p.total), p, nstripes);


Thanks for the change! I learned a lot!

So if I understand correctly, you make grainsize appropriately small enough so that we can use both big and small cores, and big cores can naturally take more jobs. What about the "magic number" 1024? ~~Like it was taken from the "bathtub curve" from this link~~?

parallel_for() strategy should rely on subtask size and the scheduling overhead. 1024 is some empiric number here which specifies size of subtask.

Okay, thanks again. Benefit a lot from this. Another question is why multiplying p.normSize. Another empirical operation? I tried without it and the speed is like twice slower.

By the way, I found that all other **Invokers use the same strategy assuming all threads have the same performance. I think they need to be upgraded as well.

fengyuentau · 2023-01-27T10:32:31Z

@alalek what is the status of this pull request? Could we make a step forward?

dnn: add layer normalization for vision transformers * add layer norm onnx parser, impl and tests * add onnx graph simplifier for layer norm expanded * handle the case when constants are of type Initializer * add test case for layer norm expanded with initializers * use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16 * use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull * template hasBias * remove trailing whitespace * use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt * refactor implementation, optimize parallel_for * disable layer norm expanded * remove the removal of layer norm optional outputs

add layer norm onnx parser, impl and tests

148a148

fengyuentau mentioned this pull request Dec 28, 2022

models and data for layer norm opencv/opencv_extra#1032

Merged

fengyuentau added feature category: dnn labels Dec 28, 2022

fengyuentau added this to the 4.8.0 milestone Dec 28, 2022

asmorkalov requested a review from rogday December 28, 2022 06:58

fengyuentau added 3 commits December 30, 2022 18:14

add onnx graph simplifier for layer norm expanded

5ad7ac4

handle the case when constants are of type Initializer

a3a762c

add test case for layer norm expanded with initializers

c1073b6

rogday approved these changes Jan 13, 2023

View reviewed changes

modules/dnn/src/layers/layer_norm.cpp Show resolved Hide resolved

fengyuentau commented Jan 16, 2023

View reviewed changes

modules/dnn/src/layers/layer_norm.cpp Outdated Show resolved Hide resolved

use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fal…

6a87f9b

…lback for OCL_FP16

alalek reviewed Jan 16, 2023

View reviewed changes

fengyuentau added 3 commits January 17, 2023 11:09

use const ref / ref in parameters of invoker::run; extract inner cons…

8beaad5

…t if from nested loop; use size_t in place of ull

template hasBias

1fe2521

remove trailing whitespace

27bd0f0

fengyuentau requested a review from alalek January 20, 2023 03:18

alalek reviewed Jan 20, 2023

View reviewed changes

use pointer parameter with null check; move normSize division & mean_…

55ad62c

…square division outside of loop; use std::max to ensure positive value before std::sqrt

fengyuentau requested a review from alalek January 20, 2023 12:28

refactor implementation, optimize parallel_for

b540b9f

alalek reviewed Jan 20, 2023

View reviewed changes

fengyuentau commented Jan 21, 2023

View reviewed changes

fengyuentau added 2 commits January 21, 2023 11:17

disable layer norm expanded

ce09be5

remove the removal of layer norm optional outputs

2e9e0fd

fengyuentau requested a review from alalek January 24, 2023 03:22

alalek approved these changes Jan 27, 2023

View reviewed changes

alalek assigned rogday Jan 27, 2023

alalek merged commit 4d918ba into opencv:4.x Jan 27, 2023

alalek mentioned this pull request Jan 28, 2023

(5.x) Merge 4.x #23189

Merged

fengyuentau mentioned this pull request Mar 14, 2023

ONNX conformance test results #21078

Open

48 tasks

asmorkalov mentioned this pull request May 31, 2023

(5.x) Merge 4.x #23718

Merged

dkurt mentioned this pull request Aug 4, 2023

Merge MVN and LayerNorm in one layer #24105

Closed

9 tasks

fengyuentau mentioned this pull request Oct 18, 2023

dnn: add shared fastNorm kernel for mvn, instance norm and layer norm #24409

Merged

11 tasks

fengyuentau deleted the layer_norm branch October 19, 2023 07:34

opencv-alalek mentioned this pull request May 27, 2024

dnn: parallelize nary elementwise forward implementation & enable related conformance tests #25630

Merged

10 tasks

		double nstripes = ((size_t)p.total * p.normSize) * (1 / 1024.0);
		parallel_for_(Range(0, p.total), p, nstripes);

Uh oh!

dnn: add layer normalization for vision transformers #23047

dnn: add layer normalization for vision transformers #23047

Uh oh!

Conversation

fengyuentau commented Dec 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Readiness Checklist

Uh oh!

fengyuentau commented Jan 13, 2023

Uh oh!

rogday left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengyuentau Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengyuentau Jan 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengyuentau Jan 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengyuentau commented Jan 27, 2023

Uh oh!

Uh oh!

fengyuentau commented Dec 28, 2022 •

edited

Loading

rogday left a comment •

edited

Loading

fengyuentau Jan 17, 2023 •

edited

Loading

fengyuentau Jan 21, 2023 •

edited

Loading

fengyuentau Jan 21, 2023 •

edited

Loading