CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
dnn: add layer normalization for vision transformers #23047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@rogday Could you review this pull request if possible? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for contribution! LGTM 👍
…lback for OCL_FP16
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some optimization and minor comments.
std::vector<Mat> inputs, outputs; | ||
inputs_arr.getMatVector(inputs); | ||
outputs_arr.getMatVector(outputs); | ||
const int nstripes = getNumThreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const int nstripes = getNumThreads();
This doesn't look as a reliable design.
This scheme assumes that all threads has the same speed and they are not interrupted.
It is not true for OS with preemptive execution (all widely used OS for the last 25+ years). Some cores may handle background tasks or interrupts.
Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).
nstripes
should be based on subtask's "grain size" (subtask time >>> scheduling overhead) instead of number of available threads.
Some information is available here: https://oneapi-src.github.io/oneTBB/main/tbb_userguide/Controlling_Chunking_os.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it is not true at all for CPUs with big+little design (modern ARM, Intel CPUs with P+E cores).
And this is why I happened fix the openmp issue on macOS. I was doing benchmarking on vision transformers on onnxruntime, mnn and opencv dnn the other day, and found that both onnxruntime and mnn can run with 4 threads on my apple m1 by default, but opencv dnn uses all threads (8) instead. It is not available to set numThreads with gcd and when I tried with openmp, things went wrong with building issues.
I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible. Somehow cv::Range
needs to support step as well in order to work with "grain size"...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible
It doesn't look as an improvement really.
E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.
Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure what you are exactly asking for here. Basically every other **Invoker
of other layers uses the same strategy. If we want to take care of big+little core CPUs, that is going to be another pull request I think.
…t if from nested loop; use size_t in place of ull
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the update!
std::vector<Mat> inputs, outputs; | ||
inputs_arr.getMatVector(inputs); | ||
outputs_arr.getMatVector(outputs); | ||
const int nstripes = getNumThreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think somehow we should improve the multi-threading functionality of opencv, which detects big+little design and returns the numThreads of big cores if possible
It doesn't look as an improvement really.
E.g., in case of ARM-based phones we a already have configurations with 2 big + 6 little cores.
Again, we should not assume or rely heterogenous or same performance of threads or equality of subtasks complexity.
…square division outside of loop; use std::max to ensure positive value before std::sqrt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored references and parallel_for usage.
@@ -2394,6 +2394,36 @@ TEST_P(Test_ONNX_layers, Tile) | |||
testONNXModels("tile", pb); | |||
} | |||
|
|||
TEST_P(Test_ONNX_layers, LayerNorm) | |||
{ | |||
testONNXModels("test_layer_normalization_2d_axis0", pb, 0, 0, false, true, 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are many error messages during the model import:
[ RUN ] Test_ONNX_layers.LayerNorm/0, where GetParam() = OCV/CPU
[ INFO:0@132.156] global onnx_importer.cpp:831 populateNet DNN/ONNX: loading ONNX v8 model produced by 'backend-test'. Number of nodes = 1, initializers = 0, inputs = 3, outputs = 3
[ INFO:0@132.156] global onnx_importer.cpp:725 parseOperatorSet DNN/ONNX: ONNX opset version = 17
[ INFO:0@132.156] global onnx_importer.cpp:997 handleNode DNN/ONNX: processing node with 3 inputs and 3 outputs: [LayerNormalization]:(onnx_node_output_0!Y) from domain='ai.onnx'
[ERROR:0@132.156] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'Mean'. Does model imported properly?
[ERROR:0@132.156] global onnx_importer.cpp:924 populateNet DNN/ONNX: can't find layer for output name: 'InvStdDev'. Does model imported properly?
We should not have them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why they exist:
- These ONNX models are taken from the ONNX conformance tests. I think it is better not modifying them.
- I took a look at the opencv-onnx functionalities and did not find how to remove them completely in our onnx importer:
// Remove additional outputs (Mean, InvStdDev) if (node_proto.output_size() > 1) { auto outputName = node_proto.output(0); opencv_onnx::NodeProto node_proto_ = node_proto; node_proto_.clear_output(); node_proto_.add_output(outputName); addLayer(layerParams, node_proto_); }
@rogday Do you happen to know how to remove optional (node & graph) output in ONNX importer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed optional outputs from those ONNX models in the end. Turned out it is not straightforward to modify outputs of ONNX graph proto.
CV_CheckTypeEQ(src.type(), dst.type(), ""); | ||
CV_Assert(scale.isContinuous()); | ||
|
||
CV_CheckGE(epsilon, 0.0f, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this extra check
double nstripes = ((size_t)p.total * p.normSize) * (1 / 1024.0); | ||
parallel_for_(Range(0, p.total), p, nstripes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I learned a lot!
So if I understand correctly, you make grainsize appropriately small enough so that we can use both big and small cores, and big cores can naturally take more jobs. What about the "magic number" 1024? Like it was taken from the "bathtub curve" from this link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parallel_for() strategy should rely on subtask size and the scheduling overhead. 1024 is some empiric number here which specifies size of subtask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks again. Benefit a lot from this. Another question is why multiplying p.normSize
. Another empirical operation? I tried without it and the speed is like twice slower.
By the way, I found that all other **Invoker
s use the same strategy assuming all threads have the same performance. I think they need to be upgraded as well.
@alalek what is the status of this pull request? Could we make a step forward? |
dnn: add layer normalization for vision transformers * add layer norm onnx parser, impl and tests * add onnx graph simplifier for layer norm expanded * handle the case when constants are of type Initializer * add test case for layer norm expanded with initializers * use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16 * use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull * template hasBias * remove trailing whitespace * use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt * refactor implementation, optimize parallel_for * disable layer norm expanded * remove the removal of layer norm optional outputs
dnn: add layer normalization for vision transformers * add layer norm onnx parser, impl and tests * add onnx graph simplifier for layer norm expanded * handle the case when constants are of type Initializer * add test case for layer norm expanded with initializers * use CV_Assert & CV_CheckType in place of CV_Assert_N; use forward_fallback for OCL_FP16 * use const ref / ref in parameters of invoker::run; extract inner const if from nested loop; use size_t in place of ull * template hasBias * remove trailing whitespace * use pointer parameter with null check; move normSize division & mean_square division outside of loop; use std::max to ensure positive value before std::sqrt * refactor implementation, optimize parallel_for * disable layer norm expanded * remove the removal of layer norm optional outputs
Merge with opencv/opencv_extra#1032
Benchmark:
*: tested with size 1x50x768 on Apple M1.
Pull Request Readiness Checklist
See details at https://github.com/opencv/opencv/wiki/How_to_contribute#making-a-good-pull-request
Patch to opencv_extra has the same branch name.