CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
GAPI Fluid: SIMD for SubC kernel. #21158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
5abee9b
to
86f21d4
Compare
86f21d4
to
d26e6da
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM: added minor comments
float* sc = scratch.OutLine<float>(); | ||
|
||
for (int i = 0; i < scratch.length(); ++i) | ||
sc[i] = static_cast<float>(_scalar[i % chan]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naive: till scratch
size is known at compile phase - is it possible to make unrolling loop for that determined size?
In case of compiler didn't make unroll it automatically...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not entirely clear yet. Could you explain in more detail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure:
Scratch buffer size depends from MACRO definition here https://github.com/opencv/opencv/pull/21158/files#diff-fe0ac6b5a07c1bec59c3756100eb186c9117a82dd1d951ef6997b8423a89943dR1505
we have constant buffer size here for CV_SIMD turn on/off case... So it is possible to introduce MACRO branch here by CV_SIMD turn on/off and make manual loop unrolling in a way
#ifdef CV_SIMD
cs[0] = _scalar[0];
cs[1] = _scalar[1];
...<16 + 2 times>
#else
But from the other hand, C++ compiler can do this work of manual loop unrolling in more optimized way... But I do not known much details how is it applicable here: would C++ compiler make unrolling or wouldn't in this case - if scratch.length()
has runtime (but not compile time) value.
So my point here: We can try to simply replace scratch.length()
from dynamic parameter to constant parameter from MACRO where scratch buffer is determined and believe that C++ compiler would make loop unrolling or straight way vectorization here
P.S. this comment may be (or may not be ) applicable if this code is placed on critical performance path ONLY
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me there are no sense to do this for current case, because this loop work just once when function is called for the first line of the handling picture. And the compiler usually does it well itself.
@@ -81,6 +81,9 @@ MUL_SIMD(float, float) | |||
|
|||
#undef MUL_SIMD | |||
|
|||
struct add_op {}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simple find on a page
doesn't return to me anything about using this structure: where is it used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Removed.
{ | ||
#if CV_SIMD | ||
// 512 bits / 32 bits = 16 elements of float32 can contain a AVX 512 SIMD vector. | ||
constexpr int maxNlanes = 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it the same initialization which used for AddC? if yes - then maybe we need single common function to reuse in both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
CV_ALWAYS_INLINE | ||
typename std::enable_if<(std::is_same<DST, ushort>::value || | ||
std::is_same<DST, short>::value), void>::type | ||
addc_simd_common_impl(const SRC* inx, DST* outx, const v_float32& sc, const int nlanes) | ||
arithmOpScalar_simd_common_impl(op_tag t, const SRC* inx, DST* outx, const v_float32& sc, const int nlanes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Now code for sum + sub is unified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you think to rename op
for something more less domain specific -> operation
or op/operation_dispatcher
? too much short-named variables here and functions name stays hidden beside them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied.
@alalek CI checks passed, PR was approved. Could you please merge it? |
* GAPI Fluid: SIMD for SubC kernel. * Applied comments
SIMD for GAPI Fluid SubC kernel.
Performance report:
FluidSubCSIMD.xlsx