CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
GAPI: SIMD optimization for AbsDiffC kernel #19233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2564c69
to
8528d20
Compare
1e9ddaa
to
261f45f
Compare
0df5429
to
f72ef73
Compare
@alalek please review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).
What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev
return v_float32x16(_mm512_setr_ps(*scalar, *(scalar + 1), *scalar, *(scalar + 1), | ||
*scalar, *(scalar + 1), *scalar, *(scalar + 1), | ||
*scalar, *(scalar + 1), *scalar, *(scalar + 1), | ||
*scalar, *(scalar + 1), *scalar, *(scalar + 1))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v_float32x16
ctor must be used instead of native intrinsics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated
CV_ALWAYS_INLINE int absdiffc_simd_c1c2c4(const T in[], T out[], | ||
const v_float32& s, const int length) | ||
{ | ||
constexpr int nlanes = static_cast<int>(v_uint16::nlanes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typename T
v_uint16::nlanes
Code should be consistent.
Don't use assumptions in generic implementation (especially silently).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point is that this function handles cases when data is of unsigned short type and when data is of signed short type. In both cases nlanes is one and the same. nlanes = ength vector in bits / number bits in types. For this case 128(SSE42)/16 = 8. So for both types U16 and S16 nlanes = 8 for SSE42. So there is no particular need to separate two these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are expecting T=ushort
or T=short
here; in this case, maybe it would be better to explicitly check that by asserts, smth like:
bool isShort = std::is_same<T, ushort>::value || std::is_same<T, short>::value;
GAPI_Assert(isShort == true);
This also should be applied to absdiffc_simd_c3_impl
, I think. There is the same issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't use assumptions in generic implementation (especially silently).
Changed.
v_float32 a1 = v_cvt_f32(vx_load_expand_q(in + x)), | ||
a2 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 4)), | ||
a3 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 2)), | ||
a4 = v_cvt_f32(vx_load_expand_q(in + x + 3 * nlanes / 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid declarations of multiple vars at once:
- debugger is not able to show the right statement if this code goes out of buffer range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
v_float32 a1 = v_cvt_f32(vx_load_expand_q(in + x)), | ||
a2 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 4)), | ||
a3 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 2)), | ||
a4 = v_cvt_f32(vx_load_expand_q(in + x + 3 * nlanes / 4)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vx_load_expand_q
vx_load_expand_q
vx_load_expand_q
vx_load_expand_q
Reduce pressure on CPU's LOAD units. Fetched memory is equal to vx_load(in + x)
.
Load v_uint8
first and then repack in registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't quite understand your proposal. could you please clarify your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace 4 load instructions to one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I need initialize 4 vectors for further work with them. How can I load four vectors with one vx_load call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terfendail Could you please comment or clarify Alexander's proposal? How will Alexander's approach affect performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Alexander means something like
v_uint16 ld0, ld1;
v_expand(vx_load(in+x), ld0, ld1);
v_float32 a1 = v_cvt_f32(v_expand_low(ld0));
v_float32 a2 = v_cvt_f32(v_expand_high(ld0));
v_float32 a3 = v_cvt_f32(v_expand_low(ld1));
v_float32 a4 = v_cvt_f32(v_expand_high(ld1));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terfendail Ok. Thank you so much for clarification!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Alexander means something like
v_uint16 ld0, ld1; v_expand(vx_load(in+x), ld0, ld1); v_float32 a1 = v_cvt_f32(v_expand_low(ld0)); v_float32 a2 = v_cvt_f32(v_expand_high(ld0)); v_float32 a3 = v_cvt_f32(v_expand_low(ld1)); v_float32 a4 = v_cvt_f32(v_expand_high(ld1));
@alalek I applied your proposal for 8U and gather performance report for AVX512 vectors. I observed average performance degradation equals to 12.6%. For 8UC3 test cases performance degradation is up to 33.3%. So I wouldn't like to apply this proposal to my snippet. Please take a look at the comparative performance report: //cc @dmatveev
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see applied proposal in the "Performance experiment" commit.
@alalek If I understand correctly dispatching between SSE4.2 / AVX2 / AVX512 in a single binary will be possible only if I move my new universal intrinsics v_cvt_f32 () and v_set_scalar () to intrin_sse.hpp, intrin_avx.hpp, intrin_avx512.hpp files. Which is highly undesirable for you like for reviewer. |
@terfendail Could you please comment our proposals? |
float init[6] = { *scalar, *(scalar + 1), *(scalar + 2), *scalar, | ||
*(scalar + 1), *(scalar + 2) }; | ||
|
||
v_float32 s1 = v_set_scalar<3>(scalar); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to extend init
array to v_float::nlanes +2
and than just load
s1 =vx_load(init+0)
For 2 and 4 channels you could use the same approach or try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32())
whatever show better performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed about simplifying/minimization of initialization code (no real performance impact)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to extend
init
array tov_float::nlanes +2
and than just load
s1 =vx_load(init+0)
Thanks for advice. I'll try.
For 2 and 4 channels you could use the same approach or try
vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32())
whatever show better performance
try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32())
It is not so good idea. vx_lut_pairs()
calls (pefix)_i32gather_epi64()
intrinsic that has latency equals to about 25. For comparison, vx_load()
has the latency equals to 7.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And vx_lut_quads()
has summary latency about equals to 33 when the latency of the vx_load()
equals to 7 .
{ | ||
for (; x <= length - nlanes; x += nlanes) | ||
{ | ||
v_float32 a1 = v_cvt_f32(vx_load_expand(in + x)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use v_cvt_f32(v_reinterpret_as_s32(vx_load_expand(in + x)))
and avoid defining v_cvt_f32 for uint32
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
b9f5681
to
b308343
Compare
23c6974
to
192bc3d
Compare
@alalek please review. |
float init[size]; | ||
for (int i = 0; i < size; ++i) | ||
{ | ||
init[i] = *(scalar + i % chan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to obfuscate code:
-*(scalar + i % chan)
+scalar[i % chan]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated.
T out[], int width) | ||
{ | ||
constexpr int chan = 4; | ||
constexpr int size = static_cast<int>(v_float32::nlanes) + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ 2
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I've already written in the post above, that loading to each next coefficient vector occurs with an offset:
v_float32 s1 = vx_load(init);
#if CV_SIMD_WIDTH == 32
v_float32 s2 = vx_load(init + 2);
v_float32 s3 = vx_load(init + 1);
#else
v_float32 s2 = vx_load(init + 1);
v_float32 s3 = vx_load(init + 2);
#endif
Maximal offset is 2.
Also @terfendail has already write about it here
Size of vector equals to nlanes. If loading start at second element of init
array, then it'll finish at nlanes+2 element of init
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no such code in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. It's a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outdated.
float init[size]; | ||
for (int i = 0; i < size; ++i) | ||
{ | ||
init[i] = *(scalar + i % chan); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dmatveev AFAIK, Fluid backend performs per-row processing.
So it make sense to implement support for initializer code of such constants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Scratch buffer was applied.
const v_float32& s1, const v_float32& s2, | ||
const v_float32& s3, const int length) | ||
{ | ||
CV_StaticAssert((std::is_same<T, ushort>::value) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there CV_StaticAssert()
support in standalone mode? IE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to static_assert()
fcb13db
to
8ffecb7
Compare
8ffecb7
to
46fd4ce
Compare
{ | ||
for (int i = 0; i < num_vectors; ++i) | ||
{ | ||
vectors[i] = v_load_f32(in + x + i * nlanes / 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to perform hand-made registers spilling. Compilers are smart enough and can do that for you if necessary (moreover AVX512 has up to 32 vector registers)
This data is:
- loaded once
- used once
Move data loading to corresponding places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alalek Could you please clarify what you mean under hand-made registers spilling? If you mean v_load_f32(), it isn't hand-made registers spilling. It is just an overloaded function for ease of writing templates.
If you mean for
loop, for initialization 12 vectors- are you sure that you want to see 12 load
lines instead of one?
I don't quite understand the essence of your request. Please clarify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Data is loaded from the memory.
- On the same line data is stored back to the memory.
- Data re-loaded later once again for processing.
Do you see here redundant steps?
P.S. No need to load all 12 SIMD vectors at once. Load data on demand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P.S. No need to load all 12 SIMD vectors at once. Load data on demand.
It's necessary because of specificities of the algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to perform hand-made registers spilling. Compilers are smart enough and can do that for you if necessary (moreover AVX512 has up to 32 vector registers)
This data is:
- loaded once
- used once
Move data loading to corresponding places.
Reworked.
static void initScratch(const GMatDesc& in, const cv::Scalar& _scalar, Buffer& scratch) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👍
0fd3dea
to
1ebbd0c
Compare
@alalek All comments were applied. Please check. |
d9a52e1
to
fb7f668
Compare
fb7f668
to
c8d6cc2
Compare
@alalek CI builds finished successfully. There are no unapplied comments. |
GAPI: SIMD optimization for AbsDiffC kernel * SIMD optimization for AbsDiffC kernel * Applied comments * Applying comments and refactoring: Remove new univ intrinsics. * Performance experiment * Applied comments.Step2 * Applied comments. Step3
SIMD optimization for AbsDiffC kernel via univ intrinsics.
@rgarnov, @OrestChura please take a look.Full performance report from latest revision:
AbsDiffC_full_perf_report.xlsx