GAPI: SIMD optimization for AbsDiffC kernel #19233

anna-khakimova · 2020-12-29T13:32:14Z

SIMD optimization for AbsDiffC kernel via univ intrinsics.

@rgarnov, @OrestChura please take a look.

Full performance report from latest revision:
AbsDiffC_full_perf_report.xlsx

build_gapi_standalone:Linux x64=ade-0.1.1f
force_builders=Linux AVX2,Custom,ARMv7
disable_ipp:Custom=ON
buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
Xbuildworker:Custom=linux-1,linux-2,linux-4
Xbuild_image:Custom=powerpc64le

modules/gapi/test/cpu/gapi_core_tests_fluid.cpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

anna-khakimova · 2021-01-20T15:27:49Z

@alalek please review.

alalek

Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).

What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev

alalek · 2021-01-20T21:45:31Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    return v_float32x16(_mm512_setr_ps(*scalar, *(scalar + 1), *scalar, *(scalar + 1),
+                                       *scalar, *(scalar + 1), *scalar, *(scalar + 1),
+                                       *scalar, *(scalar + 1), *scalar, *(scalar + 1),
+                                       *scalar, *(scalar + 1), *scalar, *(scalar + 1)));


v_float32x16 ctor must be used instead of native intrinsics.

modules/gapi/src/backends/fluid/gfluidcore.cpp

alalek · 2021-01-20T21:55:07Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+CV_ALWAYS_INLINE int absdiffc_simd_c1c2c4(const T in[], T out[],
+                                          const v_float32& s, const int length)
+{
+    constexpr int nlanes = static_cast<int>(v_uint16::nlanes);


typename T
v_uint16::nlanes

Code should be consistent.
Don't use assumptions in generic implementation (especially silently).

The point is that this function handles cases when data is of unsigned short type and when data is of signed short type. In both cases nlanes is one and the same. nlanes = ength vector in bits / number bits in types. For this case 128(SSE42)/16 = 8. So for both types U16 and S16 nlanes = 8 for SSE42. So there is no particular need to separate two these cases.

So, we are expecting T=ushort or T=short here; in this case, maybe it would be better to explicitly check that by asserts, smth like:

bool isShort = std::is_same<T, ushort>::value || std::is_same<T, short>::value; GAPI_Assert(isShort == true);

This also should be applied to absdiffc_simd_c3_impl, I think. There is the same issue

Don't use assumptions in generic implementation (especially silently).

Changed.

alalek · 2021-01-20T21:59:06Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+            v_float32 a1 = v_cvt_f32(vx_load_expand_q(in + x)),
+                      a2 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 4)),
+                      a3 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 2)),
+                      a4 = v_cvt_f32(vx_load_expand_q(in + x + 3 * nlanes / 4));


Avoid declarations of multiple vars at once:

debugger is not able to show the right statement if this code goes out of buffer range

alalek · 2021-01-20T22:04:24Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+            v_float32 a1 = v_cvt_f32(vx_load_expand_q(in + x)),
+                      a2 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 4)),
+                      a3 = v_cvt_f32(vx_load_expand_q(in + x + nlanes / 2)),
+                      a4 = v_cvt_f32(vx_load_expand_q(in + x + 3 * nlanes / 4));


vx_load_expand_q
vx_load_expand_q
vx_load_expand_q
vx_load_expand_q

Reduce pressure on CPU's LOAD units. Fetched memory is equal to vx_load(in + x).
Load v_uint8 first and then repack in registers.

I didn't quite understand your proposal. could you please clarify your idea?

Replace 4 load instructions to one.

But I need initialize 4 vectors for further work with them. How can I load four vectors with one vx_load call?

@terfendail Could you please comment or clarify Alexander's proposal? How will Alexander's approach affect performance?

I think Alexander means something like

v_uint16 ld0, ld1; v_expand(vx_load(in+x), ld0, ld1); v_float32 a1 = v_cvt_f32(v_expand_low(ld0)); v_float32 a2 = v_cvt_f32(v_expand_high(ld0)); v_float32 a3 = v_cvt_f32(v_expand_low(ld1)); v_float32 a4 = v_cvt_f32(v_expand_high(ld1));

@terfendail Ok. Thank you so much for clarification!

I think Alexander means something like

v_uint16 ld0, ld1; v_expand(vx_load(in+x), ld0, ld1); v_float32 a1 = v_cvt_f32(v_expand_low(ld0)); v_float32 a2 = v_cvt_f32(v_expand_high(ld0)); v_float32 a3 = v_cvt_f32(v_expand_low(ld1)); v_float32 a4 = v_cvt_f32(v_expand_high(ld1));

@alalek I applied your proposal for 8U and gather performance report for AVX512 vectors. I observed average performance degradation equals to 12.6%. For 8UC3 test cases performance degradation is up to 33.3%. So I wouldn't like to apply this proposal to my snippet. Please take a look at the comparative performance report: //cc @dmatveev

vx_load_expand_q_vs_v_expand_high_low.xlsx

You can see applied proposal in the "Performance experiment" commit.

anna-khakimova · 2021-01-21T10:02:37Z

Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).

What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev

@alalek If I understand correctly dispatching between SSE4.2 / AVX2 / AVX512 in a single binary will be possible only if I move my new universal intrinsics v_cvt_f32 () and v_set_scalar () to intrin_sse.hpp, intrin_avx.hpp, intrin_avx512.hpp files. Which is highly undesirable for you like for reviewer.
Or is there some other way to organize dynamic dispatching without adding new intrinsics to files mentioned above?
Could you express your opinion on this matter please? What approach do you propose?

anna-khakimova · 2021-01-21T13:04:35Z

Need to reduce usage of native intrinsics.
Amount of code should be reduced too, no need to start with optimizations of one-time initialization part (we just can't measure these benefits through perf tests).
What is about code dispatching between SSE4.2 / AVX2 / AVX512 in a single binary? // cc @dmatveev

@alalek If I understand correctly dispatching between SSE4.2 / AVX2 / AVX512 in a single binary will be possible only if I move my new universal intrinsics v_cvt_f32 () and v_set_scalar () to intrin_sse.hpp, intrin_avx.hpp, intrin_avx512.hpp files. Which is highly undesirable for you like for reviewer.
Or is there some other way to organize dynamic dispatching without adding new intrinsics to files mentioned above?
Could you express your opinion on this matter please? What approach do you propose?

@terfendail Could you please comment our proposals?

modules/gapi/src/backends/fluid/gfluidcore.cpp

terfendail · 2021-01-22T13:21:28Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    float init[6] = { *scalar, *(scalar + 1), *(scalar + 2), *scalar,
+                      *(scalar + 1), *(scalar + 2) };
+
+    v_float32 s1 = v_set_scalar<3>(scalar);


I think it would be better to extend init array to v_float::nlanes +2 and than just load
s1 =vx_load(init+0)
For 2 and 4 channels you could use the same approach or try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32()) whatever show better performance

agreed about simplifying/minimization of initialization code (no real performance impact)

I think it would be better to extend init array to v_float::nlanes +2 and than just load
s1 =vx_load(init+0)

Thanks for advice. I'll try.

For 2 and 4 channels you could use the same approach or try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32()) whatever show better performance

try vx_lut_pairs/vx_lut_quads(scalar, vx_setzero_s32())

It is not so good idea. vx_lut_pairs() calls (pefix)_i32gather_epi64() intrinsic that has latency equals to about 25. For comparison, vx_load() has the latency equals to 7.

And vx_lut_quads() has summary latency about equals to 33 when the latency of the vx_load() equals to 7 .

terfendail · 2021-01-22T13:25:20Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    {
+        for (; x <= length - nlanes; x += nlanes)
+        {
+            v_float32 a1 = v_cvt_f32(vx_load_expand(in + x)),


You could use v_cvt_f32(v_reinterpret_as_s32(vx_load_expand(in + x))) and avoid defining v_cvt_f32 for uint32

anna-khakimova · 2021-01-29T13:22:45Z

@alalek please review.

alalek · 2021-01-29T13:48:52Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    float init[size];
+    for (int i = 0; i < size; ++i)
+    {
+        init[i] = *(scalar + i % chan);


No need to obfuscate code:

-*(scalar + i % chan) +scalar[i % chan]

alalek · 2021-01-29T13:49:54Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+                                      T out[], int width)
+{
+    constexpr int chan = 4;
+    constexpr int size = static_cast<int>(v_float32::nlanes) + 2;


+ 2

Why?

As I've already written in the post above, that loading to each next coefficient vector occurs with an offset:

v_float32 s1 = vx_load(init); #if CV_SIMD_WIDTH == 32 v_float32 s2 = vx_load(init + 2); v_float32 s3 = vx_load(init + 1); #else v_float32 s2 = vx_load(init + 1); v_float32 s3 = vx_load(init + 2); #endif

Maximal offset is 2.
Also @terfendail has already write about it here

Size of vector equals to nlanes. If loading start at second element of init array, then it'll finish at nlanes+2 element of init.

There is no such code in this function.

Ok. It's a typo.

alalek · 2021-01-29T14:03:01Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    float init[size];
+    for (int i = 0; i < size; ++i)
+    {
+        init[i] = *(scalar + i % chan);


@dmatveev AFAIK, Fluid backend performs per-row processing.
So it make sense to implement support for initializer code of such constants.

Ok. Scratch buffer was applied.

alalek · 2021-01-29T14:05:25Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+                                           const v_float32& s1, const v_float32& s2,
+                                           const v_float32& s3, const int length)
+{
+    CV_StaticAssert((std::is_same<T, ushort>::value) ||


Is there CV_StaticAssert() support in standalone mode? IE?

Changed to static_assert()

alalek · 2021-02-04T16:56:50Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+        {
+            for (int i = 0; i < num_vectors; ++i)
+            {
+                vectors[i] = v_load_f32(in + x + i * nlanes / 4);


No need to perform hand-made registers spilling. Compilers are smart enough and can do that for you if necessary (moreover AVX512 has up to 32 vector registers)

This data is:

loaded once

used once

Move data loading to corresponding places.

@alalek Could you please clarify what you mean under hand-made registers spilling? If you mean v_load_f32(), it isn't hand-made registers spilling. It is just an overloaded function for ease of writing templates.
If you mean for loop, for initialization 12 vectors- are you sure that you want to see 12 load lines instead of one?
I don't quite understand the essence of your request. Please clarify.

Data is loaded from the memory.

On the same line data is stored back to the memory.

Data re-loaded later once again for processing.

Do you see here redundant steps?

P.S. No need to load all 12 SIMD vectors at once. Load data on demand.

P.S. No need to load all 12 SIMD vectors at once. Load data on demand.

It's necessary because of specificities of the algorithm.

No need to perform hand-made registers spilling. Compilers are smart enough and can do that for you if necessary (moreover AVX512 has up to 32 vector registers)

This data is:

loaded once

used once

Move data loading to corresponding places.

Reworked.

alalek · 2021-02-04T17:00:03Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    static void initScratch(const GMatDesc& in, const cv::Scalar& _scalar, Buffer& scratch)
+    {


anna-khakimova · 2021-02-05T16:10:05Z

@alalek All comments were applied. Please check.

anna-khakimova · 2021-02-08T12:28:14Z

@alalek CI builds finished successfully. There are no unapplied comments.
Please check.

GAPI: SIMD optimization for AbsDiffC kernel * SIMD optimization for AbsDiffC kernel * Applied comments * Applying comments and refactoring: Remove new univ intrinsics. * Performance experiment * Applied comments.Step2 * Applied comments. Step3

anna-khakimova changed the title ~~SIMD optimization for AbsDiffC kernel~~ GAPI: SIMD optimization for AbsDiffC kernel Dec 29, 2020

anna-khakimova force-pushed the ak/simd_absdiffc branch from 2564c69 to 8528d20 Compare December 29, 2020 13:46

OrestChura added the category: g-api / gapi label Dec 29, 2020

rgarnov requested changes Dec 29, 2020

View reviewed changes

modules/gapi/test/cpu/gapi_core_tests_fluid.cpp Show resolved Hide resolved

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

OrestChura reviewed Jan 11, 2021

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

modules/gapi/src/backends/fluid/gfluidcore.cpp Show resolved Hide resolved

modules/gapi/src/backends/fluid/gfluidcore.cpp Show resolved Hide resolved

anna-khakimova force-pushed the ak/simd_absdiffc branch 7 times, most recently from 1e9ddaa to 261f45f Compare January 19, 2021 08:38

anna-khakimova force-pushed the ak/simd_absdiffc branch 6 times, most recently from 0df5429 to f72ef73 Compare January 20, 2021 15:03

rgarnov approved these changes Jan 20, 2021

View reviewed changes

alalek reviewed Jan 20, 2021

View reviewed changes

OrestChura reviewed Jan 21, 2021

View reviewed changes

modules/gapi/src/backends/fluid/gfluidcore.cpp Outdated Show resolved Hide resolved

terfendail reviewed Jan 22, 2021

View reviewed changes

anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from b9f5681 to b308343 Compare January 26, 2021 21:23

anna-khakimova requested review from alalek and OrestChura January 26, 2021 21:28

anna-khakimova force-pushed the ak/simd_absdiffc branch from 23c6974 to 192bc3d Compare January 29, 2021 12:11

OrestChura approved these changes Jan 29, 2021

View reviewed changes

alalek reviewed Jan 29, 2021

View reviewed changes

anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from fcb13db to 8ffecb7 Compare February 4, 2021 10:01

anna-khakimova requested a review from alalek February 4, 2021 10:23

anna-khakimova force-pushed the ak/simd_absdiffc branch from 8ffecb7 to 46fd4ce Compare February 4, 2021 10:41

anna-khakimova requested a review from rgarnov February 4, 2021 10:47

terfendail approved these changes Feb 4, 2021

View reviewed changes

alalek reviewed Feb 4, 2021

View reviewed changes

Anna Khakimova added 5 commits February 5, 2021 18:39

SIMD optimization for AbsDiffC kernel

360e3d5

Applied comments

4ca5eca

Applying comments and refactoring: Remove new univ intrinsics.

508b107

Performance experiment

818fd58

Applied comments.Step2

a07892f

anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from 0fd3dea to 1ebbd0c Compare February 5, 2021 16:02

anna-khakimova requested a review from alalek February 5, 2021 16:06

alalek assigned OrestChura Feb 5, 2021

anna-khakimova force-pushed the ak/simd_absdiffc branch 2 times, most recently from d9a52e1 to fb7f668 Compare February 8, 2021 09:13

Applied comments. Step3

c8d6cc2

anna-khakimova force-pushed the ak/simd_absdiffc branch from fb7f668 to c8d6cc2 Compare February 8, 2021 09:29

anna-khakimova requested a review from alalek February 8, 2021 11:17

alalek merged commit 7ab3a80 into opencv:master Feb 8, 2021

alalek mentioned this pull request Apr 9, 2021

(5.x) Merge 4.x #19885

Merged

		static void initScratch(const GMatDesc& in, const cv::Scalar& _scalar, Buffer& scratch)
		{

Uh oh!

GAPI: SIMD optimization for AbsDiffC kernel #19233

GAPI: SIMD optimization for AbsDiffC kernel #19233

Uh oh!

Conversation

anna-khakimova commented Dec 29, 2020 • edited by alalek Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anna-khakimova commented Jan 20, 2021

Uh oh!

alalek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alalek Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

anna-khakimova commented Jan 21, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova Jan 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

anna-khakimova commented Dec 29, 2020 •

edited by alalek

Loading

anna-khakimova Jan 21, 2021 •

edited

Loading

alalek Jan 21, 2021 •

edited

Loading

anna-khakimova Jan 21, 2021 •

edited

Loading

anna-khakimova Jan 21, 2021 •

edited

Loading

anna-khakimova Jan 26, 2021 •

edited

Loading

anna-khakimova Jan 26, 2021 •

edited

Loading

anna-khakimova commented Jan 21, 2021 •

edited

Loading

anna-khakimova Jan 26, 2021 •

edited

Loading

anna-khakimova Jan 26, 2021 •

edited

Loading

anna-khakimova Jan 29, 2021 •

edited

Loading

anna-khakimova Feb 5, 2021 •

edited

Loading