GAPI: SIMD optimization for AddWeighted kernel. #18466

anna-khakimova · 2020-09-30T08:57:13Z

SIMD optimization for AddWeighted kernel via universal intrinsics.

force_builders=Linux AVX2,Custom
disable_ipp:Custom=ON
buildworker:Custom=linux-3
build_image:Custom=ubuntu:18.04
CPU_BASELINE:Custom=AVX512_SKX
Xbuildworker:Custom=linux-1,linux-2,linux-4
Xbuild_image:Custom=powerpc64le

Perf report for AddWeighted:

AddWeighted_avx512_avx2_sse42_perf_report.xlsx

anna-khakimova · 2020-09-30T21:46:40Z

@terfendail , please take a look on SIMD optimization of And, Or, Xor kernels.

modules/core/include/opencv2/core/hal/intrin_neon.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

modules/core/include/opencv2/core/hal/intrin_avx.hpp

modules/core/include/opencv2/core/hal/intrin_sse.hpp

modules/gapi/src/backends/fluid/gfluidcore.cpp

asmorkalov · 2020-11-12T11:02:05Z

@anna-khakimova @anton-potapov Friendly reminder.

asmorkalov · 2020-11-23T09:27:05Z

@anna-khakimova could you rebase the PR and fix conflicts?

anna-khakimova · 2020-11-30T16:28:03Z

Hello @asmorkalov
This task has been temporarily suspended.

OrestChura

I also can suggest the same changes for the code of addw_simd function; but it is more complicated question, let's discuss it then

modules/gapi/src/backends/fluid/gfluidcore.cpp

anna-khakimova · 2021-01-28T14:14:47Z

I also can suggest the same changes for the code of addw_simd function; but it is more complicated question, let's discuss it then

Reworked.

anna-khakimova · 2021-01-28T14:19:10Z

@alalek , @terfendail , @rgarnov , @OrestChura please check.

alalek · 2021-01-28T14:45:29Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    v_float32 a = vx_setall_f32(_alpha);
+    v_float32 b = vx_setall_f32(_beta);
+    v_float32 g = vx_setall_f32(_gamma);
+
+    x = addw_simd(in1, in2, out, a, b, g, length);


Why?

I don't know any case where passing SIMD vectors through parameters (into non-inline function) can improve performance.

I don't tried to improve performance with commit "Refactoring.Step2". I've just tried remove excess intermediate function which has the same name addw_simd() but look like this:

template<typename SRC, typename DST> CV_ALWAYS_INLINE int addw_simd(const SRC in1[], const SRC in2[], DST out[], float _alpha, float _beta, float _gamma, int length) { //Cases when dst type is float are successfully vectorized with compiler. if (std::is_same<DST, float>::value) return 0; v_float32 alpha = vx_setall_f32(_alpha); v_float32 beta = vx_setall_f32(_beta); v_float32 gamma = vx_setall_f32(_gamma); if (std::is_same<SRC, ushort>::value && std::is_same<DST, ushort>::value) { return addw_short2short(reinterpret_cast<const ushort*>(in1), reinterpret_cast<const ushort*>(in2), reinterpret_cast<ushort*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, short>::value && std::is_same<DST, short>::value) { return addw_short2short(reinterpret_cast<const short*>(in1), reinterpret_cast<const short*>(in2), reinterpret_cast<short*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, short>::value && std::is_same<DST, uchar>::value) { return addw_short2uchar(reinterpret_cast<const short*>(in1), reinterpret_cast<const short*>(in2), reinterpret_cast<uchar*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, ushort>::value && std::is_same<DST, uchar>::value) { return addw_short2uchar(reinterpret_cast<const ushort*>(in1), reinterpret_cast<const ushort*>(in2), reinterpret_cast<uchar*>(out), alpha, beta, gamma, length); } else if (std::is_same<SRC, uchar>::value && std::is_same<DST, uchar>::value) { constexpr int nlanes = v_uint8::nlanes; if (length < nlanes) return 0; int x = 0; for (;;) { for (; x <= length - nlanes; x += nlanes) { v_float32 a1 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x])); v_float32 a2 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x + nlanes / 4])); v_float32 a3 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x + nlanes / 2])); v_float32 a4 = v_load_f32(reinterpret_cast<const uchar*>(&in1[x + 3 * nlanes / 4])); v_float32 b1 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x])); v_float32 b2 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x + nlanes / 4])); v_float32 b3 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x + nlanes / 2])); v_float32 b4 = v_load_f32(reinterpret_cast<const uchar*>(&in2[x + 3 * nlanes / 4])); v_int32 sum1 = v_round(v_fma(a1, alpha, v_fma(b1, beta, gamma))), sum2 = v_round(v_fma(a2, alpha, v_fma(b2, beta, gamma))), sum3 = v_round(v_fma(a3, alpha, v_fma(b3, beta, gamma))), sum4 = v_round(v_fma(a4, alpha, v_fma(b4, beta, gamma))); vx_store(reinterpret_cast<uchar*>(&out[x]), v_pack_u(v_pack(sum1, sum2), v_pack(sum3, sum4))); } if (x < length) { x = length - nlanes; continue; // process one more time (unaligned tail) } break; } return x; } return 0; }` ```

Functions addw_short2uchar(), addw_short2short() were renamed to addw_simd() and became templated with 2 template specializations. Please don't rely on github's diff. It often shows a difference incorrectly.
Since showed above intermediate function was removed, initialization of vectors alpha, beta, gamma had to be moved to function run_addweighted() on one call stack's level higher.

So now addw_simd() look like this:

template<typename SRC, typename DST> CV_ALWAYS_INLINE int addw_simd(const SRC in1[], const SRC in2[], DST out[], const v_float32& alpha, const v_float32& beta, const v_float32& gamma, int length) { GAPI_Assert((std::is_same<DST, ushort>::value) || (std::is_same<DST, short>::value)); constexpr int nlanes = (std::is_same<DST, ushort>::value) ? static_cast<int>(v_uint16::nlanes) : static_cast<int>(v_int16::nlanes); if (length < nlanes) return 0; int x = 0; for (;;) { for (; x <= length - nlanes; x += nlanes) { v_float32 a1 = v_load_f32(&in1[x]); v_float32 a2 = v_load_f32(&in1[x + nlanes / 2]); v_float32 b1 = v_load_f32(&in2[x]); v_float32 b2 = v_load_f32(&in2[x + nlanes / 2]); addw_short_store(&out[x], v_round(v_fma(a1, alpha, v_fma(b1, beta, gamma))), v_round(v_fma(a2, alpha, v_fma(b2, beta, gamma)))); } if (x < length) { x = length - nlanes; continue; // process one more time (unaligned tail) } break; } return x; } template<typename SRC> CV_ALWAYS_INLINE int addw_simd(const SRC in1[], const SRC in2[], uchar out[], const v_float32& alpha, const v_float32& beta, const v_float32& gamma, int length) { constexpr int nlanes = v_uint8::nlanes; if (length < nlanes) return 0; int x = 0; for (;;) { for (; x <= length - nlanes; x += nlanes) { v_float32 a1 = v_load_f32(&in1[x]); v_float32 a2 = v_load_f32(&in1[x + nlanes / 4]); v_float32 a3 = v_load_f32(&in1[x + nlanes / 2]); v_float32 a4 = v_load_f32(&in1[x + 3 * nlanes / 4]); v_float32 b1 = v_load_f32(&in2[x]); v_float32 b2 = v_load_f32(&in2[x + nlanes / 4]); v_float32 b3 = v_load_f32(&in2[x + nlanes / 2]); v_float32 b4 = v_load_f32(&in2[x + 3 * nlanes / 4]); v_int32 sum1 = v_round(v_fma(a1, alpha, v_fma(b1, beta, gamma))), sum2 = v_round(v_fma(a2, alpha, v_fma(b2, beta, gamma))), sum3 = v_round(v_fma(a3, alpha, v_fma(b3, beta, gamma))), sum4 = v_round(v_fma(a4, alpha, v_fma(b4, beta, gamma))); vx_store(&out[x], v_pack_u(v_pack(sum1, sum2), v_pack(sum3, sum4))); } if (x < length) { x = length - nlanes; continue; // process one more time (unaligned tail) } break; } return x; } template<typename SRC> CV_ALWAYS_INLINE int addw_simd(const SRC*, const SRC*, float*, const v_float32&, const v_float32&, const v_float32&, int length) { //Cases when dst type is float are successfully vectorized with compiler. return 0; }

But if you mind I can move

v_float32 a = vx_setall_f32(_alpha); v_float32 b = vx_setall_f32(_beta); v_float32 g = vx_setall_f32(_gamma);

to each of the addw_simd() template specialization. But it cause code duplication.

I think it would be better and easier to add static assert.
@alalek could you please comment?

OK, this may work here.

if you mind I can move

This should be moved into SIMD functions.

Think about runtime dispatching (not in a scope of this PR) where we don't even know SIMD size and registers types of underlying functions.
Which SIMD value do you want to send there?

Consider following these rules:

no SIMD variables in dispatching code (generic C++ part)

no SIMD arguments in SIMD functions which are called from generic C++ code.

anna-khakimova · 2021-02-04T08:26:37Z

@alalek I've already applied all comments. It seems to me that Vitaly is on vacation. Could you please check, remove change request from Vitaly and merge?

alalek

@rgarnov @dmatveev In this patch SIMD code targeted for NEON is explicitly disabled. It is very unusual case, especially for such simple loops.

Need to properly re-measure performance with enabled NEON intrinsics on the target platform.

On public TX-1 (no access to target platform) I see 1.5-3.5x speedup.

Anna provides report with 12% degradation.

We need third independent measurement through OpenCV perf tests scripts to confirm performance observations.

alalek · 2021-02-04T08:52:46Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

@@ -97,6 +97,123 @@ static inline DST divr(SRC1 x, SRC2 y, float scale=1)
 // Fluid kernels: addWeighted
 //
 //---------------------------
+#if CV_SSE2 | CV_AVX2 | CV_AVX512_SKX


#if CV_SSE2 | CV_AVX2 | CV_AVX512_SKX

Why is just #if CV_SSE2 not enough?

alalek · 2021-02-04T09:00:37Z

modules/gapi/src/backends/fluid/gfluidcore.cpp

+    v_float32 a = vx_setall_f32(_alpha);
+    v_float32 b = vx_setall_f32(_beta);
+    v_float32 g = vx_setall_f32(_gamma);
+
+    x = addw_simd(in1, in2, out, a, b, g, length);


if you mind I can move

This should be moved into SIMD functions.

Think about runtime dispatching (not in a scope of this PR) where we don't even know SIMD size and registers types of underlying functions.
Which SIMD value do you want to send there?

Consider following these rules:

no SIMD variables in dispatching code (generic C++ part)

no SIMD arguments in SIMD functions which are called from generic C++ code.

anna-khakimova · 2021-02-04T11:34:55Z

@dbudniko please take a look.

anna-khakimova · 2021-02-05T08:57:19Z

@rgarnov @dmatveev In this patch SIMD code targeted for NEON is explicitly disabled. It is very unusual case, especially for such simple loops.

Need to properly re-measure performance with enabled NEON intrinsics on the target platform.

On public TX-1 (no access to target platform) I see 1.5-3.5x speedup.

Anna provides report with 12% degradation.

We need third independent measurement through OpenCV perf tests scripts to confirm performance observations.

@alalek
@dbudniko tested this patch on target ARM platform yesterday and got following results:

perf_report_Dmitry.xlsx

As you can see Dmitry's result is the same as our with Orest. Speedup is not observed.

anna-khakimova · 2021-02-05T10:02:44Z

@alalek please take a look.

terfendail

I haven't yet collected performance statistics for this update, however the change looks fine for me

GAPI: SIMD optimization for AddWeighted kernel. * Add, sub, absdiff kernels optimization * AddW kernel * And, or kernels * AddWeighted refactoring and SIMD opt for AbsDiffC kernel * Remove simd opt of AbsDiffC kernel * Refactoring * Applied comments * Refactoring.Step2 * Applied comments.Step2

anna-khakimova force-pushed the ak/simd_addw_bitwise branch from d1bdfcd to c14bca7 Compare September 30, 2020 09:17

anna-khakimova changed the title ~~AddW and bitwise kernels~~ UI SIMD for AddW and bitwise kernels Sep 30, 2020

anna-khakimova force-pushed the ak/simd_addw_bitwise branch 8 times, most recently from 4187207 to a1eec7b Compare September 30, 2020 20:16

terfendail reviewed Oct 1, 2020

View reviewed changes

modules/core/include/opencv2/core/hal/intrin_neon.hpp Outdated Show resolved Hide resolved

terfendail reviewed Oct 1, 2020

View reviewed changes