CARVIEW |
Navigation Menu
-
-
Notifications
You must be signed in to change notification settings - Fork 56.2k
Added MSA implementations for mips platforms. #15422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I have added MSA implementations for MIPS platforms. The changes include:
Performance and functionality testsuites have been run locally for verification. According to the opencv_perf tests results, the MSA optimized version is 39% faster than the No-MSA version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job 👍
General notes:
- please split PR: SIMD backend + build scripts should be reviewed and merged first. Other parts please move into separate PRs. Please check related comment below about RAW MSA code.
- as an "optimization" activity please re-target this PR onto 3.4 branch
3rdparty/libpng/CMakeLists.txt
Outdated
set(PNG_MIPS_MSA "on" CACHE STRING "Enable MIPS_MSA optimizations: | ||
off: disable the optimizations") | ||
set_property(CACHE PNG_MIPS_MSA PROPERTY STRINGS | ||
${PNG_MIPS_MSA_POSSIBLE_VALUES}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-...
+set(PNG_MIPS_MSA ON CACHE BOOL ...)
-if(${PNG_MIPS_MSA} STREQUAL "on")
+if(PNG_MIPS_MSA)
instead of strings manipulation.
Also it is better to add this pre-check:
if(";${CPU_BASELINE_FINAL};" MATCHES ";MSA;")
@@ -97,6 +98,8 @@ ocv_optimization_process_obsolete_option(ENABLE_FMA3 FMA3 ON) | |||
ocv_optimization_process_obsolete_option(ENABLE_VFPV3 VFPV3 OFF) | |||
ocv_optimization_process_obsolete_option(ENABLE_NEON NEON OFF) | |||
|
|||
ocv_optimization_process_obsolete_option(ENABLE_MSA MSA OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't introduce obsolete options.
CPU_BASELINE=MSA
should be used instead (it is defaulted to this values below on the line 350)
@@ -256,6 +256,8 @@ namespace cv { namespace debug_build_guard { } using namespace debug_build_guard | |||
#define CV_CPU_AVX512_CEL 261 | |||
#define CV_CPU_AVX512_ICL 262 | |||
|
|||
#define CV_CPU_MSA 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use value "150"
// of this distribution and at https://opencv.org/license.html. | ||
|
||
#ifndef OPENCV_CORE_HAL_MSA_MACROS_H | ||
#define OPENCV_CORE_HAL_MSA_MACROS_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move this file into core/hal directory.
platforms/linux/mips.toolchain.cmake
Outdated
|
||
if(USE_MSA) | ||
message(WARNING "You use obsolete variable USE_MSA to enable MSA instruction set. Use -DENABLE_MSA=ON instead." ) | ||
set(ENABLE_MSA TRUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove obsolete options.
} | ||
|
||
inline v_uint32x4 v_interleave_pairs(const v_uint32x4& vec) { return v_reinterpret_as_u32(v_interleave_pairs(v_reinterpret_as_s32(vec))); } | ||
inline v_float32x4 v_interleave_pairs(const v_float32x4& vec) { return v_reinterpret_as_f32(v_interleave_pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove this call. It is gone (handled by compile time checks).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub code reference is buggy again - correct line is 1748 about hasSIMD128().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove static inline bool hasSIMD128()
- it is not used anymore in OpenCV
(wrong code is linked due GitHub bug, but file is correct)
} | ||
|
||
inline v_uint32x4 v_interleave_pairs(const v_uint32x4& vec) { return v_reinterpret_as_u32(v_interleave_pairs(v_reinterpret_as_s32(vec))); } | ||
inline v_float32x4 v_interleave_pairs(const v_float32x4& vec) { return v_reinterpret_as_f32(v_interleave_pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
#define SLDI_B2_0(RTYPE, in0, in1, out0, out1, slide_val) \ | ||
{ \ | ||
out0 = (RTYPE) __msa_sldi_b((v16i8) __msa_fill_b(0), (v16i8) in0, slide_val); \ | ||
out1 = (RTYPE) __msa_sldi_b((v16i8) __msa_fill_b(0), (v16i8) in1, slide_val); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use source from libpng 1.6.37
Add separate patch file if required (to avoid fixes lost after 3rdparty upgrade in the future)
vs11 = v_fma(w1, r1, vs11); | ||
vs12 = v_fma(w1, r2, vs12); | ||
vs13 = v_fma(w1, r3, vs13); | ||
#else | ||
vs00 += w0*r0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps this can be replaced to v_fma()
unconditionally for all platforms.
modules/imgproc/src/demosaicing.cpp
Outdated
{ | ||
v8u16 r0 = msa_ld1q_u16((const ushort*)bayer); | ||
v8u16 r1 = msa_ld1q_u16((const ushort*)(bayer + bayer_step)); | ||
v8u16 r2 = msa_ld1q_u16((const ushort*)(bayer + bayer_step * 2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Main principle of "universal intrinsics" (SIMD) approach is "code once, run anywhere".
You need to write code once (on some platform), debug it with some backend (there is even C++ backend for SIMD for reference) and then it should run on other backends with reasonable speed.
So we are strictly against huge code blocks using native intrinsic if this code can be replaced by universal SIMD code without significant lost of speed.
Universal SIMD code is regularly tested via x86/ARM backends, native RAW MIPS has very low testing coverage.
See discussion here: #9763 (comment)
Could you please move these changes into a separate PR?
Lets merge first addition of SIMD backend and related build scripts (so we can add cross-platforms builds for that into our infrastructure (but no tests)).
Hi Alexander
Thanks a lot for your quick response and detailed comments.
I will adjust and optimize the source code as you have commented, then submit separated PRs.
Best Regards,
Fei Wu
|
@alalek Since I need to separate the PR into two, should I close this pull request first?
|
Lets keep basic part of MSA integration in this PR (at least to track conversation above). You can "force push" commits in your branch to update PR ("mipsopen-fwu:msa-dev"). |
@alalek |
Yes. You can resolve conflicts during rebase. Before that you may enable diff3 conflict style for more information about changes: https://stackoverflow.com/questions/27417656/should-diff3-be-default-conflictstyle-on-git To abort rebase process use |
I rebased PR onto 3.4 branch - fetch changes from your fork. Old saved commits are here: https://github.com/alalek/opencv/commits/pr15422_0903 |
@alalek Thanks for rebasing! So what I need to do for the next step is to separate SIMD backend & build scripts from my previous commits, then commit and 'force push' to my branch (msa-dev) to update PR, right? |
…build scripts for MIPS platforms are added. Signed-off-by: Fei Wu <fwu@wavecomp.com>
31292d0
to
7cd5726
Compare
@alalek I have separated MSA backend and build scripts from the old commit in this PR which has been rebased to 3.4 branch. |
I have added builder with cross-compilation from Ubuntu 18:04 x64:
Update:
|
@alalek Can you build the project for mips platform successfully now? |
Signed-off-by: Fei Wu <fwu@wavecomp.com>
@alalek
In file included from /tmp/iJM73d75Px/dump1.h:55:0:
--These warnings are introduced by the original filter_msa_intrinsics.c file from libpng-1.6.37. We didn't make any changes on it. If needed, I can make a fix. cc1plus: warning: 'void* __builtin_memcpy(void*, const void*, long unsigned int)' forming offset [25, 32] is out of the bounds [0, 24] of object 'scharr' with type 'const int [6]' [-Warray-bounds] --This warning is not introduced by my commit. |
Thank you for recommended build instructions! Currently on CI we prefer using cross-compilation mode through builtin Ubuntu compiler kits. We believe it is enough for basic validation builds (no tests). It is easy enough to integrate and upgrade dependencies from our side.
|
…g warnings for libpng. Signed-off-by: Fei Wu <fwu@wavecomp.com>
@alalek I checked the failed build log for 'Android pack' : https://pullrequest.opencv.org/buildbot/builders/precommit_pack_android/builds/11090/steps/build%20sdk/logs/stdio This error Error ‘ opcode not supported on this processor: mips32 (mips32) `pause' ’ is not introduced by my commit. I think maybe this error can be fixed by adding __mips_isa_rev check in modules/core/src/parallel_impl.cpp: |
… is less than 2. Signed-off-by: Fei Wu <fwu@wavecomp.com>
@alalek I have committed the fix for compiling error in 'Android pack' build |
Hi, Alexander
The PR #15422 is kept in ‘waiting for builds’ state currently. How can I make it go ahead?
I’ll submit another PR for the other parts of MSA optimizations after 15422 is merged.
Thanks.
Best Regards,
Fei Wu
发件人: Alexander Alekhin <notifications@github.com>
发送时间: 2019年9月6日 12:59
收件人: opencv/opencv <opencv@noreply.github.com>
抄送: Fei Wu <fwu@wavecomp.com>; Author <author@noreply.github.com>
主题: [EXTERNAL]Re: [opencv/opencv] Added MSA implementations for mips platforms. (#15422)
Thank you for recommended build instructions!
I believe it make sense to add them into PR description and/or add comment in r6 toolchain file with SDK link + this PR link.
Currently on CI we prefer using cross-compilation mode through builtin Ubuntu compiler kits. We believe it is enough for basic validation builds (no tests). It is easy enough to integrate and upgrade dependencies from our side.
1. It is Linux x86-64 build. No -mmsa option is expected here.
Each header is included directly into ABI checker. Need to add guard in hal/msa_macros.h:
#ifdef __mips_msa
...
#endif
1. In 3rdparty/libpng/CMakeLists.txt near add_definitions(-DPNG_MIPS_MSA_OPT=2) add this line:
ocv_warnings_disable(CMAKE_C_FLAGS -Wshadow)
1. OK, I will try to check this with GCC-8 on other archs and find out how to workaround that.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<#15422?email_source=notifications&email_token=AM4VXTJFXXMCNA6HTJRRTHDQIHPPRA5CNFSM4ISK27G2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6BXEDA#issuecomment-528708108>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AM4VXTJEXMOKMS37GT66CGTQIHPPRANCNFSM4ISK27GQ>.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added 'patch' files for '3rdparty' code.
Please take a look on comments below.
ocv_update(CPU_MSA_TEST_FILE "${OpenCV_SOURCE_DIR}/cmake/checks/cpu_msa.cpp") | ||
ocv_update(CPU_KNOWN_OPTIMIZATIONS "MSA") | ||
ocv_update(CPU_MSA_FLAGS_ON "") | ||
ocv_update(CPU_FP16_IMPLIES "MSA") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FP16 is not enabled in current builds.
What did you mean here?
@@ -319,6 +321,8 @@ enum CpuFeatures { | |||
CPU_AVX512_CEL = 261, //!< Cascade Lake with AVX-512F/CD/BW/DQ/VL/IFMA/VBMI/VNNI | |||
CPU_AVX512_ICL = 262, //!< Ice Lake with AVX-512F/CD/BW/DQ/VL/IFMA/VBMI/VNNI/VBMI2/BITALG/VPOPCNTDQ | |||
|
|||
CPU_MSA = 300, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
150
} | ||
|
||
inline v_uint32x4 v_interleave_pairs(const v_uint32x4& vec) { return v_reinterpret_as_u32(v_interleave_pairs(v_reinterpret_as_s32(vec))); } | ||
inline v_float32x4 v_interleave_pairs(const v_float32x4& vec) { return v_reinterpret_as_f32(v_interleave_pa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove static inline bool hasSIMD128()
- it is not used anymore in OpenCV
(wrong code is linked due GitHub bug, but file is correct)
…ptimizations.cmake. 2. Use CV_CPU_COMPILE_MSA instead of __mips_msa for MSA feature check in cv_cpu_dispatch.h. 3. Removed hasSIMD128() in intrin_msa.hpp. 4. Define CPU_MSA as 150. Signed-off-by: Fei Wu <fwu@wavecomp.com>
bb24134
to
f75fcb6
Compare
|
OPENCV_HAL_IMPL_MSA_EXPAND(v_uint32x4, v_uint64x2, uint, u32, s32, v4u32, v4i32) | ||
OPENCV_HAL_IMPL_MSA_EXPAND(v_int32x4, v_int64x2, int, s32, s32, v4i32, v4i32) | ||
|
||
#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this version of implementation is disabled? Is it slow, inaccurate or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The '#if 0' version is slower according to the test results. The reason is that MSA doesn't actually provide data type expanding instructions for vectors. The 'movl' related conversion are done by C.
So it would take more steps to use 'movl' interface than doing converion directly.
modules/core/src/arithm.simd.hpp
Outdated
@@ -393,7 +393,7 @@ static void bin_loop(const T1* src1, size_t step1, const T1* src2, size_t step2, | |||
#if CV_SIMD | |||
typedef bin_loader<OP, T1, Tvec> ldr; | |||
enum {wide_step = Tvec::nlanes}; | |||
#if !CV_NEON && CV_SIMD_WIDTH == 16 | |||
#if !CV_NEON && !CV_MSA && CV_SIMD_WIDTH == 16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary? Do manual loop unrolling affect performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case data width is less than wide_step * 2, the performance would be better with this change. Otherwise, the performance shouldn't be better. Considering 'width' should be larger in most cases, this change would not be necessary. I will remove it.
modules/core/src/matmul.simd.hpp
Outdated
@@ -2390,6 +2390,45 @@ double dotProd_8u(const uchar* src1, const uchar* src2, int len) | |||
i += blockSize; | |||
} | |||
} | |||
#elif CV_MSA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do performance of this implementation differs essentially from generic universal intrinsics version? I suppose generic UI version should work quite fast due to availability of native intrinsic for v_dotrpod. In case generic version provide similar performance I prefer to avoid platform specific code wherever possible.
BTW This code is anyway disabled since CV_SIMD is available and is checked prior to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove CV_MSA related code block dotProd_8u(), it's not necessary.
@terfendail
|
2. Removed unnecessary CV_MSA related code block in dotProd_8u(). Signed-off-by: Fei Wu <fwu@wavecomp.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
@@ -2476,6 +2476,45 @@ double dotProd_8s(const schar* src1, const schar* src2, int len) | |||
i += blockSize; | |||
} | |||
} | |||
#elif CV_MSA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#if CV_SIMD
blocks this code execution here too (and compilation). The same note is about #elif CV_NEON
code block (dead code too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mipsopen-fwu Have you compared performance of this MSA specific implementation against generic UI implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@terfendail
This CV_MSA related code block was added by mistake when I merged msa implementations from 4.0.1 branch to 4.1.1 and 3.4 branch. On our pervious 4.0.1 development branch, this MSA specific implementation doesn't exist. It's correct to remove them from the pull request.
Even though, in order to make clear the performance difference, I did MatType_Length_dot test on mips32r5 HW by setting testparam=(8UC1, 1024) and samples=15. From the result, we can see that the performance of generic UI implementation(using CV_SIMD) is better against CV_MSA specific implementation which was added by mistake in this code block.
The following is the test result:
- The total running time for CV_MSA implementation is:
[----------] 18 tests from MatType_Length_dot (2177 ms total) - The total running time for CV_SIMD implementations is:
[----------] 18 tests from MatType_Length_dot (618 ms total)
elseif(MIPS) | ||
ocv_update(CPU_MSA_TEST_FILE "${OpenCV_SOURCE_DIR}/cmake/checks/cpu_msa.cpp") | ||
ocv_update(CPU_KNOWN_OPTIMIZATIONS "MSA") | ||
ocv_update(CPU_MSA_FLAGS_ON "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it empty? It should contain "-mmsa" or something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option seems not be used elsewhere, but it should be better to be defined as '-mmsa' for future use. I will modify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This option is used, but it not obvious:
- there is a list of
CPU_KNOWN_OPTIMIZATIONS
which containMSA
- cmake first tries to compile file
CPU_${OPT}_TEST_FILE
with optionCPU_${OPT}_FLAGS_ON
, whereOPT=MSA
- if compilation succeeded and
MSA
is requested by user (viaCPU_BASELINE
), then these options will be added to compiler
Is the -mmsa
option the only one required by implemented intrinsics? What about -fhard-float
or others? We should move these options from toolchain file to CompilerOptimization.cmake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. -mmsa option is enough. When '-mmsa' is included, '-mfp64' and '-mhard-float' will be enabled by default. '-mmsa' has the same effect as '-mmsa -mfp64 -mhard-float'.
inline bool v_check_any(const v_float32x4& a) | ||
{ return v_check_any(v_reinterpret_as_u32(a)); } | ||
|
||
#if CV_SIMD128_64F |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v_check_all(const v_int64x2&)
shouldn't be guarded by CV_SIMD128_64F as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will remove CV_SIMD128_64F guarding form intrin_msa.hpp, only the define is kept.
2. Removed CV_SIMD128_64F guardings in intrin_msa.hpp. Signed-off-by: Fei Wu <fwu@wavecomp.com>
Signed-off-by: Fei Wu <fwu@wavecomp.com>
Signed-off-by: Fei Wu fwu@wavecomp.com
Materials: