You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The following things prevented the original algorithm from vectorization:
Loop-carried dependency, the previous input is used as one of operands.
This seems expected that the compiler doesn't transform such code to eliminate this automagically, too much of a transformation.
This was addressed by transforming the code to read the input array twice per iteration instead of carrying the values through the loop.
Odd iterator pattern where the compiler cannot understand the iteration.
This seemed to me a strange limitation, so it was reported as DevCom-10742868.
This was addressed by using integer index.
๐ Correctness concern
The standard defines exact steps for this algorithm. The optimization alters the steps.
In particular the standard wants the subtracted value to be saved from the previous iteration, rather than being read again.
The two below sections explain what precautions are made to make the change unobservable, so I hope the change is correct.
โ Checks for eligibility
The following checks were added:
No Aliasing (see below)
Iterators can be pointers
Source iterator is not volatile (read order is altered)
Trivially copyable (we skip copying where the standard asks for it)
There's no need in check for integral types or so, since the compiler makes the final decision anyway, and it may be able optimize even something that wouldn't pass a strict check.
โ ๏ธ No Aliasing
Apparently there's no rule that the source and the destination ranges may not overlap.
We should handle aliasing.
Unlike the #4431 precedent, we can't yield to the compiler here. The compiler is able to insert overlaps check that prevents vectorization and go to the scalar fallback in case of checks failure, but:
We apply transformation that would change the meaning of the program in case of overlapping range, and the meaning would be changed no matter if vectorization happens
The checks that compiler inserts may be too loose, it may allow like equal source and destination pointer, as these are thc checks if the transformed algorithm would not change the meaning
So we do our own checks.
Then we tell the compiler with __restrict that we already checked, and it should not bother. This is done in a separate function, because the __restrict is not aliased within scope, so saying __restrict within the original algorithm would apparently be a lie.
The extra check by the compiler, if not prevented would slightly add run time and dead code size.
๐พ Compiler warnings
We have a great feature called integral promotion. Smaller types are converted to integers, and there is a warning about converting them back. Local pragma suppresses them in benchmark, but not in the test.
@StephanTLavavej used a function object with static_cast to avoid warnings in the test.
โฑ๏ธ Benchmark results
Benchmark
main
this
this + AVX2
bm<uint8_t>/2255
745 ns
563 ns
562 ns
bm<uint16_t>/2255
799 ns
83.3 ns
75.1 ns
bm<uint32_t>/2255
731 ns
154 ns
141 ns
bm<uint64_t>/2255
805 ns
293 ns
272 ns
bm/2255
751 ns
154 ns
123 ns
bm/2255
753 ns
304 ns
233 ns
๐ฅ Results interpretation
Overall, we're good ๐ธ
8-bit case failed to vectorize for no reason, reported DevCom-10745948
Still 8-bit case is noticeably better. I didn't analyze that, but looks like this a consistent thing, not codegen gremlins. I think it is a side effect of eliminating loop-carried dependency, so the processor can parallelize and overlap iterations
AVX2 is only slightly faster. I did not analyze, but think that memory wall is being hit here ๐งฑ
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
๐ The approach
The following things prevented the original algorithm from vectorization:
๐ Correctness concern
The standard defines exact steps for this algorithm. The optimization alters the steps.
In particular the standard wants the subtracted value to be saved from the previous iteration, rather than being read again.
The two below sections explain what precautions are made to make the change unobservable, so I hope the change is correct.
โ Checks for eligibility
The following checks were added:
There's no need in check for integral types or so, since the compiler makes the final decision anyway, and it may be able optimize even something that wouldn't pass a strict check.
Apparently there's no rule that the source and the destination ranges may not overlap.
We should handle aliasing.
Unlike the #4431 precedent, we can't yield to the compiler here. The compiler is able to insert overlaps check that prevents vectorization and go to the scalar fallback in case of checks failure, but:
So we do our own checks.
Then we tell the compiler with
__restrict
that we already checked, and it should not bother. This is done in a separate function, because the__restrict
is not aliased within scope, so saying__restrict
within the original algorithm would apparently be a lie.The extra check by the compiler, if not prevented would slightly add run time and dead code size.
๐พ Compiler warnings
We have a great feature called integral promotion. Smaller types are converted to integers, and there is a warning about converting them back. Local pragma suppresses them in benchmark, but not in the test.
@StephanTLavavej used a function object with
static_cast
to avoid warnings in the test.โฑ๏ธ Benchmark results
๐ฅ Results interpretation