You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
remove_copy and unique_copy are different from their non-_copy counterparts in that they don't have room they are allowed to overwrite. This means we can't directly store results from vector registers.
The previous attempt #5062 tried to use masked stores to bypass that limitation. Unfortunately, this doesn't perform well for some CPUs. Also the minimum granularity of AVX2 masked store is 32 bits, so it would not work for smaller elements.
This time, temporary storage comes to rescue. The algorithms already use some additional memory (the tables), so why wouldn't it use a bit more. I arbitrarily picked 512 bytes, should be not too much. Each time the temporary buffer is full, it can be copied to the destination with memcpy, it should be fast enough for this buffer size.
🚫 No find before remove_copy
In #4987, it was explained that doing find before remove is good for both correctness and performance. Originally it was in vectorization code, but during the review @StephanTLavavej observed that it is done in the headers already (#4987 (comment)).
For remove_copy it is not necessary for correctness, and may be harmful for performance. find would need copy in addition, this will be double pass on the input, which can make the performance worse for large input and memry-bound situation.
We may have special handling of the range before the first match in vectorization code, this is another story, and it would not be harmful, but I'm not doing this in the current PR. Maybe later.
So, as we have not called find, and so have not checked if value type can even match iterator value type, we need this _Could_compare_equal_to_value_type check here.
✅ Test coverage
Shared with non-_copy counterparts to save total tests run time and some lines of code, at the expense with otherwise unnecessary coupling.
We check both modified and unmodified destination parts, to make sure unmodified indeed didn't modify.
⏱️ Benchmark results
Benchmark
Before
After
Speedup
rc<alg_type::std_fn. std::uint8_t>
908 ns
349 ns
2.60
rc<alg_type::std_fn. std::uint16_t>
1850 ns
462 ns
4.00
rc<alg_type::std_fn. std::uint32_t>
901 ns
532 ns
1.69
rc<alg_type::std_fn. std::uint64_t>
1876 ns
1018 ns
1.84
rc<alg_type::rng. std::uint8_t>
1344 ns
349 ns
3.85
rc<alg_type::rng. std::uint16_t>
2094 ns
465 ns
4.50
rc<alg_type::rng. std::uint32_t>
884 ns
460 ns
1.92
rc<alg_type::rng. std::uint64_t>
1884 ns
1079 ns
1.75
uc<alg_type::std_fn. std::uint8_t>
3329 ns
263 ns
12.66 🤡
uc<alg_type::std_fn. std::uint16_t>
1145 ns
342 ns
3.35
uc<alg_type::std_fn. std::uint32_t>
1144 ns
388 ns
2.95
uc<alg_type::std_fn. std::uint64_t>
1128 ns
754 ns
1.50
uc<alg_type::rng. std::uint8_t>
1111 ns
252 ns
4.41
uc<alg_type::rng. std::uint16_t>
1328 ns
331 ns
4.01
uc<alg_type::rng. std::uint32_t>
1313 ns
386 ns
3.40
uc<alg_type::rng. std::uint64_t>
1146 ns
758 ns
1.51
🥇 Results interpretation
Good improvement!
Not as good as for non-_copy counterparts though, as memcpy takes some noticeable time.
The usual codegen gremlins that cause results variation are observed for non-vectorized tight loops. I've marked the most notorious one with clown. I can't explain that anomality.
Thanks! 😻 I pushed fixes for significant bugs in the "can compare equal to value type" codepaths, and expanded the test coverage (auditing all existing codepaths, and verifying that the new tests caught the bugs). Please double-check.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
⚙️ The optimization
remove_copy
andunique_copy
are different from their non-_copy
counterparts in that they don't have room they are allowed to overwrite. This means we can't directly store results from vector registers.The previous attempt #5062 tried to use masked stores to bypass that limitation. Unfortunately, this doesn't perform well for some CPUs. Also the minimum granularity of AVX2 masked store is 32 bits, so it would not work for smaller elements.
This time, temporary storage comes to rescue. The algorithms already use some additional memory (the tables), so why wouldn't it use a bit more. I arbitrarily picked 512 bytes, should be not too much. Each time the temporary buffer is full, it can be copied to the destination with
memcpy
, it should be fast enough for this buffer size.🚫 No
find
beforeremove_copy
In #4987, it was explained that doing
find
beforeremove
is good for both correctness and performance. Originally it was in vectorization code, but during the review @StephanTLavavej observed that it is done in the headers already (#4987 (comment)).For
remove_copy
it is not necessary for correctness, and may be harmful for performance.find
would needcopy
in addition, this will be double pass on the input, which can make the performance worse for large input and memry-bound situation.We may have special handling of the range before the first match in vectorization code, this is another story, and it would not be harmful, but I'm not doing this in the current PR. Maybe later.
So, as we have not called
find
, and so have not checked if value type can even match iterator value type, we need this_Could_compare_equal_to_value_type
check here.✅ Test coverage
Shared with non-
_copy
counterparts to save total tests run time and some lines of code, at the expense with otherwise unnecessary coupling.We check both modified and unmodified destination parts, to make sure unmodified indeed didn't modify.
⏱️ Benchmark results
🥇 Results interpretation
Good improvement!
Not as good as for non-
_copy
counterparts though, asmemcpy
takes some noticeable time.The usual codegen gremlins that cause results variation are observed for non-vectorized tight loops. I've marked the most notorious one with clown. I can't explain that anomality.