You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Instead of trying to call a separately compiled implementation in the import library, reimplement the specific case in headers the way that compiler is able to vectorize it.
Do memcpy swap by portions of nice length, and then tail, so that compiler can use registers as immediate storage, vector registers when possible. Due to compile-time known length this will unroll perfectly.
I didn't investigate a lot which length is nice, I think 64 is a good choice:
Not too much stack consumed in debug mode where there will be actual stack allocation
A big difference in "After" columnbetween std_swap<9000, uint8_t> and std_swap_ranges<uint8_t>/9000 made me curious. It is big for me too.
Looks like the main reason is vector over-alignment. adding alignas(64) to stack array makes significant improvement there.
I don't think the benchmark should be modified to have that though.
If we know that alignment has a significant effect on the results, shouldn't we control it instead of letting it vary and skew the benchmark? Note that "control" doesn't mean "pick an unrealistic value that skews the results" but either pick a value that is representative or average over a set of values that are representative.
but either pick a value that is representative or average over a set of values that are representative
x64 stack is 16-aligned on function entry, but for multiple variables on the stack the location of each is limited by its own alignment and alignment of the variable next to it.
malloc has also 16 bytes alignment.
So 8 is a good skew maybe to imitate being next to a pointer.
Or 16 to imitate top or the only variable on stack or malloc allocation.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #2683
π The Approach
Instead of trying to call a separately compiled implementation in the import library, reimplement the specific case in headers the way that compiler is able to vectorize it.
Do
memcpy
swap by portions of nice length, and then tail, so that compiler can use registers as immediate storage, vector registers when possible. Due to compile-time known length this will unroll perfectly.I didn't investigate a lot which length is nice, I think 64 is a good choice:
βοΈ Self swap check
[utility.swap]/7 says:
[alg.swap]/2 says:
Therefore it is UB to swap overlapping arrays.
Added an assert for that and removed the check.@frederick-vs-ja reported LWG-4165, and the check preserved.
β±οΈ Benchmark results
π₯ Results interpretation