You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in remove, unique or adjacent_find, and comparisons from minmax family.
⚖️ Less and greater
Sorting in both ascending and descending order makes equal sense, although the standard picks less as the default predicate. We support both less and greater here. This is done for the first time. It wasn't done for minmax family in the past, because for minmax it is very squirrelly to reverse the predicate 🐿️.
To reduce the number of functions, less and greater distinguished by a bool parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.
The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.
Fun observation: one could attempt to pass less_equal or greater_equal too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's _DEBUG_LT_PRED to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.
🐞 Floating bugs
We support floating point types here too.
I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.
One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for /fp:except, so it should be fine:
#else // ^^^ use vector algorithms and fast math / not use vector algorithms or not use fast math vvv
#define _USE_STD_VECTOR_FLOATING_ALGORITHMS 0
#endif // ^^^ not use vector algorithms or not use fast math ^^^
If anything goes wrong, the escape hatch is still there to help.
➕ Unsigned values
Like in minmax_element, we need the *_gt intrinsics that only exist for signed values, so we have to do sign correction.
Unlike in minmax_element, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.
Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for int8_t and ~39.5 for iint8_t, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.
Interesting thing about floating types in your "Before" results. In mine, they were consistently slower, but I suspected "code alignment gremlins", rather than inherently slower instructions. The instructions have a bit longer encoding, making stepping into that a bit easier. Your result seem to confirm that, having one of floating results consistent with integer results.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🗺️ Overview
Another vectorziation, apparently one of the last relatively low hanging fruits 🍎.
Uses offset load like in
remove
,unique
oradjacent_find
, and comparisons fromminmax
family.⚖️ Less and greater
Sorting in both ascending and descending order makes equal sense, although the standard picks
less
as the default predicate. We support bothless
andgreater
here. This is done for the first time. It wasn't done forminmax
family in the past, because forminmax
it is very squirrelly to reverse the predicate 🐿️.To reduce the number of functions,
less
andgreater
distinguished by abool
parameter, which turns to a couple of offsets for the loads, making difference which of the loads are actually with a negative offset.The runtime cost of applying extra offset on each iteration is nonzero, but still I expect it to be small.
Fun observation: one could attempt to pass
less_equal
orgreater_equal
too, and expect the algorithm to be checking for strictly asceding/desceding values. But it is UB according to the sorted algorithms requirements, and there's_DEBUG_LT_PRED
to alert about that 😈. The vectorization will not affect this case anyhow, as it looks for the specific predicates.🐞 Floating bugs
We support floating point types here too.
I don't expect anything bad this time. This algorithm is position-based, and we only use the intrinsics which do correct math.
One caveat I see is signaling values🚨. As this is early return algorithm, for some data it may be expected not to signal when handling elements one-by-one, and it would start signaling when working with vectors. But we already do not enable floating vector algorithms for
/fp:except
, so it should be fine:STL/stl/inc/xutility
Lines 55 to 59 in 5762e6b
If anything goes wrong, the escape hatch is still there to help.
➕ Unsigned values
Like in
minmax_element
, we need the*_gt
intrinsics that only exist for signed values, so we have to do sign correction.Unlike in
minmax_element
, we apply the correction at compile-time, having different signed and unsigned functions. The whole algorithm is faster, so the correction overhead is more noticeable.Can be changed if the benchmark results do not show the signed/unsigned difference persuasive enough. Like we have ~33.5 for
int8_t
and ~39.5 foriint8_t
, and in runtime correction applying I expect ~39.5 for both, but with less machine code generated, and with simpler dispatch in the header.⏱️ Benchmark results