You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
strstr is given for a reference in the benchmark, it is not affected by the optimization.
It may be impossible to reach strstr performance, as it uses pcmpistri (and reading beyond the last element, as pcmpistri is not very useful otherwise). We can try pcmpestri for 8-bit and 16-bit cases, but still it may be not as efficient, as strstr. I'd prefer to try this additional optimization in a next PR though.
The difference in "before" results between ranges::search and search(..., default_searcher) is due to different implementation. search(..., default_searcher) doesn't try to use memcmp on every iteration. It is direct comparison. Apparently, this is faster.
It is curious that search(..., default_searcher) is faster for 64-bit type before vectorization.
I would want someone else to confirm the results, and maintainers decision what to do with this.
A possible way to handle it is to remove 32 and 64 bit optimization/vectorization attempts at all, so that ranges::search case (along with the usual std::search) would be the same as search(..., default_searcher)
And if we are to keep vectorization only for 8-bit and 16-bit elements, we may drop the current implementation and not review/commit it in the first place, if SSE4.2 pcmpestri smells like it would be faster.
I pushed changes to address the issues that I found, but I still need to think about whether we should be vectorizing this at all. I'm leaning towards ripping out the existing memcmp. Stay tuned...
โฆpred`.
`_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`.
`_Equal_rev_pred` is called by ranges `search`/`find_end`.
This doesn't affect `equal` etc.
Ok, after looking at the benchmarks, I've taken the radical step of reverting both the vectorization changes and the existing memcmp "optimizations" that were massively pessimizing your reasonable benchmark example. My measurements:
Benchmark
main
Vector
Plain
c_strstr
144 ns
151 ns
145 ns
classic_search<std::uint8_t>
8935 ns
1726 ns
1754 ns
classic_search<std::uint16_t>
9732 ns
3017 ns
1739 ns
classic_search<std::uint32_t>
9029 ns
4829 ns
1912 ns
classic_search<std::uint64_t>
8970 ns
8471 ns
2527 ns
ranges_search<std::uint8_t>
8916 ns
1723 ns
1784 ns
ranges_search<std::uint16_t>
8932 ns
3012 ns
1785 ns
ranges_search<std::uint32_t>
8303 ns
4840 ns
2460 ns
ranges_search<std::uint64_t>
9061 ns
8577 ns
1860 ns
search_default_searcher<std::uint8_t>
1858 ns
1720 ns
1883 ns
search_default_searcher<std::uint16_t>
2525 ns
3016 ns
1861 ns
search_default_searcher<std::uint32_t>
1863 ns
4812 ns
1869 ns
search_default_searcher<std::uint64_t>
1873 ns
8440 ns
2592 ns
This compares main (with only the benchmark added), this vectorization PR before my revert, and finally "Plain" which lacks both vectorization and memcmp.
There's some noise here (e.g. search_default_searcher is unchanged between main and Plain, but there were random 2500 ns spikes), but the pattern is quite clear: memcmp is almost a 5x penalty for all element sizes, vectorization helps smaller elements, but plain code results in great performance across the board.
Note that dropping the memcmp paths from _Equal_rev_pred_unchecked and _Equal_rev_pred has very limited impact. _Equal_rev_pred_unchecked is called by classic/parallel search/find_end, and _Equal_rev_pred is called by ranges search/find_end. (This makes sense, since find_end is the "opposite" of search.)
The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial). It's search/find_end that are unusual because they want to repeatedly compare a needle, which has very different performance characteristics.
In addition to keeping the benchmark from this vectorization attempt, I've also kept the correctness test (even though search is no longer vectorized), since it's quick to run, and we might figure out how to vectorize this beneficially in the future.
The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial)
Agreed. mismatch / lexicographical_compare did benefit from similar optimization.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #2453
These benchmark results are no longer relevant, as the PR intention has changed
Before:
After:
strstr
is given for a reference in the benchmark, it is not affected by the optimization.It may be impossible to reach
strstr
performance, as it usespcmpistri
(and reading beyond the last element, aspcmpistri
is not very useful otherwise). We can trypcmpestri
for 8-bit and 16-bit cases, but still it may be not as efficient, asstrstr
. I'd prefer to try this additional optimization in a next PR though.