You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
AVX2 based search for all four element widths. Modeled on SSE4.2 8-bit and 16-bit approach.
Without pcmpestr* instruction, as it is not available for AVX vector width, traditional find-like approach is used instead.
Notable differences:
pcmpestr* compares all fit characters, pcmpeq* compares only the first, so need to confirm the rest of characters even if the needle fits vector
as a result, for search (forward search), the variable step is no longer better than the fixed step. See below
pcmpeq* returns byte bitmask instead of element bitmask, so the mask is masked to make it contain only initial byte positions, also, some element size multiplications are not needed
for 32-bit and 64-bit elements, AVX2 mask can be used to load partial value, instead of temporary buffer
^= (1 << _Pos) is used instead of _bittestandreset. Apparently, this is faster, and should have been used for SSE either
πΆ Fixed vs variable step
For SSE4.2, there are pcmpestri instruction that returns first or last match position, and pcmpestrm instruction that returns bitmask of matches. pcmpestri is faster, also pcmpestri is not affected by DevCom-10689455, unlike pcmpestrm.
Both of the instructions check all characters they can, not just the initial character of a potential match. With smaller needle they can end up making a full match, if the match start is closer to the beginning of the haystack part vector.
For search (the forward one), this makes it appealing to use only pcmpestri for matching, and if the match does not fit, make such step that pcmpestri would have matched, so it would be full match next time. If match is not confirmed, we have to do such step that we skip the unconfirmed match, but attempt to match immediately past it. This results in variable step, but generally faster progress if there are not too many apparent matches.
For find_end (the backward one) this does not work well, as the matches we need to prioritize are closer to the end, which would result in more matches we need to confirm with additional comparison, and still smaller step, so pcmpestri is clearly worse than pcmpestrm with fixed step.
For AVX2, we only match the first character, so there are no cases, where variable step is better.
π Perf bug fix in SSE4.2 path
There was a confusion of elements/bytes in the SSE4.2 path of find_end remaining part check. The intention was to mask out the positions that were already checked. It worked correctly for 8-bit elements, but masked out too few bits for 16-bit element. Whereas the impact is very small, I'm fixing this here in a separate commit, to avoid inconsistencies and make the comparison of these branches easier
I'd be curious about benchmark results for this new search/find_end vectorization compared to main on other CPUs. AMDs is of most interest, but Intels of significantly higher or significantly lower grade are also of certain interest. Mine is 12th Gen Intel(R) Core(TM) i5-1235U.
Only one of algorithm type is of interest, so but for all element sizes, also strstr is still an insteresting baseline, so running out\bench\benchmark-search.exe --benchmark_filter="strstr|classic_" is a good way to filter in only what is needed.
I've modified the benchmarks before the actual changes, so commit 1fb4bfe is a convenient point for benchmarking the "before" state.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
π Overview
AVX2 based search for all four element widths. Modeled on SSE4.2 8-bit and 16-bit approach.
Without
pcmpestr*
instruction, as it is not available for AVX vector width, traditionalfind
-like approach is used instead.Notable differences:
pcmpestr*
compares all fit characters,pcmpeq*
compares only the first, so need to confirm the rest of characters even if the needle fits vectorsearch
(forward search), the variable step is no longer better than the fixed step. See belowpcmpeq*
returns byte bitmask instead of element bitmask, so the mask is masked to make it contain only initial byte positions, also, some element size multiplications are not needed^= (1 << _Pos)
is used instead of_bittestandreset
. Apparently, this is faster, and should have been used for SSE eitherπΆ Fixed vs variable step
For SSE4.2, there are
pcmpestri
instruction that returns first or last match position, andpcmpestrm
instruction that returns bitmask of matches.pcmpestri
is faster, alsopcmpestri
is not affected by DevCom-10689455, unlikepcmpestrm
.Both of the instructions check all characters they can, not just the initial character of a potential match. With smaller needle they can end up making a full match, if the match start is closer to the beginning of the haystack part vector.
For
search
(the forward one), this makes it appealing to use onlypcmpestri
for matching, and if the match does not fit, make such step thatpcmpestri
would have matched, so it would be full match next time. If match is not confirmed, we have to do such step that we skip the unconfirmed match, but attempt to match immediately past it. This results in variable step, but generally faster progress if there are not too many apparent matches.For
find_end
(the backward one) this does not work well, as the matches we need to prioritize are closer to the end, which would result in more matches we need to confirm with additional comparison, and still smaller step, sopcmpestri
is clearly worse thanpcmpestrm
with fixed step.For AVX2, we only match the first character, so there are no cases, where variable step is better.
π Perf bug fix in SSE4.2 path
There was a confusion of elements/bytes in the SSE4.2 path of
find_end
remaining part check. The intention was to mask out the positions that were already checked. It worked correctly for 8-bit elements, but masked out too few bits for 16-bit element. Whereas the impact is very small, I'm fixing this here in a separate commit, to avoid inconsistencies and make the comparison of these branches easierβ±οΈ Benchmark results
STL/benchmarks/src/search.cpp
Lines 42 to 50 in 396eaf8
classic_search<std::uint8_t>/0
classic_search<std::uint8_t>/1
classic_search<std::uint8_t>/2
classic_search<std::uint8_t>/3
classic_search<std::uint8_t>/4
classic_search<std::uint8_t>/5
classic_search<std::uint16_t>/0
classic_search<std::uint16_t>/1
classic_search<std::uint16_t>/2
classic_search<std::uint16_t>/3
classic_search<std::uint16_t>/4
classic_search<std::uint16_t>/5
classic_search<std::uint32_t>/0
classic_search<std::uint32_t>/1
classic_search<std::uint32_t>/2
classic_search<std::uint32_t>/3
classic_search<std::uint32_t>/4
classic_search<std::uint32_t>/5
classic_search<std::uint64_t>/0
classic_search<std::uint64_t>/1
classic_search<std::uint64_t>/2
classic_search<std::uint64_t>/3
classic_search<std::uint64_t>/4
classic_search<std::uint64_t>/5
classic_find_end<std::uint8_t>/0
classic_find_end<std::uint8_t>/1
classic_find_end<std::uint8_t>/2
classic_find_end<std::uint8_t>/3
classic_find_end<std::uint8_t>/4
classic_find_end<std::uint8_t>/5
classic_find_end<std::uint16_t>/0
classic_find_end<std::uint16_t>/1
classic_find_end<std::uint16_t>/2
classic_find_end<std::uint16_t>/3
classic_find_end<std::uint16_t>/4
classic_find_end<std::uint16_t>/5
classic_find_end<std::uint32_t>/0
classic_find_end<std::uint32_t>/1
classic_find_end<std::uint32_t>/2
classic_find_end<std::uint32_t>/3
classic_find_end<std::uint32_t>/4
classic_find_end<std::uint32_t>/5
classic_find_end<std::uint64_t>/0
classic_find_end<std::uint64_t>/1
classic_find_end<std::uint64_t>/2
classic_find_end<std::uint64_t>/3
classic_find_end<std::uint64_t>/4
classic_find_end<std::uint64_t>/5
Other results include unreachable perf of `strstr` and the same cases reached via other codepaths
c_strstr/0
c_strstr/1
c_strstr/2
c_strstr/3
c_strstr/4
c_strstr/5
ranges_search<std::uint8_t>/0
ranges_search<std::uint8_t>/1
ranges_search<std::uint8_t>/2
ranges_search<std::uint8_t>/3
ranges_search<std::uint8_t>/4
ranges_search<std::uint8_t>/5
ranges_search<std::uint16_t>/0
ranges_search<std::uint16_t>/1
ranges_search<std::uint16_t>/2
ranges_search<std::uint16_t>/3
ranges_search<std::uint16_t>/4
ranges_search<std::uint16_t>/5
ranges_search<std::uint32_t>/0
ranges_search<std::uint32_t>/1
ranges_search<std::uint32_t>/2
ranges_search<std::uint32_t>/3
ranges_search<std::uint32_t>/4
ranges_search<std::uint32_t>/5
ranges_search<std::uint64_t>/0
ranges_search<std::uint64_t>/1
ranges_search<std::uint64_t>/2
ranges_search<std::uint64_t>/3
ranges_search<std::uint64_t>/4
ranges_search<std::uint64_t>/5
search_default_searcher<std::uint8_t>/0
search_default_searcher<std::uint8_t>/1
search_default_searcher<std::uint8_t>/2
search_default_searcher<std::uint8_t>/3
search_default_searcher<std::uint8_t>/4
search_default_searcher<std::uint8_t>/5
search_default_searcher<std::uint16_t>/0
search_default_searcher<std::uint16_t>/1
search_default_searcher<std::uint16_t>/2
search_default_searcher<std::uint16_t>/3
search_default_searcher<std::uint16_t>/4
search_default_searcher<std::uint16_t>/5
search_default_searcher<std::uint32_t>/0
search_default_searcher<std::uint32_t>/1
search_default_searcher<std::uint32_t>/2
search_default_searcher<std::uint32_t>/3
search_default_searcher<std::uint32_t>/4
search_default_searcher<std::uint32_t>/5
search_default_searcher<std::uint64_t>/0
search_default_searcher<std::uint64_t>/1
search_default_searcher<std::uint64_t>/2
search_default_searcher<std::uint64_t>/3
search_default_searcher<std::uint64_t>/4
search_default_searcher<std::uint64_t>/5
member_find<not_highly_aligned_string>/0
member_find<not_highly_aligned_string>/1
member_find<not_highly_aligned_string>/2
member_find<not_highly_aligned_string>/3
member_find<not_highly_aligned_string>/4
member_find<not_highly_aligned_string>/5
member_find<not_highly_aligned_wstring>/0
member_find<not_highly_aligned_wstring>/1
member_find<not_highly_aligned_wstring>/2
member_find<not_highly_aligned_wstring>/3
member_find<not_highly_aligned_wstring>/4
member_find<not_highly_aligned_wstring>/5
ranges_find_end<std::uint8_t>/0
ranges_find_end<std::uint8_t>/1
ranges_find_end<std::uint8_t>/2
ranges_find_end<std::uint8_t>/3
ranges_find_end<std::uint8_t>/4
ranges_find_end<std::uint8_t>/5
ranges_find_end<std::uint16_t>/0
ranges_find_end<std::uint16_t>/1
ranges_find_end<std::uint16_t>/2
ranges_find_end<std::uint16_t>/3
ranges_find_end<std::uint16_t>/4
ranges_find_end<std::uint16_t>/5
ranges_find_end<std::uint32_t>/0
ranges_find_end<std::uint32_t>/1
ranges_find_end<std::uint32_t>/2
ranges_find_end<std::uint32_t>/3
ranges_find_end<std::uint32_t>/4
ranges_find_end<std::uint32_t>/5
ranges_find_end<std::uint64_t>/0
ranges_find_end<std::uint64_t>/1
ranges_find_end<std::uint64_t>/2
ranges_find_end<std::uint64_t>/3
ranges_find_end<std::uint64_t>/4
ranges_find_end<std::uint64_t>/5
member_rfind<not_highly_aligned_string>/0
member_rfind<not_highly_aligned_string>/1
member_rfind<not_highly_aligned_string>/2
member_rfind<not_highly_aligned_string>/3
member_rfind<not_highly_aligned_string>/4
member_rfind<not_highly_aligned_string>/5
member_rfind<not_highly_aligned_wstring>/0
member_rfind<not_highly_aligned_wstring>/1
member_rfind<not_highly_aligned_wstring>/2
member_rfind<not_highly_aligned_wstring>/3
member_rfind<not_highly_aligned_wstring>/4
member_rfind<not_highly_aligned_wstring>/5