You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.
For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.
In benchmark results 0 is small needle, 1 is large needle.
β¦pred`.
`_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`.
`_Equal_rev_pred` is called by ranges `search`/`find_end`.
This doesn't affect `equal` etc.
Benchmark results on my 5950X, split into separate tables for 1 and 2 bytes versus 4 and 8 bytes:
Benchmark
main
PR
Speedup (Old/New)
c_strstr/0
142 ns
143 ns
0.99
c_strstr/1
157 ns
162 ns
0.97
classic_search<std::uint8_t>/0
1976 ns
160 ns
12.35
classic_search<std::uint8_t>/1
2153 ns
175 ns
12.30
classic_search<std::uint16_t>/0
1432 ns
310 ns
4.62
classic_search<std::uint16_t>/1
1557 ns
344 ns
4.53
ranges_search<std::uint8_t>/0
1561 ns
160 ns
9.76
ranges_search<std::uint8_t>/1
1689 ns
176 ns
9.60
ranges_search<std::uint16_t>/0
1594 ns
311 ns
5.13
ranges_search<std::uint16_t>/1
1747 ns
345 ns
5.06
search_default_searcher<std::uint8_t>/0
1660 ns
160 ns
10.38
search_default_searcher<std::uint8_t>/1
1796 ns
174 ns
10.32
search_default_searcher<std::uint16_t>/0
2222 ns
309 ns
7.19
search_default_searcher<std::uint16_t>/1
2421 ns
345 ns
7.02
Benchmark
main
PR
Speedup (Old/New)
classic_search<std::uint32_t>/0
1970 ns
1979 ns
1.00
classic_search<std::uint32_t>/1
2151 ns
2148 ns
1.00
classic_search<std::uint64_t>/0
1423 ns
1387 ns
1.03
classic_search<std::uint64_t>/1
1566 ns
1527 ns
1.03
ranges_search<std::uint32_t>/0
1591 ns
1611 ns
0.99
ranges_search<std::uint32_t>/1
1729 ns
1760 ns
0.98
ranges_search<std::uint64_t>/0
1605 ns
1543 ns
1.04
ranges_search<std::uint64_t>/1
1761 ns
1691 ns
1.04
search_default_searcher<std::uint32_t>/0
2234 ns
1609 ns
1.39
search_default_searcher<std::uint32_t>/1
2408 ns
1752 ns
1.37
search_default_searcher<std::uint64_t>/0
1620 ns
2193 ns
0.74
search_default_searcher<std::uint64_t>/1
1761 ns
2366 ns
0.74
Aside from c_strstr which is of course unchanged, I'm also seeing across-the-board massive improvements for 1 and 2 bytes, so this is great.
I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.
I am mildly confused as to why performance for search_default_searcher seems to vary for 4 bytes (better) and 8 bytes (worse) for this PR, when it shouldn't have been altered at all - the if constexpr should be completely vanishing. Codegen gremlins? I don't think it should block merging though.
I guess the biggest of codegen gremlin is exact loop alignment. The compiler only align functions to 16-byte boundary, whereas apparently like 32 or 64 bytes boundary in important. You may try /QIntel-jcc-erratum, (yes, even despite you run on AMD!) for both main and changed code, build whole import lib and the benchmark executable with it, and see if this variation disappears.
I've seen this happening even when changing unrelated functions. That's why it doesn't worth hunting for -- eventually we will add or change even more unrelated functions, and alignment would change again.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Different approach for both search and inner comparison (SSE4.2 instead of AVX2). This time the results are better.
For now 1 and 2 bytes element only. The same slightly modified approach can be used for 4 and 8 bytes elements, but need to test if there would be still a performance gain.
In benchmark results 0 is small needle, 1 is large needle.