Improve `search`/`find_end` perf by dropping `memcmp` #4654

AlexGuteniev · 2024-05-05T10:59:32Z

Resolves #2453

These benchmark results are no longer relevant, as the PR intention has changed

Before:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
c_strstr                                      184 ns          119 ns      8960000
ranges_search<std::uint8_t>                  6276 ns         3899 ns       344615
ranges_search<std::uint16_t>                 6358 ns         3625 ns       331852
ranges_search<std::uint32_t>                 5652 ns         3692 ns       431622
ranges_search<std::uint64_t>                 7510 ns         4269 ns       373333
search_default_searcher<std::uint8_t>        1720 ns          921 ns      1120000
search_default_searcher<std::uint16_t>       2497 ns         1438 ns      1000000
search_default_searcher<std::uint32_t>       2224 ns         1001 ns      1280000
search_default_searcher<std::uint64_t>       2762 ns         1297 ns      1000000

After:

---------------------------------------------------------------------------------
Benchmark                                       Time             CPU   Iterations
---------------------------------------------------------------------------------
c_strstr                                      183 ns          102 ns     10000000
ranges_search<std::uint8_t>                   678 ns          355 ns      4266667
ranges_search<std::uint16_t>                 1285 ns          698 ns      1723077
ranges_search<std::uint32_t>                 2317 ns         1111 ns      1280000
ranges_search<std::uint64_t>                 4736 ns         2537 ns       597333
search_default_searcher<std::uint8_t>         669 ns          353 ns      4072727
search_default_searcher<std::uint16_t>       1293 ns          582 ns      1906383
search_default_searcher<std::uint32_t>       2261 ns         1400 ns       814545
search_default_searcher<std::uint64_t>       4756 ns         2511 ns       497778

strstr is given for a reference in the benchmark, it is not affected by the optimization.

It may be impossible to reach strstr performance, as it uses pcmpistri (and reading beyond the last element, as pcmpistri is not very useful otherwise). We can try pcmpestri for 8-bit and 16-bit cases, but still it may be not as efficient, as strstr. I'd prefer to try this additional optimization in a next PR though.

AlexGuteniev · 2024-05-05T18:56:14Z

The difference in "before" results between ranges::search and search(..., default_searcher) is due to different implementation. search(..., default_searcher) doesn't try to use memcmp on every iteration. It is direct comparison. Apparently, this is faster.

It is curious that search(..., default_searcher) is faster for 64-bit type before vectorization.

I would want someone else to confirm the results, and maintainers decision what to do with this.

AlexGuteniev · 2024-05-06T02:14:45Z

A possible way to handle it is to remove 32 and 64 bit optimization/vectorization attempts at all, so that ranges::search case (along with the usual std::search) would be the same as search(..., default_searcher)

AlexGuteniev · 2024-05-06T14:46:21Z

And if we are to keep vectorization only for 8-bit and 16-bit elements, we may drop the current implementation and not review/commit it in the first place, if SSE4.2 pcmpestri smells like it would be faster.

StephanTLavavej

Partial review - looking good so far!

stl/src/vector_algorithms.cpp

benchmarks/src/search.cpp

… to be named.

Who's a good search? You are! Yes you!

stl/inc/xutility

stl/inc/functional

stl/inc/algorithm

benchmarks/src/search.cpp

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

StephanTLavavej · 2024-06-10T22:40:53Z

I pushed changes to address the issues that I found, but I still need to think about whether we should be vectorizing this at all. I'm leaning towards ripping out the existing memcmp. Stay tuned...

…pred`. `_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`. `_Equal_rev_pred` is called by ranges `search`/`find_end`. This doesn't affect `equal` etc.

StephanTLavavej · 2024-06-10T23:45:37Z

Ok, after looking at the benchmarks, I've taken the radical step of reverting both the vectorization changes and the existing memcmp "optimizations" that were massively pessimizing your reasonable benchmark example. My measurements:

Benchmark	`main`	Vector	Plain
`c_strstr`	144 ns	151 ns	145 ns
`classic_search<std::uint8_t>`	8935 ns	1726 ns	1754 ns
`classic_search<std::uint16_t>`	9732 ns	3017 ns	1739 ns
`classic_search<std::uint32_t>`	9029 ns	4829 ns	1912 ns
`classic_search<std::uint64_t>`	8970 ns	8471 ns	2527 ns
`ranges_search<std::uint8_t>`	8916 ns	1723 ns	1784 ns
`ranges_search<std::uint16_t>`	8932 ns	3012 ns	1785 ns
`ranges_search<std::uint32_t>`	8303 ns	4840 ns	2460 ns
`ranges_search<std::uint64_t>`	9061 ns	8577 ns	1860 ns
`search_default_searcher<std::uint8_t>`	1858 ns	1720 ns	1883 ns
`search_default_searcher<std::uint16_t>`	2525 ns	3016 ns	1861 ns
`search_default_searcher<std::uint32_t>`	1863 ns	4812 ns	1869 ns
`search_default_searcher<std::uint64_t>`	1873 ns	8440 ns	2592 ns

This compares main (with only the benchmark added), this vectorization PR before my revert, and finally "Plain" which lacks both vectorization and memcmp.

There's some noise here (e.g. search_default_searcher is unchanged between main and Plain, but there were random 2500 ns spikes), but the pattern is quite clear: memcmp is almost a 5x penalty for all element sizes, vectorization helps smaller elements, but plain code results in great performance across the board.

Note that dropping the memcmp paths from _Equal_rev_pred_unchecked and _Equal_rev_pred has very limited impact. _Equal_rev_pred_unchecked is called by classic/parallel search/find_end, and _Equal_rev_pred is called by ranges search/find_end. (This makes sense, since find_end is the "opposite" of search.)

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial). It's search/find_end that are unusual because they want to repeatedly compare a needle, which has very different performance characteristics.

In addition to keeping the benchmark from this vectorization attempt, I've also kept the correctness test (even though search is no longer vectorized), since it's quick to run, and we might figure out how to vectorize this beneficially in the future.

AlexGuteniev · 2024-06-11T05:26:03Z

The equal etc. algorithms aren't affected by this. We should probably evaluate their use of memcmp, but I expect the behavior to be quite different (and I hope that memcmp is beneficial)

Agreed. mismatch / lexicographical_compare did benefit from similar optimization.

StephanTLavavej · 2024-06-14T03:13:00Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-06-18T04:51:53Z

Thanks for investigating and improving the performance here even if the final result was very different from the initial vision! 🔮 🪄 🚀

It is used only once after microsoft#4654

Avoid calling equal in a loop. Similar to microsoft#4654

vectorize search

73c96da

AlexGuteniev requested a review from a team as a code owner May 5, 2024 10:59

AlexGuteniev added 11 commits May 5, 2024 14:03

very tail fix

0c17a53

I 🧡 ADL

11c05ee

unify ipsum

d4fcc96

-newline

da5cf2e

strstr for competition

da157b1

missing progress

772c513

coverage

2c6c329

these tests are too long

81a6000

missing include

0b59b2e

default_searcher

f2806c5

ADL again

15e54a9

avoid memcmp in fallback

26646fe

StephanTLavavej changed the title ~~vectorize search~~ vectorize search May 7, 2024

StephanTLavavej added the performance Must go faster label May 7, 2024

StephanTLavavej self-assigned this May 7, 2024

StephanTLavavej requested changes Jun 7, 2024

View reviewed changes

partial review comment

0c473a4

AlexGuteniev requested a review from StephanTLavavej June 7, 2024 18:34

StephanTLavavej added 7 commits June 10, 2024 11:24

Merge branch 'main' into search

3452fcc

Internal static assert sizeof(_Ty1) == sizeof(_Ty2).

629afd4

Use += and + instead of _RANGES next.

a24e6eb

Style: Return _Ptr_res1 instead of _Ptr_last1 when they're equal.

9d07a40

Style: In <algorithm> and <functional>, _Ptr_last1 doesn't need…

d57f9b6

… to be named.

Restore top-level constness for _UFirst2.

e51b98d

Benchmark classic search().

d4462a5

Simplify last_known_good_search().

95ba820

Who's a good search? You are! Yes you!

StephanTLavavej reviewed Jun 10, 2024

View reviewed changes

StephanTLavavej added 2 commits June 10, 2024 15:50

Revert vectorized implementation.

72a0d29

Drop memcmp paths from _Equal_rev_pred_unchecked and `_Equal_rev_…

38b32d6

…pred`. `_Equal_rev_pred_unchecked` is called by classic/parallel `search`/`find_end`. `_Equal_rev_pred` is called by ranges `search`/`find_end`. This doesn't affect `equal` etc.

StephanTLavavej approved these changes Jun 10, 2024

View reviewed changes

StephanTLavavej removed their assignment Jun 10, 2024

StephanTLavavej changed the title ~~vectorize search~~ Improve search/find_end perf by dropping memcmp Jun 10, 2024

StephanTLavavej mentioned this pull request Jun 10, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Jun 14, 2024

StephanTLavavej merged commit e3ed206 into microsoft:main Jun 18, 2024
39 checks passed

AlexGuteniev mentioned this pull request Jun 24, 2024

Vectorize std::search of 1 and 2 bytes elements with pcmpestri #4745

Merged

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request Jun 29, 2024

Inline _Memcmp_ranges that is used only once

8f70e6c

It is used only once after microsoft#4654

AlexGuteniev mentioned this pull request Jun 29, 2024

Inline _Memcmp_ranges that is used only once #4753

Merged

StephanTLavavej mentioned this pull request Aug 2, 2024

Call CRT wmemcmp/wmemchr when possible in char_traits for better performance #4873

Merged

AlexGuteniev added a commit to AlexGuteniev/STL that referenced this pull request Apr 22, 2025

optimize LKG search and find_end

8d62380

Avoid calling equal in a loop. Similar to microsoft#4654

AlexGuteniev mentioned this pull request Apr 22, 2025

Reduce vector algorithms test run time with ASan #5425

Merged

Improve search/find_end perf by dropping memcmp #4654

Improve search/find_end perf by dropping memcmp #4654

Uh oh!

Conversation

AlexGuteniev commented May 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexGuteniev commented May 5, 2024

Uh oh!

AlexGuteniev commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexGuteniev commented May 6, 2024

Uh oh!

StephanTLavavej left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Jun 10, 2024

Uh oh!

StephanTLavavej commented Jun 10, 2024

Uh oh!

AlexGuteniev commented Jun 11, 2024

Uh oh!

StephanTLavavej commented Jun 14, 2024

Uh oh!

Uh oh!

StephanTLavavej commented Jun 18, 2024

Uh oh!

Uh oh!

Improve `search`/`find_end` perf by dropping `memcmp` #4654

Improve `search`/`find_end` perf by dropping `memcmp` #4654

AlexGuteniev commented May 5, 2024 •

edited

Loading

AlexGuteniev commented May 6, 2024 •

edited

Loading