Call CRT `wmemcmp`/`wmemchr` when possible in `char_traits` for better performance #4873

mcfi · 2024-08-02T15:35:26Z

__builtin_wmemcmp and __builtin_wmemchr are simply lowered to wchar-by-wchar loops by MSVC backend, which may perform poorly than vectorized CRT wmemcmp and wmemchr. Therefore, for non-constexpr cases in string view, we call CRT functions instead of the builtins.

AlexGuteniev · 2024-08-02T15:45:26Z

I'd like to see a benchmark and its results. I expect that maintainers would be interested in that too.
#4387 is an example for a similar benchmark added to this repo.

I also recall my past observation that wmemchr is slow and not vectorized. See DevCom-1614562. If this still confirms, I suggest reusing the std::find vectorization implementation instead of wmemchr

stl/inc/__msvc_string_view.hpp

StephanTLavavej · 2024-08-02T22:31:46Z

@mcfi's internal OS-PR-11111227 "[Perf] Optimize wmemcmp/wmemchr in UCRT" has enhanced <wchar.h> with vectorized implementations for x64 and ARM64, which have been benchmarked there as up to 10x faster. Although it'll be some unspecified amount of time before we're able to pick up updated UCRT headers, I'm fine with calling the UCRT now - this is our historical practice and only in highly unusual cases (e.g. #4654) do we avoid the UCRT.

Since this is deferring to our usual dependency, I also don't think it's absolutely necessary to add benchmarks to the STL repo.

Also, we're already calling wmemcmp and wmemchr in C++14 mode (expand context below the diffs), so I have absolutely no concerns here.

stl/inc/__msvc_string_view.hpp

StephanTLavavej · 2024-08-02T23:39:05Z

Updated title - the header is <__msvc_string_view.hpp> but this applies to char_traits and thus std::wstring etc.

AlexGuteniev · 2024-08-03T10:42:06Z

Should anything be done with this then?

STL/stl/src/vector_algorithms.cpp

Lines 3273 to 3277 in a357ff1

    
           // TRANSITION, ABI: preserved for binary compatibility 
        
           const void* __stdcall __std_find_trivial_unsized_2(const void* const _First, const uint16_t _Val) noexcept { 
        
               // TRANSITION, DevCom-1614562: not trying wmemchr 
        
               return __std_find_trivial_unsized_impl(_First, _Val); 
        
           }

STL/stl/inc/xutility

Lines 6005 to 6017 in a357ff1

    
           #else // ^^^ _USE_STD_VECTOR_ALGORITHMS / !_USE_STD_VECTOR_ALGORITHMS vvv
 
                       if constexpr (sizeof(_Iter_value_t<_InIt>) == 1) {
 
                           const auto _First_ptr = _STD _To_address(_First);
 
                           const auto _Result    = static_cast<remove_reference_t<_Iter_ref_t<_InIt>>*>(
 
                               _CSTD memchr(_First_ptr, static_cast<unsigned char>(_Val), static_cast<size_t>(_Last - _First)));
 
                           if constexpr (is_pointer_v<_InIt>) {
 
                               return _Result ? _Result : _Last;
 
                           } else {
 
                               return _Result ? _First + (_Result - _First_ptr) : _Last;
 
                           }
 
                       }
 
                       // TRANSITION, DevCom-1614562: not trying wmemchr
 
           #endif // ^^^ !_USE_STD_VECTOR_ALGORITHMS ^^^

STL/stl/inc/xutility

Lines 6058 to 6064 in a357ff1

    
           _NODISCARD constexpr _It _Find_unchecked(_It _First, const _Se _Last, const _Ty& _Val, _Pj _Proj = {}) {
 
               // TRANSITION, DevCom-1614562: not trying wmemchr
 
               // Only single-byte elements are suitable for unsized optimization
 
               constexpr bool _Single_byte_elements = sizeof(_Iter_value_t<_It>) == 1;
 
               constexpr bool _Is_sized             = sized_sentinel_for<_Se, _It>;
 
               if constexpr (_Vector_alg_in_find_is_safe<_It, _Ty>

StephanTLavavej · 2024-08-06T18:38:37Z

Good catch @AlexGuteniev! I think we can handle those in a followup PR (except that changing the ABI-retained function is pointless). I've recorded a note.

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

AlexGuteniev · 2024-08-07T05:41:51Z

except that changing the ABI-retained function is pointless

Don't think so. The function used to be the optimization that we reverted to make ASAN happy.
Using wmemchr brings the optimization back, mitigating the ASAN happiness impact. (Not fully eliminating though, as there are 32 and 64 bit versions too)

StephanTLavavej · 2024-08-08T07:05:45Z

Thanks for this (future) performance improvement! 🚀 📆 ⏳

Update __msvc_string_view.hpp

c08decd

mcfi requested a review from a team as a code owner August 2, 2024 15:35

Update __msvc_string_view.hpp

b334119

frederick-vs-ja reviewed Aug 2, 2024

View reviewed changes

stl/inc/__msvc_string_view.hpp Outdated Show resolved Hide resolved

stl/inc/__msvc_string_view.hpp Show resolved Hide resolved

Update __msvc_string_view.hpp

394eea5

StephanTLavavej added the performance Must go faster label Aug 2, 2024

StephanTLavavej added 3 commits August 2, 2024 15:52

Activate optimization for exactly C++17.

d9c1790

Activate optimization for char16_t, unify wmemcmp/wmemchr calls.

ba5b17d

Also make length() fall through to wcslen().

82d2eff

StephanTLavavej reviewed Aug 2, 2024

View reviewed changes

stl/inc/__msvc_string_view.hpp Outdated Show resolved Hide resolved

stl/inc/__msvc_string_view.hpp Outdated Show resolved Hide resolved

stl/inc/__msvc_string_view.hpp Show resolved Hide resolved

StephanTLavavej approved these changes Aug 2, 2024

View reviewed changes

StephanTLavavej changed the title ~~Call CRT wmemcmp/wmemchr when possible in string view for better performance~~ Call CRT wmemcmp/wmemchr when possible in char_traits for better performance Aug 2, 2024

StephanTLavavej mentioned this pull request Aug 2, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Aug 6, 2024

StephanTLavavej merged commit b5285d1 into microsoft:main Aug 8, 2024
39 checks passed

AlexGuteniev mentioned this pull request Aug 15, 2024

Use wmemchr in optimizations #4894

Merged

AlexGuteniev mentioned this pull request Nov 19, 2024

Vectorize basic_string::find #5101

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Call CRT `wmemcmp`/`wmemchr` when possible in `char_traits` for better performance #4873

Call CRT `wmemcmp`/`wmemchr` when possible in `char_traits` for better performance #4873

Uh oh!

mcfi commented Aug 2, 2024

Uh oh!

AlexGuteniev commented Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Aug 2, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Aug 2, 2024

Uh oh!

AlexGuteniev commented Aug 3, 2024

Uh oh!

StephanTLavavej commented Aug 6, 2024

Uh oh!

AlexGuteniev commented Aug 7, 2024

Uh oh!

Uh oh!

StephanTLavavej commented Aug 8, 2024

Uh oh!

Uh oh!

Call CRT wmemcmp/wmemchr when possible in char_traits for better performance #4873

Call CRT wmemcmp/wmemchr when possible in char_traits for better performance #4873

Uh oh!

Conversation

mcfi commented Aug 2, 2024

Uh oh!

AlexGuteniev commented Aug 2, 2024

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Aug 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Aug 2, 2024

Uh oh!

AlexGuteniev commented Aug 3, 2024

Uh oh!

StephanTLavavej commented Aug 6, 2024

Uh oh!

AlexGuteniev commented Aug 7, 2024

Uh oh!

Uh oh!

StephanTLavavej commented Aug 8, 2024

Uh oh!

Uh oh!

Call CRT `wmemcmp`/`wmemchr` when possible in `char_traits` for better performance #4873

Call CRT `wmemcmp`/`wmemchr` when possible in `char_traits` for better performance #4873

StephanTLavavej commented Aug 2, 2024 •

edited

Loading