Auto-vectorize arrays `swap` #4991

AlexGuteniev · 2024-09-29T10:05:14Z

Resolves #2683

📜 The Approach

Instead of trying to call a separately compiled implementation in the import library, reimplement the specific case in headers the way that compiler is able to vectorize it.

Do memcpy swap by portions of nice length, and then tail, so that compiler can use registers as immediate storage, vector registers when possible. Due to compile-time known length this will unroll perfectly.

I didn't investigate a lot which length is nice, I think 64 is a good choice:

Not too much stack consumed in debug mode where there will be actual stack allocation
Enough to use AVX-512 if available
Not too much to use only SSE2
I hope it is good for Arm64 too 🤷

⚖️ Self swap check

[utility.swap]/7 says:

Effects: As if by swap_ranges(a, a + N, b).

[alg.swap]/2 says:

Preconditions: The two ranges [first1, last1) and [first2, last2) do not overlap.

Therefore it is UB to swap overlapping arrays.

~~Added an assert for that and removed the check.~~

@frederick-vs-ja reported LWG-4165, and the check preserved.

⏱️ Benchmark results

Benchmark	main	this
std_swap<1, uint8_t>	1.22 ns	1.20 ns
std_swap<5, uint8_t>	2.83 ns	1.44 ns
std_swap<15, uint8_t>	10.7 ns	1.92 ns
std_swap<26, uint8_t>	14.9 ns	1.93 ns
std_swap<38, uint8_t>	25.6 ns	2.41 ns
std_swap<60, uint8_t>	32.4 ns	2.64 ns
std_swap<125, uint8_t>	62.0 ns	4.31 ns
std_swap<800, uint8_t>	399 ns	22.0 ns
std_swap<3000, uint8_t>	1490 ns	90.4 ns
std_swap<9000, uint8_t>	4365 ns	203 ns
std_swap_ranges<uint8_t>/1	1.97 ns	1.91 ns
std_swap_ranges<uint8_t>/5	4.03 ns	3.58 ns
std_swap_ranges<uint8_t>/15	5.26 ns	5.01 ns
std_swap_ranges<uint8_t>/26	3.55 ns	3.12 ns
std_swap_ranges<uint8_t>/38	5.34 ns	4.53 ns
std_swap_ranges<uint8_t>/60	5.61 ns	5.05 ns
std_swap_ranges<uint8_t>/125	7.01 ns	6.50 ns
std_swap_ranges<uint8_t>/800	19.1 ns	16.9 ns
std_swap_ranges<uint8_t>/3000	81.8 ns	79.4 ns
std_swap_ranges<uint8_t>/9000	139 ns	138 ns

🥇 Results interpretation

Improvement even for arrays of 5 elements!
Sometimes wins and sometimes loses to manually vectorized implementation
- The auto vectorization of middle sized arrays benefits from knowing the number of elements in advance and full unrolling
- The manual vectorization of large sized array benefits from CPU dispatch and small loop body

philnik777 · 2024-09-29T12:37:58Z

Can't you just throw in a __restrict? Works just fine with Clang/libc++: https://godbolt.org/z/Ez1hM11jG

AlexGuteniev · 2024-09-29T13:20:35Z

Can't you just throw in a __restrict? Works just fine with Clang/libc++:

No, unfortunately. Pragma loop ivdep also doesn't help

Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

stl/inc/utility

AlexGuteniev · 2024-10-18T06:17:32Z

I want to note that ranges::swap does not have any self-swap check, unlike std::swap.

stl/inc/utility

stl/inc/type_traits

…ludes `<concepts>`.

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/inc/type_traits

stl/inc/utility

StephanTLavavej · 2024-10-22T22:34:02Z

Thanks! 🐈 I pushed fairly minor changes.

Results on my 5950X:

Benchmark	Before	After	Speedup
`std_swap<1, uint8_t>`	2.98 ns	3.63 ns	0.82
`std_swap<5, uint8_t>`	4.72 ns	3.84 ns	1.23
`std_swap<15, uint8_t>`	8.33 ns	3.87 ns	2.15
`std_swap<26, uint8_t>`	24.0 ns	3.65 ns	6.58
`std_swap<38, uint8_t>`	38.7 ns	3.86 ns	10.03
`std_swap<60, uint8_t>`	62.5 ns	3.87 ns	16.15
`std_swap<125, uint8_t>`	129 ns	4.75 ns	27.16
`std_swap<800, uint8_t>`	816 ns	21.4 ns	38.13
`std_swap<3000, uint8_t>`	3019 ns	80.6 ns	37.46
`std_swap<9000, uint8_t>`	9262 ns	241 ns	38.43
`std_swap_ranges<uint8_t>/1`	4.70 ns	4.48 ns	1.05
`std_swap_ranges<uint8_t>/5`	5.60 ns	6.22 ns	0.90
`std_swap_ranges<uint8_t>/15`	7.72 ns	7.75 ns	1.00
`std_swap_ranges<uint8_t>/26`	4.96 ns	4.97 ns	1.00
`std_swap_ranges<uint8_t>/38`	6.71 ns	6.69 ns	1.00
`std_swap_ranges<uint8_t>/60`	6.30 ns	6.31 ns	1.00
`std_swap_ranges<uint8_t>/125`	8.02 ns	7.97 ns	1.01
`std_swap_ranges<uint8_t>/800`	22.5 ns	23.8 ns	0.95
`std_swap_ranges<uint8_t>/3000`	86.2 ns	89.9 ns	0.96
`std_swap_ranges<uint8_t>/9000`	124 ns	123 ns	1.01

AlexGuteniev · 2024-10-23T05:52:37Z

Results on my 5950X:

A big difference in "After" columnbetween std_swap<9000, uint8_t> and std_swap_ranges<uint8_t>/9000 made me curious.
It is big for me too.

Looks like the main reason is vector over-alignment. adding alignas(64) to stack array makes significant improvement there.

I don't think the benchmark should be modified to have that though.

CaseyCarter · 2024-10-23T17:58:22Z

Results on my 5950X:

A big difference in "After" columnbetween std_swap<9000, uint8_t> and std_swap_ranges<uint8_t>/9000 made me curious. It is big for me too.

Looks like the main reason is vector over-alignment. adding alignas(64) to stack array makes significant improvement there.

I don't think the benchmark should be modified to have that though.

If we know that alignment has a significant effect on the results, shouldn't we control it instead of letting it vary and skew the benchmark? Note that "control" doesn't mean "pick an unrealistic value that skews the results" but either pick a value that is representative or average over a set of values that are representative.

AlexGuteniev · 2024-10-23T18:19:45Z

but either pick a value that is representative or average over a set of values that are representative

x64 stack is 16-aligned on function entry, but for multiple variables on the stack the location of each is limited by its own alignment and alignment of the variable next to it.

malloc has also 16 bytes alignment.

So 8 is a good skew maybe to imitate being next to a pointer.
Or 16 to imitate top or the only variable on stack or malloc allocation.

StephanTLavavej · 2024-10-23T19:09:39Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-24T15:44:38Z

🔀 🚀 😻

Vectorize arrays swap

6dd355f

AlexGuteniev requested a review from a team as a code owner September 29, 2024 10:05

The Very Core

e699043

CaseyCarter added the performance Must go faster label Sep 29, 2024

StephanTLavavej self-assigned this Sep 29, 2024

AlexGuteniev and others added 3 commits September 30, 2024 18:35

There's another swap!

a7b11af

Restore this test

3cda1fb

Co-authored-by: Stephan T. Lavavej <stl@nuwen.net>

Imagine ADL could strike here too

77ccd7f

frederick-vs-ja reviewed Oct 18, 2024

View reviewed changes

stl/inc/utility Outdated Show resolved Hide resolved

AlexGuteniev added 3 commits October 18, 2024 08:25

64 in one place

aa6c28e

Allow self swap

90ec86a

Merge remote-tracking branch 'upstream/main' into try-swapping-again

476cf18

frederick-vs-ja reviewed Oct 18, 2024

View reviewed changes

stl/inc/utility Outdated Show resolved Hide resolved

AlexGuteniev commented Oct 21, 2024

View reviewed changes

stl/inc/type_traits Show resolved Hide resolved

StephanTLavavej changed the title ~~Vectorize arrays swap~~ Auto-vectorize arrays swap Oct 22, 2024

StephanTLavavej added 6 commits October 22, 2024 14:04

Merge branch 'main' into try-swapping-again

0d319c7

Test _HAS_CXX20 instead of __cpp_lib_concepts, and <ranges> inc…

bc455f2

…ludes `<concepts>`.

Add const to _Stop.

616824f

Drop comment typo and simplify LWG-4165 citation.

55539aa

Use _Is_constant_evaluated.

27797e6

Refine the comment.

4a097fe

StephanTLavavej reviewed Oct 22, 2024

View reviewed changes

StephanTLavavej approved these changes Oct 22, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 22, 2024

StephanTLavavej mentioned this pull request Oct 22, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Oct 23, 2024

AlexGuteniev mentioned this pull request Oct 24, 2024

Provide an intentional (mis)alignment that corresponds to typical usage in benchmarks for plain arrays #5035

Closed

StephanTLavavej merged commit 7b199b2 into microsoft:main Oct 24, 2024
39 checks passed

AlexGuteniev deleted the try-swapping-again branch October 24, 2024 15:45

AlexGuteniev mentioned this pull request Jan 4, 2025

Add XOR swap and XCHG assembly optimization for integral types #5215

Closed

This was referenced May 2, 2025

<xfacet>: Consider remove <yvals.h> inclusion #5447

Closed

<concepts>: std::swap for arrays uses memcpy #5481

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Auto-vectorize arrays `swap` #4991

Auto-vectorize arrays `swap` #4991

Uh oh!

AlexGuteniev commented Sep 29, 2024 •

edited

Loading

Uh oh!

philnik777 commented Sep 29, 2024

Uh oh!

AlexGuteniev commented Sep 29, 2024

Uh oh!

Uh oh!

AlexGuteniev commented Oct 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Oct 22, 2024

Uh oh!

AlexGuteniev commented Oct 23, 2024

Uh oh!

CaseyCarter commented Oct 23, 2024

Uh oh!

AlexGuteniev commented Oct 23, 2024

Uh oh!

StephanTLavavej commented Oct 23, 2024

Uh oh!

Uh oh!

StephanTLavavej commented Oct 24, 2024

Uh oh!

Uh oh!

Auto-vectorize arrays swap #4991

Auto-vectorize arrays swap #4991

Uh oh!

Conversation

AlexGuteniev commented Sep 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📜 The Approach

⚖️ Self swap check

⏱️ Benchmark results

🥇 Results interpretation

Uh oh!

philnik777 commented Sep 29, 2024

Uh oh!

AlexGuteniev commented Sep 29, 2024

Uh oh!

Uh oh!

AlexGuteniev commented Oct 18, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Oct 22, 2024

Uh oh!

AlexGuteniev commented Oct 23, 2024

Uh oh!

CaseyCarter commented Oct 23, 2024

Uh oh!

AlexGuteniev commented Oct 23, 2024

Uh oh!

StephanTLavavej commented Oct 23, 2024

Uh oh!

Uh oh!

StephanTLavavej commented Oct 24, 2024

🔀 🚀 😻

Uh oh!

Uh oh!

Auto-vectorize arrays `swap` #4991

Auto-vectorize arrays `swap` #4991

AlexGuteniev commented Sep 29, 2024 •

edited

Loading