Help the compiler vectorize `adjacent_difference` #4958

AlexGuteniev · 2024-09-14T10:33:39Z

📜 The approach

The following things prevented the original algorithm from vectorization:

Loop-carried dependency, the previous input is used as one of operands.
- This seems expected that the compiler doesn't transform such code to eliminate this automagically, too much of a transformation.
- This was addressed by transforming the code to read the input array twice per iteration instead of carrying the values through the loop.
Odd iterator pattern where the compiler cannot understand the iteration.
- This seemed to me a strange limitation, so it was reported as DevCom-10742868.
- This was addressed by using integer index.

🛑 Correctness concern

The standard defines exact steps for this algorithm. The optimization alters the steps.
In particular the standard wants the subtracted value to be saved from the previous iteration, rather than being read again.
The two below sections explain what precautions are made to make the change unobservable, so I hope the change is correct.

✅ Checks for eligibility

The following checks were added:

No Aliasing (see below)
Iterators can be pointers
Source iterator is not volatile (read order is altered)
Trivially copyable (we skip copying where the standard asks for it)

There's no need in check for integral types or so, since the compiler makes the final decision anyway, and it may be able optimize even something that wouldn't pass a strict check.

⚠️ No Aliasing

Apparently there's no rule that the source and the destination ranges may not overlap.
We should handle aliasing.

Unlike the #4431 precedent, we can't yield to the compiler here. The compiler is able to insert overlaps check that prevents vectorization and go to the scalar fallback in case of checks failure, but:

We apply transformation that would change the meaning of the program in case of overlapping range, and the meaning would be changed no matter if vectorization happens
The checks that compiler inserts may be too loose, it may allow like equal source and destination pointer, as these are thc checks if the transformed algorithm would not change the meaning

So we do our own checks.

Then we tell the compiler with __restrict that we already checked, and it should not bother. This is done in a separate function, because the __restrict is not aliased within scope, so saying __restrict within the original algorithm would apparently be a lie.

The extra check by the compiler, if not prevented would slightly add run time and dead code size.

😾 Compiler warnings

We have a great feature called integral promotion. Smaller types are converted to integers, and there is a warning about converting them back. Local pragma suppresses them in benchmark, but not in the test.

@StephanTLavavej used a function object with static_cast to avoid warnings in the test.

⏱️ Benchmark results

Benchmark	main	this	this + AVX2
bm<uint8_t>/2255	745 ns	563 ns	562 ns
bm<uint16_t>/2255	799 ns	83.3 ns	75.1 ns
bm<uint32_t>/2255	731 ns	154 ns	141 ns
bm<uint64_t>/2255	805 ns	293 ns	272 ns
bm/2255	751 ns	154 ns	123 ns
bm/2255	753 ns	304 ns	233 ns

🥇 Results interpretation

Overall, we're good 😸
8-bit case failed to vectorize for no reason, reported DevCom-10745948
Still 8-bit case is noticeably better. I didn't analyze that, but looks like this a consistent thing, not codegen gremlins. I think it is a side effect of eliminating loop-carried dependency, so the processor can parallelize and overlap iterations
AVX2 is only slightly faster. I did not analyze, but think that memory wall is being hit here 🧱

stl/inc/numeric

CaseyCarter · 2024-09-15T15:20:40Z

8-bit case failed to vectorize for no reason (didn't look up if it is known compiler issue, or to be reported)

Interestingly it vectorizes if we use - directly instead of indirecting through std::minus, or if the output is a pointer to int. Something to do with narrowing the result of the promoted operation, maybe?

stl/inc/numeric

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

benchmarks/src/adjacent_difference.cpp

stl/inc/numeric

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

…ion with DOOM!

…uous`.

benchmarks/src/adjacent_difference.cpp

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp

stl/inc/numeric

StephanTLavavej · 2024-10-26T14:51:29Z

Thanks! 😻 I pushed minor nitpicks and a significant fix for C++14/17. Speedups look good on my 5950X:

Benchmark	Before	After	Speedup
`bm<uint8_t>/2255`	968 ns	967 ns	1.00
`bm<uint16_t>/2255`	917 ns	97.2 ns	9.43
`bm<uint32_t>/2255`	648 ns	158 ns	4.10
`bm<uint64_t>/2255`	689 ns	331 ns	2.08
`bm<float>/2255`	646 ns	158 ns	4.09
`bm<double>/2255`	652 ns	332 ns	1.96

StephanTLavavej · 2024-10-29T19:43:34Z

I'm mirroring this to the MSVC-internal repo - please notify me if any further changes are pushed.

StephanTLavavej · 2024-10-29T22:52:27Z

I had to push an additional commit to fix the overlap check for heterogeneous types.

stl/inc/numeric

StephanTLavavej · 2024-10-30T14:25:09Z

Thanks for helping the compiler, said the author of the presentation, Don't Help The Compiler 😹 😻 🎉

AlexGuteniev added 3 commits September 14, 2024 09:50

benchmark

85a67ec

test coverage

b0bab69

the optimization

eb0cf5a

AlexGuteniev requested a review from a team as a code owner September 14, 2024 10:33

AlexGuteniev changed the title ~~Help the compiler vectorize adjacent_differentce~~ Help the compiler vectorize adjacent_difference Sep 14, 2024

You shall pass!

eed039f

frederick-vs-ja reviewed Sep 14, 2024

View reviewed changes

stl/inc/numeric Outdated Show resolved Hide resolved

AlexGuteniev added 3 commits September 14, 2024 14:27

constexpr

bc92d54

ADL

fd51080

rvalue

3db07e7

frederick-vs-ja reviewed Sep 14, 2024

View reviewed changes

stl/inc/numeric Outdated Show resolved Hide resolved

stl/inc/numeric Outdated Show resolved Hide resolved

stl/inc/numeric Show resolved Hide resolved

AlexGuteniev added 2 commits September 14, 2024 16:22

Review comments

53a60d4

types

51595a3

StephanTLavavej added the performance Must go faster label Sep 15, 2024

StephanTLavavej self-assigned this Sep 15, 2024

This comment was marked as resolved.

Sign in to view

AlexGuteniev added 2 commits September 19, 2024 18:28

Pointer math

99e6058

Merge remote-tracking branch 'upstream/main' into adjacent

869a0cc

StephanTLavavej requested changes Oct 24, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 24, 2024

AlexGuteniev added 8 commits October 24, 2024 20:22

std already

3cdbc88

const

18f64ca

oops loops

c68b402

Merge remote-tracking branch 'upstream/main' into adjacent

da4ef2b

U

ba9a098

typos

3ddaf9e

includes

944672b

non-random filler

0042fe7

AlexGuteniev requested a review from StephanTLavavej October 24, 2024 17:39

AlexGuteniev commented Oct 24, 2024

View reviewed changes

tests/std/tests/VSO_0000000_vector_algorithms/test.cpp Show resolved Hide resolved

StephanTLavavej self-assigned this Oct 24, 2024

AlexGuteniev and others added 7 commits October 25, 2024 06:36

stray

d06ea71

Merge branch 'main' into adjacent

df487ce

Test 8-bit and 16-bit, avoid truncation warnings.

f132d46

static_assert: The documentation that rewards poor reading comprehens…

cfb92a4

…ion with DOOM!

Consistently order output_expected before output_actual.

c2a130f

Minor comment grammar improvements.

7a110a8

Fix perf bug: Inspect unwrapped iterators with `_Iterators_are_contig…

d86bf29

…uous`.

StephanTLavavej reviewed Oct 26, 2024

View reviewed changes

StephanTLavavej approved these changes Oct 26, 2024

View reviewed changes

StephanTLavavej removed their assignment Oct 26, 2024

StephanTLavavej mentioned this pull request Oct 26, 2024

Maintainer priorities #4700

Open

StephanTLavavej self-assigned this Oct 29, 2024

Fix and test adjacent_difference with heterogeneous types.

9b85627

StephanTLavavej approved these changes Oct 29, 2024

View reviewed changes

AlexGuteniev commented Oct 30, 2024

View reviewed changes

stl/inc/numeric Show resolved Hide resolved

StephanTLavavej merged commit 1990083 into microsoft:main Oct 30, 2024
39 checks passed

AlexGuteniev deleted the adjacent branch October 30, 2024 14:47

This was referenced Oct 30, 2024

Fix internal Perl script for Standard Library Header Units test coverage #5056

Merged

Guard __restrict usage for CUDA #5061

Merged

frederick-vs-ja mentioned this pull request Nov 12, 2024

Use __restrict__ for CUDA #5079

Merged

AlexGuteniev mentioned this pull request Nov 16, 2024

Vectorize unique #5092

Merged

Help the compiler vectorize adjacent_difference #4958

Help the compiler vectorize adjacent_difference #4958

Uh oh!

Conversation

AlexGuteniev commented Sep 14, 2024 • edited by StephanTLavavej Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📜 The approach

🛑 Correctness concern

✅ Checks for eligibility

⚠️ No Aliasing

😾 Compiler warnings

⏱️ Benchmark results

🥇 Results interpretation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

CaseyCarter commented Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Oct 26, 2024

Uh oh!

StephanTLavavej commented Oct 29, 2024

Uh oh!

StephanTLavavej commented Oct 29, 2024

Uh oh!

Uh oh!

Uh oh!

StephanTLavavej commented Oct 30, 2024

Uh oh!

Uh oh!

Help the compiler vectorize `adjacent_difference` #4958

Help the compiler vectorize `adjacent_difference` #4958

AlexGuteniev commented Sep 14, 2024 •

edited by StephanTLavavej

Loading

CaseyCarter commented Sep 15, 2024 •

edited

Loading