You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Use AVX2 mask to read tails for minmax/minmax element, then use the same masks to populate the tail with previous data, and to exclude tail indices for _element algorithm.
β±οΈ Benchmark results
8 and 16 bit Both_val cases are expected to improve in a significant way too, but it is currently hidden by #4913
StephanTLavavej
changed the title
Use AVX/AVX2 masks in minmax_element and minmax vectoization
Use AVX/AVX2 masks in minmax_element and minmax vectorization
Aug 27, 2024
Final (voluminous) results on my 5950X, relative to main containing #4913. As always, "before" has the updated benchmark, but with git restore --source=main stl to revert the product code.
Click to expand table:
Benchmark
Before
After
Speedup
bm<uint8_t, Op::Min>/8021
129 ns
122 ns
1.06
bm<uint8_t, Op::Min>/63
31.2 ns
12.5 ns
2.50
bm<uint8_t, Op::Max>/8021
130 ns
124 ns
1.05
bm<uint8_t, Op::Max>/63
31.2 ns
12.8 ns
2.44
bm<uint8_t, Op::Both>/8021
161 ns
150 ns
1.07
bm<uint8_t, Op::Both>/63
41.3 ns
20.5 ns
2.01
bm<uint8_t, Op::Min_val>/8021
69.7 ns
63.4 ns
1.10
bm<uint8_t, Op::Min_val>/63
19.3 ns
5.81 ns
3.32
bm<uint8_t, Op::Max_val>/8021
70.7 ns
63.6 ns
1.11
bm<uint8_t, Op::Max_val>/63
19.0 ns
6.61 ns
2.87
bm<uint8_t, Op::Both_val>/8021
76.7 ns
66.1 ns
1.16
bm<uint8_t, Op::Both_val>/63
26.6 ns
6.82 ns
3.90
bm<uint16_t, Op::Min>/8021
227 ns
226 ns
1.00
bm<uint16_t, Op::Min>/31
16.5 ns
10.5 ns
1.57
bm<uint16_t, Op::Max>/8021
224 ns
229 ns
0.98
bm<uint16_t, Op::Max>/31
15.8 ns
10.5 ns
1.50
bm<uint16_t, Op::Both>/8021
250 ns
267 ns
0.94
bm<uint16_t, Op::Both>/31
25.6 ns
18.8 ns
1.36
bm<uint16_t, Op::Min_val>/8021
116 ns
116 ns
1.00
bm<uint16_t, Op::Min_val>/31
7.77 ns
5.33 ns
1.46
bm<uint16_t, Op::Max_val>/8021
116 ns
114 ns
1.02
bm<uint16_t, Op::Max_val>/31
8.34 ns
5.54 ns
1.51
bm<uint16_t, Op::Both_val>/8021
117 ns
117 ns
1.00
bm<uint16_t, Op::Both_val>/31
9.27 ns
5.76 ns
1.61
bm<uint32_t, Op::Min>/8021
446 ns
432 ns
1.03
bm<uint32_t, Op::Min>/15
10.0 ns
8.80 ns
1.14
bm<uint32_t, Op::Max>/8021
446 ns
437 ns
1.02
bm<uint32_t, Op::Max>/15
10.1 ns
9.05 ns
1.12
bm<uint32_t, Op::Both>/8021
506 ns
511 ns
0.99
bm<uint32_t, Op::Both>/15
18.8 ns
17.1 ns
1.10
bm<uint32_t, Op::Min_val>/8021
223 ns
223 ns
1.00
bm<uint32_t, Op::Min_val>/15
6.39 ns
5.54 ns
1.15
bm<uint32_t, Op::Max_val>/8021
223 ns
223 ns
1.00
bm<uint32_t, Op::Max_val>/15
6.33 ns
5.12 ns
1.24
bm<uint32_t, Op::Both_val>/8021
439 ns
223 ns
1.97
bm<uint32_t, Op::Both_val>/15
6.64 ns
5.96 ns
1.11
bm<uint64_t, Op::Min>/8021
1142 ns
881 ns
1.30
bm<uint64_t, Op::Min>/7
10.0 ns
11.1 ns
0.90
bm<uint64_t, Op::Max>/8021
903 ns
879 ns
1.03
bm<uint64_t, Op::Max>/7
11.1 ns
10.6 ns
1.05
bm<uint64_t, Op::Both>/8021
1249 ns
1272 ns
0.98
bm<uint64_t, Op::Both>/7
20.0 ns
20.2 ns
0.99
bm<uint64_t, Op::Min_val>/8021
862 ns
857 ns
1.01
bm<uint64_t, Op::Min_val>/7
6.24 ns
6.00 ns
1.04
bm<uint64_t, Op::Max_val>/8021
864 ns
852 ns
1.01
bm<uint64_t, Op::Max_val>/7
6.05 ns
5.78 ns
1.05
bm<uint64_t, Op::Both_val>/8021
884 ns
873 ns
1.01
bm<uint64_t, Op::Both_val>/7
9.07 ns
8.78 ns
1.03
bm<int8_t, Op::Min>/8021
129 ns
123 ns
1.05
bm<int8_t, Op::Min>/63
30.5 ns
12.6 ns
2.42
bm<int8_t, Op::Max>/8021
129 ns
123 ns
1.05
bm<int8_t, Op::Max>/63
30.3 ns
12.6 ns
2.40
bm<int8_t, Op::Both>/8021
158 ns
148 ns
1.07
bm<int8_t, Op::Both>/63
41.3 ns
20.4 ns
2.02
bm<int8_t, Op::Min_val>/8021
71.7 ns
64.2 ns
1.12
bm<int8_t, Op::Min_val>/63
18.9 ns
6.61 ns
2.86
bm<int8_t, Op::Max_val>/8021
71.5 ns
64.1 ns
1.12
bm<int8_t, Op::Max_val>/63
19.1 ns
5.77 ns
3.31
bm<int8_t, Op::Both_val>/8021
77.5 ns
65.5 ns
1.18
bm<int8_t, Op::Both_val>/63
25.9 ns
7.10 ns
3.65
bm<int16_t, Op::Min>/8021
230 ns
230 ns
1.00
bm<int16_t, Op::Min>/31
15.6 ns
11.5 ns
1.36
bm<int16_t, Op::Max>/8021
225 ns
228 ns
0.99
bm<int16_t, Op::Max>/31
16.8 ns
11.3 ns
1.49
bm<int16_t, Op::Both>/8021
249 ns
268 ns
0.93
bm<int16_t, Op::Both>/31
25.9 ns
18.6 ns
1.39
bm<int16_t, Op::Min_val>/8021
116 ns
115 ns
1.01
bm<int16_t, Op::Min_val>/31
10.0 ns
5.14 ns
1.95
bm<int16_t, Op::Max_val>/8021
117 ns
220 ns
0.53
bm<int16_t, Op::Max_val>/31
8.62 ns
5.34 ns
1.61
bm<int16_t, Op::Both_val>/8021
119 ns
118 ns
1.01
bm<int16_t, Op::Both_val>/31
13.2 ns
5.99 ns
2.20
bm<int32_t, Op::Min>/8021
440 ns
438 ns
1.00
bm<int32_t, Op::Min>/15
9.93 ns
10.4 ns
0.95
bm<int32_t, Op::Max>/8021
441 ns
438 ns
1.01
bm<int32_t, Op::Max>/15
9.95 ns
10.4 ns
0.96
bm<int32_t, Op::Both>/8021
507 ns
514 ns
0.99
bm<int32_t, Op::Both>/15
18.4 ns
16.7 ns
1.10
bm<int32_t, Op::Min_val>/8021
224 ns
222 ns
1.01
bm<int32_t, Op::Min_val>/15
6.45 ns
5.53 ns
1.17
bm<int32_t, Op::Max_val>/8021
225 ns
222 ns
1.01
bm<int32_t, Op::Max_val>/15
7.92 ns
5.98 ns
1.32
bm<int32_t, Op::Both_val>/8021
437 ns
434 ns
1.01
bm<int32_t, Op::Both_val>/15
8.14 ns
5.54 ns
1.47
bm<int64_t, Op::Min>/8021
1135 ns
1096 ns
1.04
bm<int64_t, Op::Min>/7
13.2 ns
13.8 ns
0.96
bm<int64_t, Op::Max>/8021
1094 ns
1114 ns
0.98
bm<int64_t, Op::Max>/7
13.6 ns
14.1 ns
0.96
bm<int64_t, Op::Both>/8021
1218 ns
1272 ns
0.96
bm<int64_t, Op::Both>/7
19.6 ns
19.8 ns
0.99
bm<int64_t, Op::Min_val>/8021
851 ns
845 ns
1.01
bm<int64_t, Op::Min_val>/7
5.57 ns
5.32 ns
1.05
bm<int64_t, Op::Max_val>/8021
860 ns
846 ns
1.02
bm<int64_t, Op::Max_val>/7
5.59 ns
5.33 ns
1.05
bm<int64_t, Op::Both_val>/8021
873 ns
868 ns
1.01
bm<int64_t, Op::Both_val>/7
7.75 ns
7.91 ns
0.98
bm<float, Op::Min>/8021
443 ns
436 ns
1.02
bm<float, Op::Min>/15
9.72 ns
8.36 ns
1.16
bm<float, Op::Max>/8021
443 ns
436 ns
1.02
bm<float, Op::Max>/15
9.43 ns
8.33 ns
1.13
bm<float, Op::Both>/8021
521 ns
518 ns
1.01
bm<float, Op::Both>/15
16.6 ns
17.5 ns
0.95
bm<float, Op::Min_val>/8021
442 ns
437 ns
1.01
bm<float, Op::Min_val>/15
9.67 ns
8.35 ns
1.16
bm<float, Op::Max_val>/8021
439 ns
436 ns
1.01
bm<float, Op::Max_val>/15
9.23 ns
8.41 ns
1.10
bm<float, Op::Both_val>/8021
514 ns
524 ns
0.98
bm<float, Op::Both_val>/15
15.5 ns
16.4 ns
0.95
bm<double, Op::Min>/8021
874 ns
660 ns
1.32
bm<double, Op::Min>/7
8.43 ns
8.25 ns
1.02
bm<double, Op::Max>/8021
653 ns
654 ns
1.00
bm<double, Op::Max>/7
7.95 ns
8.20 ns
0.97
bm<double, Op::Both>/8021
1027 ns
1023 ns
1.00
bm<double, Op::Both>/7
16.4 ns
16.6 ns
0.99
bm<double, Op::Min_val>/8021
871 ns
654 ns
1.33
bm<double, Op::Min_val>/7
8.40 ns
8.55 ns
0.98
bm<double, Op::Max_val>/8021
653 ns
654 ns
1.00
bm<double, Op::Max_val>/7
8.03 ns
8.45 ns
0.95
bm<double, Op::Both_val>/8021
1024 ns
1016 ns
1.01
bm<double, Op::Both_val>/7
15.4 ns
15.7 ns
0.98
The only significant regression is for bm<int16_t, Op::Max_val>/8021, which was 117 ns before and 220 ns after, 0.53x speedup. I don't think this should block the PR, but I wanted to note it.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
π§ Overview
Use AVX2 mask to read tails for minmax/minmax element, then use the same masks to populate the tail with previous data, and to exclude tail indices for
_element
algorithm.β±οΈ Benchmark results
8 and 16 bit
Both_val
cases are expected to improve in a significant way too, but it is currently hidden by #4913