You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Like I mentioned in #5346, both std::search_n and ranges::search_n make steps by n elements, and avoid going back for a good input (where there are few potential matches), so for large n values vectorization wouldn't be an improvement.
Still for small n, such that vector register width is larger than n, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.
The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length. @Alcaro suggested:
you can do things like
match &= match>>1
match &= match>>2
match &= match>>3
to find 7 consecutive 1s, but that probably does something ruinous to instruction parallelism and may cost more than it saves
Turns out this is efficient enough for AVX2 with the values of n up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.
The shift values are computed based on n. To save one variable (general purpose register), we rely on n=1 to be handled separately, and assume at least one shift to happen.
To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case with lzcnt/tzcnt, this turned out to be not faster.
The fallback is used for tails and too large n values. For tails it uses lzcnt with inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreates ranges::search_n in <algorithm>, with slight variation.
๐ฅ Down-level architectures support
SSE4.2 version is implementable in both senses of backporting the current approach to SSE and using pcmpestri. I'd expect either to be of advantage for n values twice smaller than SSE register. Just feel like should not bother trying that.
x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact __ull_rshift is used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are of int type to match the intrinsic parameter type.
Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.
For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in #4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.
We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.
1๏ธโฃ Special n=1 case
We need to handle this case as just find vectorization. find vectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.
The question is where we should handle this:
Only in separately compiled code
Only in headers
Both in headers and in separately compiled code
The latter two are indistinguishable in practice, so the real question is, if we should:
Some non-vectorization optimization for non-vector element types (I believe it is noticeable, but not like multiple times)
Some auto-vectorization from Clang and probably MSVC in future (Clang recognizes find pattern)
memchr for corresponding type and disabled vectorization mode
โ Test coverage
To cover the variety of possibilities, the randomized test should try different input lengths, different n, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.
The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in #933.
This PR still doesn't fully address #933 as it is asked because:
It does not cover the forward-only iterator branches
It does not have features, like nice error case print, or seed parameter acceptance
I'm not sure how much these features are required, though. If they are required, further work to complete #933 would certainly need a different PR.
๐ Benchmarks
In addition to the TwoZones case inherited from ##5346 , it has DenseSmallSequences.
These two are close to normal case and worst case respectively.
TwoZones (Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.
DenseSmallSequences (Dense in the table below) has too short matches of random with from 0 to n-1 interrupted by a single mismatch character.
The vectorization improvement is more for DenseSmallSequences, but we should probably care about TwoZones somewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.
โฑ๏ธ Benchmark results
Click to expand:
Benchmark
V. alg?
x64 Before
x64 After
x64 ๐๏ธ
x86 Before
x86 After
x86 ๐๏ธ
u8/Std/Zones/3000/40
no
45.4 ns
46.7 ns
0.97
42.4 ns
66.8 ns
0.63
u8/Std/Zones/3000/18
no
63.0 ns
61.8 ns
1.02
62.1 ns
83.2 ns
0.75
u8/Std/Zones/3000/16
yes
66.0 ns
69.3 ns
0.95
77.8 ns
128 ns
0.61
u8/Std/Zones/3000/14
yes
68.7 ns
69.0 ns
1.00
71.7 ns
129 ns
0.56
u8/Std/Zones/3000/10
yes
85.8 ns
72.4 ns
1.19
93.9 ns
130 ns
0.72
u8/Std/Zones/3000/8
yes
103 ns
69.4 ns
1.48
113 ns
128 ns
0.88
u8/Std/Zones/3000/5
yes
157 ns
74.2 ns
2.12
171 ns
128 ns
1.34
u8/Std/Zones/3000/4
yes
189 ns
72.4 ns
2.61
210 ns
125 ns
1.68
u8/Std/Zones/3000/3
yes
250 ns
71.6 ns
3.49
272 ns
132 ns
2.06
u8/Std/Zones/3000/2
yes
368 ns
72.5 ns
5.08
402 ns
130 ns
3.09
u8/Std/Zones/3000/1
find
18.0 ns
18.2 ns
0.99
18.3 ns
21.8 ns
0.84
u8/Rng/Zones/3000/40
no
47.7 ns
45.8 ns
1.04
52.4 ns
66.9 ns
0.78
u8/Rng/Zones/3000/18
no
78.2 ns
60.8 ns
1.29
79.7 ns
83.7 ns
0.95
u8/Rng/Zones/3000/16
yes
84.9 ns
71.1 ns
1.19
85.6 ns
129 ns
0.66
u8/Rng/Zones/3000/14
yes
90.3 ns
71.4 ns
1.26
93.7 ns
128 ns
0.73
u8/Rng/Zones/3000/10
yes
118 ns
72.3 ns
1.63
118 ns
128 ns
0.92
u8/Rng/Zones/3000/8
yes
141 ns
71.7 ns
1.97
144 ns
128 ns
1.13
u8/Rng/Zones/3000/5
yes
215 ns
75.5 ns
2.85
212 ns
125 ns
1.70
u8/Rng/Zones/3000/4
yes
303 ns
72.9 ns
4.16
265 ns
129 ns
2.05
u8/Rng/Zones/3000/3
yes
346 ns
73.8 ns
4.69
344 ns
130 ns
2.65
u8/Rng/Zones/3000/2
yes
509 ns
74.8 ns
6.80
506 ns
129 ns
3.92
u8/Rng/Zones/3000/1
find
18.2 ns
18.4 ns
0.99
18.5 ns
18.7 ns
0.99
u8/Std/Dense/3000/40
no
818 ns
381 ns
2.15
823 ns
654 ns
1.26
u8/Std/Dense/3000/18
no
1006 ns
501 ns
2.01
1036 ns
774 ns
1.34
u8/Std/Dense/3000/16
yes
985 ns
135 ns
7.30
1022 ns
236 ns
4.33
u8/Std/Dense/3000/14
yes
987 ns
136 ns
7.26
1004 ns
244 ns
4.11
u8/Std/Dense/3000/10
yes
1071 ns
144 ns
7.44
1094 ns
245 ns
4.47
u8/Std/Dense/3000/8
yes
1140 ns
138 ns
8.26
1239 ns
246 ns
5.04
u8/Std/Dense/3000/5
yes
1301 ns
147 ns
8.85
1356 ns
279 ns
4.86
u8/Std/Dense/3000/4
yes
1303 ns
147 ns
8.86
1418 ns
243 ns
5.84
u8/Std/Dense/3000/3
yes
1300 ns
147 ns
8.84
1460 ns
248 ns
5.89
u8/Std/Dense/3000/2
yes
1191 ns
149 ns
7.99
1363 ns
244 ns
5.59
u8/Std/Dense/3000/1
find
49.4 ns
47.0 ns
1.05
48.3 ns
49.6 ns
0.97
u8/Rng/Dense/3000/40
no
830 ns
382 ns
2.17
584 ns
653 ns
0.89
u8/Rng/Dense/3000/18
no
813 ns
506 ns
1.61
622 ns
768 ns
0.81
u8/Rng/Dense/3000/16
yes
853 ns
143 ns
5.97
660 ns
237 ns
2.78
u8/Rng/Dense/3000/14
yes
843 ns
137 ns
6.15
665 ns
241 ns
2.76
u8/Rng/Dense/3000/10
yes
875 ns
138 ns
6.34
707 ns
243 ns
2.91
u8/Rng/Dense/3000/8
yes
936 ns
139 ns
6.73
771 ns
240 ns
3.21
u8/Rng/Dense/3000/5
yes
1057 ns
148 ns
7.14
858 ns
240 ns
3.58
u8/Rng/Dense/3000/4
yes
1155 ns
148 ns
7.80
876 ns
248 ns
3.53
u8/Rng/Dense/3000/3
yes
1240 ns
147 ns
8.44
889 ns
252 ns
3.53
u8/Rng/Dense/3000/2
yes
1096 ns
149 ns
7.36
1074 ns
251 ns
4.28
u8/Rng/Dense/3000/1
find
51.6 ns
49.4 ns
1.04
48.9 ns
48.6 ns
1.01
u16/Std/Zones/3000/40
no
41.2 ns
50.2 ns
0.82
46.1 ns
55.3 ns
0.83
u16/Std/Zones/3000/18
no
66.3 ns
69.2 ns
0.96
68.1 ns
76.4 ns
0.89
u16/Std/Zones/3000/16
no
71.0 ns
75.8 ns
0.94
75.3 ns
83.0 ns
0.91
u16/Std/Zones/3000/14
no
77.3 ns
83.5 ns
0.93
79.9 ns
92.0 ns
0.87
u16/Std/Zones/3000/10
no
97.1 ns
105 ns
0.92
103 ns
116 ns
0.89
u16/Std/Zones/3000/8
yes
117 ns
107 ns
1.09
126 ns
175 ns
0.72
u16/Std/Zones/3000/5
yes
166 ns
107 ns
1.55
195 ns
174 ns
1.12
u16/Std/Zones/3000/4
yes
194 ns
107 ns
1.81
231 ns
173 ns
1.34
u16/Std/Zones/3000/3
yes
270 ns
117 ns
2.31
309 ns
177 ns
1.75
u16/Std/Zones/3000/2
yes
385 ns
118 ns
3.26
438 ns
172 ns
2.55
u16/Std/Zones/3000/1
find
48.2 ns
48.9 ns
0.99
37.5 ns
52.1 ns
0.72
u16/Rng/Zones/3000/40
no
49.1 ns
49.7 ns
0.99
50.1 ns
55.0 ns
0.91
u16/Rng/Zones/3000/18
no
85.7 ns
70.1 ns
1.22
107 ns
76.6 ns
1.40
u16/Rng/Zones/3000/16
no
95.8 ns
81.1 ns
1.18
117 ns
83.9 ns
1.39
u16/Rng/Zones/3000/14
no
108 ns
84.1 ns
1.28
128 ns
91.5 ns
1.40
u16/Rng/Zones/3000/10
no
156 ns
103 ns
1.51
168 ns
115 ns
1.46
u16/Rng/Zones/3000/8
yes
185 ns
108 ns
1.71
202 ns
172 ns
1.17
u16/Rng/Zones/3000/5
yes
304 ns
108 ns
2.81
313 ns
171 ns
1.83
u16/Rng/Zones/3000/4
yes
377 ns
106 ns
3.56
394 ns
174 ns
2.26
u16/Rng/Zones/3000/3
yes
500 ns
118 ns
4.24
518 ns
172 ns
3.01
u16/Rng/Zones/3000/2
yes
734 ns
118 ns
6.22
747 ns
173 ns
4.32
u16/Rng/Zones/3000/1
find
47.3 ns
48.9 ns
0.97
37.8 ns
51.1 ns
0.74
u16/Std/Dense/3000/40
no
827 ns
385 ns
2.15
854 ns
422 ns
2.02
u16/Std/Dense/3000/18
no
964 ns
499 ns
1.93
994 ns
554 ns
1.79
u16/Std/Dense/3000/16
no
1019 ns
528 ns
1.93
1010 ns
568 ns
1.78
u16/Std/Dense/3000/14
no
1043 ns
577 ns
1.81
1064 ns
585 ns
1.82
u16/Std/Dense/3000/10
no
1202 ns
695 ns
1.73
1186 ns
708 ns
1.68
u16/Std/Dense/3000/8
yes
1308 ns
211 ns
6.20
1268 ns
337 ns
3.76
u16/Std/Dense/3000/5
yes
1514 ns
210 ns
7.21
1490 ns
340 ns
4.38
u16/Std/Dense/3000/4
yes
1494 ns
211 ns
7.08
1458 ns
355 ns
4.11
u16/Std/Dense/3000/3
yes
1438 ns
232 ns
6.20
1398 ns
365 ns
3.83
u16/Std/Dense/3000/2
yes
1136 ns
232 ns
4.90
1271 ns
346 ns
3.67
u16/Std/Dense/3000/1
find
74.3 ns
76.0 ns
0.98
87.2 ns
89.2 ns
0.98
u16/Rng/Dense/3000/40
no
423 ns
388 ns
1.09
526 ns
524 ns
1.00
u16/Rng/Dense/3000/18
no
549 ns
506 ns
1.08
643 ns
702 ns
0.92
u16/Rng/Dense/3000/16
no
576 ns
528 ns
1.09
702 ns
619 ns
1.13
u16/Rng/Dense/3000/14
no
599 ns
576 ns
1.04
701 ns
634 ns
1.11
u16/Rng/Dense/3000/10
no
699 ns
702 ns
1.00
779 ns
764 ns
1.02
u16/Rng/Dense/3000/8
yes
769 ns
216 ns
3.56
894 ns
347 ns
2.58
u16/Rng/Dense/3000/5
yes
874 ns
211 ns
4.14
1002 ns
341 ns
2.94
u16/Rng/Dense/3000/4
yes
1023 ns
210 ns
4.87
1110 ns
339 ns
3.27
u16/Rng/Dense/3000/3
yes
1320 ns
232 ns
5.69
1260 ns
344 ns
3.66
u16/Rng/Dense/3000/2
yes
1823 ns
233 ns
7.82
1769 ns
344 ns
5.14
u16/Rng/Dense/3000/1
find
75.7 ns
73.7 ns
1.03
83.0 ns
90.3 ns
0.92
u32/Std/Zones/3000/40
no
44.4 ns
43.7 ns
1.02
45.3 ns
58.1 ns
0.78
u32/Std/Zones/3000/18
no
61.9 ns
64.2 ns
0.96
71.3 ns
79.4 ns
0.90
u32/Std/Zones/3000/16
no
64.6 ns
69.5 ns
0.93
74.8 ns
84.0 ns
0.89
u32/Std/Zones/3000/14
no
72.6 ns
76.1 ns
0.95
81.1 ns
103 ns
0.79
u32/Std/Zones/3000/10
no
90.9 ns
96.3 ns
0.94
103 ns
129 ns
0.80
u32/Std/Zones/3000/8
no
113 ns
116 ns
0.97
126 ns
153 ns
0.82
u32/Std/Zones/3000/5
no
167 ns
176 ns
0.95
186 ns
237 ns
0.78
u32/Std/Zones/3000/4
yes
196 ns
162 ns
1.21
228 ns
230 ns
0.99
u32/Std/Zones/3000/3
yes
262 ns
162 ns
1.62
302 ns
232 ns
1.30
u32/Std/Zones/3000/2
yes
393 ns
163 ns
2.41
440 ns
229 ns
1.92
u32/Std/Zones/3000/1
find
80.1 ns
80.3 ns
1.00
70.5 ns
75.8 ns
0.93
u32/Rng/Zones/3000/40
no
49.2 ns
42.4 ns
1.16
52.4 ns
53.4 ns
0.98
u32/Rng/Zones/3000/18
no
101 ns
59.0 ns
1.71
100 ns
77.2 ns
1.30
u32/Rng/Zones/3000/16
no
110 ns
68.7 ns
1.60
110 ns
82.0 ns
1.34
u32/Rng/Zones/3000/14
no
125 ns
75.9 ns
1.65
122 ns
101 ns
1.21
u32/Rng/Zones/3000/10
no
159 ns
95.9 ns
1.66
162 ns
127 ns
1.28
u32/Rng/Zones/3000/8
no
194 ns
118 ns
1.64
198 ns
154 ns
1.29
u32/Rng/Zones/3000/5
no
302 ns
175 ns
1.73
313 ns
232 ns
1.35
u32/Rng/Zones/3000/4
yes
374 ns
163 ns
2.29
381 ns
231 ns
1.65
u32/Rng/Zones/3000/3
yes
494 ns
163 ns
3.03
511 ns
232 ns
2.20
u32/Rng/Zones/3000/2
yes
732 ns
162 ns
4.52
756 ns
233 ns
3.24
u32/Rng/Zones/3000/1
find
80.4 ns
80.3 ns
1.00
70.6 ns
74.3 ns
0.95
u32/Std/Dense/3000/40
no
921 ns
360 ns
2.56
821 ns
534 ns
1.54
u32/Std/Dense/3000/18
no
1171 ns
455 ns
2.57
995 ns
593 ns
1.68
u32/Std/Dense/3000/16
no
1187 ns
475 ns
2.50
978 ns
624 ns
1.57
u32/Std/Dense/3000/14
no
1212 ns
509 ns
2.38
1000 ns
659 ns
1.52
u32/Std/Dense/3000/10
no
1337 ns
605 ns
2.21
1059 ns
865 ns
1.22
u32/Std/Dense/3000/8
no
1463 ns
689 ns
2.12
1119 ns
952 ns
1.18
u32/Std/Dense/3000/5
no
1547 ns
849 ns
1.82
1268 ns
1131 ns
1.12
u32/Std/Dense/3000/4
yes
1460 ns
334 ns
4.37
1332 ns
470 ns
2.83
u32/Std/Dense/3000/3
yes
1442 ns
333 ns
4.33
1406 ns
475 ns
2.96
u32/Std/Dense/3000/2
yes
1149 ns
333 ns
3.45
1211 ns
477 ns
2.54
u32/Std/Dense/3000/1
find
163 ns
158 ns
1.03
151 ns
153 ns
0.99
u32/Rng/Dense/3000/40
no
496 ns
357 ns
1.39
553 ns
537 ns
1.03
u32/Rng/Dense/3000/18
no
600 ns
458 ns
1.31
672 ns
585 ns
1.15
u32/Rng/Dense/3000/16
no
638 ns
473 ns
1.35
665 ns
623 ns
1.07
u32/Rng/Dense/3000/14
no
665 ns
511 ns
1.30
687 ns
664 ns
1.03
u32/Rng/Dense/3000/10
no
777 ns
613 ns
1.27
784 ns
856 ns
0.92
u32/Rng/Dense/3000/8
no
857 ns
688 ns
1.25
873 ns
961 ns
0.91
u32/Rng/Dense/3000/5
no
991 ns
852 ns
1.16
1013 ns
1127 ns
0.90
u32/Rng/Dense/3000/4
yes
1105 ns
343 ns
3.22
1050 ns
470 ns
2.23
u32/Rng/Dense/3000/3
yes
1258 ns
337 ns
3.73
1275 ns
474 ns
2.69
u32/Rng/Dense/3000/2
yes
1751 ns
337 ns
5.20
1863 ns
470 ns
3.96
u32/Rng/Dense/3000/1
find
160 ns
159 ns
1.01
157 ns
152 ns
1.03
u64/Std/Zones/3000/40
no
40.9 ns
50.5 ns
0.81
48.3 ns
54.5 ns
0.89
u64/Std/Zones/3000/18
no
58.2 ns
74.7 ns
0.78
68.8 ns
76.3 ns
0.90
u64/Std/Zones/3000/16
no
63.3 ns
82.8 ns
0.76
73.9 ns
83.7 ns
0.88
u64/Std/Zones/3000/14
no
68.0 ns
90.9 ns
0.75
79.9 ns
92.9 ns
0.86
u64/Std/Zones/3000/10
no
87.0 ns
114 ns
0.76
102 ns
118 ns
0.86
u64/Std/Zones/3000/8
no
106 ns
140 ns
0.76
124 ns
143 ns
0.87
u64/Std/Zones/3000/5
no
168 ns
209 ns
0.80
187 ns
223 ns
0.84
u64/Std/Zones/3000/4
no
192 ns
259 ns
0.74
230 ns
292 ns
0.79
u64/Std/Zones/3000/3
no
266 ns
332 ns
0.80
295 ns
354 ns
0.83
u64/Std/Zones/3000/2
yes
372 ns
248 ns
1.50
434 ns
395 ns
1.10
u64/Std/Zones/3000/1
find
152 ns
151 ns
1.01
157 ns
158 ns
0.99
u64/Rng/Zones/3000/40
no
66.2 ns
49.5 ns
1.34
59.0 ns
55.9 ns
1.06
u64/Rng/Zones/3000/18
no
105 ns
73.8 ns
1.42
101 ns
74.9 ns
1.35
u64/Rng/Zones/3000/16
no
117 ns
81.2 ns
1.44
111 ns
83.4 ns
1.33
u64/Rng/Zones/3000/14
no
130 ns
90.4 ns
1.44
126 ns
91.9 ns
1.37
u64/Rng/Zones/3000/10
no
171 ns
112 ns
1.53
161 ns
118 ns
1.36
u64/Rng/Zones/3000/8
no
209 ns
137 ns
1.53
201 ns
141 ns
1.43
u64/Rng/Zones/3000/5
no
325 ns
204 ns
1.59
312 ns
211 ns
1.48
u64/Rng/Zones/3000/4
no
402 ns
251 ns
1.60
381 ns
262 ns
1.45
u64/Rng/Zones/3000/3
no
531 ns
333 ns
1.59
501 ns
346 ns
1.45
u64/Rng/Zones/3000/2
yes
796 ns
242 ns
3.29
746 ns
357 ns
2.09
u64/Rng/Zones/3000/1
find
149 ns
150 ns
0.99
152 ns
150 ns
1.01
u64/Std/Dense/3000/40
no
936 ns
384 ns
2.44
1172 ns
578 ns
2.03
u64/Std/Dense/3000/18
no
1122 ns
545 ns
2.06
1244 ns
601 ns
2.07
u64/Std/Dense/3000/16
no
1177 ns
545 ns
2.16
1239 ns
622 ns
1.99
u64/Std/Dense/3000/14
no
1208 ns
597 ns
2.02
1225 ns
632 ns
1.94
u64/Std/Dense/3000/10
no
1320 ns
766 ns
1.72
1293 ns
666 ns
1.94
u64/Std/Dense/3000/8
no
1426 ns
910 ns
1.57
1395 ns
735 ns
1.90
u64/Std/Dense/3000/5
no
1582 ns
1075 ns
1.47
1606 ns
966 ns
1.66
u64/Std/Dense/3000/4
no
1488 ns
1219 ns
1.22
1673 ns
1095 ns
1.53
u64/Std/Dense/3000/3
no
1506 ns
1296 ns
1.16
1702 ns
1126 ns
1.51
u64/Std/Dense/3000/2
yes
1464 ns
470 ns
3.11
1457 ns
689 ns
2.11
u64/Std/Dense/3000/1
find
285 ns
303 ns
0.94
288 ns
291 ns
0.99
u64/Rng/Dense/3000/40
no
576 ns
384 ns
1.50
483 ns
578 ns
0.84
u64/Rng/Dense/3000/18
no
876 ns
546 ns
1.60
521 ns
590 ns
0.88
u64/Rng/Dense/3000/16
no
866 ns
540 ns
1.60
546 ns
624 ns
0.88
u64/Rng/Dense/3000/14
no
883 ns
593 ns
1.49
559 ns
625 ns
0.89
u64/Rng/Dense/3000/10
no
944 ns
773 ns
1.22
632 ns
674 ns
0.94
u64/Rng/Dense/3000/8
no
1052 ns
914 ns
1.15
743 ns
731 ns
1.02
u64/Rng/Dense/3000/5
no
1071 ns
1079 ns
0.99
970 ns
950 ns
1.02
u64/Rng/Dense/3000/4
no
1136 ns
1224 ns
0.93
1123 ns
1096 ns
1.02
u64/Rng/Dense/3000/3
no
1303 ns
1282 ns
1.02
1317 ns
1130 ns
1.17
u64/Rng/Dense/3000/2
yes
1761 ns
474 ns
3.72
1801 ns
691 ns
2.61
u64/Rng/Dense/3000/1
find
286 ns
302 ns
0.95
290 ns
296 ns
0.98
๐ฅ Results interpretation
For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.
The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.
The x86 results are not very good, but not too bad either.
The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.
Thanks! I think we should keep handling n=1 in the headers. Having a separate check in the separately compiled code is fine. If you want to have the check only in the headers, then a comment in the separately compiled code that we're assuming the check has already been done, would be a good idea.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
โ๏ธ The optimization
Like I mentioned in #5346, both
std::search_n
andranges::search_n
make steps by n elements, and avoid going back for a good input (where there are few potential matches), so for large n values vectorization wouldn't be an improvement.Still for small n, such that vector register width is larger than n, and therefore, the vector step is bigger, it is possible to vectorize in a way that would be faster even for an input with few matches. For more matches, such vectorization will have more advantage, as it would not need to go back.
The approach is to compare elements, get bit mask, and look for contiguous set of ones of proper length. @Alcaro suggested:
Turns out this is efficient enough for AVX2 with the values of n up twice smaller than AVX register size in elements. Despite there seems to be indeed high cost of ruined parallelism, I cannot find anything faster.
The shift values are computed based on n. To save one variable (general purpose register), we rely on n=1 to be handled separately, and assume at least one shift to happen.
To deal with matches on vector register boundary, the bitmask is concatenated with the previous one. AVX bitmask is 32 bits for 32 bytes of AVX value, doubled it is 64 bit, still fits x64 register perfectly. The alternative to concatenation could be handling the boundary case with
lzcnt
/tzcnt
, this turned out to be not faster.The fallback is used for tails and too large n values. For tails it uses
lzcnt
with inverted carry value to have smooth transition from potential partial match in vectors to the scalar part. The fallback recreatesranges::search_n
in<algorithm>
, with slight variation.๐ฅ Down-level architectures support
SSE4.2 version is implementable in both senses of backporting the current approach to SSE and using
pcmpestri
. I'd expect either to be of advantage for n values twice smaller than SSE register. Just feel like should not bother trying that.x86 version works the same way as x64. However, unlike many other vectorization algorithms, this one relies a lot on general-purpose 64 bit integer ops. To mitigate the impact
__ull_rshift
is used instead of the plain shift. This intrinsic usage doesn't impact 64-bit code, but makes 32-bit codegen better (at the expense of not handling huge shifts, which we don't need anyway). The shift values are ofint
type to match the intrinsic parameter type.Still, the efficiency on x86 is questionable (see benchmark results below). Apart from having shifts in multiple instructions, it is apparently due to general purpose registers deficit. The compiler isn't being helpful here too, some register spills look superfluous.
For 32-bit and 64-bit elements, it is possible to use the floating point bit mask, instead of integer bit mask, like in #4987/#5092. This will save bit width. But apart from the mysterious "bypass delay" (integers and floats instructions mix potential penalty), it will also make the bit magic more complicated, more dependent on element width, and still won't reduce the bit width for 8-bit and 16-bit elements, so this doesn't seem to be worth doing.
We could just skip x86. But we don't have precedent of having vectorization for x64, but not having it for x86, so I didn't want to introduce one.
1๏ธโฃ Special n=1 case
We need to handle this case as just
find
vectorization.find
vectorization is more efficient than this one, plus the assumption that the shift happens at least once saves a variable/register.The question is where we should handle this:
The latter two are indistinguishable in practice, so the real question is, if we should:
find
forsearch_n
when n=1ย #5346 optimizationWith removal n=1 case from headers we get:
With keeping n=1 case in headers we get:
find
pattern)memchr
for corresponding type and disabled vectorization modeโ Test coverage
To cover the variety of possibilities, the randomized test should try different input lengths, different n, and different actual matches lengths (including too long matches, too short matches, and different gap between matches). This has to have long run time, so it deserves a dedicated test.
The test coverage is not only useful for vectorization, it also compensates missing non-vectorization coverage, asked in #933.
This PR still doesn't fully address #933 as it is asked because:
I'm not sure how much these features are required, though. If they are required, further work to complete #933 would certainly need a different PR.
๐ Benchmarks
In addition to the
TwoZones
case inherited from ##5346 , it hasDenseSmallSequences
.These two are close to normal case and worst case respectively.
TwoZones
(Zones in the table below) has half of range with mismatch character and half of rangers with match character. So the search should quickly proceed to the match part then check the first match which is successful.DenseSmallSequences
(Dense in the table below) has too short matches of random with from 0 to n-1 interrupted by a single mismatch character.The vectorization improvement is more for
DenseSmallSequences
, but we should probably care aboutTwoZones
somewhat more. If worst case is a priority, we can lift threshold for the vectorization twice.โฑ๏ธ Benchmark results
Click to expand:
๐ฅ Results interpretation
For x64 and for the vectorized n there is a certain improvement for Zones. For Dense the improvement is even greater.
The non-vectorized cases vary a lot, The fallback happen to be faster than header implementation often, but not always. Out of the header implementations, surprisingly, the ranges one is slower for Zones case.
The x86 results are not very good, but not too bad either.
The table contains a lot of rows, but I don't see a reasonable way to reduce it without losing important information.