You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Substring search is one of the most common operations in text processing, and one of the slowest.
StringZilla was designed to supersede LibC and implement those core operations in CPU-friendly manner, using branchless operations, SWAR, and SIMD assembly instructions.
Notably, Rust has a memchr crate that provides a similar functionality, and it's used in many popular libraries.
This repository provides basic benchmarking scripts for comparing the throughput of stringzilla and memchr.
For normal order and reverse order search, over ASCII and UTF8 input data, the following numbers can be expected.
ASCII ⏩
ASCII ⏪
UTF8 ⏩
UTF8 ⏪
Intel:
memchr
5.89 GB/s
1.08 GB/s
8.73 GB/s
3.35 GB/s
stringzilla
8.37 GB/s
8.21 GB/s
11.21 GB/s
11.20 GB/s
Arm:
memchr
6.38 GB/s
1.12 GB/s
13.20 GB/s
3.56 GB/s
stringzilla
6.56 GB/s
5.56 GB/s
9.41 GB/s
8.17 GB/s
Average
1.2x faster
6.2x faster
-
2.8x faster
For Intel the benchmark was run on AWS r7iz instances with Sapphire Rapids cores.
For Arm the benchmark was run on AWS r7g instances with Graviton 3 cores.
The ⏩ signifies forward search, and ⏪ signifies reverse order search.
At the time of writing, the latest versions of memchr and stringzilla were used - 2.7.1 and 3.3.0, respectively.
Replicating the Results
Before running benchmarks, you can test your Rust environment running:
As part of the benchmark, the input "haystack" file is whitespace-tokenized into an array of strings.
In every benchmark iteration, a new "needle" is taken from that array of tokens.
All inclusions of that token in the haystack are counted, and the throughput is calculated.
This generally results in very stable and predictable results.
The benchmark also includes a warm-up, to ensure that the CPU caches are filled and the results are not affected by cold start or SIMD-related frequency scaling.
ASCII Corpus
For benchmarks on ASCII data I've used the English Leipzig Corpora Collection.
It's 124 MB in size, 1'000'000 lines long, and contains 8'388'608 tokens of mean length 5.
For richer mixed UTF data, I've used the XL Sum dataset for multilingual extractive summarization.
It's 4.7 GB in size (1.7 GB compressed), 1'004'598 lines long, and contains 268'435'456 tokens of mean length 8.
To download, unpack, and run the benchmarks, execute the following bash script in your terminal: