CARVIEW |
Navigation Menu
-
Notifications
You must be signed in to change notification settings - Fork 429
Description
Here are some benchmarks of Argon2d 1.3 t=1 p=1 as implemented in phc-winner-argon2 as of today for the password hashing use case (not KDF) on 2x E5-2650v4 (24 cores, 48 threads) with 8x DDR4-2400 (a server kindly provided by Packet) running CentOS 7, built with that distro's default gcc, using AVX2 (I verified with objdump -d
), in cumulative hashes per second after warm-up (CPUs at max turbo for all cores in use):
32 MiB, 48 threads, as-is: 520
32 MiB, 48 threads, malloc()/free() out of loop: 650
32 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 1070
Per /proc/meminfo
, the memory was allocated as AnonHugePages
, so I don't expect further speedup from explicit use of huge pages on this Linux kernel (but there might be on CentOS 6, which per my testing wouldn't allocate huge pages transparently). 24 threads were consistently slightly slower across all of these, e.g. 980 hashes/s for the last one.
Here are some more, for 2 MiB (high request rate capacity, targeting L3 cache):
2 MiB, 48 threads, as-is: 15.5k
2 MiB, 48 threads, malloc()/free() out of loop: 15.7k
2 MiB, 48 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 27.4k
2 MiB, 24 threads, as-is: 20.7k
2 MiB, 24 threads, malloc()/free() out of loop: 20.5k
2 MiB, 24 threads, malloc()/free() out of loop and secure_wipe_memory() disabled: 23.3k
This shows that memory (de)allocation and cleansing has a combined performance cost of over 2x at least on this platform for 32 MiB, but apparently glibc is being smart at 2 MiB allocations so only the cleansing costs us at that size.
The inclusion of such overhead in default builds and its potential extent should be documented prominently (perhaps including right in the argon2
program output, not only in some documentation file), and a way for applications to override not only the memory (de)allocation functions, but also clear_internal_memory()
should be provided.
BTW, while the code mostly uses the clear_internal_memory()
wrapper function, in two places it calls secure_wipe_memory()
directly - intentional (why?) or a bug?
As a sanity check, current yescrypt t=0 p=1 (built for plain SSE2, as that still works best even on Broadwell-EP with yescrypt's default 128-bit S-box lookups, which I like for them being so numerous) computes 780 hashes/s for 32 MiB, 48 threads at default settings, 815 at pwxform rounds reduced from 6 to 3. Since it has another 1/3 of a pass over the memory (on top of Argon2's t=1), this would translate to 780*4/3 = 1040
to 815*4/3 = 1087
hashes/s for just the memory filling phase at same memory access speed, which is similar to the 1070 we see for Argon2d with the artificial slowdowns removed. That's at the same 1 KiB block size that Argon2 uses. Higher memory bandwidth is usable on this system at 2 KiB (925*4/3 = 1233
) and 4 KiB (1005*4/3 = 1340
, 1005*32*2*4/3*2^20/10^9 = 90 GB/s
), but these are not an option for Argon2 (block size not configurable).