[PHC] Argon2 CPU/GPU benchmarks

Discussion:

Solar Designer

2015-08-19 02:09:42 UTC

Hi,

Agnieszka Bielec produced OpenCL implementations of Argon2d and 2i, and
ran benchmarks at the same 1.5 MiB level that we had used for Lyra2 vs.
yescrypt testing.

IIUC, these are for Argon2 1.0, before BlaMka and the indexing function
enhancement.

Argon2i t=3 m=1536
i7-4770K - 2480
GeForce GTX 960M - 1861
Radeon HD 7970 GE (*) - 1288
GeForce GTX TITAN (**) - 2805

Argon2d t=1 m=1536
i7-4770K - 7808
GeForce GTX 960M - 4227
Radeon HD 7970 GE (*) - 2742
GeForce GTX TITAN (**) - 6083

(*) We actually use one GPU in HD 7990 at 1.0 GHz, which is equivalent
to HD 7970 GE.
(**) With slight overclocking by the GPU card vendor.

Raw detail:

http://www.openwall.com/lists/john-dev/2015/08/17/62

I am especially concerned about the 960M (a mobile GPU with 65W TDP)
performing surprisingly well, at 75% of CPU speed for 2i and 54% for 2d.
This means that a larger desktop/gaming/server Maxwell GPU will
trivially outperform the CPU. Per these tables comparing Maxwell GPUs:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_900M_.289xxM.29_Series
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_900_Series

GTX Titan X is more than 4 times larger than the 960M. We need to add
it to the mix and see.

We also see the older Kepler architecture GTX TITAN outperform i7-4770K
slightly for 2i (2805/2480 = 1.13) and reach a CPU-like speed for 2d
(6083/7808 = 0.78). This is much worse than what we saw for Lyra2 and
yescrypt, but isn't as impressive as 960M's result.

The speeds on AMD GCN are worse than I had expected. Perhaps there's
still much room for optimization here.

Argon2i vs. Lyra2:

2480/1861 / (3792/629) = 0.22
2480/1288 / (3792/2844) = 1.44
2480/2805 / (3792/1638) = 0.38

Argon2d vs. Lyra2:

7808/4227 / (3792/629) = 0.31
7808/2742 / (3792/2844) = 2.14
7808/6083 / (3792/1638) = 0.55

Argon2 behaves a lot worse than Lyra2 on both NVIDIAs, but better on AMD
GCN (it's unclear why; probably a current implementation issue).

Argon2i vs. yescrypt:

2480/1861 / (4736/419) = 0.12
2480/1288 / (4736/914) = 0.37
2480/2805 / (4736/1050) = 0.20

Argon2d vs. yescrypt:

7808/4227 / (4736/419) = 0.16
7808/2742 / (4736/914) = 0.55
7808/6083 / (4736/1050) = 0.28

Argon2 behaves a lot worse than yescrypt.

The gap on AMD GCN should grow once that code is properly optimized.

Potential results for GTX Titan X:

2480/(1861*3072/640*1000/1096) / (4736/419) = 0.027
7808/(4227*3072/640*1000/1096) / (4736/419) = 0.037

or:

4736/419 / (2480/(1861*3072/640*1000/1096)) = 37.1
4736/419 / (7808/(4227*3072/640*1000/1096)) = 26.8

Thus, Argon2i or 2d might perform 37 or 27 times worse than yescrypt,
but that's just extrapolation based on the tables at Wikipedia. We need
to get a Titan X and see for ourselves.

On the other hand, the final Argon2 should behave better, especially if
a MAXFORM chain is added.

Alexander

Dmitry Khovratovich

2015-08-19 10:13:06 UTC

Permalink

Hi Alexander,

thank you for the benchmarks! We are still working to produce new code
(enhanced + Maxform) that can be used for future testing. Please feel free
to ask for specific code change that might favor GPU portability.

I have several questions:

1) Would you attribute these results to the existing Argon2 parallelism in
the compression function (8 x parallel Blake2)? Do you already exploit this
feature? If yes, then we already have a more sequential pattern in mind,
that would be great to test with or without Maxform.

2) How do you get these extrapolation numbers for Titan X? What are these
numbers in the denominator?

Best regards,
Dmitry

Post by Solar Designer
Hi,
Agnieszka Bielec produced OpenCL implementations of Argon2d and 2i, and
ran benchmarks at the same 1.5 MiB level that we had used for Lyra2 vs.
yescrypt testing.
IIUC, these are for Argon2 1.0, before BlaMka and the indexing function
enhancement.
Argon2i t=3 m=1536
i7-4770K - 2480
GeForce GTX 960M - 1861
Radeon HD 7970 GE (*) - 1288
GeForce GTX TITAN (**) - 2805
Argon2d t=1 m=1536
i7-4770K - 7808
GeForce GTX 960M - 4227
Radeon HD 7970 GE (*) - 2742
GeForce GTX TITAN (**) - 6083
(*) We actually use one GPU in HD 7990 at 1.0 GHz, which is equivalent
to HD 7970 GE.
(**) With slight overclocking by the GPU card vendor.
http://www.openwall.com/lists/john-dev/2015/08/17/62
I am especially concerned about the 960M (a mobile GPU with 65W TDP)
performing surprisingly well, at 75% of CPU speed for 2i and 54% for 2d.
This means that a larger desktop/gaming/server Maxwell GPU will
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_900M_.289xxM.29_Series
https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_900_Series
GTX Titan X is more than 4 times larger than the 960M. We need to add
it to the mix and see.
We also see the older Kepler architecture GTX TITAN outperform i7-4770K
slightly for 2i (2805/2480 = 1.13) and reach a CPU-like speed for 2d
(6083/7808 = 0.78). This is much worse than what we saw for Lyra2 and
yescrypt, but isn't as impressive as 960M's result.
The speeds on AMD GCN are worse than I had expected. Perhaps there's
still much room for optimization here.
2480/1861 / (3792/629) = 0.22
2480/1288 / (3792/2844) = 1.44
2480/2805 / (3792/1638) = 0.38
7808/4227 / (3792/629) = 0.31
7808/2742 / (3792/2844) = 2.14
7808/6083 / (3792/1638) = 0.55
Argon2 behaves a lot worse than Lyra2 on both NVIDIAs, but better on AMD
GCN (it's unclear why; probably a current implementation issue).
2480/1861 / (4736/419) = 0.12
2480/1288 / (4736/914) = 0.37
2480/2805 / (4736/1050) = 0.20
7808/4227 / (4736/419) = 0.16
7808/2742 / (4736/914) = 0.55
7808/6083 / (4736/1050) = 0.28
Argon2 behaves a lot worse than yescrypt.
The gap on AMD GCN should grow once that code is properly optimized.
2480/(1861*3072/640*1000/1096) / (4736/419) = 0.027
7808/(4227*3072/640*1000/1096) / (4736/419) = 0.037
4736/419 / (2480/(1861*3072/640*1000/1096)) = 37.1
4736/419 / (7808/(4227*3072/640*1000/1096)) = 26.8
Thus, Argon2i or 2d might perform 37 or 27 times worse than yescrypt,
but that's just extrapolation based on the tables at Wikipedia. We need
to get a Titan X and see for ourselves.
On the other hand, the final Argon2 should behave better, especially if
a MAXFORM chain is added.
Alexander

--
Best regards,
Dmitry Khovratovich

Solar Designer

2015-08-19 16:06:44 UTC

Permalink

Hi Dmitry,

Post by Dmitry Khovratovich
thank you for the benchmarks! We are still working to produce new code
(enhanced + Maxform) that can be used for future testing. Please feel free
to ask for specific code change that might favor GPU portability.

Thanks.

What do you mean by GPU portability here? Simplifying OpenCL
implementations for testing, or for actual defensive use of GPUs?

So far, our intent has been mostly to discourage attack use of GPUs, so
we compare the different schemes with this in mind. It is also possible
to efficiently use GPUs defensively e.g. with Litecoin-like parameters to
scrypt (and thus to yescrypt in scrypt compatibility mode) and high p,
but I think Argon2's different thread-level parallelism model
discourages reaching that level of efficiency at defensive use of GPUs.

Post by Dmitry Khovratovich
1) Would you attribute these results to the existing Argon2 parallelism in
the compression function (8 x parallel Blake2)? Do you already exploit this
feature? If yes, then we already have a more sequential pattern in mind,
that would be great to test with or without Maxform.

We don't yet exploit this (except possibly to a very limited extent that
an OpenCL compiler and the hardware might), so I wouldn't attribute the
current results to it. I've been thinking of communicating suggestions
on how to try exploiting this to Agnieszka today. So we'll likely try.
If successful, this should let us pack more concurrent instances of
Argon2, and should provide much speedup over the results so far (as
we're not yet bumping into memory bandwidth, by far).

I attribute the faster attacks on Argon2 than on Lyra2 on the NVIDIA GPUs
so far primarily to Argon2 having a smaller internal state (Lyra2 was
benchmarked with 24 KiB blocks). yescrypt also has more internal state
due to the pwxform S-boxes, plus the pwxform operations themselves slow
GPUs down.

Post by Dmitry Khovratovich
2) How do you get these extrapolation numbers for Titan X? What are these
numbers in the denominator?

3072 and 640 are the total "shader" or "CUDA core" counts (32-bit SIMD
vector elements) for the two GPUs (Titan X vs. 960M). Since it's the
same architecture, we could also compare SMM counts: 24 vs. 5, leading
to the same ratio.

1000 and 1096 are the base clock rates in MHz for the two GPUs (actual
clock rates should be slightly higher for both).

Combined, these result in Titan X being 3072/640*1000/1096 = 4.38 times
faster. In case memory bandwidth ever becomes the limiting factor (as
we optimize the code more), it's similar too: 336/80 = 4.2 times faster.

Post by Dmitry Khovratovich

Post by Solar Designer
2480/(1861*3072/640*1000/1096) / (4736/419) = 0.027
7808/(4227*3072/640*1000/1096) / (4736/419) = 0.037
4736/419 / (2480/(1861*3072/640*1000/1096)) = 37.1
4736/419 / (7808/(4227*3072/640*1000/1096)) = 26.8

Alexander

Solar Designer

2015-08-19 17:04:38 UTC

Permalink

Post by Solar Designer

Here are my suggestions on trying to exploit this parallelism in OpenCL:

http://www.openwall.com/lists/john-dev/2015/08/19/20

We're also considering prefetching:

http://www.openwall.com/lists/john-dev/2015/08/19/22

and optimizing the modulo operation:

http://www.openwall.com/lists/john-dev/2015/08/19/10

We'd appreciate other suggestions on these and other potential
optimizations. e.g. I guess Samuel or Thomas might have more
suggestions on specific ways to optimize the modulo operation.

Alexander

Dmitry Khovratovich

2015-08-19 17:27:30 UTC

Permalink

Post by Solar Designer
What do you mean by GPU portability here? Simplifying OpenCL
implementations for testing, or for actual defensive use of GPUs?

Simplifying implementations for testing, in order to realize all the
possible advantages of GPU cracking earlier than the attacker does so.

Post by Solar Designer
We don't yet exploit this (except possibly to a very limited extent that
an OpenCL compiler and the hardware might), so I wouldn't attribute the
current results to it. I've been thinking of communicating suggestions
on how to try exploiting this to Agnieszka today. So we'll likely try.
If successful, this should let us pack more concurrent instances of
Argon2, and should provide much speedup over the results so far (as
we're not yet bumping into memory bandwidth, by far).

What is the parallelism parameter BTW? p=1 for all schemes?

Post by Solar Designer
Combined, these result in Titan X being 3072/640*1000/1096 = 4.38 times
faster. In case memory bandwidth ever becomes the limiting factor (as
we optimize the code more), it's similar too: 336/80 = 4.2 times faster.

I understand, but why do you get 37x advantage of yescrypt from there?
Don't these properties speed up yescrypt as well?

Post by Solar Designer

Alexander

--
Best regards,
Dmitry Khovratovich

Solar Designer

2015-08-19 18:35:34 UTC

Permalink

Post by Dmitry Khovratovich

Post by Solar Designer
What do you mean by GPU portability here? Simplifying OpenCL
implementations for testing, or for actual defensive use of GPUs?

Simplifying implementations for testing, in order to realize all the
possible advantages of GPU cracking earlier than the attacker does so.

Makes sense. We don't have such suggestions currently (that wouldn't
have significant drawbacks).

Post by Dmitry Khovratovich
What is the parallelism parameter BTW? p=1 for all schemes?

Yes. Agnieszka also ran some tests with 5, but I am not considering
those yet.

Post by Dmitry Khovratovich

I understand, but why do you get 37x advantage of yescrypt from there?
Don't these properties speed up yescrypt as well?

Oh. You're absolutely right. It's totally flawed logic on my part.
Please disregard those 37x and 27x figures.

The expectation is that original Argon2 will run on Titan X maybe 2 or 3
times faster than on i7-4770K at these settings:

1861/2480*3072/640*1000/1096 = 3.29
4227/7808*3072/640*1000/1096 = 2.37

whereas yescrypt will run maybe 2 or 3 times slower on Titan X than on
the CPU:

419/4736*3072/640*1000/1096 = 0.39

That's obviously the same difference we see between Argon2 and yescrypt
on 960M vs. the CPU.

Alexander

Solar Designer

2015-10-15 13:03:14 UTC

Permalink

Post by Solar Designer
Agnieszka Bielec produced OpenCL implementations of Argon2d and 2i, and
ran benchmarks at the same 1.5 MiB level that we had used for Lyra2 vs.
yescrypt testing.
IIUC, these are for Argon2 1.0, before BlaMka and the indexing function
enhancement.
Argon2i t=3 m=1536
i7-4770K - 2480
GeForce GTX 960M - 1861
Radeon HD 7970 GE (*) - 1288
GeForce GTX TITAN (**) - 2805
Argon2d t=1 m=1536
i7-4770K - 7808
GeForce GTX 960M - 4227
Radeon HD 7970 GE (*) - 2742
GeForce GTX TITAN (**) - 6083
(*) We actually use one GPU in HD 7990 at 1.0 GHz, which is equivalent
to HD 7970 GE.
(**) With slight overclocking by the GPU card vendor.
http://www.openwall.com/lists/john-dev/2015/08/17/62

Jeremi Gosney's company, Sagitta HPC, has kindly sponsored the addition
of a Titan X to our HPC Village machine, which I finally got around to
announcing here:

http://www.openwall.com/lists/announce/2015/10/14/1

although we've been playing with the Titan X for a while now (a very
fast and well-behaving card and driver), and Agnieszka ran Argon2
benchmarks on it.

The results for Argon2 so far are moderately disappointing from attack
perspective: although Agnieszka got speeds higher than those quoted
above, she has also got comparably higher speeds out of the old Kepler
architecture Titan.

Here are the updated figures:

Argon2i t=3 m=1536
i7-4770K - 2480
GeForce GTX 960M - 2007
Radeon HD 7970 GE (*) - 1542
GeForce GTX TITAN (**) - 4292
GeForce GTX Titan X - 6301

Argon2d t=1 m=1536
i7-4770K - 7808
GeForce GTX 960M - 4881
Radeon HD 7970 GE (*) - 4266
GeForce GTX TITAN (**) - 11715
GeForce GTX Titan X - 9600

Post by Solar Designer
I am especially concerned about the 960M (a mobile GPU with 65W TDP)
performing surprisingly well, at 75% of CPU speed for 2i and 54% for 2d.
This means that a larger desktop/gaming/server Maxwell GPU will
trivially outperform the CPU.

... and it does, but not by such a large margin. Also, the older Kepler
GPU outperforms the CPU now, for both 2i and 2d.

For 2i, the best result is for Titan X: 6301/2480 = 2.54 times faster
than the CPU.

For 2d, the best result is for the old TITAN: 11715/7808 = 1.5 times
faster than the CPU.

Post by Solar Designer
GTX Titan X is more than 4 times larger than the 960M. We need to add
it to the mix and see.

Added, but the previously expected scaling is not seen: it's 4+ times
larger, but only 3 times faster at 2i, and 2 times faster at 2d.

To me, this suggests the code is still badly unoptimized - in fact, we
know there's heavy register spilling going on, and the kernel is huge.
It is possible that we incur too many global memory accesses, and favor
the mobile GPU's relatively narrower memory bus.

Here's the raw detail:

http://www.openwall.com/lists/john-dev/2015/09/05/12
http://www.openwall.com/lists/john-dev/2015/09/06/26

This is still for Argon2 1.0. We got to update to the latest.

Alexander

Krisztián Pintér

2015-10-15 13:13:09 UTC

Permalink

Post by Solar Designer
For 2i, the best result is for Titan X: 6301/2480 = 2.54 times faster
than the CPU.
For 2d, the best result is for the old TITAN: 11715/7808 = 1.5 times
faster than the CPU.

so far, i'm not convinced that data dependent access worth the
increased timing risk. although, argon uses a randomish access pattern
in i mode too, so maybe it leaves space for significant optimization
not done yet? do you plan to do some clever pre-reading?

Solar Designer

2015-10-15 13:57:08 UTC

Permalink

Post by Solar Designer
For 2i, the best result is for Titan X: 6301/2480 = 2.54 times faster
than the CPU.
For 2d, the best result is for the old TITAN: 11715/7808 = 1.5 times
faster than the CPU.

so far, i'm not convinced that data dependent access worth the
increased timing risk.

Yes, from these two results it's not convincing. I expect the
difference to be far greater when a MAXFORM chain is added (it should
then be close to the difference between Argon2i and yescrypt, which for
the GPU implementations so far is 5x to 10x), and that's only possible
with data dependent access (since MAXFORM itself uses data dependent
S-box lookups). Maybe that's a reason to exclude the data dependent yet
MAXFORM-lacking version. Data dependent accesses provide most advantage
when they are rapid and their parallelism within one instance is low.

(A data independent replacement for MAXFORM is possible, even if less
effective - but we haven't even discussed that yet. So it's non-PHC.)

Post by KrisztiÃ¡n PintÃ©r
although, argon uses a randomish access pattern
in i mode too, so maybe it leaves space for significant optimization
not done yet?

We're already taking advantage of coalescing, for all block lookups in
2i (since it's the same order across concurrent instances), and for the
initial writes in 2d. Also, 1 KB blocks are pretty large, and it's
sequential access within each block. Like I said before, something
like MAXFORM is needed to have a GPU-unfriendly random access pattern.

There's not a lot of cache or local memory on GPUs to prefetch to, given
how many concurrent instances need to be run. That said, I think there
is in fact room for some prefetching, possibly just of portions of a
block as computation on the previous block is being finished. (IIRC,
with Argon2 specifically, this may also be possible for 2d, starting
after 9 out of 16 BLAKE2b's. Not the case for (ye)scrypt.) We're not
taking advantage of this yet, and we have no immediate plans to do so,
in part because I think there are still bigger opportunities for
optimization:

I mentioned the register spills and the code size issue. We need to
make our memory accesses explicit (and see if we can optimize them)
rather than just let the compiler spill. Also, for the original Argon2,
parallel computation of several BLAKE2b's may be implemented, even if
non-trivial to do under the SIMT model, requiring use of local memory to
pass the results much like we've seen for the BSTY mining yescrypt
implementation discussed in here recently.

Alexander