179f6280 | 17-Jun-2019 |
Conrad Meyer <cem@FreeBSD.org> |
random(4): Fortuna: allow increased concurrency
Add experimental feature to increase concurrency in Fortuna. As this diverges slightly from canonical Fortuna, and due to the security sensitivity of
random(4): Fortuna: allow increased concurrency
Add experimental feature to increase concurrency in Fortuna. As this diverges slightly from canonical Fortuna, and due to the security sensitivity of random(4), it is off by default. To enable it, set the tunable kern.random.fortuna.concurrent_read="1". The rest of this commit message describes the behavior when enabled.
Readers continue to update shared Fortuna state under global mutex, as they do in the status quo implementation of the algorithm, but shift the actual PRF generation out from under the global lock. This massively reduces the CPU time readers spend holding the global lock, allowing for increased concurrency on SMP systems and less bullying of the harvestq kthread.
It is somewhat of a deviation from FS&K. I think the primary difference is that the specific sequence of AES keys will differ if READ_RANDOM_UIO is accessed concurrently (as the 2nd thread to take the mutex will no longer receive a key derived from rekeying the first thread). However, I believe the goals of rekeying AES are maintained: trivially, we continue to rekey every 1MB for the statistical property; and each consumer gets a forward-secret, independent AES key for their PRF.
Since Chacha doesn't need to rekey for sequences of any length, this change makes no difference to the sequence of Chacha keys and PRF generated when Chacha is used in place of AES.
On a GENERIC 4-thread VM (so, INVARIANTS/WITNESS, numbers not necessarily representative), 3x concurrent AES performance jumped from ~55 MiB/s per thread to ~197 MB/s per thread. Concurrent Chacha20 at 3 threads went from roughly ~113 MB/s per thread to ~430 MB/s per thread.
Prior to this change, the system was extremely unresponsive with 3-4 concurrent random readers; each thread had high variance in latency and throughput, depending on who got lucky and won the lock. "rand_harvestq" thread CPU use was high (double digits), seemingly due to spinning on the global lock.
After the change, concurrent random readers and the system in general are much more responsive, and rand_harvestq CPU use dropped to basically zero.
Tests are added to the devrandom suite to ensure the uint128_add64 primitive utilized by unlocked read functions to specification.
Reviewed by: markm Approved by: secteam(delphij) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D20313
show more ...
|
d0d71d81 | 17-Jun-2019 |
Conrad Meyer <cem@FreeBSD.org> |
random(4): Generalize algorithm-independent APIs
At a basic level, remove assumptions about the underlying algorithm (such as output block size and reseeding requirements) from the algorithm-indepen
random(4): Generalize algorithm-independent APIs
At a basic level, remove assumptions about the underlying algorithm (such as output block size and reseeding requirements) from the algorithm-independent logic in randomdev.c. Chacha20 does not have many of the restrictions that AES-ICM does as a PRF (Pseudo-Random Function), because it has a cipher block size of 512 bits. The motivation is that by generalizing the API, Chacha is not penalized by the limitations of AES.
In READ_RANDOM_UIO, first attempt to NOWAIT allocate a large enough buffer for the entire user request, or the maximal input we'll accept between signal checking, whichever is smaller. The idea is that the implementation of any randomdev algorithm is then free to divide up large requests in whatever fashion it sees fit.
As part of this, two responsibilities from the "algorithm-generic" randomdev code are pushed down into the Fortuna ra_read implementation (and any other future or out-of-tree ra_read implementations):
1. If an algorithm needs to rekey every N bytes, it is responsible for handling that in ra_read(). (I.e., Fortuna's 1MB rekey interval for AES block generation.)
2. If an algorithm uses a block cipher that doesn't tolerate partial-block requests (again, e.g., AES), it is also responsible for handling that in ra_read().
Several APIs are changed from u_int buffer length to the more canonical size_t. Several APIs are changed from taking a blockcount to a bytecount, to permit PRFs like Chacha20 to directly generate quantities of output that are not multiples of RANDOM_BLOCKSIZE (AES block size).
The Fortuna algorithm is changed to NOT rekey every 1MiB when in Chacha20 mode (kern.random.use_chacha20_cipher="1"). This is explicitly supported by the math in FS&K §9.4 (Ferguson, Schneier, and Kohno; "Cryptography Engineering"), as well as by their conclusion: "If we had a block cipher with a 256-bit [or greater] block size, then the collisions would not have been an issue at all."
For now, continue to break up reads into PAGE_SIZE chunks, as they were before. So, no functional change, mostly.
Reviewed by: markm Approved by: secteam(delphij) Differential Revision: https://reviews.freebsd.org/D20312
show more ...
|