[PATCH v8 1/9] stdlib: Add arc4random, arc4random_buf, and arc4random_uniform (BZ #4417)

Tue Jul 12 17:15:32 GMT 2022

* Adhemerval Zanella Netto:

>>> +/* arc4random keeps two counters: 'have' is the current valid bytes not yet
>>> +   consumed in 'buf' while 'count' is the maximum number of bytes until a
>>> +   reseed.
>>> +
>>> +   Both the initial seed and reseed try to obtain entropy from the kernel
>>> +   and abort the process if none could be obtained.
>>> +
>>> +   The state 'buf' improves the usage of the cipher calls, allowing to call
>>> +   optimized implementations (if the architecture provides it) and optimize
>>> +   arc4random calls (since only multiple calls it will encrypt the next
>>> +   block).  */

>> I don't understand the “since only multiple calls it will encrypt the
>> next block” part.
>
> I changed to 'and minimize function call overhead'.  Using the generic
> implementation, a 8 times the chacha20 blocks buffer shows about 2x more
> throughput um aarch64.
>
> A buffer with 4x the chacha20 block size shows slight less performance,
> so one option might to make the buffer sizes arch-specific (since AVX2,
> and potentially AVX512 requires large block size for the arch-specific
> implementations).

Ah, that makes sense.  I think the quoted part is just a bit garbled and
needs some polishing.

>>> +/* Reinit the thread context by reseeding the cipher state with kernel
>>> +   entropy.  */
>>> +static void
>>> +arc4random_check_stir (struct arc4random_state_t *state, size_t len)
>> Could you add a comment describing the len parameter?
>
> I changed to:
>
> /* Check if the thread context STATE should be reseed with kernel entropy
>    depending of requested LEN bytes.  If there is less than requested,
>    the state is either initialized or reseed, otherwise the internal
>    counter subtract the requested lenght.  */

“reseeded”

>> Why not simply call __arc4random_buf?  If you want to retain the
>> optimization, turn the implementation of __arc4random_buf into an inline
>> function and call it here and from __arc4random_buf.
>
> I actually tried in some interation and I recall that it yield some
> worse throughput.  I just tested again and it holds true, on
> an aarch64 system current approach with generic implementation yields
> 290 MB/s while calling arc4random_buf shows 172 MB/s.
>
> I am trying to decompose the function to eliminate the need of the
> loop (which I think compiler can't optimize away for arc4random)
> but I don't think it would be simpler than open code the logic
> on both functions.

Hmm, this isn't great, but I see why you are doing it.

Thanks,
Florian