C++2a synchronisation inefficient in GCC 11

Sat Feb 27 01:13:46 GMT 2021

We are SPECIFICALLY NOT committing to an ABI for wait/notify with GCC11. 
These features are experimental and I have no intention of treating this 
as a blocker.

On 2021-02-25 14:50, Thiago Macieira via Libstdc++ wrote:

> Hello all
> 
> I was investigating the implementation of std::atomic<T>::wait to use 
> inside
> my own code (Qt's QMutex and QSemaphore), when I realised that it has
> huge inefficiencies. Since everything in this implementation is inline 
> and,
> once released, it will tie our hands until the next ABI break 
> (libstdc++.so.
> 7). And no, inline namespaces and ABI tags are no different than ABI 
> breaks,
> they only make the ABI break more or less silent in the process.
> 
> Here's a summary of the findings:
> 
> 1) everything is inline
> 2) futex code is still behind a lot of code calling into 
> std::_Hash_bytes
> 3) other int-sized (incl. enums) atomics don't use futex
> 4) std::latch and std::counting_semaphore defaults preclude from using
> futex on Linux
> 5) std::barrier implementation also uses a type that futex(2) can't 
> handle
> 
> Here's what the code currently looks like: 
> https://gcc.godbolt.org/z/MeP8rr
> 
> Given how far we are in the release cycle, a simple solution is to 
> simply
> disable the entire solution by #if 0'ing out bits/atomic_wait.h. If
> __cpp_lib_atomic_wait isn't defined, the rest of the functionality 
> cascades
> disabled too. Then we can spend however long is necessary to get a good
> implementation for GCC 12.
> 
> I have patches for all but the first issue, but they are a result of 
> only 3
> hours of hacking to prove the concept. I will post them, but I do not 
> expect
> them to get merged as-is.
> 
> Details:
> 
> 1) everything is inline
> 
> I think this is unnecessary optimisation, especially since it expands 
> to a LOT
> of code (see link above). A cursory research into libc++ and MS STL 
> shows that
> the actual wait and notification are out-of-line. I recommend we do the 
> same
> for libstdc++: if we're going to syscall anyway, the extra function 
> call isn't
> going to be the performance bottleneck.
> 
> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050360.html>, 
> Jonathan
> wrote:
> 
>> Eventually we'll want to move the rest of this function (which doesn't
>> depend on the template argument) into the compiled library, but it's
>> better to be header-only for now.
> 
> But that has never happened.
> 
> Having out-of-line implementations would allow the implementation for 
> generic-
> sized atomic wait/notify to fine-tune as time and experience 
> progresses. The
> std::__detail::__waiter implementation looks clever, but it's unclear 
> if it is
> the best for all workloads. Moreover, the implementation is limited to 
> 16
> waiters.
> 
> Moreover, it would allow us to have support in the future for a 64-bit 
> futex.
> Linus did shoot down any need for it 13 years ago, but the world has 
> changed
> since and it's not inconceivable to have a pointer-sized futex on every
> platform, which on 64-bit would be 64-bit. Windows and Darwin do have 
> support
> for 64-bit compare-and-wait functionality.
> 
> And it would allow us to have futex-equivalent support in other OSes, 
> added at
> our leisure. The current implementation does not support Windows's
> WaitOnAddress and could never support it if released in the current 
> state.
> It's also possible other OSes will eventually add equivalent 
> functionality
> because of the C++ standard. For example, FreeBSD already has futex() 
> support
> inside its kernel in order to run Linux binaries, but it is not exposed 
> to the
> FreeBSD-personality syscall ABI.
> 
> Inlining certain functionality at a later date is possible. De-inlining 
> isn't.
> 
> 2) futex code for int is still behind a lot of boiler plate
> 
> It comes from the creation/use of std::__detail::__waiter in
> std::__atomic_wait and __atomic_notify. That does allow one to detect 
> whether
> there is contention in the first place and avoid the system call, but I 
> think
> that's premature optimisation.
> 
> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050362.html>, 
> Thomas
> says:
> 
>> __atomic_notify checks to see if that count is non-zero
>> before invoking the platform/semaphore notify because it is cheaper
>> to do the atomic load than it is to make the syscall() when there are 
>> no
>> waiters.
> 
> That's true, but it ignores the fact that users of 
> std::atomic<T>::wait() are
> likely to track contention by themselves on the value of the atomic (I 
> know I
> do). libc++, MS STL and the original atomic_wait[1 [1]] implementation 
> operate on
> direct system calls and don't try to track contention. Because of that,
> libstdc++ adding an extra layer of tracking is double book-keeping and
> violates "don't pay for what you don't need".
> 
> [1] https://github.com/ogiroux/atomic_wait
> 
> 3) other int-sized atomics don't use futex
> 
> The kernel doesn't care what we store in those 32 bits, only that they 
> are
> comparable (FUTEX_OP does provide less-than and greater-than operations 
> that
> are signed, but we don't use that and the kernel API can be later 
> extended to
> unsigned compares). Therefore, I would recommend we extend futex 
> support to
> all 32-bit signed scalar types. This would add:
> * pointers and long on 32-bit architectures
> * unsigned
> * untyped enums and typed enums on int & unsigned int
> * float
> 
> 4) std::latch and std::counting_semaphore defaults too high
> 
> The standard requires a ptrdiff_t in the API, but it does allow us to 
> set hard
> limits and/or defaults for best performance. Right now, std::latch is
> implemented using ptrdiff_t internally and std::counting_semaphore's 
> default
> limit uses PTRDIFF_MAX:
> 
> template<ptrdiff_t __least_max_value =
> __gnu_cxx::__int_traits<ptrdiff_t>::__max>
> class counting_semaphore
> 
> As a consequence, those two classes will use 64-bit atomics on 64-bit
> architectures, which means they will not use futex. I recommend 
> changing
> std::latch's internal to int and lowering the default maximum in
> std::counting_semaphore to INT_MAX.
> 
> 5) std::barrier uses a type that futex can't handle
> 
> The barrier phase is the actual type that gets waited / notified on, 
> but it is
> an enum class on unsigned char:
> 
> enum class __barrier_phase_t : unsigned char { };
> 
> This is a harder problem to solve, but I really do recommend making it 
> 32-bit
> in width. Whether each barrier phase holds only 256 values or the full 
> 32
> bits, I don't know (I've implemented with wrapping on 256).
> 
> What is of consequence is that the __state_t type could increase in 
> size
> ("could" because we can simply reduce the number of tickets in each 
> state
> entry). I haven't done enough research on this to know whether it is 
> the right
> way to go.

Links:
------
[1] https://github.com/ogiroux/atomic_wait