C++2a synchronisation inefficient in GCC 11
Thomas Rodgers
rodgert@appliantology.com
Sat Feb 27 01:13:46 GMT 2021
We are SPECIFICALLY NOT committing to an ABI for wait/notify with GCC11.
These features are experimental and I have no intention of treating this
as a blocker.
On 2021-02-25 14:50, Thiago Macieira via Libstdc++ wrote:
> Hello all
>
> I was investigating the implementation of std::atomic<T>::wait to use
> inside
> my own code (Qt's QMutex and QSemaphore), when I realised that it has
> huge inefficiencies. Since everything in this implementation is inline
> and,
> once released, it will tie our hands until the next ABI break
> (libstdc++.so.
> 7). And no, inline namespaces and ABI tags are no different than ABI
> breaks,
> they only make the ABI break more or less silent in the process.
>
> Here's a summary of the findings:
>
> 1) everything is inline
> 2) futex code is still behind a lot of code calling into
> std::_Hash_bytes
> 3) other int-sized (incl. enums) atomics don't use futex
> 4) std::latch and std::counting_semaphore defaults preclude from using
> futex on Linux
> 5) std::barrier implementation also uses a type that futex(2) can't
> handle
>
> Here's what the code currently looks like:
> https://gcc.godbolt.org/z/MeP8rr
>
> Given how far we are in the release cycle, a simple solution is to
> simply
> disable the entire solution by #if 0'ing out bits/atomic_wait.h. If
> __cpp_lib_atomic_wait isn't defined, the rest of the functionality
> cascades
> disabled too. Then we can spend however long is necessary to get a good
> implementation for GCC 12.
>
> I have patches for all but the first issue, but they are a result of
> only 3
> hours of hacking to prove the concept. I will post them, but I do not
> expect
> them to get merged as-is.
>
> Details:
>
> 1) everything is inline
>
> I think this is unnecessary optimisation, especially since it expands
> to a LOT
> of code (see link above). A cursory research into libc++ and MS STL
> shows that
> the actual wait and notification are out-of-line. I recommend we do the
> same
> for libstdc++: if we're going to syscall anyway, the extra function
> call isn't
> going to be the performance bottleneck.
>
> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050360.html>,
> Jonathan
> wrote:
>
>> Eventually we'll want to move the rest of this function (which doesn't
>> depend on the template argument) into the compiled library, but it's
>> better to be header-only for now.
>
> But that has never happened.
>
> Having out-of-line implementations would allow the implementation for
> generic-
> sized atomic wait/notify to fine-tune as time and experience
> progresses. The
> std::__detail::__waiter implementation looks clever, but it's unclear
> if it is
> the best for all workloads. Moreover, the implementation is limited to
> 16
> waiters.
>
> Moreover, it would allow us to have support in the future for a 64-bit
> futex.
> Linus did shoot down any need for it 13 years ago, but the world has
> changed
> since and it's not inconceivable to have a pointer-sized futex on every
> platform, which on 64-bit would be 64-bit. Windows and Darwin do have
> support
> for 64-bit compare-and-wait functionality.
>
> And it would allow us to have futex-equivalent support in other OSes,
> added at
> our leisure. The current implementation does not support Windows's
> WaitOnAddress and could never support it if released in the current
> state.
> It's also possible other OSes will eventually add equivalent
> functionality
> because of the C++ standard. For example, FreeBSD already has futex()
> support
> inside its kernel in order to run Linux binaries, but it is not exposed
> to the
> FreeBSD-personality syscall ABI.
>
> Inlining certain functionality at a later date is possible. De-inlining
> isn't.
>
> 2) futex code for int is still behind a lot of boiler plate
>
> It comes from the creation/use of std::__detail::__waiter in
> std::__atomic_wait and __atomic_notify. That does allow one to detect
> whether
> there is contention in the first place and avoid the system call, but I
> think
> that's premature optimisation.
>
> In <https://gcc.gnu.org/pipermail/libstdc++/2020-May/050362.html>,
> Thomas
> says:
>
>> __atomic_notify checks to see if that count is non-zero
>> before invoking the platform/semaphore notify because it is cheaper
>> to do the atomic load than it is to make the syscall() when there are
>> no
>> waiters.
>
> That's true, but it ignores the fact that users of
> std::atomic<T>::wait() are
> likely to track contention by themselves on the value of the atomic (I
> know I
> do). libc++, MS STL and the original atomic_wait[1 [1]] implementation
> operate on
> direct system calls and don't try to track contention. Because of that,
> libstdc++ adding an extra layer of tracking is double book-keeping and
> violates "don't pay for what you don't need".
>
> [1] https://github.com/ogiroux/atomic_wait
>
> 3) other int-sized atomics don't use futex
>
> The kernel doesn't care what we store in those 32 bits, only that they
> are
> comparable (FUTEX_OP does provide less-than and greater-than operations
> that
> are signed, but we don't use that and the kernel API can be later
> extended to
> unsigned compares). Therefore, I would recommend we extend futex
> support to
> all 32-bit signed scalar types. This would add:
> * pointers and long on 32-bit architectures
> * unsigned
> * untyped enums and typed enums on int & unsigned int
> * float
>
> 4) std::latch and std::counting_semaphore defaults too high
>
> The standard requires a ptrdiff_t in the API, but it does allow us to
> set hard
> limits and/or defaults for best performance. Right now, std::latch is
> implemented using ptrdiff_t internally and std::counting_semaphore's
> default
> limit uses PTRDIFF_MAX:
>
> template<ptrdiff_t __least_max_value =
> __gnu_cxx::__int_traits<ptrdiff_t>::__max>
> class counting_semaphore
>
> As a consequence, those two classes will use 64-bit atomics on 64-bit
> architectures, which means they will not use futex. I recommend
> changing
> std::latch's internal to int and lowering the default maximum in
> std::counting_semaphore to INT_MAX.
>
> 5) std::barrier uses a type that futex can't handle
>
> The barrier phase is the actual type that gets waited / notified on,
> but it is
> an enum class on unsigned char:
>
> enum class __barrier_phase_t : unsigned char { };
>
> This is a harder problem to solve, but I really do recommend making it
> 32-bit
> in width. Whether each barrier phase holds only 256 values or the full
> 32
> bits, I don't know (I've implemented with wrapping on 256).
>
> What is of consequence is that the __state_t type could increase in
> size
> ("could" because we can simply reduce the number of tickets in each
> state
> entry). I haven't done enough research on this to know whether it is
> the right
> way to go.
Links:
------
[1] https://github.com/ogiroux/atomic_wait
More information about the Libstdc++
mailing list