This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization
- From: Will Newton <will dot newton at linaro dot org>
- To: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
- Cc: Torvald Riegel <triegel at redhat dot com>, "GNU C. Library" <libc-alpha at sourceware dot org>
- Date: Tue, 6 May 2014 09:36:06 +0100
- Subject: Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization
- Authentication-results: sourceware.org; auth=none
- References: <5361013D dot 9020005 at linux dot vnet dot ibm dot com> <1399045370 dot 32485 dot 6639 dot camel at triegel dot csb> <5363D878 dot 9080803 at linux dot vnet dot ibm dot com>
On 2 May 2014 18:40, Adhemerval Zanella <azanella@linux.vnet.ibm.com> wrote:
> On 02-05-2014 12:42, Torvald Riegel wrote:
>> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
>>> This patch adds a single-thread optimization for locks used within
>>> libc.so. For each lock operations it checks it the process has already
>>> spawned one thread and if not use non-atomic operations. Other libraries
>>> (libpthread.so for instance) are unaffected by this change.
>>>
>>> This is a respin on my first patch to add such optimization, but now the
>>> code is focused only on lowlevellock.h, the atomic.h is untouched.
>>>
>>> Tested on powerpc32 and powerpc64.
>>>
>>> --
>>>
>>> * nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>>> (__lll_is_single_thread): New macro to check if process has spawned
>>> any threads.
>>> (__lll_robust_trylock): Add optimization to avoid atomic operations in
>>> single thread case.
>>> (lll_lock): Likewise.
>>> (lll_robust_lock): Likewise.
>>> (lll_cond_lock): Likewise.
>>> (lll_robust_cond_lock): Likewise.
>>> (lll_timedlock): Likewise.
>>> (lll_robust_timedlock): Likewise.
>>> (lll_unlock): Likewise.
>>> (lll_robust_unlock): Likewise.
>>>
>>> ---
>>>
>>> diff --git a/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h b/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>>> index ab92c3f..38529a4 100644
>>> --- a/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>>> +++ b/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>>> @@ -76,6 +76,16 @@
>>> # endif
>>> #endif
>>>
>>> +/* For internal libc.so lock calls in single-thread process we use normal
>>> + load/stores. */
>>> +#if !defined NOT_IN_libc || defined UP
>>> +# define __lll_is_single_thread \
>>> + __glibc_likely (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
>> I disagree that single-threaded execution should be the likely case.
>> There is a large body of existing single-threaded code. But what's the
>> microbenchmarks or data that we base this decision on?
>
> Fair enough, in fact I think changing to add no branch prediction and let either
> the compiler or the chip to prediction would be better.
>
>>
>>> +#else
>>> +# define __lll_is_single_thread (0)
>>> +#endif
>>> +
>>> +
>>> #define lll_futex_wait(futexp, val, private) \
>>> lll_futex_timed_wait (futexp, val, NULL, private)
>>>
>>> @@ -205,7 +215,9 @@
>>> /* Set *futex to ID if it is 0, atomically. Returns the old value */
>>> #define __lll_robust_trylock(futex, id) \
>>> ({ int __val; \
>>> - __asm __volatile ("1: lwarx %0,0,%2" MUTEX_HINT_ACQ "\n" \
>>> + if (!__lll_is_single_thread) \
>>> + __asm __volatile ( \
>>> + "1: lwarx %0,0,%2" MUTEX_HINT_ACQ "\n" \
>>> " cmpwi 0,%0,0\n" \
>>> " bne 2f\n" \
>>> " stwcx. %3,0,%2\n" \
>>> @@ -214,6 +226,12 @@
>>> : "=&r" (__val), "=m" (*futex) \
>>> : "r" (futex), "r" (id), "m" (*futex) \
>>> : "cr0", "memory"); \
>>> + else \
>>> + { \
>>> + __val = *futex; \
>>> + if (__val == 0) \
>>> + *futex = id; \
>>> + } \
>>> __val; \
>>> })
>> Conceptually, you can safely use trylock in signal handlers, because
>> it's not a blocking operation. And this is used for normal and robust
>> locks. I haven't checked all the trylock code and all uses to see
>> whether it is indeed AS-Safe, but I'm pretty sure this change is wrong.
>> At the very least, it doesn't document that trylock is now AS-Unsafe.
>
> If there is the case I concede to you, however I couldn't find any proper
> documentation (either POSIX or from our manual) saying trylocks should be
> AS-Safe. And answering your previous question, I don't know if anyone is
> tracking the remaining documentation about pthread.
>
>>
>> Regarding the robust locks case: What is using it outside of nptl? And
>> if so, what's the point of using an explicitly robust lock if there's no
>> concurrency? Who's going to clean up afterwards?
>
> Indeed I see no use outside NPTL, I will drop it.
>
>>
>>> @@ -237,8 +255,16 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
>>> #define lll_lock(lock, private) \
>>> (void) ({ \
>>> int *__futex = &(lock); \
>>> - if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 1, 0),\
>>> - 0) != 0) \
>>> + int __tmp; \
>>> + if (!__lll_is_single_thread) \
>>> + __tmp = atomic_compare_and_exchange_val_acq (__futex, 1, 0); \
>>> + else \
>>> + { \
>>> + __tmp = *__futex; \
>>> + if (__tmp == 0) \
>>> + *__futex = 1; \
>>> + } \
>>> + if (__builtin_expect (__tmp, 0) != 0) \
>>> { \
>>> if (__builtin_constant_p (private) && (private) == LLL_PRIVATE) \
>>> __lll_lock_wait_private (__futex); \
>> This is obviously wrong for anything in nptl because you don't check
>> private. You defined lll_is_single_thread so that it's always 0 in
>> nptl, so this works -- but it's not increasing the clarity of the code.
>>
>> You also assume that nobody outside of nptl will call lll_lock with
>> private not set to LLL_PRIVATE. This seems error-prone. Or can you
>> show why that's fine?
>>
>> Also, you add this code to all low-level locks independently of whether
>> they are actually used outside of nptl or not. Have you looked at
>> libc_lock.h? I'm not familiar with how non-nptl code uses the locks,
>> but it seems it's calling nptl code, so the change wouldn't have any
>> effect.
>
> Indeed libc-lock primitives seems the best way to implement such change, thanks
> for the point, I'll rebase the changes base on these.
>
>>
>> Finally, for a change as this whose effect on performance is
>> non-obvious, I think we really need performance measurements. We need
>> to have microbenchmarks so that we can track the change of performance.
>>
> Although micro-benchmarks can provided (and they will prob be in that loop)
> 'this is not a real case'), for PowerPC avoid LL/SC instructions where they
> are not required is always a gain.
I recently posted a malloc micro-benchmark that should benefit from
these changes.
--
Will Newton
Toolchain Working Group, Linaro