This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization

From: Adhemerval Zanella <azanella at linux dot vnet dot ibm dot com>
To: Torvald Riegel <triegel at redhat dot com>
Cc: "GNU C. Library" <libc-alpha at sourceware dot org>
Date: Fri, 02 May 2014 14:40:08 -0300
Subject: Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization
Authentication-results: sourceware.org; auth=none
References: <5361013D dot 9020005 at linux dot vnet dot ibm dot com> <1399045370 dot 32485 dot 6639 dot camel at triegel dot csb>

On 02-05-2014 12:42, Torvald Riegel wrote:
> On Wed, 2014-04-30 at 10:57 -0300, Adhemerval Zanella wrote:
>> This patch adds a single-thread optimization for locks used within
>> libc.so.  For each lock operations it checks it the process has already
>> spawned one thread and if not use non-atomic operations.  Other libraries
>> (libpthread.so for instance) are unaffected by this change.
>>
>> This is a respin on my first patch to add such optimization, but now the
>> code is focused only on lowlevellock.h, the atomic.h is untouched.
>>
>> Tested on powerpc32 and powerpc64.
>>
>> --
>>
>>         * nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>>         (__lll_is_single_thread): New macro to check if process has spawned
>>         any threads.
>>         (__lll_robust_trylock): Add optimization to avoid atomic operations in
>>         single thread case.
>>         (lll_lock): Likewise.
>>         (lll_robust_lock): Likewise.
>>         (lll_cond_lock): Likewise.
>>         (lll_robust_cond_lock): Likewise.
>>         (lll_timedlock): Likewise.
>>         (lll_robust_timedlock): Likewise.
>>         (lll_unlock): Likewise.
>>         (lll_robust_unlock): Likewise.
>>
>> ---
>>
>> diff --git a/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h b/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>> index ab92c3f..38529a4 100644
>> --- a/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>> +++ b/nptl/sysdeps/unix/sysv/linux/powerpc/lowlevellock.h
>> @@ -76,6 +76,16 @@
>>  # endif
>>  #endif
>>  
>> +/* For internal libc.so lock calls in single-thread process we use normal
>> +   load/stores.  */
>> +#if !defined NOT_IN_libc || defined UP
>> +# define __lll_is_single_thread                                                      \
>> +  __glibc_likely (THREAD_GETMEM (THREAD_SELF, header.multiple_threads) == 0)
> I disagree that single-threaded execution should be the likely case.
> There is a large body of existing single-threaded code.  But what's the
> microbenchmarks or data that we base this decision on?

Fair enough, in fact I think changing to add no branch prediction and let either
the compiler or the chip to prediction would be better.

>
>> +#else
>> +# define __lll_is_single_thread (0)
>> +#endif
>> +
>> +
>>  #define lll_futex_wait(futexp, val, private) \
>>    lll_futex_timed_wait (futexp, val, NULL, private)
>>  
>> @@ -205,7 +215,9 @@
>>  /* Set *futex to ID if it is 0, atomically.  Returns the old value */
>>  #define __lll_robust_trylock(futex, id) \
>>    ({ int __val;                                                                      \
>> -     __asm __volatile ("1:     lwarx   %0,0,%2" MUTEX_HINT_ACQ "\n"          \
>> +     if (!__lll_is_single_thread)                                            \
>> +       __asm __volatile (                                                    \
>> +                      "1:      lwarx   %0,0,%2" MUTEX_HINT_ACQ "\n"          \
>>                        "        cmpwi   0,%0,0\n"                             \
>>                        "        bne     2f\n"                                 \
>>                        "        stwcx.  %3,0,%2\n"                            \
>> @@ -214,6 +226,12 @@
>>                        : "=&r" (__val), "=m" (*futex)                         \
>>                        : "r" (futex), "r" (id), "m" (*futex)                  \
>>                        : "cr0", "memory");                                    \
>> +     else                                                                    \
>> +       {                                                                     \
>> +        __val = *futex;                                                      \
>> +        if (__val == 0)                                                      \
>> +          *futex = id;                                                       \
>> +       }                                                                     \
>>       __val;                                                                  \
>>    })
> Conceptually, you can safely use trylock in signal handlers, because
> it's not a blocking operation.  And this is used for normal and robust
> locks.  I haven't checked all the trylock code and all uses to see
> whether it is indeed AS-Safe, but I'm pretty sure this change is wrong.
> At the very least, it doesn't document that trylock is now AS-Unsafe.

If there is the case I concede to you, however I couldn't find any proper
documentation (either POSIX or from our manual) saying trylocks should be
AS-Safe.  And answering your previous question, I don't know if anyone is
tracking the remaining documentation about pthread.

>
> Regarding the robust locks case: What is using it outside of nptl?  And
> if so, what's the point of using an explicitly robust lock if there's no
> concurrency?  Who's going to clean up afterwards?

Indeed I see no use outside NPTL, I will drop it.

>
>> @@ -237,8 +255,16 @@ extern int __lll_robust_lock_wait (int *futex, int private) attribute_hidden;
>>  #define lll_lock(lock, private) \
>>    (void) ({                                                                  \
>>      int *__futex = &(lock);                                                  \
>> -    if (__builtin_expect (atomic_compare_and_exchange_val_acq (__futex, 1, 0),\
>> -                         0) != 0)                                            \
>> +    int __tmp;                                                               \
>> +    if (!__lll_is_single_thread)                                             \
>> +      __tmp = atomic_compare_and_exchange_val_acq (__futex, 1, 0);           \
>> +    else                                                                     \
>> +      {                                                                              \
>> +       __tmp = *__futex;                                                     \
>> +       if (__tmp == 0)                                                       \
>> +         *__futex = 1;                                                       \
>> +      }                                                                              \
>> +    if (__builtin_expect (__tmp, 0) != 0)                                    \
>>        {                                                                              \
>>         if (__builtin_constant_p (private) && (private) == LLL_PRIVATE)       \
>>           __lll_lock_wait_private (__futex);                                  \
> This is obviously wrong for anything in nptl because you don't check
> private.  You defined lll_is_single_thread so that it's always 0 in
> nptl, so this works -- but it's not increasing the clarity of the code.
>
> You also assume that nobody outside of nptl will call lll_lock with
> private not set to LLL_PRIVATE.  This seems error-prone.  Or can you
> show why that's fine?
>
> Also, you add this code to all low-level locks independently of whether
> they are actually used outside of nptl or not.  Have you looked at
> libc_lock.h?  I'm not familiar with how non-nptl code uses the locks,
> but it seems it's calling nptl code, so the change wouldn't have any
> effect.

Indeed libc-lock primitives seems the best way to implement such change, thanks
for the point, I'll rebase the changes base on these.

>
> Finally, for a change as this whose effect on performance is
> non-obvious, I think we really need performance measurements.  We need
> to have microbenchmarks so that we can track the change of performance.
>
Although micro-benchmarks can provided (and they will prob be in that loop)
'this is not a real case'), for PowerPC avoid LL/SC instructions where they
are not required is always a gain.

Follow-Ups:
- Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization
  - From: Torvald Riegel
- Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization
  - From: Will Newton

References:
- Re: [PATCH 2/2] PowerPC: libc single-thread lock optimization
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]