This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386

From: pinskia at gmail dot com
To: Torvald Riegel <triegel at redhat dot com>
Cc: GLIBC Devel <libc-alpha at sourceware dot org>, andi <andi at firstfloor dot org>
Date: Fri, 11 Oct 2013 22:40:07 -0700
Subject: Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
Authentication-results: sourceware.org; auth=none
References: <1381523328 dot 18547 dot 3422 dot camel at triegel dot csb>


> On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
> 
> Assuming the pthread_once unification I sent recently is applied, we
> still have custom x86_64 and i386 variants of pthread_once.  The
> algorithm they use is the same as the unified variant, so we would be
> able to remove the custom variants if this doesn't affect performance.
> 
> The common case when pthread_once is executed is that the initialization
> has already been performed; thus, this is the fast path that we can
> focus on.  (I haven't looked specifically at the generated code for the
> slow path, but the algorithm is the same and I assume that the overhead
> of the synchronizing instructions and futex syscalls determines the
> performance of it, not any differences between compiler-generated code
> and the custom code.)
> 
> The fast path of the custom assembler version:
>    testl    $2, (%rdi)
>    jz    1f
>    xorl    %eax, %eax
>    retq
> 
> The fast path of the generic pthread_once C code, as it is after the
> pthread_once unification patch:
>  20:   48 89 5c 24 e8          mov    %rbx,-0x18(%rsp)
>  25:   48 89 6c 24 f0          mov    %rbp,-0x10(%rsp)
>  2a:   48 89 fb                mov    %rdi,%rbx
>  2d:   4c 89 64 24 f8          mov    %r12,-0x8(%rsp)
>  32:   48 89 f5                mov    %rsi,%rbp
>  35:   48 83 ec 38             sub    $0x38,%rsp
>  39:   41 b8 ca 00 00 00       mov    $0xca,%r8d
>  3f:   8b 13                   mov    (%rbx),%edx
>  41:   f6 c2 02                test   $0x2,%dl
>  44:   74 16                   je     5c <__pthread_once+0x3c>
>  46:   31 c0                   xor    %eax,%eax
>  48:   48 8b 5c 24 20          mov    0x20(%rsp),%rbx
>  4d:   48 8b 6c 24 28          mov    0x28(%rsp),%rbp
>  52:   4c 8b 64 24 30          mov    0x30(%rsp),%r12
>  57:   48 83 c4 38             add    $0x38,%rsp
>  5b:   c3                      retq   

Seems like this is a good case where shrink wrapping should have helped.  What version of GCC did you try this with and if it was 4.8 or latter, can you file a bug for this missed opt?

Thanks,
Andrew


> 
> The only difference is more stack save/restore.  However, a quick run of
> benchtests/pthread_once (see the patch I sent for review) on my laptop
> doesn't show any noticeable differences between both (averages of 8 runs
> of the microbenchmark differ by 0.2%).
> 
> When splitting out the slow path like this:
> 
> static int
> __attribute__((noinline))
> __pthread_once_slow (once_control, init_routine)
> /* ... */
> 
> int
> __pthread_once (once_control, init_routine)
>     pthread_once_t *once_control;
>     void (*init_routine) (void);
> {
>  int val;
>  val = *once_control;
>  atomic_read_barrier();
>  if (__builtin_expect ((val & __PTHREAD_ONCE_DONE) != 0, 1))
>    return 0;
>  else
>    return __pthread_once_slow(once_control, init_routine);
> }
> 
> we get this for the C variants fast path:
> 
> 00000000000000e0 <__pthread_once>:
>  e0:   8b 07                   mov    (%rdi),%eax
>  e2:   a8 02                   test   $0x2,%al
>  e4:   74 03                   je     e9 <__pthread_once+0x9>
>  e6:   31 c0                   xor    %eax,%eax
>  e8:   c3                      retq   
>  e9:   31 c0                   xor    %eax,%eax
>  eb:   e9 30 ff ff ff          jmpq   20 <__pthread_once_slow>
> 
> This is very close to the fast path of the custom assembler code.
> 
> I haven't looked further at i386, but the custom code is pretty similar
> to the x86_64 variant.
> 
> 
> What do you all prefer?:
> 1) Keep the x86-specific assembler versions?
> 2) Remove the x86-specific assembler versions and split out the slow
> path?
> 2) Just remove the x86-specific assembler versions?
>

Follow-Ups:
- Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
  - From: Torvald Riegel

References:
- [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
  - From: Torvald Riegel

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]