This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
- From: pinskia at gmail dot com
- To: Torvald Riegel <triegel at redhat dot com>
- Cc: GLIBC Devel <libc-alpha at sourceware dot org>, andi <andi at firstfloor dot org>
- Date: Fri, 11 Oct 2013 22:40:07 -0700
- Subject: Re: [RFC] pthread_once: Use unified variant instead of custom x86_64/i386
- Authentication-results: sourceware.org; auth=none
- References: <1381523328 dot 18547 dot 3422 dot camel at triegel dot csb>
> On Oct 11, 2013, at 1:28 PM, Torvald Riegel <triegel@redhat.com> wrote:
>
> Assuming the pthread_once unification I sent recently is applied, we
> still have custom x86_64 and i386 variants of pthread_once. The
> algorithm they use is the same as the unified variant, so we would be
> able to remove the custom variants if this doesn't affect performance.
>
> The common case when pthread_once is executed is that the initialization
> has already been performed; thus, this is the fast path that we can
> focus on. (I haven't looked specifically at the generated code for the
> slow path, but the algorithm is the same and I assume that the overhead
> of the synchronizing instructions and futex syscalls determines the
> performance of it, not any differences between compiler-generated code
> and the custom code.)
>
> The fast path of the custom assembler version:
> testl $2, (%rdi)
> jz 1f
> xorl %eax, %eax
> retq
>
> The fast path of the generic pthread_once C code, as it is after the
> pthread_once unification patch:
> 20: 48 89 5c 24 e8 mov %rbx,-0x18(%rsp)
> 25: 48 89 6c 24 f0 mov %rbp,-0x10(%rsp)
> 2a: 48 89 fb mov %rdi,%rbx
> 2d: 4c 89 64 24 f8 mov %r12,-0x8(%rsp)
> 32: 48 89 f5 mov %rsi,%rbp
> 35: 48 83 ec 38 sub $0x38,%rsp
> 39: 41 b8 ca 00 00 00 mov $0xca,%r8d
> 3f: 8b 13 mov (%rbx),%edx
> 41: f6 c2 02 test $0x2,%dl
> 44: 74 16 je 5c <__pthread_once+0x3c>
> 46: 31 c0 xor %eax,%eax
> 48: 48 8b 5c 24 20 mov 0x20(%rsp),%rbx
> 4d: 48 8b 6c 24 28 mov 0x28(%rsp),%rbp
> 52: 4c 8b 64 24 30 mov 0x30(%rsp),%r12
> 57: 48 83 c4 38 add $0x38,%rsp
> 5b: c3 retq
Seems like this is a good case where shrink wrapping should have helped. What version of GCC did you try this with and if it was 4.8 or latter, can you file a bug for this missed opt?
Thanks,
Andrew
>
> The only difference is more stack save/restore. However, a quick run of
> benchtests/pthread_once (see the patch I sent for review) on my laptop
> doesn't show any noticeable differences between both (averages of 8 runs
> of the microbenchmark differ by 0.2%).
>
> When splitting out the slow path like this:
>
> static int
> __attribute__((noinline))
> __pthread_once_slow (once_control, init_routine)
> /* ... */
>
> int
> __pthread_once (once_control, init_routine)
> pthread_once_t *once_control;
> void (*init_routine) (void);
> {
> int val;
> val = *once_control;
> atomic_read_barrier();
> if (__builtin_expect ((val & __PTHREAD_ONCE_DONE) != 0, 1))
> return 0;
> else
> return __pthread_once_slow(once_control, init_routine);
> }
>
> we get this for the C variants fast path:
>
> 00000000000000e0 <__pthread_once>:
> e0: 8b 07 mov (%rdi),%eax
> e2: a8 02 test $0x2,%al
> e4: 74 03 je e9 <__pthread_once+0x9>
> e6: 31 c0 xor %eax,%eax
> e8: c3 retq
> e9: 31 c0 xor %eax,%eax
> eb: e9 30 ff ff ff jmpq 20 <__pthread_once_slow>
>
> This is very close to the fast path of the custom assembler code.
>
> I haven't looked further at i386, but the custom code is pretty similar
> to the x86_64 variant.
>
>
> What do you all prefer?:
> 1) Keep the x86-specific assembler versions?
> 2) Remove the x86-specific assembler versions and split out the slow
> path?
> 2) Just remove the x86-specific assembler versions?
>