This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Unify pthread_once (bug 15215)


On Mon, 2014-04-07 at 13:46 +0100, Will Newton wrote:
> On 7 April 2014 13:37, Torvald Riegel <triegel@redhat.com> wrote:
> > On Fri, 2014-03-28 at 19:29 -0400, Carlos O'Donell wrote:
> >> David, Marcus, Joseph, Mike, Andreas, Steve, Chris,
> >>
> >> We would like to unify all C-based pthread_once implmentations
> >> per the plan in bug 15215 for glibc 2.20.
> >>
> >> Your machines are on the list of C-based pthread_once implementations.
> >>
> >> See this for the intial discussions on the unified pthread_once:
> >> https://sourceware.org/ml/libc-alpha/2013-05/msg00210.html
> >>
> >> The goal is to provide a single and correct C implementation of
> >> pthread_once. Architectures can then build on that if they need more
> >> optimal implementations, but I don't encourage that and I'd rather
> >> see deep discussions on how to make one unified solution where
> >> possible.
> >>
> >> I've also just reviewed Torvald's new pthread_once microbenchmark which
> >> you can use to compare your previous C implementation with the new
> >> standard C implementation (measures pthread_once latency). The primary
> >> use of this test is to help provide objective proof for or against the
> >> i386 and x86_64 assembly implementations.
> >>
> >> We are not presently converting any of the machines with custom
> >> implementations, but that will be a next step after testing with the
> >> help of the maintainers for sh, i386, x86_64, powerpc, s390 and alpha.
> >>
> >> If we don't hear any objections we will go forward with this change
> >> in one week and unify ia64, hppa, mips, tile, sparc, m68k, arm
> >> and aarch64 on a single pthread_once implementation based on sparc's C
> >> implementation.
> >
> > So far, I've seen an okay for tile, and a question about ARM.  Will, are
> > you okay with the change for ARM?
> 
> From a correctness and maintainability standpoint it looks good. I
> have concerns about the performance but I will leave that call to the
> respective ARM and AArch64 maintainers.
> 
> In your original post you speculate it may be possible to improve
> performance on ARM:
> 
> "I'm currently also using the existing atomic_{read/write}_barrier
> functions instead of not-yet-existing load_acq or store_rel functions.
> I'm not sure whether the latter can have somewhat more efficient
> implementations on Power and ARM; if so, and if you're concerned about
> the overhead, we can add load_acq and store_rel to atomic.h and start
> using it"
> 
> It would be interesting to know how much work that would be and what
> the performance improvements might be like.

I had a quick look at the arm and aarch64 barrier definitions, and they
only define a full barrier, but not separate read / write barriers.
That is part of the performance problem I believe, since a full barrier
should be significantly more costly than an acquire barrier.

I guess read/write barriers as used in glibc are semantically equivalent
to acquire / release as in C11, but I'm not quite sure given that some
architectures use stronger barriers for read/write than acquire/release.
Cleaning that up would require review of plenty of code.  But one could
start incrementally as well by not changing existing barrier definitions
and reviewing uses one by one.  In the long term, I think we would
benefit from using C11 atomics throughout glibc; in some cases, existing
custom assembly might be faster (e.g., that has been one comment
regarding, IIRC, powerpc low-level locks) -- but maybe we can achieve
this with custom memory orders for atomics as well, or something
similar.
In any way, cleaning this up is not specific to pthread_once.

Second, suggested mappings from C11 acquire/release to arm
(http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html) show differences
for acquire loads and acquire barriers, but I don't know whether these
would result in a performance difference.

I'd appreciate input from architecture maintainers, especially from
those maintaining archs with weaker memory models such as arm.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]