This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: [RFC] improving spinning in adaptive mutexes

From: Torvald Riegel <triegel at redhat dot com>
To: Chris Metcalf <cmetcalf at tilera dot com>
Cc: GLIBC Devel <libc-alpha at sourceware dot org>, "Kleen, Andi" <andi dot kleen at intel dot com>
Date: Fri, 01 Mar 2013 15:57:52 +0100
Subject: Re: [RFC] improving spinning in adaptive mutexes
References: <1362067023.581.14702.camel@triegel.csb> <512FBE5E.6050205@tilera.com>

On Thu, 2013-02-28 at 15:30 -0500, Chris Metcalf wrote:
> On 2/28/2013 10:57 AM, Torvald Riegel wrote:
> > === Issues in the current implementation
> >
> > --- Inefficient spinning
> >
> > 1) No spinning using loads but with just trylock()
> > - trylock() doesn't first check whether the lock is free using a load,
> > which can be beneficial in the case of uncontended locks (based on
> > related measurements I did a few years ago; I don't have current numbers)
> > - If the atomic instruction in trylock() puts the lock's cache line into
> > exclusive state right away, we cause cache misses with the spinning.
> 
> On the current Tilera microarchitectures, test-and-test-and-set turns out to be a pessimization.  The problem is that the initial "test" is a load, which is sent over the on-chip memory network and results in returning a copy of the cache line from the home cache.  Then, when some core succeeds at taking the lock, the modification to the cache line at the home requires the home to invalidate all the copied cache lines, which involves a bunch of memory network traffic when multiple cores are trying to acquire the lock, as well as delay at the cache line's home as it waits for the invalidates to all be acknowledged by the other cores' caches.  As you scale up to dozens of cores, this effect starts to dominate.
> 
> By contrast, the atomic operations work directly at the home cache of the cache line, so no cache line copying and invalidating is required as cores try to acquire the lock.  Instead you just have each core sending short atomic operation requests over the memory network and getting back short atomic result responses,

That's interesting, thanks for the notice.

Is anyone aware of any other architectures where atomic
read-modify-write ops get executed "remotely"?

Likewise, are there any architectures that pull in lines in shared mode
first on read-modify-write ops?  Otherwise, I'd assume they eagerly pull
them in in exclusive mode (i.e., what I assumed in my original post).

Andi, I suppose the latter is the case for Intel?

>  so along with appropriate backoff, things scale up much better.

Do you have any suggestions for what would be appropriate back-off on
Tilera archs, and why?  I don't have access to the hardware, so I
couldn't make a guess based on experiments.

Torvald

Follow-Ups:
- Re: [RFC] improving spinning in adaptive mutexes
  - From: Chris Metcalf

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]