This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH 02/10] Add the low level infrastructure for pthreadslock elision with TSX


On Sat, 2013-01-12 at 00:06 +0100, Andi Kleen wrote:
> On Fri, Jan 11, 2013 at 10:23:14PM +0100, Torvald Riegel wrote:
> > This can be good or bad depending on how you see it.  It is good for
> > people that want to tune.  But it also indicates that HLE _needs_ to be
> 
> This implementation does not use HLE and this has nothing to do with
> HLE, as the TSX ISA interface.

The HLE acronym has been used for the concept of hardware-assisted or
hardware-enabled lock elision (or just hardware lock elision) before the
Haswell feature appeared (ie, for concepts such as described in Ravi
Rajwar's 2001 paper).  That's what I referred to by "HLE".  If you want
to propose another generic term for the concept, please make
suggestions.

Conceptually, the patch is about hardware lock elision implemented on
Intel HW features.  But HTMs or similar on other architectures could
also be used to implement the same glibc feature.

> > tuned to be worthwhile, which worries me.
> 
> Modern computers are complex and need to be tuned in general.

And that's why we build abstractions such as pthreads mutexes to make it
feasible to use complex computers.  Come on ...

> > 
> > I know that that HLE is the first LE HW that we can experiment with (at
> > least, I'm not aware of any other mainstream CPU having something
> > similar).  And the fact that we can't yet talk about performance is
> > another obstacle.  But we should try really hard to avoid requiring
> > users to tune our locks, including whether to use HLE or not.
> 
> The goal is to have reasonable defaults.

The defaults are unlikely to be just a bunch of numbers, but instead
automatic runtime tuning mechanisms that we are confident work well in
practice.

> But until we understand
> what the good reasonable defaults are we need tuning mechanism,
> so that people can find good defaults.

So what you are saying is that it can't be expected to work well out of
the box at the current state (eg, with the current automatic adaptation
mechamism)?  If so, this suggests that it should be an opt-in feature.

I honestly would much rather like to see it being in a state so that it
can be enabled by default, because it would be a shame if we can't
exploit the HW by default.  So, just to make it clear, I'm not against
HLE (the concept, as well as TSX-based implementations of it) at all, I
just want us to try hard to make it easy for our users to benefit from
it.  Because if it's not easy to use and is opt-in, you will get less
people using it.

> I don't want to explain
> everyone who wants to play with this how to rebuild glibc,
> so a runtime mechanism is needed.

Automatic tuning is also a runtime mechanism.

> Also I actually disagree that exposing tunables is a bad thing. 

That's not what I said.  What I said is that we should "avoid
*requiring* users to tune our locks".  We can of course argue at which
point something becomes a requirement instead of some minor optional
thing, but what you said makes me worry that it is in fact a
requirement.

> The current glibc strategy of hardcoding dubious magic numbers from 10+
> years ago (like it is done for the adaptive lock) is a poor one, and contrary
> to any good continuous performance improvement strategy.

Agreed, and that's not what I suggested.

BTW, this problem is not inherently caused by magic numbers, but by the
fact that we don't have the processes in place to verify that any such
numbers are indeed the best we can do.
Just delegating the choice of magic numbers to users doesn't solve the
problem.

> Of course most users will not tune, but the few that do can give you
> priceless feedback and they need to have appropiate mechanisms.
> 
> > Also, if we want to expose tuning parameters, my gut feeling is that at
> > least long-term, we'd need something different / better for how to set
> > those parameters.
> 
> I considered a configuration file, but it has additional overhead.
> Right now based on my own experience the environment variables work
> fine.
> 
>   If we'd expose all tunables in pthreads code to the
> > same extent and in the same way (e.g., max spin count for adaptive
> > mutexes), it would be a mess.
> 
> It would be a great improvement over the current "generally poor"
> state.

The only improvement is that this allows others to try to improve the
situation without having to change glibc code.  But this does *not*
solve the problem, so I wouldn't call this a great improvement.

> > So, I don't want to blame your patch for this, but IMHO this is
> > something the project should investigate and provide guidance on.
> > 
> > > Lock elision can be enabled/disabled using environment variables.
> > > It can be also enabled or disabled using new lock types for
> > > mutex and rwlocks.
> > 
> > I think we already have too many mutex types.  It's good to provide
> 
> elision is not really adding any new types, it's just adding two
> additional flags (elision and no elision)
> 
> The initializers do not support flags well, that is why there are
> multiple combinations. But that's how pthreads was defined.

So overall, it still adds new kinds of mutexes, whether we want to call
them types or rather types with additional flags, or whatever.  The
difference between with and without elision is similar to normal vs.
adaptive mutexes, and the latter are a separate kind as well.  Also,
until you remove the semantic differences when elision is enabled, these
are really new types...

> > choice, but if we'd really optimize our locks, we'd end up with _a lot_
> > more types.  Should we really have one lock type for each of cross-NUMA,
> > one with more fairness and one without, one that spins and one that
> > doesn't (yes, that's a current case), and so on?  And do we then also
> > build combinations of all of them and with or without HLE?  And what if,
> > for example, Intel comes up with another HLE implementation that is more
> > powerful?  Will we then have HLE2 initializers?  Or HLE_vendorX?
> 
> The source interface was designed to be portable, so anyone doing lock elision
> on any platform can use it. There is nothing vendor specific in it
> (the only vendor specific parameters are exposed in the tunable
> interface)

And via the initializers.  So if the initializers carry tuning flags
that are likely to be hardware-specific, it's not portable anymore in
practice.

Note that your flags just say use elision or don't use it, so they are
tied to how elision performs.  Could we instead have tuning hints that
are more portable, such as the likelihood with which a critical section
has a data conflict with another one?  Or the rough number of memory
locations accessed in the critical section?  Such properties would be
properties about the program that the lock elision implementation cannot
figure out easily; they have a better chance at being useful for
different lock elision implementation, I think.

Also, whether lock elision is useful rather depends on the critical
sections running concurrently than on the lock itself, so perhaps hints
should be tied to critical sections, not the lock?  Otherwise users will
find it hard to tune locks that are used to protect critical sections
that are not all the same (eg, short queries vs. longer-running
updates).

> If there are multiple elision types I would expect those to be handled
> through the separate tuning interface (but I don't see this)
> 
> Right now you need two more initializers for each lock type that
> supports elision. There are only two lock types right now that do,
> so you have four more. Not an unreasonable number. The only
> other lock type I would expect to elide in the future is recursive,
> so there may be two more.

So this makes six.  And it assumes that all implementations of lock
elision will need the same tuning (which I wouldn't count on) --
otherwise, we have at least three more for every other implementation,
or even more if we want to express something like "use elision on
implementations X and Y".

> > I think we should sit down and decide which locks we really need, and
> > what they should offer.  And then see how HLE fits into it.
> 
> timed, adaptive, maybe recursive at some point.

My statement was about the (user-visible) lock types that we need, not
primarily which types should support elision.

> > 
> > For example, can we subsume HLE under adaptive mutexes, or at least
> 
> Lock elision is orthogonal to how you do the fallback lock.
> It's just another attribute, like PI.  I don't think it makes
> sense to mix the two.

In the implementation it's orthogonal if we don't consider back-off,
which none of our locks do so far.  But for the user, we might be able
to group things under adaptive mutexes because both don't necessarily
block right away when the lock is acquired, might try to do the right
thing either by reducing contention or with speculative execution, etc.
Of course their are not the same, but the question is whether they are
close enough from the user POV to be one meaningful abstract kind of
lock.
The point of this would be to provide a better abstraction to users.

> > Note that C++11 also requires the lock to be owned before an unlock()
> > call.  Same for C11.
> 
> Yes any program hitting this is non conformant.
> 
> > 
> > >   There are ways around this with some tradeoffs (more code in hot paths)
> > 
> > Can you summarize the workarounds and the associated costs?
> 
> For unlock it's adding _xtest() in the unlock path.
> For the others it's usually adding _xabort(), which leads nested
> locking to not elide.

What does "others" refer to?

> I plan to do any of those changes if there are programs affected by it.
> However I first want to understand if there are.

We won't be able to show that programs are not affected until Haswell is
available and widely used for testing.  So unless elision is opt-in, not
crashing programs could be nicer to nonconforming programs.

> 
> > >   This will also happen on systems without RTM with the patchkit.
> > >   I'm still undecided on what approach to take here; have to wait for testing reports.
> > 
> > What kind of tests did you run so far?
> 
> A lot large real programs, various benchmark suites etc.

Can you be more specific?

> > 
> > > - pthread_mutex_destroy of a lock mutex will not return EBUSY but 0.
> > 
> > That should be fine, as it's a may-fail requirement.  You could add code
> > to write to the lock value itself too to ensure that there will be no
> > other thread holding the mutex (and thus get a correct return value).
> > The current pthread_mutex_destroy code already writes to the mutex kind,
> > so this should have negligible overhead.
> 
> It would be possible to abort yes, if it's a problem.

If there's no reason to not abort concurrent txns by a forced write, we
should just do that.

> > I see that you are using RTM instead of the HLE facilities (ie, XACQUIRE
> > and XRELEASE) that would give you the expected trylock behavior, which
> > is interesting.
> > I assume you do this to be more flexible regarding when to fall back to
> > nonspeculative execution?
> > 
> > Have you tried to use HLE within an RTM transaction?  That would give
> 
> HLE inside RTM aborts in current TSX implementations.

Okay... The only other solution I can think of right now to ensure
correct behavior for trylock would be to use thread-local storage to
keep track of which locks a thread has acquired.  But that would result
in too much overhead I guess.

So, that means no elision beyond calls to trylock, unless we can find
another solution.

> > > - Same applies to the rwlocks. Some of the return values changes
> > >   (for example there is no EDEADLK for an elided lock, unless it aborts.
> > >    However when elided it will also never deadlock of course)
> > 
> > The EDEADLK case is just a may-fail, so this is fine; as you say, there
> > is no deadlock in this case . Are there any other changes in semantics
> > besides wrong return values of trywrlock and tryrdlock?
> 
> Elided locks always behave like reader locks, so you can see
> write locks behaving like read locks to the same write locks.
> (e.g. write trylock inside write trylock will succeed now)
> This is the same as with the mutexes.

But that's just the wrong return values of the trylock functions, which
need to be fixed.

> > I'd like us to discuss the semantics and high-level design in more
> > detail first.  Thus, I have no comments yet on the specifics of your
> > patch, but will review once we have consensus on the design.
> 
> On contrary I would prefer if people would discuss concrete code instead
> of various strawmans. In my experience high level theoretical discussions of
> transactional memory often go off into the weeds, it's much better 
> to have concrete examples related to code.

The discussion about the interface and the tuning is very much a
concrete topic, for example.  As is the discussion about trylock
semantics.  Do you really think that these points are straw men?


Torvald


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]