This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lock elision test results


On Thu, Jul 04, 2013 at 01:26:02PM +0200, Torvald Riegel wrote:
> On Thu, 2013-07-04 at 11:29 +0200, Dominik Vogt wrote:
> > The same number of iterations per second, no matter of what value
> > <n> is.
> 
> So no matter how long thread2 has time before it's signaled to stop by
> thread 1, it always gets the same amount of work done?  That doesn't
> seem right.  Eventually, it should get more work done.

Not the same amount of work but the same amount of work *per
second*.

> > So, while thread 1 does one iteration of 
> > 
> >   waste a minimal amount of cpu
...
> > 
> > thread 2 does
> > 
> >   lock m1
> >   increment c3
> >   unlock m2
> >  
> > It looks like one iteration of thread 2 takes bout six times more
> > cpu that an iteration of thread 1 (without elision)
> 
> But it doesn't look like it should because it's not doing more work; do
> you know the reason for this?

Well, thread 1 just increments a counter in a loop while thread 2
additionally calls pthead_mutex_lock() and -_unlock().  That could
certainly explain the factor.

> > and with
> > alision it takes about 14 times more cpu.
> 
> It might be useful to have a microbenchmark that tests at which length
> of critical sections the overhead of the transactional execution is
> amortized (e.g., if we assume that the lock is contended, or is not, or
> with a certain probability).

That alway depends on the abort ratio of the transaction and thus
on what other threads are doing.

> We need to model performance in some way
> to be able to find robust tuning parameters, and find out which kind of
> tuning and tuning input we actually need.  Right now we're just looking
> at the aborts; it seems that for z at least, the critical section length
> should also be considered.

Make a suggestion for another test and I'll happily hack and run
it.

> > > At which point we should have the option to do the same as
> > > in case (3);  thus, the difference is surprising to me.  Do you have any
> > > explanations or other guesses?
> > 
> > Hm, with the default tuning values (three attempts with elision
> > and three with locks), *if* thread 1 starts using locks, thread 2
> > would
> > 
> >   try to elide m1 <------------------\
> >     begin a transaction              |
> >     *futex != 0 ==> forced abort     |
> >   acquire lock on m1                 |
> >     increment counter                |
> >   release lock on m1                 |
> >   acquire lock on m1                 |
> >     increment counter                |
> >   release lock on m1                 |
> >   acquire lock on m1                 |
> >     increment counter                |
> >   release lock on m1  ---------------/
> > 
> > I.e. for three successful locks you get one aborted transaction.
> > This slows down thread 2 considerably.  Actually I'm surprised
> > that thread 2 does not lose more.
> > 
> > What I do not understand is why thread 1 starts aborting
> > transactions at all.  After all there is no conflict in the
> > write sets of both threads.  The only aborts should occur because
> > of interrupts.  If once the lock is used unfortunate timing
> > conditions force the code to not switch back to elision (because
> > one of the thread always uses the lock and forces the other one
> > to lock too), that would explain the observed behaviour.  But
> > frankly that looks to unlikely to me, unless I'm missing some
> > important facts.
> 
> Yes maybe it's some kind of convoying issue.  Perhaps add some
> performance counters to your glibc locally to see which kind of aborts
> you get?  I fusing TLS and doing the observations in the nontxnal path,
> it shouldn't interfere with the experiment too much.

My local gcc patch will soon be in a useable state so that I can
get some decent profiling information, but as instrumentation is
somewhat invasive (at least outside transactions), there's no
guarantee that it can pin down what's happening.

As a side note, using Tls in the lock elision patch (for thread
debugging) considerably slows down the pthread_mutex_... functions
as fetching the Tls pointer is slow.

> > The explanation for that is probably the machine architecture.
> > Our system partition has eight cpus.  Six (or five?) cpus are on
> > the same chip.  So, if some thread is executed by a cpu on a
> > different chip, memory latency is much higher.  This effect could
> > explain the constant factor I see.
> 
> Yes.  Can you try with pinning the threads to CPUs?

Not on the shared machine I have not, but I can get a slot on a
dedicated testing machine where this can be controlled.  This will
take some time and effort for preparation though, and before I get
the slot, I need to know what I want to test exactly.

Ciao

Dominik ^_^  ^_^

-- 

Dominik Vogt
IBM Germany


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]