Cygwin multithreading performance

Mark Geisert mark@maxrnd.com
Wed Dec 9 06:10:00 GMT 2015


Corinna Vinschen wrote:
> On Dec  8 16:34, Corinna Vinschen wrote:
>> On Dec  8 02:51, Mark Geisert wrote:
>>>    caller 0x018014CC77, count      1, L, /oss/src/winsup/cygwin/thread.cc:475
>>>    caller 0x018014CD00, count      1, U, /oss/src/winsup/cygwin/thread.cc:496
>>>    caller 0x018014CDAF, count    432, L, /oss/src/winsup/cygwin/thread.cc:971
>>>    caller 0x018014CDE6, count    432, U, /oss/src/winsup/cygwin/thread.cc:982
>>>    caller 0x018014D07E, count      1, L, /oss/src/winsup/cygwin/thread.cc:1946
>>>    caller 0x018014D090, count      1, U, /oss/src/winsup/cygwin/thread.cc:1951
>>>    caller 0x018014D7E6, count      1, L, /oss/src/winsup/cygwin/thread.cc:525
>>>    caller 0x018014D7FF, count      1, U, /oss/src/winsup/cygwin/thread.cc:533
>>>    caller 0x018014EDD7, count      1, U, /oss/src/winsup/cygwin/thread.cc:2400
>>>    caller 0x018014EE97, count      1, L, /oss/src/winsup/cygwin/thread.cc:2389
>>
>> This is interesting.  I'm not sure if anything in the rest of the
>> output shows how much is wasted on the above two calls, though.
>>
>> thread.cc:971 and thread.cc:982 are pthread_setcancelstate, and it's
>> called pretty often as part of stdio functions.  Every stdio function
>> which has to lock the FILE structure also calls pthread_setcancelstate
>> to disable and reenable cancellation before and after locking.  That's
>> almost any stdio function.
>>
>> This may be one of the problems which lower performance, but there's no
>> easy or quick way around that, AFAICS.
>>
>> There's also the fact that, even for tools using __fsetlocking to disable
>> stdio locking, pthread_setcancelstate will still be called unconditionally.
>> The question here is, if that's wrong and pthread_setcancelstate should be
>> skipped if the application sets FSETLOCKING_BYCALLER.
>
> For a start, I simply removed the mutex lock/unlock in calls to
> pthread_setcancelstate and pthread_setcanceltype.  These locks are
> completely unnecessary.  These functions are only called for the current
> thread anyway.
>
> I'm just creating a developer snapshot which I'll upload to
> https://cygwin.com/snapshots/ in half an hour at the latest.  Please
> have a look if your testcase behaves better now.

I'll give the snapshot a try, thanks, but it's hard for me to see :) how 
eliminating a few hundred locks and unlocks that are 75% under 1 usec anyway 
will help much.  I may have unduly influenced anybody following this by only 
showing detailed stats for the last two locks that I picked at random.

What I'm more focused on are the locks being heavily used by git and heavily 
contended by the four git threads in OP's testcase.  Here are the caller stats 
for this set of locks...

lock latency buckets:                            <=1   <=10  <=100 <=1000  >1000
mtx  4908/06002A00F0 lks 155066, ulks 155066,  66359   4395  78895   4742    675
   caller 0x0100454FAD, count 154792, L, /usr/src/git/builtin/pack-objects.c:1758
   caller 0x0100454FFC, count 154777, U, /usr/src/git/builtin/pack-objects.c:1769
   caller 0x010045557F, count     15, U, /usr/src/git/builtin/pack-objects.c:1760
   caller 0x010045588D, count     15, L, /usr/src/git/builtin/pack-objects.c:1944
   caller 0x01004558AC, count     15, U, /usr/src/git/builtin/pack-objects.c:1947
   caller 0x01004598AC, count     15, L, /usr/src/git/builtin/pack-objects.c:2039
   caller 0x01004599B9, count     11, U, /usr/src/git/builtin/pack-objects.c:2078
   caller 0x0100459A7B, count      4, U, /usr/src/git/builtin/pack-objects.c:2078
   caller 0x018014DB4F, count    244, U, /oss/src/winsup/cygwin/thread.cc:1254
   caller 0x018014DB93, count     15, L, /oss/src/winsup/cygwin/thread.cc:1286
   caller 0x018014DBD3, count    229, L, /oss/src/winsup/cygwin/thread.cc:1286

mtx  4908/0600FA6860 lks 154745, ulks 154745,  56092   3217  64883  25435   5118
   caller 0x0100455659, count   1011, L, /usr/src/git/builtin/pack-objects.c:1640
   caller 0x0100455687, count   1011, U, /usr/src/git/builtin/pack-objects.c:1642
   caller 0x01004556E0, count 153734, L, /usr/src/git/builtin/pack-objects.c:1628
   caller 0x010045570E, count 153734, U, /usr/src/git/builtin/pack-objects.c:1630

mtx  4908/0603498600 lks 316825, ulks 316825, 146254   4986 155345   9686    554
   caller 0x0100455269, count 196769, L, /usr/src/git/builtin/pack-objects.c:1695
   caller 0x01004552C4, count  15148, U, /usr/src/git/builtin/pack-objects.c:1705
   caller 0x0100455478, count 181621, U, /usr/src/git/builtin/pack-objects.c:1702
   caller 0x010045554C, count 120056, L, /usr/src/git/builtin/pack-objects.c:1834
   caller 0x010045556E, count 120056, U, /usr/src/git/builtin/pack-objects.c:1837

If you look at the ">1000" column, you see that these 3 locks account for 
675+5118+554 = 6347ms = 6.347sec spent waiting for these locks, best case for 
just this 1000+ usec latency bucket.  The next column to the left (a faster 
bucket for latencies 101us up to 1000us) has 4742+25435+9686 = 39863, times 
101us best case for this bucket, giving 4.026sec spent waiting for these locks.

I probably need to modify my stats gathering program to display the total time 
spent waiting to make it more clear how things add up for each mutex across the 
whole testcase.  Again, these stats are not timing how long locks are being 
held, but rather how long it takes just to acquire the lock.

It may be, when all is said and done, that this is the price one pays for POSIX 
compliant pthread_mutexes.  It just kind of bugs me that the testcase works fine 
under MinGW using MS CriticalSection ops but we don't seem to have something 
similarly light-weight in our pthread_mutex implementation.  Or I'm missing it :).
Thanks for reading,

..mark



More information about the Cygwin-developers mailing list