Instability with signals and threads

Corinna Vinschen corinna-cygwin@cygwin.com
Fri Nov 21 10:28:00 GMT 2014


On Nov 20 21:22, Mikulas Patocka wrote:
> > Never mind that.  I can fix your testcase by calling _my_tls.remove with 
> > INFINITE as parameter in both places.  If I drop one of them, your 
> > testcase will invariable fail at one point.  With both INFINITE params 
> > in place, your testcase is now running half an hour without problems.
> 
> For me, this change doesn't fix the testcase, it just reduces the 
> probability that it hangs.
> 
> With this change, the testcase still locks up, but with a different 
> stacktrace:
> thread1:
>         Sleep
>         _yield
>         pthread::create
>         sigdelayed ??
>         _cygwin_exit_return ??
>         _cygtls::call2
> 
> thread2:
>         SetEvent
>         muto::release
>         init_cygheap::find_tls
>         _cygtls::init_thread
> 
> thread3:
>         WriteFile
>         sig_send
>         timer_thread
>         cygthread::callfunc
>         cygthread::stub
>         _cygtls::call2
> 
> thread4:
>         VirtualFree 
>         thread_wrapper
> 
> thread5:
>         only ntdll stuff

Do you use a DLL built with optimization by any chance?  I wouldn't
take the backtraces too serious in that case.  For debugging it helps
a lot to use a Cygwin DLL built without -O2.  Btw., are you testing
on 32 or 64 bit?  I'm testing on 64 bit.

I can't reproduce your backtrace, but I can reproduce another one, which
is related to thread_exit.  At one point after a couple thousand runs
through your testcase I have a variable number of threads hanging in
thread_exit, and a timer thread which is unable to send its signal.  the
other threads all hang in thread_exit, waiting for a muto which is taken
by a thread which doesn't exist anymore.  That's a very serious downside
of the muto implementation not being able to recognize being abandoned.
I wonder if that shouldn't be using a real OS mutex.

As a sidenote, the snapshot doesn't work well in other scenarios, too,
apparently.  Yaakov reported hangs in KDE :(

> > Thinking about it, the fact that _cygtls::remove allows to apply a 
> > non-INFINITE wait is rather strange, isn't it?  Calling remove_tls with 
> > a 0 wait, it allows to return the function silently, without actually 
> > having removed the thread from the list.  This is bound to go downhill 
> > at one point and looks like a kludge to me to circumvent some potential 
> > hang in another situation...
> 
> Looking at CVS history, the "wait" argument was added to cygtls.cc version 
> 1.2 with a comment: "Add a 'wait' argument to control how long we wait for 
  ^^^
  Wow.  So that's really old, more than 10 years.

> > Other than that, there's certainly some room for improvement.  Calling 
> > threadlist[idx]->remove from the find_tls exception handler looks 
> > extremly hairy to me.  I wonder if that should be called at all at this 
> > point, or if there shouldn't be better some "simplified" removal 
> > operation which doesn't require the _cygtls pointer.  If the thread 
> > doesn't exist anymore, so does its _cygtls area.
> 
> I suggest to remove that exception handler at all. This thing can't ever 
> work reliably - it could reduce probability of crashes but not eliminate 
> them. Even if we handled the page fault correctly - what happens if some 
> other thread allocates a different object at the location that belonged to 
> the tls before? - then find_tls thinks that this different object is tls 
> and corrupts it.

My point exactly.  AFAICS, the problem is that the cygtls area of a
thread is on the thread's own stack.  While this looks neat in the first
place, and works fine in most scenarios, the problem is that it gets
destroyed by the OS as soon as the thread exits.  So there's a chance
that another thread using the cygtls area of this thread (the signal
thread for instance) may end up with pointers into nirvana or, as you
point out, space taken for completely different tasks.

In the short term it's impossible to fix this thoroughly I guess,
because this requires a very careful overhaul of the cygtls handling.
What we need is a cygtls area which is created at thread start, but
which can be locked in memory as long as it's required by any thread.
Some synchronization is required.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Maintainer                 cygwin AT cygwin DOT com
Red Hat
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 819 bytes
Desc: not available
URL: <http://cygwin.com/pipermail/cygwin/attachments/20141121/675b9f0f/attachment.sig>


More information about the Cygwin mailing list