This is the mail archive of the
mailing list for the Cygwin project.
RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
- From: "Ernie Coskrey" <Ernie dot Coskrey at steeleye dot com>
- To: <cygwin at cygwin dot com>
- Date: Thu, 9 Aug 2007 11:43:31 -0400
- Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> -----Original Message-----
> From: firstname.lastname@example.org
> [mailto:email@example.com] On Behalf Of Ernie Coskrey
> Sent: Wednesday, August 08, 2007 2:11 PM
> To: firstname.lastname@example.org
> Subject: RE: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> > -----Original Message-----
> > From: email@example.com
> > [mailto:firstname.lastname@example.org] On Behalf Of Ernie Coskrey
> > Sent: Tuesday, July 31, 2007 3:40 PM
> > To: email@example.com
> > Subject: cygwin 1.5.20-1, spinning pdksh, 100% CPU
> > I've run into a problem with cygwin 1.5.20-1 and pdksh
> 5.2.14. We've
> > got a pdksh.exe process that is spinning, using all the CPU.
> > This scenario is very hard to reproduce, but has happened
> on our test
> > systems occasionally. It occurred recently, and I
> currently have gdb
> > attached to the process and have the symbols loaded. I see
> that pdksh
> > is continually calling "sigsuspend()", which is immediately
> > from cancelable_wait due to the fact that the
> signal_arrived event is
> > set. I also see that pdksh is waiting for a subprocess to
> > and has a handle to the PID of that process - however the
> process has
> > long since terminated.
> > It appears that something went wrong during delivery of SIGCHLD.
> > I've got two questions related to this:
> > - have there been changes between 1.5.20-1 and 1.5.24-2, or
> the latest
> > snapshot, that might have fixed this issue? We've done
> some limited
> > testing with 1.5.24-2 and haven't seen this happen yet, but
> as I said
> > the it only happens rarely.
> > - is there anything I can look at in gdb to help identify what the
> > issue is?
> > Any suggestions would be appreciated!
> > ---------
> > Ernie Coskrey
> I've discovered an interesting piece of information that I
> think is related to this. I'm hoping this might ring a bell
> with someone on the list.
> Looking at _main_tls->stack, when I've set a breakpoint in
> handle_sigsuspend just after the cancelable_wait() call, I
> see the following entries:
> 0x6109186f 0x4132ac
> 0x6109186f is "sigdelayed()", which is the routine that
> should have been called to deliver the signal and reset the
> signal_arrived event.
> 0x4132ac is j_waitj (in pdksh).
> So, somehow, when this problem occurs, "sigdelayed" gets
> pushed onto the stack *before* j_waitj does. So, _sigbe
> never calls sigdelayed.
> I don't think there's ever a case where sigdelayed should be
> at _main_tls->stack. However this happened is, I believe,
> the cause of this problem.
> Ernie Coskrey
Well, I think that I may have found the cause of this issue, and I
believe that the problem exists in 1.5.24-2. Please take a look at what
I think is the solution, and let me know if I'm mistaken.
I believe that the problem is in _sigbe, at the very end of the
assembler code. _sigbe decrements the lock *before* it decrements
incyg. This leaves a very small window where another thread - possibly
the sig thread that's doing setup_handler() - can acquire the lock, see
that incyg is still set to 1, and act accordingly. In setup_handler,
this will cause the thread to go into _cygtls::interrupt_setup, which
pushes sigdelayed onto the tls stack. But since we're not really in
Cygwin code when this happens, sigdelayed() never gets executed and you
end up spinning as we're seeing.
I'll post a patch to cygwin-patches.
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Problem reports: http://cygwin.com/problems.html