Problem with zombie processes
Mark Geisert
mark@maxrnd.com
Mon Feb 20 22:54:00 GMT 2017
Erik Bray wrote:
> On Mon, Feb 20, 2017 at 11:54 AM, Mark Geisert wrote:
>>> So my guess was that Cygwin might try to hold on to a handle to a
>>> child process at least until it's been explicitly wait()ed. But that
>>> does not seem to be the case after all.
>>
>>
>> You might have missed a subtlety in what I said above. The Python
>> interpreter itself is calling wait4() to reap your child process. Cygwin
>> has told Python one of its children has died. You won't get the chance to
>> wait() for it yourself. Cygwin *does* have a handle to the process, but it
>> gets closed as part of Python calling wait4().
>
> To be clear, wait4() is not called from Python until the script
> explicitly calls p.wait().
> In other words, when run this step by step (e.g. in gdb) I don't see a
> wait4() call until the point where the script explicitly waits(). I
> don't see any reason Python would do this behind the scenes.
You're right. I missed the wait in your script and ASSumed too much of the
Python interpreter :-( .
>>> Anyways, I think it would be nicer if /proc returned at least partial
>>> information on zombie processes, rather than an error. I have a patch
>>> to this effect for /proc/<pid>/stat, and will add a few more as well.
>>> To me /proc/<pid>/stat was the most important because that's the
>>> easiest way to check the process's state in the first place! Now I
>>> also have to catch EINVAL as well and assume that means a zombie
>>> process.
>>
>>
>> The file /proc/<pid>/stat is there until Cygwin finishes cleanup of the
>> child due to Python having wait()ed for it. When you run your test script,
>> pay attention to the process state character in those cases where you
>> successfully read the stat file. It's often S (stopped, I think) or R
>> (running) but I also see Z (zombie) sometimes. Your script is in a race
>> with Cygwin, and you cannot guarantee you'll see a killed process's state
>> before Cygwin cleans it up.
>>
>> One way around this *might* be to install a SIGCHLD handler in your Python
>> script. If that's possible, that should tell you when your child exits.
>
> Perhaps the Python script is a red herring. I just wrote it to
> demonstrate the problem. The difference between where I send stdout
> to is strange, but you're likely right that it just comes down to
> subtle timing differences. Here's a C program that demonstrates the
> same issue more reliably. Interestingly, it works when I run it in
> strace (probably just because of the strace overhead) but not when I
> run it normally.
>
> My point in all this is I'm confused why Cygwin would give up its
> handles to the Windows process before wait() has been called.
>
> (In fact, it's pretty confusing to have fopen returning EINVAL which
> according to [1] it should only be doing if the mode string were
> invalid.)
>
> Thanks,
> Erik
>
> [1] http://pubs.opengroup.org/onlinepubs/9699919799/functions/fopen.html
O.K., you may be on to something amiss in the Cygwin DLL. Thanks for the STC in
C; that'll help somebody looking further at this. I'm out of ideas. It might
be possible to reduce strace overhead somewhat by selecting a smaller set of
trace options than the default.
..mark
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
More information about the Cygwin
mailing list