Intermittent failures retrieving process exit codes

Tom Honermann thonermann@coverity.com
Thu Nov 14 04:02:00 GMT 2013


On 12/21/2012 01:30 AM, Tom Honermann wrote:
> I spent most of the week debugging this issue.  This appears to be a
> defect in Windows.  I can reproduce the issue without Cygwin.  I can't
> rule out other third party kernel mode software possibly contributing to
> the issue.  A simple change to Cygwin works around the problem for me.
>
> I don't know which Windows releases are affected by this.  I've only
> reproduced the problem (outside of Cygwin) with Wow64 processes running
> on 64-bit Windows 7.  I haven't yet tried elsewhere.
>
> The problem appears to be a race condition involving concurrent calls to
> TerminateProcess() and ExitThread().  The example code below minimally
> mimics the threads created and exit process/thread calls that are
> performed when running Cygwin's false.exe.  The primary thread exits the
> process via TerminateProcess() ala pinfo::exit() in
> winsup/cygwin/pinfo.cc.  The secondary thread exits itself via
> ExitThread() ala Cygwin's signal processing thread function, wait_sig(),
> in winsup/cygwin/sigproc.cc.
>
> When the race condition results in the undesirable outcome, the exit
> code for the process is set to the exit code for the secondary thread's
> call to ExitThread().  I can only speculate at this point, but my guess
> is that the TerminateProcess() code disassociates the calling thread
> from the process before other threads are stopped such that
> ExitThread(), concurrently running in another thread, may determine that
> the calling thread is the last thread of the process and overwrite the
> process exit code.
>
> The issue also reproduces if ExitProcess() is called in place of
> TerminateProcess().  The test case below only uses TerminateProcess()
> because that is what Cygwin does.
>
> Source code to reproduce the issue follows.  Again, Cygwin is not
> required to reproduce the problem.  For my own testing, I compiled the
> code using Microsoft's Visual Studio 2010 x86 compiler with the command
> 'cl /Fetest-exit-code.exe test-exit-code.cpp'
>
> test-exit-code.cpp:
>
> #include <windows.h>
> #include <stdio.h>
> #include <stdlib.h>
>
> DWORD WINAPI SecondaryThread(
>      LPVOID lpParameter)
> {
>      Sleep(1);
>      ExitThread(2);
> }
>
> int main() {
>      HANDLE hSecondaryThread = CreateThread(
>          NULL,                               // lpThreadAttributes
>          0,                                  // dwStackSize
>          SecondaryThread,                    // lpStartAddress
>          (LPVOID)0,                          // lpParameter
>          0,                                  // dwCreationFlags
>          NULL);                              // lpThreadId
>      if (!hSecondaryThread) {
>          fprintf(stderr, "CreateThread failed.  GLE=%lu\n",
>              (unsigned long)GetLastError());
>          exit(127);
>      }
>
>      Sleep(1);
>
>      if (!TerminateProcess(GetCurrentProcess(), 1)) {
>          fprintf(stderr, "TerminateProcess failed.  GLE=%lu\n",
>              (unsigned long)GetLastError());
>          exit(127);
>      }
>
>      return 0;
> }
>
>
> To run the test, a simple .bat file is used:
>
> test.bat:
>
> @echo off
> setlocal
>
> :loop
> echo test...
> test-exit-code.exe
> if %ERRORLEVEL% NEQ 1 (
>      echo test-exit-code.exe returned %ERRORLEVEL%
>      exit /B 1
> )
> goto loop
>
>
> test.bat should run indefinitely.  The amount of time it takes to fail
> on my machine (64-bit Windows 7 running in a VMware Workstation 8 VM
> under Kubuntu 12.04 on a Lenovo T420 Intel i7-2640M 2 processor laptop)
> varies considerably.  I had one run fail in less than 10 iterations, but
> most of the time it has taken upwards of 5 minutes to get a failure.
>
> The workaround I implemented within Cygwin was simple and sloppy.  I
> added a call to Sleep(1000) immediately before the call to ExitThread()
> in wait_sig() in winsup/cygwin/sigproc.cc.  Since this thread (probably)
> doesn't exit until the process is exiting anyway, the call to Sleep()
> does not adversely affect shutdown.  The thread just gets terminated
> while in the call to Sleep() instead of exiting before the process is
> terminated or getting terminated while still in the call to
> ExitThread().  A better solution might be to avoid the thread exiting at
> all (so long as it can't get terminated while holding critical
> resources), or to have the process exiting thread wait on it.  Neither
> of these is ideal.  Orderly shutdown of multi-threaded processes is
> really hard to do correctly on Windows.
>
> Since the exit code for the signal processing thread is not used, having
> the wait_sig() thread (and any other threads that could potentially
> concurrently exit with another thread) exit with a special status value
> such as STATUS_THREAD_IS_TERMINATING (0xC000004BL) would enable
> diagnosis of this issue as any process exit code matching this would be
> a likely indicator that this issue was encountered.
>
> As is, when this race condition results in the undesirable outcome,
> since the signal processing thread exits with a status of 0, the exit
> status of the process is 0.  This explains why false.exe works so well
> to reproduce the issue.  It would be impossible to produce a negative
> test using true.exe.
>
> Tom.

Time passes...

I worked with some former colleagues to report this issue to Microsoft. 
  Windows 8.1 and Windows Server 2012 R2 contain a fix that addresses 
the test case above.  A hotfix has been made available for Windows 7 SP1 
and Windows Server 2008 R2.  Should anyone desire a hotfix for other 
versions of Windows, it will be necessary to open a case with Microsoft 
to request it.

http://support.microsoft.com/kb/2875501

Tom.

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple



More information about the Cygwin mailing list