killpg(pgid, 0) fails if the process is in the middle of spawnve()

Corinna Vinschen corinna-cygwin@cygwin.com
Wed May 18 14:54:03 GMT 2022


On May 18 20:19, Jun T wrote:
> It seems killpg(2) on Cygwin has a problem as described below.
> Can this be (easily) fixed?
> 
> [1] The problem
> 
> killpg(pgid, 0) (or kill_pgrp(pgid, si_signo=0), in signal.cc)
> fails (returns -1) even when there is a process in the process
> group pgid, if the process is in the middle of spawnve().
> 
> [2] A problem of zsh on Cygwin that is caused by [1]
> [...]
> zsh% ls --color | less
> zsh: done                    ls --color |
> zsh: suspended (tty output)  less
> 
> How frequently you get this may depend on your hardware,
> but if it happens you will find it quite annoying.
> 
> [3] How does [1] cause [2]?
> 
> According to the strace output, what is happening is as follows:
> 
> The main zsh (zsh0) fork() two subshells, zsh1 for ls and
> zsh2 for less.
> zsh1 becomes a process group leader (pid=pgid=101, for example),
> gets tty (becomes foreground), and calls execve(ls).
> zsh2 becomes a member of the process group pgid=101, and
> calls execve(less).
> 
> When ls exits, zsh0 gets SIGCHLD, and in the signal handler
> it calls killpg(101, 0) to see if there are any process
> remaining in the process group 101. At this point zsh2/less
> is still in the process group 101, so killpg(101, 0) should
> succeed.
> 
> But when problem [2] happens, zsh2 has already called execve(less)
> or spawnve(_P_OVERLAY,less), but spawnve() has not finished yet.
> 
> There are two Windows processes (zsh2 and less), but it _seems_
> neither of them is included in the list of win-pids created by
>     winpids pids ((DWORD) PID_MAP_RW);
> at line 358 of signal.cc. So kill_pgrp() fails, and zsh0 thinks
> that there is no foreground process remaining, and regains tty.
> [...]

I'm not 100% sure, but I think the core of the problem is spawn.cc.
Look into the mode == _P_OVERLAY (aka "exec'ing") starting at line 857.

At line 870, the handle to the own winpid object is closed in the
exec'ing process. However, the exec'ing process only creates the winpid
symlink for the exec'ed process if it's a non-Cygwin process. For a
Cygwin process, the assumption is that the exec'ed process creates its
own symlink (in pinfo::thisproc() in pinfo.cc). If the exec'ing process
calls NtClose on it's own winpid symlink, but the exec'ed process didn't
progress enough into initialization, there's a slim chance that neither
the exec'ing process, nor the exec'ed process has a winpid symlink
attached.

As for how to fix this... as a first rough idea I'd think that we
should always create the winpid symlink in spawn.cc, even for
exec'ed Cygwin processes.  We also need to make sure that we dup
the handle into the new process, and stop creating the winpid symlink
in exec'ed processes. Here's a Q&D patch:

diff --git a/winsup/cygwin/pinfo.cc b/winsup/cygwin/pinfo.cc
index 4a8daefd924e..f501c470f34c 100644
--- a/winsup/cygwin/pinfo.cc
+++ b/winsup/cygwin/pinfo.cc
@@ -55,10 +55,11 @@ void
 pinfo::thisproc (HANDLE h)
 {
   procinfo = NULL;
+  bool execed = !!h;
 
   DWORD flags = PID_IN_USE | PID_ACTIVE;
   /* Forked process or process started from non-Cygwin parent needs a pid. */
-  if (!h)
+  if (!execed)
     {
       cygheap->pid = create_cygwin_pid ();
       flags |= PID_NEW;
@@ -72,7 +73,8 @@ pinfo::thisproc (HANDLE h)
   procinfo->dwProcessId = myself_initial.dwProcessId;
   procinfo->sendsig = myself_initial.sendsig;
   wcscpy (procinfo->progname, myself_initial.progname);
-  create_winpid_symlink ();
+  if (!execed)
+    create_winpid_symlink ();
   procinfo->exec_sendsig = NULL;
   procinfo->exec_dwProcessId = 0;
   debug_printf ("myself dwProcessId %u", procinfo->dwProcessId);
diff --git a/winsup/cygwin/spawn.cc b/winsup/cygwin/spawn.cc
index 98b588698c5c..905d949fc5c4 100644
--- a/winsup/cygwin/spawn.cc
+++ b/winsup/cygwin/spawn.cc
@@ -860,13 +860,14 @@ child_info_spawn::worker (const char *prog_arg, const char *const *argv,
 	  strace.execing = 1;
 	  myself.hProcess = hExeced = pi.hProcess;
 	  HANDLE old_winpid_hdl = myself.shared_winpid_handle ();
-	  if (!real_path.iscygexec ())
-	    {
-	      /* If the child process is not a Cygwin process, we have to
-		 create a new winpid symlink on behalf of the child process
-		 not being able to do this by itself. */
-	      myself.create_winpid_symlink ();
-	    }
+	  /* We have to create a new winpid symlink on behalf of the child
+	     process. For Cygwin processes we also have to create a reference
+	     in the child. */
+	  myself.create_winpid_symlink ();
+	  if (real_path.iscygexec ())
+	    DuplicateHandle (GetCurrentProcess (),
+			     myself.shared_winpid_handle (),
+			     pi.hProcess, NULL, 0, 0, DUPLICATE_SAME_ACCESS);
 	  NtClose (old_winpid_hdl);
 	  real_path.get_wide_win32_path (myself->progname); // FIXME: race?
 	  sigproc_printf ("new process name %W", myself->progname);

However, this *might* have side effects, like having two processes with
the same pid in ps output or something like that.  The tricky thing here
is that creating the new winpid symlink and closing the old winpid
symlink should be atomic from the POV of pid collecting processes.

For a start, can you try the above patch?  I probably won't have the
time to come up with an atomic solution in the next 2 weeks, but in
general, the problem should be clear now.

If somebody else comes up with a neat atomic solution during my absence,
I'd be happy, too.


Corinna


More information about the Cygwin mailing list