This is the mail archive of the cygwin-patches mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Improvements to fork handling (1/5)


On 22/05/2011 4:29 PM, Christopher Faylor wrote:
On Sun, May 22, 2011 at 09:04:40AM -0400, Ryan Johnson wrote:
On 21/05/2011 9:41 PM, Christopher Faylor wrote:
On Wed, May 11, 2011 at 02:31:23PM -0400, Ryan Johnson wrote:
Hi all,

This is the first of a series of patches, sent in separate emails as
requested.

The first patch allows a child which failed due to address space
clobbers to report cleanly back to the parent. As a result, DLL_LINK
which land wrong, DLL_LOAD whose space gets clobbered, and failure to
replicate the cygheap, generate retries and dispense with the terminal
spam. Handling of unexpected errors should not have changed. Further,
the patch fixes several sources of access violations and crashes,
including:
- accessing invalid state after failing to notice that a
statically-linked dll loaded at the wrong location
- accessing invalid state while running dtors on a failed forkee. I
follow cgf's approach of simply not running any dtors, based on the
observation that dlls in the parent (gcc_s!) can store state about other
dlls and crash trying to access that state in the child, even if they
appeared to map properly in both processes.
- attempting to generate a stack trace when somebody in the call chain
used alloca(). This one is only sidestepped here, because we eliminate
the access violations and api_fatal calls which would have triggered the
problematic stack traces. I have a separate patch which allows offending
functions to disable stack traces, if folks are interested, but it was
kind of noisy so I left it out for now (cygwin uses alloca pretty
liberally!).

Ryan
diff --git a/child_info.h b/child_info.h
--- a/child_info.h
+++ b/child_info.h
@@ -92,6 +92,18 @@ public:
    void alloc_stack_hard_way (volatile char *);
};

+/* Several well-known problems can prevent us from patching up a
+   forkee; when such errors arise the child should exit cleanly (with
+   a failure code for the parent) rather than dumping stack.  */
+#define fork_api_fatal(fmt, args...)					\
+  do									\
+    {									\
+      sigproc_printf (fmt,## args);					\
+      fork_info->handle_failure (-1);					\
+    }									\
+  while(0)
+
+
class fhandler_base;

class cygheap_exec_info
diff --git a/dll_init.cc b/dll_init.cc
--- a/dll_init.cc
+++ b/dll_init.cc
@@ -19,6 +19,7 @@ details. */
#include "dtable.h"
#include "cygheap.h"
#include "pinfo.h"
+#include "child_info.h"
#include "cygtls.h"
#include "exception.h"
#include<wchar.h>
@@ -131,10 +132,16 @@ dll_list::alloc (HINSTANCE h, per_proces
      {
        if (!in_forkee)
	d->count++;	/* Yes.  Bump the usage count. */
+      else if (d->handle != h)
+	fork_api_fatal ("Location of %W changed from %p (parent) to %p (child)",
+			d->name, d->handle, h);
You seem to be guranteeing a failure in a condition which could conceivably work
ok for simple applications, i.e., if a dll loads in a different location that
is not necessarily going to cause a problem.
By fork semantics the condition *is* a failure. If we try to relax the
requirement we risk Bad Things happening, usually in hard-to-diagnose
ways. The example I have right off is libgcc_s storing pointers to other
dlls and seg faulting when it tries to access pointers which were valid
in the parent but not in the child. I prefer a fail-fast solution over
cross-fingers-and-hope-it-doesn't-happen-to-me.
When you add a failure case like this you are assuming that you
understand all of the parameters and that it will make things better.  I
am not convinced that this won't cause previously working cases to fail.

It is not inconceivable that a DLL could be relocated into another
location and continue to work in a forked process.  Yes, I know this
doesn't match the way fork is supposed to work but I'm not as concerned
about that as I am about Cygwin mailing list complaints about new
failures.
WJM?

The reason I'm objecting to this is because I've considered, from time
to time, adding a similar test but have always avoided it because I
couldn't convince myself that it would help more than it would hurt.  If
we are going to add tests, I'd prefer that the testing be done in
frok:parent when the child_copy happens for static and dynamic dlls,
maybe by adding a dll function which first checks that the data/bss can
be copied to the same location as the parent.
Maybe the new way you mentioned you're working on obviates the above, but:

1. We won't necessarily get as far as frok::parent. Both the above check and the access violations it avoids usually happen before cygwin1.dll initializes.

2. Because it runs before child_copy, the above check triggers a retry on failure and the overall fork can still succeed.

3. Any additional failures due to the above check would be just as intermittent as what happens now. It's not like some app would suddenly and systematically refuse to fork with this patch in place. At worst fork failures might become somewhat more frequent. Caveat: sometimes the parent's address space layout is especially hostile and fork does seem to fail systematically. However, this happens to me either way and can be fixed by restarting the offending process.

4. My experiments suggest that most access violations during fork arise precisely because dlls move, and moved dlls usually cause access violations. Those already generate noise on the list (emacs, anyone?), and the above fix would make that noise either disappear (successful retry) or merge with the usual gripes about "fork failed: resource temporarily unavailable" where it belongs.

5. Not performing the above check can cause forked programs to mysteriously fail at some later point after the fork appears to succeed. Given that we have at least one known example (cyggcc_s) I would argue that forks gone bad contribute to user frustration but do not generate mailing list traffic because it looks like an intermittent and non-reproducible problem with the user's code. WJM?

Ryan


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]