This is the mail archive of the gdb@sourceware.org mailing list for the GDB project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Partial cores using Linux "pipe" core_pattern

From: Paul Smith <psmith at gnu dot org>
To: Andreas Schwab <schwab at linux-m68k dot org>
Cc: Andi Kleen <andi at firstfloor dot org>, gdb at sourceware dot org
Date: Thu, 21 May 2009 12:32:24 -0400
Subject: Re: Partial cores using Linux "pipe" core_pattern
References: <1242609756.2800.135.camel@homebase.localnet> <87ab5aq3dq.fsf@basil.nowhere.org> <1242653371.2800.163.camel@homebase.localnet> <m27i0ejzbi.fsf@igel.home>
Reply-to: psmith at gnu dot org

On Mon, 2009-05-18 at 15:49 +0200, Andreas Schwab wrote:
> Apparently the ELF core dumper cannot handle short writes (see
> dump_write in fs/binfmt_elf.c).  You should probably use a read buffer
> of at least a page, which is the most the kernel tries to write at
> once.

Sorry for the delay; I lost my repro case and it took me a while to find
one.  And now when I dump cores over NFS, the bonding driver is causing
a kernel panic so there's that *sigh*.  I reconfigured my interfaces to
use a single non-bonded interface to avoid that issue and concentrate on
this one... I'll worry about that tomorrow.

I still need to do more investigation but I have more clarity around
when I see these short cores vs. "good" cores.  My system has a single
process and when a request for work comes in it forks (but not execs) a
number of helper copies of itself (typically 8).

In my test, all copies run the same code and so all will segv at the
around the same time (I just added code to do an invalid pointer access
at different areas of the program when certain test files exist).

Some areas of the code consider a segv or similar to be unrecoverable.
In those situations I have a signal handler that stops the other
processes in the process group, dumps a single core, then those other
process do NOT dump core and the whole thing exits.  The cores I get in
this situation are fine.

Other areas of the code consider a segv or similar to be recoverable.
In this case, each worker is left to dump core (or not) on its own, and
the system overall stays up.  When I force a segv in these areas, I get
the short cores.  Note that I am serializing my core dumping program
(the one cores are piped to) via an flock() file on the local disk, and
this serialization (based on messages to syslog) does seem to be
working.  What I see are 6-8 core dump messages from the kernel, then my
core saver runs on the first one and dumps about 50M of the 1G process
space (about 188 reads of 256K buffers plus some change).  Then that
exits and the second one starts and it dumps a 64K core (1 read), then
the next also dumps 64K etc.

It _feels_ to me like there's some kind of COW or similar mismanagement
of the VM for these forked processes such that they interfere and we
can't get a full and complete core dump when all of them are dumping at
the same time.

I'm going to do more investigation but maybe this rings some bells with
someone.

Follow-Ups:
- Re: Partial cores using Linux "pipe" core_pattern
  - From: Paul Smith

References:
- Partial cores using Linux "pipe" core_pattern
  - From: Paul Smith
- Re: Partial cores using Linux "pipe" core_pattern
  - From: Andi Kleen
- Re: Partial cores using Linux "pipe" core_pattern
  - From: Paul Smith
- Re: Partial cores using Linux "pipe" core_pattern
  - From: Andreas Schwab

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]