This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: Looking for recommendation for using SystemTap
- From: fche at redhat dot com (Frank Ch. Eigler)
- To: Tony Reix <tony dot reix at bull dot net>
- Cc: "systemtap at sourceware dot org" <systemtap at sourceware dot org>
- Date: 29 Sep 2006 13:00:21 -0400
- Subject: Re: Looking for recommendation for using SystemTap
- References: <1159534951.28410.54.camel@frecb000687.frec.bull.fr>
Tony Reix <tony.reix@bull.net> writes:
> [...]
> The analysis of the Oopss clearly show that "someone" writes strings
> (like "ata" or "ejbo") randomly in memory and destroys links in
> structures, like vmlilst used by get_vmalloc_info in fs/proc/mmu.c or
> ulp->proc_list used by loop_undo in ipc/sems.c .
> [...]
> Do you think SystemTap can help me finding the culprit ?
> [...]
Perhaps. Does the memory corruption occur in predictable places?
Imagine a probe that runs periodically (via a frequently triggered
timer, or a breakpoint at a code point under suspicion). That probe
could look through selected places that are corrupted, and check for
something suspicious. For example:
#! stap -g
probe kernel.function("after_your_function") { if (checkstuff ()) log ("bug") }
function checkstuff () /* .... */
What checkstuff() does depends on how a program may be able to assess
corruption. If it's ascii scripts showing up within known regions of
valid memory, something like this naive search could do it. (Such a
function could be encapsulated into the systemtap tapset library).
function checkstuff () %{
char *begin = 0xdeadbeef;
char *end = 0xdeadf00d;
int found = 0;
char *p;
for (p = begin; p+3 < end; p++)
if (p[0] == 'a' && p[1] == 't' && p[2] == 'a') found=1;
THIS->__retval = found;
%}
Later, we will have hardware-assisted watchpoint probes that hit when
a designated area of memory is read and/or written. That could narrow
the culprits down even further. This might look something lke:
probe kernel.watch.from(0xdeadbeef).to(0xdeadfood).string("ata")
{ log ("bug") }
Anyway, this all depends on being able to characterize the corruption
well enough that a routine could be written to safely check for it.
If you don't have even that much information, very drastic measures
may be necessary (such as running the kernel under a simulator or
debugger).
- FChE