This is the mail archive of the
systemtap@sources.redhat.com
mailing list for the systemtap project.
RE: architecture paper draft
- From: "Chen, Brad" <brad dot chen at intel dot com>
- To: "Sridharan, K" <k dot sridharan at intel dot com>, "William Cohen" <wcohen at redhat dot com>, "Barton Miller" <bart at cs dot wisc dot edu>, "Michael Brim" <mjbrim at cs dot wisc dot edu>
- Cc: "Frank Ch. Eigler" <fche at redhat dot com>, "Stephen C. Tweedie" <sct at redhat dot com>, <systemtap at sources dot redhat dot com>
- Date: Fri, 11 Feb 2005 11:16:11 -0800
- Subject: RE: architecture paper draft
Maybe Bart and Michael can explain more about how they
collected these numbers. I don't believe there is a paper
to cite.
As for figuring out the length of x86 instructions, Sri
is right we do it all the time around here! Also, I was
thinking we could maybe finesse this problem, assuming:
- For every probe point of interest we should be able to
safely decode the instructions, since the kernel and the
compiler are open source. Problems of this sort commonly
arise from jumps into the middle of instructions or from
self modifying code. I don't think this will be common
for systemtap probe points.
- we can enumerate and test each systemtap probe point in
the kernel
Overall the transparency of the compiler and the kernel
code ought to make this problem manageable.
Brad
-----Original Message-----
From: Sridharan, K
Sent: Friday, February 11, 2005 8:28 AM
To: William Cohen; Chen, Brad
Cc: Frank Ch. Eigler; Stephen C. Tweedie; systemtap@sources.redhat.com
Subject: RE: architecture paper draft
There are technologies available that figure out lengths of the X86
instructions very well and many of these are part of commercial products
available in the market. If that is the only stumbling block to avoid
excessive overhead, then we can see how to provide that technology.
Sri (K. Sridharan)
-----Original Message-----
From: systemtap-owner@sources.redhat.com
[mailto:systemtap-owner@sources.redhat.com] On Behalf Of William Cohen
Sent: Friday, February 11, 2005 8:24 AM
To: Chen, Brad
Cc: Frank Ch. Eigler; Stephen C. Tweedie; systemtap@sources.redhat.com
Subject: Re: architecture paper draft
Chen, Brad wrote:
> Frank Ch. Engler wrote:
>
>>In addition, this method may require that the kprobes handler not be
>>started from an interrupt context wrapped around the "int 3" trap
>
> (x86).
>
>>Changing this might require extensive changes to kprobes, to perhaps
>>insert "simple" diversionary branches into the executable image
>
> instead
>
>>of traps. Intel folks prefer this sort of approach for performance
>>reasons, but we may have come across an even better reason for it.
>
>
> Thank you for noting my earlier question about interrupt overhead.
> I said I would do a little homework on interrupt overhead; here it is:
> Cycle delay by CPU Branch Trap
> 1.6 GHz Pentium 4 149 1408
> AMD Athalon 1800 38 361
> 1.6 GHz Pentium M 84 541
>
> These numbers are from the kerninst team from the University of
> Wisconsin
> and I did not verify them myself. In general it looks like a trap is
> 7-10x
> more expensive than a branch. It appears to me that kprobes requires
> three
> traps, so that would make the overall impact 20-30x more expensive.for
Do you have a pointer to where the paper containing this information?
149 cycles for branch overhead sounds rather high for a processor even
if it has to flush pipelines. This includes the code for saving and
restoring the registers? Why are 2-3 traps required? For kernel
instrumentation only one is required when a probe is executed.
It looks like the Pentium 4 does much worse at the traps than the other
processors. The Pentium 4 example uses the processor that has the
highest overhead. Redone table assuming 1% clock cycles used by
overhead of mechanism and one trap per probe.
branch traps
samples/sec samp/sec
pentium m (1.6ghz) 190e3 44e3
athlon 1800 (1.53ghz) 183e3 28e3
pentium 4 (1.6ghz) 107e3 11e3
>
> For Example: Assume a 1.6GHz Pentium 4
> Branch overhead: 149 cycles
> Overhead for one trap: about 1400 cycles
> Kprobes requires 2-3 traps
> 1% overhead => 16M cycles
> trap-based instrumentation: 5000 probes per second
> branch-based instrumentation: 94000 probes per second
>
> For many tools, most time will be spent in analysis code and this
> issue is irrelevant. However, if you happen to be a performance
> guy, and you're trying to do something even moderately aggressive
> in terms of higher frequency or very low overhead, this might start
> to matter. If this also helps to simplify some of the interrupt
> management issues, that's great.
>
> I note in passing that the SPARC implementation of DTrace is
> reported to use branches, and their x86 implementation uses
> traps.
Figuring out the length of an x86 instruction is a non-trivial task.
Using the int3 on x86 avoids that pain.
-Will