This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Unified tracing buffer

From: Linus Torvalds <torvalds at linux-foundation dot org>
To: Mathieu Desnoyers <compudj at krystal dot dyndns dot org>
Cc: Roland Dreier <rdreier at cisco dot com>, Masami Hiramatsu <mhiramat at redhat dot com>, Martin Bligh <mbligh at google dot com>, Linux Kernel Mailing List <linux-kernel at vger dot kernel dot org>, Thomas Gleixner <tglx at linutronix dot de>, Steven Rostedt <rostedt at goodmis dot org>, darren at dvhart dot com, "Frank Ch. Eigler" <fche at redhat dot com>, systemtap-ml <systemtap at sources dot redhat dot com>
Date: Mon, 22 Sep 2008 21:05:20 -0700 (PDT)
Subject: Re: Unified tracing buffer
References: <33307c790809191433w246c0283l55a57c196664ce77@mail.gmail.com> <48D7F5E8.3000705@redhat.com> <33307c790809221313s3532d851g7239c212bc72fe71@mail.gmail.com> <48D81B5F.2030702@redhat.com> <33307c790809221616h5e7410f5gc37c262d83722111@mail.gmail.com> <48D832B6.3010409@redhat.com> <alpine.LFD.1.10.0809221718100.3265@nehalem.linux-foundation.org> <adaod2f649o.fsf@cisco.com> <alpine.LFD.1.10.0809222021230.3265@nehalem.linux-foundation.org> <20080923033635.GK24937@Krystal>

On Mon, 22 Sep 2008, Mathieu Desnoyers wrote:
> 
> Unless I am missing something, in the case we use an atomic operation
> which implies memory barriers (cmpxchg and atomic_add_return does), one
> can be sure that all memory operations done before the barrier are
> completed at the barrier and that all memory ops following the barrier
> will happen after.

Sure (if you have a barrier - not all architectures will imply that for an 
incrment).

But that still doesn't mean a thing.

You have two events (a) and (b), and you put trace-points on each. In your 
trace, you see (a) before (b) by comparing the numbers. But what does that 
mean? 

The actual event that you traced is not the trace-point - the trace-point 
is more like a fancy "printk". And the fact that one showed up before 
another in the trace buffer, doesn't mean that the events _around_ the 
trace happened in the same order.

You can use the barriers to make a partial ordering, and if you have a 
separate tracepoint for entry into a region and exit, you can perhaps show 
that they were totally disjoint. Or maybe they were partially overlapping, 
and you'll never know exactly how they overlapped.

Example:

	trace(..);
	do_X();

being executed on two different CPU's. In the trace, CPU#1 was before 
CPU#2. Does that mean that "do_X()" happened first on CPU#1?

No.

The only way to show that would be to put a lock around the whole trace 
_and_ operation X, ie

	spin_lock(..);
	trace(..);
	do_X();
	spin_unlock(..);

and now, if CPU#1 shows up in the trace first, then you know that do_X() 
really did happen first on CPU#1. Otherwise you basically know *nothing*, 
and the ordering of the trace events was totally and utterly meaningless.

See? Trace events themselves may be ordered, but the point of the trace 
event is never to know the ordering of the trace itself - it's to know the 
ordering of the code we're interested in tracing. The ordering of the 
trace events themselves is irrelevant and not useful.

And I'd rather see people _understand_ that, than if they think the 
ordering is somehow something they can trust.

Btw, if you _do_ have locking, then you can also know that the "do_X()" 
operations will be essentially as far apart in some theoretical notion of 
"time" (let's imagine that we do have global time, even if we don't) as 
the cost of the trace operation and do_X() itself.

So if we _do_ have locking (and thus a valid ordering that actually can 
matter), then the TSC doesn't even have to be synchronized on a cycle 
basis across CPU's - it just needs to be close enough that you can tell 
which one happened first (and with ordering, that's a valid thing to do). 

So you don't even need "perfect" synchronization, you just need something 
reasonably close, and you'll be able to see ordering from TSC counts 
without having that horrible bouncing cross-CPU thing that will impact 
performance a lot.

Quite frankly, I suspect that anybody who wants to have a global counter 
might as well almost just have a global ring-buffer. The trace events 
aren't going to be CPU-local anyway if you need to always update a shared 
cacheline - and you might as well make the shared cacheline be the ring 
buffer head with a spinlock in it.

That may not be _quite_ true, but it's probably close enough.

		Linus

References:
- Re: Unified tracing buffer
  - From: Masami Hiramatsu
- Re: Unified tracing buffer
  - From: Martin Bligh
- Re: Unified tracing buffer
  - From: Masami Hiramatsu
- Re: Unified tracing buffer
  - From: Martin Bligh
- Re: Unified tracing buffer
  - From: Masami Hiramatsu
- Re: Unified tracing buffer
  - From: Linus Torvalds
- Re: Unified tracing buffer
  - From: Roland Dreier
- Re: Unified tracing buffer
  - From: Linus Torvalds
- Re: Unified tracing buffer
  - From: Mathieu Desnoyers

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]