This is the mail archive of the
systemtap@sourceware.org
mailing list for the systemtap project.
Re: [PATCH v2] Scheduler Tapset based on kernel tracepoints
- From: Josh Stone <jistone at redhat dot com>
- To: Kiran <kiran at linux dot vnet dot ibm dot com>
- Cc: systemtap at sources dot redhat dot com
- Date: Fri, 25 Sep 2009 15:52:39 -0700
- Subject: Re: [PATCH v2] Scheduler Tapset based on kernel tracepoints
- References: <1253715217.5193.115.camel@kiran-laptop>
Hi,
On 09/23/2009 07:13 AM, Kiran wrote:
> +probe scheduler.ctxswitch.tp = kernel.trace("sched_switch")
These .tp and .kp variants are an internal detail -- please put them
under __scheduler so they are in a more private namespace from what's
exposed to the user.
> +/**
> + * probe scheduler.migrate_task - Traces the migration of the tasks across cpus by the scheduler.
> + * @pid: pid of the task being migrated.
> + * @priority: priority of the task being migrated.
> + * @original_cpu: the original cpu
> + * @destination_cpu: the destination cpu
> + */
There is already a "scheduler.migrate" defined, so please merge with
that instead of creating a new one.
> +/**
> + * probe scheduler.process_exit - Fires when a process exits
> + * @pid: pid of the process exiting
> + * @priority: priority of the process exiting
> + */
This overlaps with the less well-known "kprocess.exit". I think in this
case we can make the kprocess one point to yours.
> +/**
> + * probe scheduler.process_fork - Probes the tracepoint for forking a process
> + * @parent_pid: PID of the parent process
> + * @child_pid: PID of the child process
> + */
The "kprocess.create" is also similar to this.
> +/**
> + * probe scheduler.signal_send - Probes the tracepoint for sending a signal
> + * @pid: pid of the process sending signal
> + * @signal_number: signal number
> + */
Here we have the whole signal.stp tapset, which includes a "signal.send"
probe point.
It's not easy to decide how to resolve the overlaps in different
tapsets, but please consider it. At a minimum, they should leverage
each other so they're all using the best available probe points.
> diff -Naur systemtap-orig/testsuite/systemtap.examples/profiling/sched_switch.stp systemtap/testsuite/systemtap.examples/profiling/sched_switch.stp
> --- systemtap-orig/testsuite/systemtap.examples/profiling/sched_switch.stp 1969-12-31 19:00:00.000000000 -0500
> +++ systemtap/testsuite/systemtap.examples/profiling/sched_switch.stp 2009-09-22 02:29:16.000000000 -0400
> [...]
> +probe scheduler.wakeup
> +{
> + pids[task_pid]++
> + processes[task_pid] = $p;
> + prev[task_pid] = task_current()
> +
> +}
If this is just so you can print the "+" line of who wokeup whom, why
not print that right away? Then you don't need those arrays at all.
Saving task_struct pointers is messy business.
> +probe scheduler.ctxswitch
> +{
> + tid = next_tid
> + tid1 = prev_tid
> + state = prev_state
> + state1 = next_state
Why copy these values? The prev/next variable names already reflect
more meaning.
> +
> + %( $# == 2 %?
> +
> + if(@1 == "pid")
> + if (tid != $2 && tid1 != $2)
> + next
> + if(@1 == "name")
> + if (task_execname(task_current()) != @2 && task_execname($next) != @2)
> + next
Note that task_execname(task_current()) is the same as simply
execname(). But anyway, your tapset provides prev_task_name and
next_task_name, so just use those.
> +
> + foreach (name in pids-) {
> + if ((@1 == "pid" && (name == $2 || task_pid(prev[name]) == $2)) ||
> + (@1 == "name" && (task_execname(prev[name]) == @2 || task_execname(processes[name]) == @2)))
> + printf("%s\t\t%d\t%d\t%d\t%d:%d:%s + %d:%d:%s %s\n",
> + task_execname(prev[name]), task_pid(prev[name]), task_cpu(processes[name]), gettimeofday_ns(),
> + task_pid(prev[name]), task_prio(prev[name]), state_calc(task_state(prev[name])),
> + task_pid(processes[name]), task_prio(processes[name]), state_calc(task_state(processes[name])),
> + task_execname(processes[name]))
> + } %)
Hmm, "name" is not a name, which is confusing. But if you're looking
for the task which woke this one up, wouldn't that be prev[next_pid]?
(That should really be indexed by the tid, actually.) Or, as I said
above, it may be easier to print this right in the wakeup probe.
> +
> + old_cpu = task_cpu_old[tid]
> + printf("%s\t\t%d\t%d\t%d\t%d:%d:%s ==> %d:%d:%s %s\n",task_execname(task_current()),tid1,
> + old_cpu,gettimeofday_ns(),tid1,task_prio(task_current()),state_calc(state),next_pid,
> + next_prio,state_calc(next_state),next_task_name )
> + task_cpu_old[next_tid] = cpu()
> +}
I think task_current() is probably always the same as prev_task, but
it's probably best to use the latter. It would also be helpful if you
spaced this out so we can better see what's being printed. I can see
two "tid1"s in there, which is a bit weird.
Thanks,
Josh