This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: what does 'probe process(PID_OR_NAME).clone' mean?


> > Another note on report_clone: this callback is not the place to do much
> > with the child except to attach it. [...]
> Is this also true for any other events?  Currently we're using
> UTRACE_EVENT({CLONE, DEATH, EXEC, SYSCALL_ENTRY, SYSCALL_EXIT}), but
> this list could expand in the future.

There aren't hard and fast rules.  There are particular caveats in each
context.  CLONE is of course unique in having the state of the
not-quite-started child to consider.  The main part of that caveat has to
do with when you want to do things to the child, not in whether you're
holding up the parent for its own sake.  For all other events, there is
only the thread itself experiencing the event to consider.

EXEC is at a place where it's keeping some kernel resources live (binprm
and binfmt), is keeping the exec'd file from being written, and a few other
things like that.  Once you return from the callback, those things are
released.  Since you're not ever supposed to block for long in a callback,
all that should mean is to make sure you copy out anything you wanted from
those data structures.

DEATH is in a strange scheduling context (exit_state already set) where
you're not expected to do much, and the last bit of thread state cleanup
has not been done yet.

REAP is either in the DEATH context or is called from a wholly other
process (the parent/ptracer).

EXIT is special in that staying quiescent/stopped here is not preventing
going back to user mode, it's preventing teardown work and getting to DEATH.

The other events (SYSCALL_*, SIGNAL_*, JCTL) all by definition take place
at the safe boundary places between kernel and user.  These have the same
conditions as a thread that is fully stopped/quiescent.

I expect any future additions all to be in this last category.
It is the norm, being the most thoroughly unrestricted case.

> At this point, I'm liking your "grouping" model (although I have a few
> quibbles later on).  Note that currently the grouping model doesn't
> really exist - each probe has its own utrace engine, even probes on the
> same pid/exe.

The discussion so far is only on the abstraction to describe what we're
doing.  I'm leaving related implementation details for later.  The focus
now is on getting the "middle ground" abstraction to fully make sense and
fit as the basis from which to describe what we mean in concrete low-level
terms by the user features.

> I'd think we'd want to hide the low-level stuff from users and not
> expose them at the script level, but I could be talked out of it.

I'm also not really talking about systemtap feature details here.  
The low-level script constructs are just a strawman for discussing the
mapping to the middle layer.  It's easier to refer to approximate
systemtap syntax and handwaving than just contextless pseudocode and
handwaving.  (If I were talking about systemtap features, I would
indeed talk you out of it.  But that is not the discussion I'm trying
to have right now.)

> Here's my quibble.
> 
> I like the process(PID) behavior you outline above, but I'm not sure I
> like the difference in behavior between it and the process.execname
> behavior.
> 
> Here's a concrete example to see if I'm reading you correctly.  Assume
> pid 123 points to /bin/bash and I'm doing syscall tracing.  If I'm
> tracing by pid, I'm not going to get syscall events between the fork and
> the exec for the child.  If I'm tracing by exename, I am going to get
> the syscall events between the fork and exec.
> 
> But, I certainly like the idea of tracing by 'pid' - and by 'pid' we
> meant a tgid, not a tid.  So, a multi-threaded 'pid' tracing would work
> as a user *meant*, but not exactly as he *said*.

Whether it's a difference really depends on what you meant by sameness,
or put another way, what you meant to say the user said or meant depends
on what you meant the given saying to mean or thought the meaning said.
(Yes, I'm throwing in a bonus Tylenol with that one.)

As per the caveat I mentioned, by PID we really meant "process".  That
is, the PID we said at the time of the start of the session meant the
particular process identity then identified by that PID.  

So, the question is what were we saying by a process(PID).foo probe?

I presumed in my description the intended semantics was in the style of
a predicate.  That is, "When a foo happens, was it in process(PID)?"
This is the same behavior as process.execname, which asks, "When a foo
happens, was it in a process with execname 'bar'?"

When a process exec's "baz", it no longer meets the "execname is 'bar'"
predicate, so those probes don't apply.  When a process forks a child,
the child does not meet the "process identity is PID" predicate, so
those probes don't apply.  It's the same behavior.

The other thing you're describing is a "this process and its children"
predicate, which is not what I ever thought you meant with process(PID).
Still, I'm not quibbling about what the systemtap syntax should mean.  I
go into this detail just so we'll clarify exactly what we're describing
with our casual references.  In fact, I guess what you really think
makes sense is a "this process and its children until they exec"
predicate, which to me is a fairly nonobvious definition of process(PID)
to have been implying without describing it.

All three of those things are fairly easy to represent in the tracing
group model, which was the true point of the example.  The process and
its children style is obviously just a child-joins-group rule in the
clone/fork event.  

For your fancy idea where a process doesn't mean a process or its
children but means those which haven't exec'd, the representation it
depends on which of two ways you mean to construe that.  If you mean
that process(PID) ceases to match the original PID process's own events
once it execs, then it's a simple matter of an exec rule for the group.
If you instead mean a more complex behavior where process(PID) always
means the original PID but means its children only until they exec, then
it involves a second group for the children, since their rule for exec
is different from the parent's.

The former of those two seems most sensical to me, since it corresponds
to having the same or copied user memory/state so that user-level
program state (variables, pointers) is of a piece for everything in that
tracing group.  Anyway, I'd still prefer to separate the choices for
systemtap features from the fundamental discussion of this new middle
layer idea.  My main focus at the moment is to iron out the tracing
group model so we're confident it is a sound basis to facilitate
implementing whatever choices of user-visible features we end up with.
I'm less concerned with exactly what you mean the systemtap language to
mean than and more with understanding all the options for what it means
precisely enough to ensure my model is a good platform to express them.

> I'm lost on the difference between 'process.fork.any' and
> 'process.fork.fork'.  Does 'process.fork.any' include 'process.vfork'?

Yes.  The difference is just exactly what it says there: the tests on
clone_flags.  It's not a suggestion for systemtap features, it's an
illustration of the distinctions that might make sense from the
application programmer's perspective, and how those would map to
filters on the underlying event.

> On more question.  Frank and I bounced a few ideas on irc the other day,
> and we wondered if there was a good way on UTRACE_EVENT(DEATH) to tell
> the difference between a "thread" death and "process" death?

At the moment, there are two easy methods that are racy in different
ways.  But, the kernel has the answer on hand right there and it would
be trivial to pass it down.  In the coming version of the interface,
I'll give the report_death callback an argument to tell you.  For the
moment, I'd say test atomic_read(&task->signal->live) == 0 and don't
worry about the race.


Thanks,
Roland


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]