This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: Controlling probe overhead


David Smith wrote:
> I took your patch and updated it a small bit.  Sorry for not
> respondingn sooner.  See comments at the end.

Thanks -- I'm glad this idea didn't get lost.

[...]
> Making exceptions for begin/end probe is going to be a bit difficult,
> since the common_probe_entryfn_prologue() and
> common_probe_entryfn_epilogue() functions don't really have any
> context of where they are called from.

It's simple to add that context -- we can add a parameter to those
functions so that the default does output the throttling checks, and
then modify the call site in be_derived_probe_group::emit_module_decls()
to turn them off.

This could be applied in the future to any probe handlers for which we
allow long runtimes.  For instance, if we have probes that are allowed
to sleep while accessing user data, then they probably don't need these
checks.

> Besides sharing code with STP_TIMING, I also added a command-line
> switch to turn this new functionality off (but your idea of tunable
> thresholds is probably better).

An option to disable it is a good idea.  As for tuning the threshold, we
could make a new -D option like MAXOVERHEAD.  If this is a percentage,
then internally we can just define this:

#define STP_ACCOUNTING_THRESHOLD
MAX_OVERHEAD*STP_ACCOUNTING_INTERVAL/100

This way we expose some control without exposing the implementation of
the threshold and interval.

> I do have a question.  I'm afraid I don't quite understand the logic
> behind (and the differences between) STP_ACCOUNTING_THRESHOLD and
> STP_ACCOUNTING_INTERVAL.  Why isn't STP_ACCOUNTING_THRESHOLD enough?
> Could you explain that a bit?

If you defined the threshold as a relative value, like a percentage,
then you could get by with that alone.  Absolute numbers don't work,
because the runtime for a script that runs 5% overhead for 100 seconds
looks the same as a script that runs 100% overhead for 5 seconds -- it's
only the latter that we want to prevent.  You need to include the total
time, so your check would look something like this:

    if (100*cum_time/(now - start_time) > THRESHOLD_PCT)
        handle_overhead_problem();

... or eliminate the division:

    if (100*cum_time > THRESHOLD_PCT*(now - start_time))
        handle_overhead_problem();

One problem with this is that it only measures the *average* overhead.
If a script has been running a long time with low overhead, and suddenly
the conditions change such that the overhead rails, it may take a long
time for the average to be brought up enough to abort.  One solution is
to periodically reset the cum_time and start_time, and if you take this
a little further and only make the check right before doing the reset,
then you pretty much have my concept of an interval.

Thus, in my implementation, the interval defines how often we should
check our overhead, and the threshold defines how much overhead is
allowed within that interval.  After making the check, I reset the
numbers for the next interval.

As a side note, we may need a better name to describe this --
"accounting" is pretty vague, and "throttling" implies that we're
holding back but not quitting (which would be cool, but much harder).
Perhaps "cutoff" better describes what we're about here.

Right now I'm treating too much overhead as a hard error, but we could
make it more like a forced exit instead.  This way the end probes would
still run, so if you have a long-running script that aborts suddenly due
to overhead, you can still get some sort of report out of it.  Any
preference?


Josh


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]