This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

RE: Making the transport layer more robust


Hi Turgis,

On Mon, 2011-09-05 at 16:04 +0200, Turgis, Frederic wrote:
> >The kernel side "polling" is not just for exit, it is for any
> >cmd message that is generated from a possible "unsafe"
> >context
> 
> I have then probably not understood code correctly (code was before
> latest changes, now this polling is mandatory as I mentioned later in
> mail). "unsafe" context is associated to unannounced message, no ?

Yes, it depends on the probe point the message is generated from. Since
in general we cannot know whether a probe is "safe" or not, we always
just queue the message and continue. The kernel timer then periodically
checks the queue (and the exit flag) and announces it. I added a
particularly nasty example of a probe point where immediately announcing
the message to user space would be really bad:
testsuite/systemtap.base/wakeup.exp

>  Well, even for announced messages, I have the impression that reading
> message relies only on user side polling because user side is not
> waiting for a wake-up of _stp_ctl_ready_q. Here is my understanding
> but I didn't take time to perform some traces:
> 
> - for annouced messages or on kernel polling (_stp_ctl_work_callback),
> I understand that we trigger a reading through
> wake_up_interruptible(&_stp_ctl_wq);

Correct.

> - on user side, main reading loop does:
> flags |= O_NONBLOCK;
> fcntl(control_channel, F_SETFL, flags);
> nb = read(control_channel, &recvbuf, sizeof(recvbuf));
> So I expect a non blocking read (however, there may be another place where we read cmd message)

Indeed. That is the first pass, if there is anything to read, we read
it, but then (if select is supported) we call select() on the
control_channel. This means we will wait there in user space until the
kernel explicitly wakes us up to tell there are messages waiting, but we
don't need to poll ourselves in user space. Only when select returns
will we try to read a message from the kernel again.

> This ends in "_stp_ctl_read_cmd" in kernel doing:
> while (list_empty(&_stp_ctl_ready_q)) {
>                 spin_unlock_irqrestore(&_stp_ctl_ready_lock, flags);
>                 if (file->f_flags & O_NONBLOCK) -> non blocking read, we rely on polling to recheck _stp_ctl_ready_q
>                         return -EAGAIN;
>                 if (wait_event_interruptible(_stp_ctl_wq, !list_empty(&_stp_ctl_ready_q))) -> code not reached so kernel polling (or even message annoucement) useless ?
>                         return -ERESTARTSYS;

Yeah, you are right, that second bit is a little strange, we kind of
depend on user space always calling us with O_NONBLOCK. But if it
doesn't, then at least we do something somewhat sensible.

> >I am very interested in any results you get from the new code.
> 
> Never tested bulk mode. We quite like filling up buffer and doing a
> big dump but doing very small regular dumps could also work.

I think you should also try bulk mode (stap -b) if possible. It won't
give you immediate results, but because of the extra buffering it should
definitely be less overhead.

> Our modifications are just ugly hacks to understand the internals.
> They make sense for some, but for some other parts, we probably have
> different requirements between a server and an embedded platform.
> Capability to tune a timer would be OK (or maybe bulk-mode would be
> good)
> 
> Here are the v1.3 experiments we performed few months ago (latest
> months have been too busy with customer to share before :-( ). Goal
> was to remove "stapio:5192" process and "systemtap/1:5189" workqueue
> from waking up too often, here every 200ms and 7.8125ms (HZ=128)
> respectively. Attached picture shows what each core runs in time
> (blue=core 0, red=core 1, the lower a process, the more it runs,
> except swapper=Idle tasks always at bottom)

Looks like the attachement somehow didn't reach the mailinglist. odd.

> Control channel userpace polling (well, I consider control everything that is not trace/data output from script)
> -      usleep (250*1000); /* sleep 250ms between polls */
> +      usleep (2000*1000); /* sleep 250ms between polls */

So with current git trunk, you get pselect support, so don't need to
tweak this anymore.

> Control channel kernel polling (you might find it a bit extreme ;-) )
> -       if (likely(_stp_ctl_attached))
> -               queue_delayed_work(_stp_wq, &_stp_work, STP_WORK_TIMER);
> +       //if (likely(_stp_ctl_attached))
> +       //      queue_delayed_work(_stp_wq, &_stp_work, STP_WORK_TIMER);

yeah, that is somewhat extreme, if possible try just
-DSTP_WORK_TIMER=... to get a good balance. I am interested in what
value works well for you.

> Data channel userspace timeout of select()
> -       struct timespec tim = {.tv_sec=0, .tv_nsec=200000000}, *timeout = &tim;
> +       struct timespec tim = {.tv_sec=5, .tv_nsec=0}, *timeout = &tim;

With bulk mode this shouldn't be necessary. But it might be nice to make
this tunable too with some -D option.

> Data channel kernel polling
> -#define STP_RELAY_TIMER_INTERVAL               ((HZ + 99) / 100)
> +#define STP_RELAY_TIMER_INTERVAL       HZ              /* ((HZ + 99) / 100) */  -> wake-up every s

Hopefully you can use a -DSTP_RELAY_TIMER_INTERVAL for this too on the
command line instead. I am really hoping we make it so that you don't
have to hack the sources anymore just for experimenting with the
timers/poll intervals.

Thanks for your experiments and report.

Cheers,

Mark


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]