This is the mail archive of the systemtap@sources.redhat.com mailing list for the systemtap project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [RFC PATCH 0/6] kprobes: remove global kprobe_lock


Mathieu Desnoyers wrote:
* Ananth N Mavinakayanahalli (ananth@in.ibm.com) wrote:

Hi,

The following set of patches replaces the global spinlock (kprobe_lock)
with an rwlock. With this change, it is now possible to have parallel
execution of kprobes (same/different), without having to spin on the
kprobe_lock. Of course, it is required that the handlers are reentrant
so as to obtain accurate results, or the handlers have to take care of
serializing in case they share variables (counters for example).



Well, I looks like a problem I had to face in my LTT experimental
implementation. Your goal, as tracing should have a minimal impact on
performances, is to have the fastest locks in the critical path. In fact, rwlock
are not made for that purpose : they keep a counter of readers, which should
therefore make that value bounce from one cpu to another causing a cache
invalidation.

Yes, cacheline bouncing is an issue. But, note that this is just the first step in making kprobes scalable. And since context switches are involved with kprobes in any case due to breakpointing and single-stepping, I don't know if it is that big an issue.

As it is said in the Documentation/Docbook/ about locking, rwlocks are made to
protect paths of data which take a long time to execute, or otherwize it does
not worth the performance cost compared to the contention caused by a spinlock.

If what you are looking for is scalability, here is the two final locking
scheme I came up with :

* Use atomic operations (no locking at all)
* Use a per_cpu spinlock :
  This is interesting in a case where you almost never write to a data
  structure, but read it really often. Here is the basic idea :
  - writers take every spinlock in the very same order. Once a writer has them
    all, it has write access to the structure.
  - readers only take their per cpu spinlock. It insure that no writer is
    currently modifying the data.

And then there is RCU. With RCU you can run handlers without *any* locking.

Someone, at OLS, suggested that it was like the brlock (for big reader lock).
The subtile enhancements of my implementation is the use of per_cpu variables to
hold the spinlocks, benefits :
  - No false sharing of the spinlocks.
  - Does no waste precious cpu cache space by aligning each spinlock on cache
    line boundaries : it's implicit in the per_cpu variables.


What do you think about it ?

Well, I have a RCU based prototype which could potentially be the
fastest of the lot. However, I saw some wierd issues on an 8-way x86 smp
with a kprobe on "schedule" and "make -j8" of the linux kernel. rmmod on
the kprobe module never returned. Still working on it. Needs more polishing and testing :-)


I've heard from most users that they don't care about the probe
insertion/removal overheads, only that they are concerned with handler execution times.


The goal is to finally have an RCU based mechanism for kprobes. The rwlock patchset is the first step in that direction.

Ananth


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]