This is the mail archive of the systemtap@sourceware.org mailing list for the systemtap project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]
breakpoint assistance: single-step out of line

From: Roland McGrath <roland at redhat dot com>
To: systemtap at sources dot redhat dot com
Date: Sun, 4 Mar 2007 13:38:11 -0800 (PST)
Subject: breakpoint assistance: single-step out of line
The method of single-stepping over an out of line copy of the instruction
clobbered by breakpoint insertion has been proven by kprobes.  The
complexities are mitigated in that implementation by the constrained
context of the kernel and the fixed subset of possible machine code known
to validly occur in any kernel or module text.

There are two core problem areas in implementing single-step out of line
for user mode code.  These are where to store the out of line copies, and
arch issues with instruction semantics.


Starting with arch issues, I'll talk about the only ones I know in detail,
which are x86 and x86_64.  kprobes has done the basic work here.  For the
user mode context, on the one hand the risks of munging an instruction's
behavior are confined to the user address space in question, but on the
other hand we have to deal robustly with the full range of instructions
that can be executed on the processor in user mode.

Instruction decoding needs to be robust, not presume the canonical subset
of encodings normally produced by the compiler, as used in the kernel.  On
machines other than x86, this tends to be quite simple.  On x86, it means
parsing all the instruction prefixes correctly and so forth.  I think the
parsing should be done at breakpoint insertion time, caching just a few
bits saying what fixup strategy we need to use after the step.  If we can't
positively decode the instruction enough to be confident that we know how
to fix it up, refuse to insert the breakpoint.  (If it's an invalid
instruction, you don't need a breakpoint because you'll get a trap anyway.)

The instructions of concern are those that refer to the PC.
On 32-bit x86, these are only the few control flow instructions.

On x86_64, there is also %rip-relative addressing.  We cannot presume
addresses are within the same 4GB window so that the displacement can just
be adjusted, as we do in the kernel.  However, we can use some other
tricks.  The only instruction that computes a %rip-relative address as a
result is lea.  It is not difficult to recognize that one and just emulate
it outright; there are only a few variations of address-size, data-size,
and output register.  It's not much easier to fix it up after the step.

Unless I'm overlooking something, all other %rip-relative uses are implicit
in the effective address for a memory access.  For these, we can use the fs
or gs segment prefix on the copied instruction, and adjust the displacement
and the fs or gs base value to come up with the original target address.
In the unlikely event that the instruction already uses the fs or gs
prefix, just adjust the appropriate base value and use the instruction as
it is.  Otherwise, insert a gs prefix in the copied instruction, and set
the gs base to the difference between the address of the copy (after the
inserted prefix) and the breakpoint address.  It is a little costly to set
the fs or gs base value and reset it after the step, much more than setting
a register in the trap frame; but it's probably not too bad.


Next we come to the problem of where to store copied instructions for
stepping.  The idea of stealing a stack page for this is a non-starter.
For both security and robustness, it's never acceptable to introduce a user
mapping that is both writable and executable, even temporarily.  We need to
use an otherwise unused page in the address space, that will be
read/execute only for the user, we can write to it only from kernel mode.

In some meeting notes I've seen mention of "do what the vdso does".  I
don't know what this referred to specifically.  There are two things this
might mean, and those are the two main options I see.  What the i386 vDSO
used to do (CONFIG_COMPAT_VDSO), what the ia64 vDSO does, and what the
x86-64 vsyscall page does (not a vDSO but similar), is the fixmap area.
What the i386 vDSO, the ia32 vDSO on x86_64, and the powerpc vDSO do,
is insert a vma.

The fixmap area is a region of address space that shares some page tables
across all tasks in the system.  The advantages are that it has no vm setup
cost since it is done once at boot, and that it is completely outside the
range of virtual addresses the user task can map normally and so does not
perturb any mapping behavior or appear in /proc/PID/maps or via
access_process_vm or such things that might have unintended side effects on
the user process.  On 32-bit x86, a disadvantage is that when NX page
protection is not available (older CPUs or non-PAE kernel builds), the
exec-shield approximation of NX protection via segmentation is defeated by
having an executable page high in the address space; this can be worked
around on the exec-shield kernel with some extra effort.  Other machines
may not already have an analogous region of reserved address space where a
page can be made user-readable/executable.  Other potential disadvantages
are the fixed amount of space (chosen at compile-time or boot-time, with
some small limit on the number of pages available), and the security
implications of global pages visible to all users on the system.  The
limited size might mean that slots need to be assigned only momentarily
while doing the step, meaning fresh icache flushing every time.  Then you'd
ideally use only one slot per CPU, but that needs some work to be right
given preemption.  The briefness of this window may mitigate the security
concerns, but still there are a few bytes of information about a traced
thread leaking to anyone in the system who wants to try to see them.  The
setup every time necessitated by the fixed space is costly, but on the
other hand its CPU use scales linearly with more breakpoints and more
occurrences and its memory use stays constant, compared to open-ended
allocation scaling with the number of breakpoints.

Inserting a vma means essentially doing an mmap from inside the kernel.
Both the advantages and the disadvantages of this stem from its normalcy.
Any stray mmap/munmap/mprotect call from the user might wind up clobbering
this mapping.  It appears in /proc/PID/maps and will become known to other
debugging facilities tracing the process, so they will think it's a normal
user allocation; it might appear in core dumps.  This might have other bad
effects on processes that look at their own maps file to see what heap
pages there are, which some GC libraries or suchlike might well do.  The
mapping also has subtler effects perturbing the process's own mapping
behavior, which could introduce anomalies or even break some programs that
need a lot of control over their address space.  The advantages are that
it's straightforward to implement and easy to be sure that it does the
right thing vis a vis paging and so forth, it provides the option of using
an open-ended amount of storage to optimize the use of many breakpoints,
and it's wholly private to the user address space in question.

A third option I didn't mention before is doing something in the page
tables behind the vm system's back (this as distinct, and somewhat simpler
than, the fancy VM ideas like per-thread page tables).  I don't know enough
about this to comment in detail.  The attraction is that it would avoid
some of the interactions I just mentioned with vma's, and might have lower
overhead to set up.  It might be difficult to make this do reasonable
things about paging and such.  This is probably not a good bet, but I don't
know much about it.

The fixmap is somewhat attractive at least for x86, x86-64, and ia64.  It's
nice that it doesn't interact with the normal user address range and set of
visible mappings.  The overhead of resetting and icache flushing an
instruction slot on every use is less than the uprobes prototype using a
stack page already has.  I don't know if the performance of that will be
good enough in the long run, or if priming a slot once and using it
repeatedly will perform enough better that we care about this overhead.

The vma is the most straightforward thing to implement, and is generic
across machines.  It makes sense to implement this first generically and
then experiment later with the fixmap approach as an arch-specific
alternative.  The stack randomization done on at least x86/x86-64 means
that there is normally a good little stretch of address space free above
the stack vma (the top part of which holds environ and auxv).  (Just try
"tail -1 /proc/self/maps" a few times.)  This area is unlikely to conflict
with address space the user's own mappings would ever have considered.
Allocating at one page above the end of the stack vma (leaving a red zone)
seems good.  I'm really more concerned about things monitoring the
mappings.  Perhaps we could add a VM_* flag that says to omit the vma from
listings, but I don't know how that would be received by kernel people, let
alone a flag to disallow user munmap/mmap/mprotect calls to change a mapping.

I can go into further detail on how I envision implementing the vma and/or
fixmap plans if it is not clear.


Thanks,
Roland
Follow-Ups:
- Re: breakpoint assistance: single-step out of line
  - From: Jim Keniston
- Re: breakpoint assistance: single-step out of line
  - From: Frank Ch. Eigler
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]