This is the mail archive of the
libc-alpha@sourceware.org
mailing list for the glibc project.
Re: [x86-64 psABI] RFC: Extend x86-64 PLT entry to support MPX
- From: Roland McGrath <roland at hack dot frob dot com>
- To: "H.J. Lu" <hjl dot tools at gmail dot com>
- Cc: GNU C Library <libc-alpha at sourceware dot org>, GCC Development <gcc at gcc dot gnu dot org>, Binutils <binutils at sourceware dot org>, "Girkar, Milind" <milind dot girkar at intel dot com>, "Kreitzer, David L" <david dot l dot kreitzer at intel dot com>
- Date: Wed, 24 Jul 2013 16:36:21 -0700 (PDT)
- Subject: Re: [x86-64 psABI] RFC: Extend x86-64 PLT entry to support MPX
- References: <CAMe9rOp=1v38F_aV-pbv50YOGSEr_ju+byZP1L_G_h4bm5Ad3w at mail dot gmail dot com>
I've read through the MPX spec once, but most of it is still not very
clear to me. So please correct any misconceptions. (HJ, if you answer
any or all of these questions in your usual style with just, "It's not a
problem," I will find you and I will kill you. Explain!)
Will an MPX-using binary require an MPX-supporting dynamic linker to run
correctly?
* An old dynamic linker won't clobber %bndN directly, so that's not a
problem.
* Does having the bounds registers set have any effect on regular/legacy
code, or only when bndc[lun] instructions are used?
If it doesn't affect normal instructions, then I don't entirely
understand why it would matter to clear %bnd* when entering or leaving
legacy code. Is it solely for the case of legacy code returning a
pointer value, so that the new code would expect the new ABI wherein
%bnd0 has been set to correspond to the pointer returned in %rax?
* What's the effect of entering the dynamic linker via "bnd jmp"
(i.e. new MPX-using binary with new PLT, old dynamic linker)? The old
dynamic linker will leave %bndN et al exactly as they are, until its
first unadorned branching instruction implicitly clears them. So the
only problem would be if the work _dl_runtime_{resolve,profile} does
before its first branch/call were affected by the %bndN state.
If there are indeed any problems with this scenario, then you need a
plan to make new binaries require a new dynamic linker (and fail
gracefully in the absence of one, and have packaging systems grok the
dependency, etc.)
In a related vein, what's the effect of entering some legacy code via
"bnd jmp" (i.e. new binary using PLT call into legacy DSO)?
* If the state of %bndN et al does not affect legacy code directly, then
it's not a problem. The legacy code will eventually use an unadorned
branch instruction, and that will implicitly clear %bnd*. (Even if
it's a leaf function that's entirely branch-free, its return will
count as such an unadorned branch instruction.)
* If that's not the case, then a PLT entry that jumps to legacy code
will need to clear the %bndN state. I see one straightforward
approach, at the cost of a double-bounce (i.e. turning the normal
double-bounce into a triple-bounce) when going from MPX code to legacy
code. Each PLT entry can be:
bnd jmp *foo@GOTPCREL(%rip)
pushq $N
bnd jmp .Lplt0
.balign 16
jmp *foo@GOTPCREL+8(%rip)
.balign 32
and now each of those gets two (adjacent) GOT slots rather than just
one. When the dynamic linker resolves "foo" and sees that it's in a
legacy DSO, it sets the foo GOT slot to point to .plt+(N*32 + 16) and
the foo+1 GOT slot to point to the real target (resolution of "foo").
After fixup, entering that PLT entry will do "bnd jmp" to the second
half of the entry, which does (unadorned) "jmp" to the real target,
implicitly clearing %bndN state.
Those are the background questions to help me understand better.
Now, to your specific questions.
I can't tell if you are proposing that a single object might contain
both 16-byte and 32-byte PLT slots next to each other in the same .plt
section. That seems like a bad idea. I can think of two things off
hand that expect PLT entries to be of uniform size, and there may well
be more.
* The foo@plt pseudo-symbols that e.g. objdump will display are based on
the BFD backend knowing the size of PLT entries. Arguably this ought
to look at sh_entsize of .plt instead of using baked-in knowledge, but
it doesn't.
* The linker-generated CFI for .plt is a single FDE for the whole
section, using a DWARF expression covering all normal PLT entries
together based on them having uniform size and contents. (You could
of course make the linker generate per-entry CFI, or partition the PLT
into short and long entries and have the CFI treat the two partitions
appropriately differently. But that seems like a complication best
avoided.)
Now, assuming we are talking about a uniform PLT in each object, there
is the question of whether to use a new PLT layout everywhere, or only
when linking an object with some input files that use MPX.
* My initial reaction was to say that we should just change it
unconditionally to keep things simple: use new linker, get new format,
end of story. Simplicity is good.
* But, doubling the size of PLT entries means more i-cache pressure. If
cache lines are 64 bytes, then today you fit four entries into a cache
line. Assuming PLT entries are more used than unused, this is a good
thing. Reducing that to two entries per cache line means twice as
many i-cache misses if you hit a given PLT frequently (with even
distribution of which entries you actually use--at any rate, it's
"more" even if it's not "twice as many"). Perhaps this is enough cost
in real-world situations to be worried about. I really don't know.
* As I mentioned before, there are things floating around that think
they know the size of PLT entries. Realistically, there will be
plenty of people using new tools to build binaries but not using MPX
at all, and these people will give those binaries to people who have
old tools. In the case of someone running an old objdump on a new
binary, they would see bogus foo@plt pseudo-symbols and be misled and
confused. Not to mention the unknown unknowns, i.e. other things that
"know" the size of PLT entries that we don't know about or haven't
thought of here. It's just basic conservatism not to perturb things
for these people who don't care about or need anything related to MPX
at all.
How a relocatable object is marked so that the linker knows whether its
code is MPX-compatible at link time and how a DSO/executable is marked
so that the dynamic linker knows at runtime are two separate subjects.
For relocatable objects, I don't think there is really any precedent for
using ELF notes to tell the linker things. It seems much nicer if the
linker continues to treat notes completely normally, i.e. appending
input files' same-named note sections together like with any other named
section rather than magically recognizing and swallowing certain notes.
OTOH, the SHT_GNU_ATTRIBUTES mechanism exists for exactly this sort of
purpose and is used on other machines for very similar sorts of issues.
There is both precedent and existing code in binutils to have the linker
merge attribute sections from many input files together in a fashion
aware of the semantics of those sections, and to have those attributes
affect the linker's behavior in machine-specific ways. I think you have
to make a very strong case to use anything other than SHT_GNU_ATTRIBUTES
for this sort of purpose in relocatable objects.
For linked objects, there a couple of obvious choices. They all require
that the linker have special knowledge to create the markings. One
option is a note. We use .note.ABI-tag for a similar purpose in libc,
but I don't know of any precedent for the linker synthesizing notes.
The most obvious choice is e_flags bits. That's what other machines use
to mark ABI variants. There are no bits assigned for x86 yet. There
are obvious limitations to using e_flags, in that it's part of the
universal ELF psABI rather than something with vendor extensibility
built in like notes have, and in that there are only 32 bits available
to assign rather than being a wholly open-ended format like notes. But
using e_flags is certainly simpler to synthesize in the linker and
simpler to recognize in the dynamic linker than a note format. I think
you have to make at least a reasonable (objective) case to use a note
rather than e_flags, though I'm certainly not firmly against a note.
Finally, you've only mentioned x86-64. The hardware details apply about
the same to x86-32 AFAICT. If this is something that we'll eventually
want to do for x86-32 as well, then I think we should at least hash out
the plan for x86-32 fairly thoroughly before committing to a plan for
x86-64 (even if the actual implementation for x86-32 lags). Probably
it's all much the same and working it through for x86-32 won't give us
any pause in our x86-64 plans, but we won't know until we actually do it.
Thanks,
Roland