This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [x86-64 psABI] RFC: Extend x86-64 PLT entry to support MPX


I've read through the MPX spec once, but most of it is still not very
clear to me.  So please correct any misconceptions.  (HJ, if you answer
any or all of these questions in your usual style with just, "It's not a
problem," I will find you and I will kill you.  Explain!)

Will an MPX-using binary require an MPX-supporting dynamic linker to run
correctly?

* An old dynamic linker won't clobber %bndN directly, so that's not a
  problem.

* Does having the bounds registers set have any effect on regular/legacy
  code, or only when bndc[lun] instructions are used?  

  If it doesn't affect normal instructions, then I don't entirely
  understand why it would matter to clear %bnd* when entering or leaving
  legacy code.  Is it solely for the case of legacy code returning a
  pointer value, so that the new code would expect the new ABI wherein
  %bnd0 has been set to correspond to the pointer returned in %rax?

* What's the effect of entering the dynamic linker via "bnd jmp"
  (i.e. new MPX-using binary with new PLT, old dynamic linker)?  The old
  dynamic linker will leave %bndN et al exactly as they are, until its
  first unadorned branching instruction implicitly clears them.  So the
  only problem would be if the work _dl_runtime_{resolve,profile} does
  before its first branch/call were affected by the %bndN state.

If there are indeed any problems with this scenario, then you need a
plan to make new binaries require a new dynamic linker (and fail
gracefully in the absence of one, and have packaging systems grok the
dependency, etc.)

In a related vein, what's the effect of entering some legacy code via
"bnd jmp" (i.e. new binary using PLT call into legacy DSO)?  

* If the state of %bndN et al does not affect legacy code directly, then
  it's not a problem.  The legacy code will eventually use an unadorned
  branch instruction, and that will implicitly clear %bnd*.  (Even if
  it's a leaf function that's entirely branch-free, its return will
  count as such an unadorned branch instruction.)

* If that's not the case, then a PLT entry that jumps to legacy code
  will need to clear the %bndN state.  I see one straightforward
  approach, at the cost of a double-bounce (i.e. turning the normal
  double-bounce into a triple-bounce) when going from MPX code to legacy
  code.  Each PLT entry can be:

	bnd jmp *foo@GOTPCREL(%rip)
	pushq $N
	bnd jmp .Lplt0
	.balign 16
	jmp *foo@GOTPCREL+8(%rip)
	.balign 32

  and now each of those gets two (adjacent) GOT slots rather than just
  one.  When the dynamic linker resolves "foo" and sees that it's in a
  legacy DSO, it sets the foo GOT slot to point to .plt+(N*32 + 16) and
  the foo+1 GOT slot to point to the real target (resolution of "foo").
  After fixup, entering that PLT entry will do "bnd jmp" to the second
  half of the entry, which does (unadorned) "jmp" to the real target,
  implicitly clearing %bndN state.

Those are the background questions to help me understand better.
Now, to your specific questions.

I can't tell if you are proposing that a single object might contain
both 16-byte and 32-byte PLT slots next to each other in the same .plt
section.  That seems like a bad idea.  I can think of two things off
hand that expect PLT entries to be of uniform size, and there may well
be more.

* The foo@plt pseudo-symbols that e.g. objdump will display are based on
  the BFD backend knowing the size of PLT entries.  Arguably this ought
  to look at sh_entsize of .plt instead of using baked-in knowledge, but
  it doesn't.

* The linker-generated CFI for .plt is a single FDE for the whole
  section, using a DWARF expression covering all normal PLT entries
  together based on them having uniform size and contents.  (You could
  of course make the linker generate per-entry CFI, or partition the PLT
  into short and long entries and have the CFI treat the two partitions
  appropriately differently.  But that seems like a complication best
  avoided.)

Now, assuming we are talking about a uniform PLT in each object, there
is the question of whether to use a new PLT layout everywhere, or only
when linking an object with some input files that use MPX.  

* My initial reaction was to say that we should just change it
  unconditionally to keep things simple: use new linker, get new format,
  end of story.  Simplicity is good.

* But, doubling the size of PLT entries means more i-cache pressure.  If
  cache lines are 64 bytes, then today you fit four entries into a cache
  line.  Assuming PLT entries are more used than unused, this is a good
  thing.  Reducing that to two entries per cache line means twice as
  many i-cache misses if you hit a given PLT frequently (with even
  distribution of which entries you actually use--at any rate, it's
  "more" even if it's not "twice as many").  Perhaps this is enough cost
  in real-world situations to be worried about.  I really don't know.

* As I mentioned before, there are things floating around that think
  they know the size of PLT entries.  Realistically, there will be
  plenty of people using new tools to build binaries but not using MPX
  at all, and these people will give those binaries to people who have
  old tools.  In the case of someone running an old objdump on a new
  binary, they would see bogus foo@plt pseudo-symbols and be misled and
  confused.  Not to mention the unknown unknowns, i.e. other things that
  "know" the size of PLT entries that we don't know about or haven't
  thought of here.  It's just basic conservatism not to perturb things
  for these people who don't care about or need anything related to MPX
  at all.

How a relocatable object is marked so that the linker knows whether its
code is MPX-compatible at link time and how a DSO/executable is marked
so that the dynamic linker knows at runtime are two separate subjects.

For relocatable objects, I don't think there is really any precedent for
using ELF notes to tell the linker things.  It seems much nicer if the
linker continues to treat notes completely normally, i.e. appending
input files' same-named note sections together like with any other named
section rather than magically recognizing and swallowing certain notes.
OTOH, the SHT_GNU_ATTRIBUTES mechanism exists for exactly this sort of
purpose and is used on other machines for very similar sorts of issues.
There is both precedent and existing code in binutils to have the linker
merge attribute sections from many input files together in a fashion
aware of the semantics of those sections, and to have those attributes
affect the linker's behavior in machine-specific ways.  I think you have
to make a very strong case to use anything other than SHT_GNU_ATTRIBUTES
for this sort of purpose in relocatable objects.

For linked objects, there a couple of obvious choices.  They all require
that the linker have special knowledge to create the markings.  One
option is a note.  We use .note.ABI-tag for a similar purpose in libc,
but I don't know of any precedent for the linker synthesizing notes.
The most obvious choice is e_flags bits.  That's what other machines use
to mark ABI variants.  There are no bits assigned for x86 yet.  There
are obvious limitations to using e_flags, in that it's part of the
universal ELF psABI rather than something with vendor extensibility
built in like notes have, and in that there are only 32 bits available
to assign rather than being a wholly open-ended format like notes.  But
using e_flags is certainly simpler to synthesize in the linker and
simpler to recognize in the dynamic linker than a note format.  I think
you have to make at least a reasonable (objective) case to use a note
rather than e_flags, though I'm certainly not firmly against a note.

Finally, you've only mentioned x86-64.  The hardware details apply about
the same to x86-32 AFAICT.  If this is something that we'll eventually
want to do for x86-32 as well, then I think we should at least hash out
the plan for x86-32 fairly thoroughly before committing to a plan for
x86-64 (even if the actual implementation for x86-32 lags).  Probably
it's all much the same and working it through for x86-32 won't give us
any pause in our x86-64 plans, but we won't know until we actually do it.


Thanks,
Roland


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]