[Ltrace-devel] Tracing library calls on PowerPC

Thu Mar 22 15:55:15 UTC 2012

Hi there,

I've finally gotten around to look a serious look into the ppc support
in ltrace this week.  I'd like to present my conclusions, to let people
interested in ltrace know, and to keep this archived for posterity.

First, let's deal with the case of 32-bit PPC.  There are two PLT types
on 32-bit PPC: old-style, BSS PLT, and new-style "secure" PLT.  We can
tell one from the other by the flags on the .plt section.  If it's +X
(executable), it's BSS PLT, otherwise it's secure.

BSS PLT works the same way as most architectures: the .plt section
contains trampolines and we put breakpoints to those.  So this case is
trivial to take care of.

With secure PLT, the .plt section doesn't contain instructions but
addresses.  The real PLT table is stored in .text.  Addresses of those
PLT entries can be computed.  I adapted the computation that BFD does
some time ago.  It was in ltrace-elf.c.  On my libs branch (more on this
in some other e-mail), I moved it to PPC back end where it belongs.

If not prelinked, BSS PLT entries in the .plt section contain zeroes
that are overwritten by the dynamic linker during start-up.  For that
reason, ltrace enables those breakpoints only after .start is hit.

This was all rather straightforward (unless I forgot about some corner
case that is), and the libs branch currently happily traces PPC32
processes, BSS or secure, prelinked or not, stripped or not, straight or
cross.

64-bit PPC case is much more involved.  PPC64 only ever has secure PLTs.
Right now tracing of PPC64 processes is done by reading the contents of
.plt (which contains addresses), and putting breakpoints to those
addresses.  Thus ltrace traces entry points, not library calls.
Intra-library calls fire as well.  That's not bad per se, but there is a
separate mechanism for that: the -x option.

On PPC64, callers call _stubs_, not PLT entries.  There may be more than
one stub for each PLT entry.  Stubs play the role of PLT: read value
from .plt and dispatch a call.  PLT entries themselves are essentially
just curried calls to the resolver, all they do is setup r0 to a number
of function to be resolved, and call the resolver.  When a symbol is
resolved, the resolver updates the value stored in .plt, and so the PLT
entry is only ever called once, if at all.  Correspondingly, PLT entries
are useless as breakpoint sites.

When not prelinked, .plt contains zeroes, and we have to wait for the
dynamic linker to fill this with addresses of PLT entries.  When
prelinked, .plt is initialized to contain target addresses right away,
and the stub thus completely avoids calling PLT entry.  So it seems like
stubs are where we would like to have breakpoints.

Which is kind of tricky, because there doesn't seem to be any way to
reliably compute addresses of stubs.  There are symbols for stubs in
symbol table, but that goes away when stripped, and therefore can't be
used (look for xxxxxxxx.plt_call.name, xxxxxxxx being a hex number).
They appear to always be at the beginning of their respective section
(typically .text), but I don't know whether that's any rule.  It seems
it's like that just because I happened to look into binaries that use
the standard linker script.  Generally they are ordered differently from
PLT entries.  Figuring out which stub belongs to which PLT entry is
possible, but implies assumptions, disassembling, and is a mess.

We might still use them if they are in symbol table, but for the general
case, something else is necessary.

So when a process is started, we do two things: inspect .plt section,
and put breakpoints to PLT entries (that we compute like in PPC32).

If the binary was prelinked, .plt is filled with addresses.  We remember
those and rewrite them to point to PLT entries instead.  When a call is
made, the corresponding PLT breakpoint hits.  We don't even bother
removing and re-enabling the breakpoint, instead we simply move the IP
to the previously-remembered address.  The .plt changes should be undone
before a process is detached, but even if they are not, the worst that
happens is that those symbols are resolved again.

If the binary was not prelinked, we need to smuggle a breakpoint
somewhere where it hits every time that .plt is updated: a post-fixup
breakpoint.

One way to do this is to put a brakpoint on .dl_fixup (or .fixup in
older linkers, or whatever else in non-standard linkers).  When this
hits, we remove that breakpoint, and put a breakpoint to the return
address instead (read from link register), which will be somewhere in
._dl_runtime_resolve.  That is our post-fixup breakpoint.  Another way
is that we scan ._dl_runtime_resolve and find a "bctr" instruction, and
put the breakpoint there.  Neither is too nice or robust.  Yet another
way is to single-step the process through the resolver and watch for
changes in .plt.  That seems as the most robust solution.

When the post-fixup breakpoint hits, we know a .plt slot was updated,
and treat it as if it was prelinked, as described above.

The whole resolver gambit has to be handled the same way that breakpoint
re-enablement in a multi-threaded processes is: when PLT breakpoint hits
the first time, we stop all threads before proceeding.  Single stepping
one thread through the resolver on itself is not enough, because there's
a brief window after the value is updated, but before ltrace gets to
handle it, and in that window other threads may race through and ltrace
misses these calls.

When we attach to a running process, .plt will be in a mixed state: some
slots were already fixed up, some weren't.  We can easily tell the
former case from the latter by comparing the address stored herein with
address of the corresponding PLT entry.

As a bonus, I'll mention one blind alley that I went through.  Under
this scheme, we would put breakpoints on PLT entries.  When these would
hit, we would read return address, which leads back to the original
caller.  Decoding one instruction before the return address would give
us slot address, and we would put a breakpoint there.  That's tempting,
but unfortunately there is generally more than one stub per PLT entry,
and a breakpoint would be set only in the first called stub.  So this
would need to be combined with the post-fixup scheme as well, and would
eventually simply collapse to the previous case, except breakpoints
would hit in stubs instead of in PLT.

As a pleasant side effect, the post-fixup scheme takes care of tracing
re-entrant calls, and the case that several threads make the same
first-time PLT call in parallel, both of which are currently broken.
The deal is that we know that the address in .plt changes after the
first call.  But we don't get around to updating the breakpoint until
the function returns, mostly because we don't know when.  Unfortunately
this means that we fail to trace any calls done between the first call
and first return, as we don't have breakpoints in the right places.  I
don't see other way to fix this than putting a breakpoint somewhere in
dynamic linker, which is what I propose above.

Thanks,
PM