Monday 22 April 2013

DTrace/ARM

After my last write up, I mentioned the issue of the "3 chdir" problem,
whereby the process (bash) would crash on the 3rd chdir.

After a lot of head scratching...I finally figured out my mistake.
Interestingly, it takes me back to the earliest part of the x86 port,
where I made the exact same mistake.

When a breakpoint (in the case of ARM, an undefined-instruction trap) fires,
we are sitting on a kernel stack - a nested stack of the point
where the breakpoint occurred. As I recounted on the last blog,
we dont have a single-step trap on the ARM, so we cannot do the
same logic flow as X86 (single step over the instruction where we
placed the breakpoint probe).

Instead, we have to emulate the instruction. The initial tactic was
to emulate the PUSH instruction - a class of instructions (ARM
encodes a multitude of operations into a single 32-bit instruction,
to do things in parallel, compared with x86). When emulating a PUSH,
we cannot "push" because the place we want to push is between where
the SP register was, and the stack frame for the trap we just took.


---------------
| original |
| callers |
| stack |
| |
--------------- SP at the time of the FBT trap
| saved |
| registers |
| |
--------------- SP in the FBT handler


This is solvable, by moving the saved-registers area around.
So, the alternate trick is to call into a scratch buffer:


---------------
| orig instr |
---------------
| jmp OPC+4 |
---------------


where OPC is the original trap program counter.

This actually works well, except for two scenarios. The first is
the original instruction may be doing PC relative addressing, so
we either have to emulate that, or rewrite the instruction to avoid
the PC relative addressing.

The second issue is we cannot have a scratch buffer (per cpu), because
the scratch buffer may be modified by another trap or interrupt which
fires. So, we need a scratch buffer per probe. (The scratch buffer is
sized to be 5 x 32-bit instructions long - the longest "rewrite" requires
3 instruction slots and we need 2 slots to handle a 32-bit JMP instruction).

Now, with that done, I have modified the dtrace to start channelling
"proven" instructions through the FBT probe handler - only is we have
implemented the rewrite support will we fire. And the results are
striking! From a single FBT probe to 1,000,000+ probes (until I ^C'ed
dtrace).

As I ramp up coverage of all kernel functions, I am hitting a few exceptions
(instructions not being properly rewritten, or functions needed by
the undefined-instruction trap handler itself; the latter are
handled by the blacklist in toxic.c).

So, we are nearly finished in FBT function tracing - and after that
is the syscall provider (which should be much easier to do than FBT, albeit
there is a lot of quirky code to handle the various arguments passed
to the syscalls).

This DTrace/ARM port does not handle SMP configurations, since my
VM (or RaspberryPi) doesnt include SMP support - but this can happen
later.

I'll release the new code when I am happy fbt is "done".

Post created by CRiSP v11.0.16a-b6552


No comments:

Post a Comment