In the beginning was COFF - the common object file format. It
was a great step up from prior binary file formats, since it had
a structure and allowed enhancements to what was embodied in an
executable or object file.
COFF was replaced with ELF - overcoming some of the limitations
and hardcodings of COFF. It has been very successful. (Microsoft
stuck with COFF; Unixes didnt).
ELF has a set of tables and sizes and allows arbitrary things in a binary
such as debug symbols, relocatable symbols, and so on.
When it came to debugging, the initial attempts at containing symbol
tables extended from symbol name/value pairs to structured descriptions
of what an application does - no longer are name/value pairs enough
to describe arrays-of-structs or class-members etc. This is
inside one of the ELF sections, and uses a binary format called DWARF.
Most of this technology comes from Sun / Solaris (ELF, DWARF).
DWARF was ported to other operating systems, and gcc/gdb and Linuxes
happily support this.
Then libdw was created - with enhancements to libdwarf. Alas, documentation
on both libraries is poor at best, and your eyes start to get very
confused trying to figure out what the subtle differences are.
DTrace relies on libdwarf support for the CTF tools. When people
build dtrace on a system without libdwarf, they see a diagnostic
saying that ctfdump/ctfconvert are not available, and the build proceeds
anyway.
CTF is the mechanism for taking the dwarf debug symbols from the kernel
and making them available in DTrace D scripts, so you can refer
to structures and members. When that is not present, it doesnt matter
if all you want is to use the standard FBT and SYSCALL providers.
It does matter if you want to do some other advanced things.
Ubuntu seems to support libdwarf but not libdw. Other Linuxes support
libdw and not libdwarf. (Its difficult to tell if this is the truth,
since my installations may be polluted by whatever I did a few years
ago).
Its annoying today, eg when I install archlinux, that libdwarf is not
supported but libdw is. So, trying to modify the code in cmd/ctfconvert
to compile either way requires being an expert in both.
I may try over the next few days to see if I can fix this. With
DWARF4 becoming more popular with later gcc or clang releases, this
becomes more important to step up to the latest standards and tools.
Post created by CRiSP v11.0.16a-b6555
CRiSP, DTrace, and other technobabble
Technoblog with some random mutterings and ramblings
Saturday, 4 May 2013
Tuesday, 30 April 2013
DTrace/ARM syscall provider
After a small amount of hard brain thunking...this now works. First off
was separating out the x86 specific systrace code to make room for ARM.
All bar 9 syscalls work - since these have special assembler preambles
to get the calling arguments correct.
Next, tackling the clone() syscall. After a bit of brainstorming and
trial and error, I was doing everything right...but in the wrong order.
Once I realised that - clone() works. A few minutes later, execve() is
working.
And after a tedious bit of typing, the remaining 7 are good.
So, that effectively completes the first phase of DTrace/ARM.
Next up is to fix some DWARF issues on 3.8 and above kernels and
gcc toolchains, and decide what to do next.
I'll hopefully put up a new release tonight if I can
validate I havent broken anything too much.
Post created by CRiSP v11.0.16a-b6555
was separating out the x86 specific systrace code to make room for ARM.
All bar 9 syscalls work - since these have special assembler preambles
to get the calling arguments correct.
Next, tackling the clone() syscall. After a bit of brainstorming and
trial and error, I was doing everything right...but in the wrong order.
Once I realised that - clone() works. A few minutes later, execve() is
working.
And after a tedious bit of typing, the remaining 7 are good.
So, that effectively completes the first phase of DTrace/ARM.
Next up is to fix some DWARF issues on 3.8 and above kernels and
gcc toolchains, and decide what to do next.
I'll hopefully put up a new release tonight if I can
validate I havent broken anything too much.
Post created by CRiSP v11.0.16a-b6555
Saturday, 27 April 2013
DTrace/ARM update
I am getting ready to push out a new DTrace release, which includes
the ARM support. I enclose in this blog post, a copy of doc/ARM.txt
which explains what works and what does not.
--- doc/ARM.txt
DTrace can run on the ARM processor. The ARM CPU exists as a number
of variants, from tiny embedded CPUs, to full blown general purpose
CPUs, commonly found in smart phones and other systems. As of this writing,
ARM/64 is coming.
The earliest ARM processors had limited memory support (many instructions
refer to 26-bit addresses); later processors can support 4GB of memory.
The ARM port of DTrace has been done in a KVM virtual machine, targetting
a custom kernel (Debian/Wheezy and 3.6.11 kernel) for the RaspberryPi,
which is a ARMv6 architecture kernel and CPU.
This specific kernel was chosen, simply so that I could tally the kernel
binary with the source code, in order to clarify how the ARM
architecture worked, and specifics of debugging probe functions.
In theory DTrace should work on earlier and later kernels.
As of this writing, SMP kernels have not been tried. Almost certainly,
DTrace will not work on an SMP system (because I have not validated
the xcall CPU code). It *might*, but I am suspect it will not, and there
is a need to verify this.
This port relies on the register_undef_hook() kernel function
to intercept the FBT probes. FBT probes are implemented by
using an undefined-instruction and handling the traps they generated.
This is different to x86, where 0xCC (INT3) and single-step mode is
used to manage probes which are taken. (ARM appears not to have
single-step mode execution).
The file toxic.c is updated to avoid those parts of the
interrupt fabric which are needed for the probes to fire, so we are
more conservative in what can be probed (this mostly wont matter
to most people). "dtrace -n fbt:::" is *safe*, as far as my testing
is concerned, and the toxic probe functions reflect areas which have
caused trouble. It is possible that more research is needed if you
attempt to step out of what I have personally tested.
Summary:
Future:
Some of the items above will be addressed in future versions, especially
SYSCALL, followed by SMP, and eventually, Android.
Post created by CRiSP v11.0.16a-b6555
the ARM support. I enclose in this blog post, a copy of doc/ARM.txt
which explains what works and what does not.
--- doc/ARM.txt
DTrace can run on the ARM processor. The ARM CPU exists as a number
of variants, from tiny embedded CPUs, to full blown general purpose
CPUs, commonly found in smart phones and other systems. As of this writing,
ARM/64 is coming.
The earliest ARM processors had limited memory support (many instructions
refer to 26-bit addresses); later processors can support 4GB of memory.
The ARM port of DTrace has been done in a KVM virtual machine, targetting
a custom kernel (Debian/Wheezy and 3.6.11 kernel) for the RaspberryPi,
which is a ARMv6 architecture kernel and CPU.
This specific kernel was chosen, simply so that I could tally the kernel
binary with the source code, in order to clarify how the ARM
architecture worked, and specifics of debugging probe functions.
In theory DTrace should work on earlier and later kernels.
As of this writing, SMP kernels have not been tried. Almost certainly,
DTrace will not work on an SMP system (because I have not validated
the xcall CPU code). It *might*, but I am suspect it will not, and there
is a need to verify this.
This port relies on the register_undef_hook() kernel function
to intercept the FBT probes. FBT probes are implemented by
using an undefined-instruction and handling the traps they generated.
This is different to x86, where 0xCC (INT3) and single-step mode is
used to manage probes which are taken. (ARM appears not to have
single-step mode execution).
The file toxic.c is updated to avoid those parts of the
interrupt fabric which are needed for the probes to fire, so we are
more conservative in what can be probed (this mostly wont matter
to most people). "dtrace -n fbt:::" is *safe*, as far as my testing
is concerned, and the toxic probe functions reflect areas which have
caused trouble. It is possible that more research is needed if you
attempt to step out of what I have personally tested.
Summary:
* ARMv6 architecture only
* No support for Thumb mode (or, not tested with Thumb user apps)
* Validated against RaspberryPi
* Not validated against Android
* Not validated on SMP
* Not validated on ARM/64
* Not validated on < ARMv6 or > ARMv6
* FBT works
* USDT has not been validated
* SYSCALL is dummied out - to be fixed in subsequent release
Future:
Some of the items above will be addressed in future versions, especially
SYSCALL, followed by SMP, and eventually, Android.
Post created by CRiSP v11.0.16a-b6555
Monday, 22 April 2013
DTrace/ARM
After my last write up, I mentioned the issue of the "3 chdir" problem,
whereby the process (bash) would crash on the 3rd chdir.
After a lot of head scratching...I finally figured out my mistake.
Interestingly, it takes me back to the earliest part of the x86 port,
where I made the exact same mistake.
When a breakpoint (in the case of ARM, an undefined-instruction trap) fires,
we are sitting on a kernel stack - a nested stack of the point
where the breakpoint occurred. As I recounted on the last blog,
we dont have a single-step trap on the ARM, so we cannot do the
same logic flow as X86 (single step over the instruction where we
placed the breakpoint probe).
Instead, we have to emulate the instruction. The initial tactic was
to emulate the PUSH instruction - a class of instructions (ARM
encodes a multitude of operations into a single 32-bit instruction,
to do things in parallel, compared with x86). When emulating a PUSH,
we cannot "push" because the place we want to push is between where
the SP register was, and the stack frame for the trap we just took.
This is solvable, by moving the saved-registers area around.
So, the alternate trick is to call into a scratch buffer:
where OPC is the original trap program counter.
This actually works well, except for two scenarios. The first is
the original instruction may be doing PC relative addressing, so
we either have to emulate that, or rewrite the instruction to avoid
the PC relative addressing.
The second issue is we cannot have a scratch buffer (per cpu), because
the scratch buffer may be modified by another trap or interrupt which
fires. So, we need a scratch buffer per probe. (The scratch buffer is
sized to be 5 x 32-bit instructions long - the longest "rewrite" requires
3 instruction slots and we need 2 slots to handle a 32-bit JMP instruction).
Now, with that done, I have modified the dtrace to start channelling
"proven" instructions through the FBT probe handler - only is we have
implemented the rewrite support will we fire. And the results are
striking! From a single FBT probe to 1,000,000+ probes (until I ^C'ed
dtrace).
As I ramp up coverage of all kernel functions, I am hitting a few exceptions
(instructions not being properly rewritten, or functions needed by
the undefined-instruction trap handler itself; the latter are
handled by the blacklist in toxic.c).
So, we are nearly finished in FBT function tracing - and after that
is the syscall provider (which should be much easier to do than FBT, albeit
there is a lot of quirky code to handle the various arguments passed
to the syscalls).
This DTrace/ARM port does not handle SMP configurations, since my
VM (or RaspberryPi) doesnt include SMP support - but this can happen
later.
I'll release the new code when I am happy fbt is "done".
Post created by CRiSP v11.0.16a-b6552
whereby the process (bash) would crash on the 3rd chdir.
After a lot of head scratching...I finally figured out my mistake.
Interestingly, it takes me back to the earliest part of the x86 port,
where I made the exact same mistake.
When a breakpoint (in the case of ARM, an undefined-instruction trap) fires,
we are sitting on a kernel stack - a nested stack of the point
where the breakpoint occurred. As I recounted on the last blog,
we dont have a single-step trap on the ARM, so we cannot do the
same logic flow as X86 (single step over the instruction where we
placed the breakpoint probe).
Instead, we have to emulate the instruction. The initial tactic was
to emulate the PUSH instruction - a class of instructions (ARM
encodes a multitude of operations into a single 32-bit instruction,
to do things in parallel, compared with x86). When emulating a PUSH,
we cannot "push" because the place we want to push is between where
the SP register was, and the stack frame for the trap we just took.
---------------
| original |
| callers |
| stack |
| |
--------------- SP at the time of the FBT trap
| saved |
| registers |
| |
--------------- SP in the FBT handler
This is solvable, by moving the saved-registers area around.
So, the alternate trick is to call into a scratch buffer:
---------------
| orig instr |
---------------
| jmp OPC+4 |
---------------
where OPC is the original trap program counter.
This actually works well, except for two scenarios. The first is
the original instruction may be doing PC relative addressing, so
we either have to emulate that, or rewrite the instruction to avoid
the PC relative addressing.
The second issue is we cannot have a scratch buffer (per cpu), because
the scratch buffer may be modified by another trap or interrupt which
fires. So, we need a scratch buffer per probe. (The scratch buffer is
sized to be 5 x 32-bit instructions long - the longest "rewrite" requires
3 instruction slots and we need 2 slots to handle a 32-bit JMP instruction).
Now, with that done, I have modified the dtrace to start channelling
"proven" instructions through the FBT probe handler - only is we have
implemented the rewrite support will we fire. And the results are
striking! From a single FBT probe to 1,000,000+ probes (until I ^C'ed
dtrace).
As I ramp up coverage of all kernel functions, I am hitting a few exceptions
(instructions not being properly rewritten, or functions needed by
the undefined-instruction trap handler itself; the latter are
handled by the blacklist in toxic.c).
So, we are nearly finished in FBT function tracing - and after that
is the syscall provider (which should be much easier to do than FBT, albeit
there is a lot of quirky code to handle the various arguments passed
to the syscalls).
This DTrace/ARM port does not handle SMP configurations, since my
VM (or RaspberryPi) doesnt include SMP support - but this can happen
later.
I'll release the new code when I am happy fbt is "done".
Post created by CRiSP v11.0.16a-b6552
Number 3
I wrote last time about getting the first ARM based FBT probe working.
In extending the ARM emulation, I started cleaning up the way the entry
probes are handled - so I can ensure we only trap what is supported by
the emulator. Supporting:
is sufficient to handle a large number of entry points (that push
instruction approximately handles all single arg functions). By
generalising to handle any PUSH instruction, we can handle a lot more, and
so on. (In ARM assembler, you can push any permutation of all 16 registers
on the stack in one instruction - the registers are bit encoded).
However, I hit a problem. Scaling back, I found that:
would cause bash to get a segmentation violation. Why its the third chdir -
I havent figured out. A simple test app doesnt show this. Internal
prints and close code review doesnt make it obvious what I am doing wrong,
even if I distill this down to a basic trap handler.
It seems like something is being corrupted on return back
to the invoking process, but the trace is not logical. I have to try
really hard to push all preconceptions from my mind and look for the
"unobvious". Hopefully, I can find it (I dont think any debugging tool
can help, unless I could somehow trace execution forwards, at the instruction
level on the third chdir() syscall).
If I can solve this, then it opens up FBT to handle as many sequences
as I care to emulate - but, because of the way ARM probes are implemented
(by me), I will have to be careful of some functions which could cause
a recursion issue (which doesnt exist on the x86 DTrace implementations).
Nobody said this was gonna be simple...
Post created by CRiSP v11.0.16a-b6552
In extending the ARM emulation, I started cleaning up the way the entry
probes are handled - so I can ensure we only trap what is supported by
the emulator. Supporting:
push {r4, lr}
is sufficient to handle a large number of entry points (that push
instruction approximately handles all single arg functions). By
generalising to handle any PUSH instruction, we can handle a lot more, and
so on. (In ARM assembler, you can push any permutation of all 16 registers
on the stack in one instruction - the registers are bit encoded).
However, I hit a problem. Scaling back, I found that:
$ bash
$ cd
$ cd
$ cd
would cause bash to get a segmentation violation. Why its the third chdir -
I havent figured out. A simple test app doesnt show this. Internal
prints and close code review doesnt make it obvious what I am doing wrong,
even if I distill this down to a basic trap handler.
It seems like something is being corrupted on return back
to the invoking process, but the trace is not logical. I have to try
really hard to push all preconceptions from my mind and look for the
"unobvious". Hopefully, I can find it (I dont think any debugging tool
can help, unless I could somehow trace execution forwards, at the instruction
level on the third chdir() syscall).
If I can solve this, then it opens up FBT to handle as many sequences
as I care to emulate - but, because of the way ARM probes are implemented
(by me), I will have to be careful of some functions which could cause
a recursion issue (which doesnt exist on the x86 DTrace implementations).
Nobody said this was gonna be simple...
Post created by CRiSP v11.0.16a-b6552
Wednesday, 17 April 2013
Number 3
I wrote last time about getting the first ARM based FBT probe working.
In extending the ARM emulation, I started cleaning up the way the entry
probes are handled - so I can ensure we only trap what is supported by
the emulator. Supporting:
is sufficient to handle a large number of entry points (that push
instruction approximately handles all single arg functions). By
generalising to handle any PUSH instruction, we can handle a lot more, and
so on. (In ARM assembler, you can push any permutation of all 16 registers
on the stack in one instruction - the registers are bit encoded).
However, I hit a problem. Scaling back, I found that:
would cause bash to get a segmentation violation. Why its the third chdir -
I havent figured out. A simple test app doesnt show this. Internal
prints and close code review doesnt make it obvious what I am doing wrong,
even if I distill this down to a basic trap handler.
It seems like something is being corrupted on return back
to the invoking process, but the trace is not logical. I have to try
really hard to push all preconceptions from my mind and look for the
"unobvious". Hopefully, I can find it (I dont think any debugging tool
can help, unless I could somehow trace execution forwards, at the instruction
level on the third chdir() syscall).
If I can solve this, then it opens up FBT to handle as many sequences
as I care to emulate - but, because of the way ARM probes are implemented
(by me), I will have to be careful of some functions which could cause
a recursion issue (which doesnt exist on the x86 DTrace implementations).
Nobody said this was gonna be simple...
Post created by CRiSP v11.0.16a-b6552
In extending the ARM emulation, I started cleaning up the way the entry
probes are handled - so I can ensure we only trap what is supported by
the emulator. Supporting:
push {r4, lr}
is sufficient to handle a large number of entry points (that push
instruction approximately handles all single arg functions). By
generalising to handle any PUSH instruction, we can handle a lot more, and
so on. (In ARM assembler, you can push any permutation of all 16 registers
on the stack in one instruction - the registers are bit encoded).
However, I hit a problem. Scaling back, I found that:
$ bash
$ cd
$ cd
$ cd
would cause bash to get a segmentation violation. Why its the third chdir -
I havent figured out. A simple test app doesnt show this. Internal
prints and close code review doesnt make it obvious what I am doing wrong,
even if I distill this down to a basic trap handler.
It seems like something is being corrupted on return back
to the invoking process, but the trace is not logical. I have to try
really hard to push all preconceptions from my mind and look for the
"unobvious". Hopefully, I can find it (I dont think any debugging tool
can help, unless I could somehow trace execution forwards, at the instruction
level on the third chdir() syscall).
If I can solve this, then it opens up FBT to handle as many sequences
as I care to emulate - but, because of the way ARM probes are implemented
(by me), I will have to be careful of some functions which could cause
a recursion issue (which doesnt exist on the x86 DTrace implementations).
Nobody said this was gonna be simple...
Post created by CRiSP v11.0.16a-b6552
Sunday, 14 April 2013
DTrace/ARM .. some progress at last
After spending a long time and effort trying to get FBT to work, I have
a working example:
My prior attempts at intercepting the invalid opcode handler failed.
I dont fully understand why, but having gotten this far, I may have a better
understanding about what I did wrong to be able to tackle this again.
The prior attempt tried to come in at a low level. The current
attempt does what kprobes does and uses the register_undef_hook() kernel
function to add a handler for invalid opcode manipulation. This is nicer,
in that it is pure C code - no assembler, but it not so good because it
will preclude tracing certain low level functions.
This example above is special - the ARM does not (does it?) have single-step
support, so in order to handle FBT probes requires an ARM emulator
to handle the continuation of an FBT probe. The code in dtrace handlers
the PUSH instruction on entry to sys_chdir. (sys_chdir was chosen as
its easy to fire it, on demand, and theres no background activity misfiring
it except, when I want it to).
The next step is to start advancing the ARM emulator (I have been studying
what kprobes does to get an idea of how complex this is - it is complex,
but it only needs certain instructions to be emulated - those on entry
and exit of a function - not every instruction. Now I can start looking
at all entry points to see what are the common functions).
I have a realisation that this only targets the raspberrypi cpu (armv6l)
and fully expect any other system to require more hard work to handle the
various ARM chipsets, along with ARM64. My only other target ARM chip
is a galaxy note 2 (Android), so eventually, the goal is to try and
get this working on Android, but thats a step later.
(The sources arent released yet - theres no milage in doing so, but I will
release them in a few days or weeks [most likely], when I feel the
substance of the ARM/DTrace port is more functional).
In case anyone asks: Why am I doing this? Because I can. No more to it than
that. If anyone wants to pay me or send me appropriate hardware, am happy
to consider prioritising the work, but theres no guarantee progress is fast.
BTW, I created my first "Hello world" Android app the other day. It didnt
work (some cross compilation issue). I want to solve that so I can get CRiSP
running on Android. (CRiSP actually works quite nicely with the ConnectBot
ssh emulator and running remotely; but thats not really very useful).
Post created by CRiSP v11.0.16a-b6552
a working example:
/home/fox/src/dtrace@raspberrypi: uname -a
Linux raspberrypi 3.6.11 #4 Mon Mar 18 21:26:49 GMT 2013 armv6l GNU/Linux
/home/fox/src/dtrace@raspberrypi: build/dtrace -n fbt::sys_chdir:entry
dtrace: description 'fbt::sys_chdir:entry' matched 1 probe
CPU ID FUNCTION:NAME
0 3170 sys_chdir:entry
^C
My prior attempts at intercepting the invalid opcode handler failed.
I dont fully understand why, but having gotten this far, I may have a better
understanding about what I did wrong to be able to tackle this again.
The prior attempt tried to come in at a low level. The current
attempt does what kprobes does and uses the register_undef_hook() kernel
function to add a handler for invalid opcode manipulation. This is nicer,
in that it is pure C code - no assembler, but it not so good because it
will preclude tracing certain low level functions.
This example above is special - the ARM does not (does it?) have single-step
support, so in order to handle FBT probes requires an ARM emulator
to handle the continuation of an FBT probe. The code in dtrace handlers
the PUSH instruction on entry to sys_chdir. (sys_chdir was chosen as
its easy to fire it, on demand, and theres no background activity misfiring
it except, when I want it to).
The next step is to start advancing the ARM emulator (I have been studying
what kprobes does to get an idea of how complex this is - it is complex,
but it only needs certain instructions to be emulated - those on entry
and exit of a function - not every instruction. Now I can start looking
at all entry points to see what are the common functions).
I have a realisation that this only targets the raspberrypi cpu (armv6l)
and fully expect any other system to require more hard work to handle the
various ARM chipsets, along with ARM64. My only other target ARM chip
is a galaxy note 2 (Android), so eventually, the goal is to try and
get this working on Android, but thats a step later.
(The sources arent released yet - theres no milage in doing so, but I will
release them in a few days or weeks [most likely], when I feel the
substance of the ARM/DTrace port is more functional).
In case anyone asks: Why am I doing this? Because I can. No more to it than
that. If anyone wants to pay me or send me appropriate hardware, am happy
to consider prioritising the work, but theres no guarantee progress is fast.
BTW, I created my first "Hello world" Android app the other day. It didnt
work (some cross compilation issue). I want to solve that so I can get CRiSP
running on Android. (CRiSP actually works quite nicely with the ConnectBot
ssh emulator and running remotely; but thats not really very useful).
Post created by CRiSP v11.0.16a-b6552
Subscribe to:
Posts (Atom)