Its caused me a few days of frustration, but I didnt want to be beat.
I have mostly cracked it now. The issue is around syscalls. The
way dtrace intercepts the system calls is by patching the system
call table. This normally points to the C code function to implement
the activity, and a small amount of assembler glue and C is used
to "hook" or "wrap" the system call.
One of the problems is that all system calls in dtrace go through
the same hooking code. When a system call is invoked, the %RAX register
contains the system call number, and dtrace assumed this register was
It turns out it isnt. But why it isnt is interesting.
The opensuse kernel appears to be one of the most resilient kernels
to kernel crashes. Many a time an array of general protection faults
and illegal instruction traps fired, but I have rarely had to reboot
Two things distinguish this kernel: DWARF stack walking and
stack protection. The opensuse kernel has a DWARF stack walker which
helps to ensure more accurate stacks are displayed when a fault
occurs. (Similar to the work I started, but to which I abandoned.
Maybe I can look at that code and see what style of approach they used).
[Stack walking is problematic generically, because of the 32 + 64 bit
kernels, along with all the permutations of GCC compiler switches
which makes it difficult to ensure the code base can handle these
The stack protection ensures that if a buffer overflow or some
other bad thing happens, then this is caught very fast.
The approach that GCC takes is to snapshot a random value
on the stack at the start of the function and validate the value
is still there on exit. This code utilises the %RAX register, which is
what was tickling my problem.
After various attempts to "jam" the uncorrupted %RAX into the C
arguments to the dtrace handler, I gave up. On 64-bit code, arguments
can be passed on the stack or via registers (the first 6 arguments only),
which means some degree of register fiddling, but also sensitivity
to compiler regimes and kernel compilation modes.
What I did was created a new per-cpu data structure so that
the %RAX register could be saved, without corruption by another cpu.
This data structure can then be used by the syscall wrapper code and
the results look good.
The existing systrace code has to handle 6 of the syscalls specially
(eg clone, fork and a few others), because, by definition, these
syscalls dont take the normal "exit" from the kernel route, but
hopefully I can fix these.
The next steps is to see if this "better" code works for the other
kernels I did not have problems with, and then to contemplate looking
at the 32-bit code implementation.
Hopefully, an update in a few days (or over the long weekend).