Saturday, 3 December 2011

Whats a nice byte doing in an instruction like this?

Just spent a few days trying to debug a strange scenario in Fedora Core 16.

Trying to enable all probes would crash the kernel. After a binary
search, the function flush_old_exec was found to be the culprit.

Nothing special in that function makes it stand out, but putting
a fbt::flush_old_exec:return probe in would cause the next fork/exec
to kill the process. After trying every conceivable thing, nothing worked.

Obviously a bug in dtrace - could it be the trap handler? Interrupts?
Pre-emption? CPU rescheduling?

Whilst analysing and trying to resolve this, I did some interesting things.

The nice thing about localising the probe at error, was that I
could test (simply start a new process), and it wouldnt crash the kernel
but kill the new process. So, a very controlled environment for
making small changes and adding monitoring was possible.

Firstly, which probe was firing? Looking at /proc/dtrace/stats showed
that *no* probe was firing. I added some extra debug to the int1 and int3
handlers (single step and breakpoint), and this too, showed no
probe was firing.

Not possible ! Really not possible !

Ok, so next, had we actually armed the probe? Well, we can use
/proc/dtrace/fbt to examine probes, and we can tell if a probe is armed
(tell-tale sign is "cc" instruction as the opcode at the location). Yes,
we are arming the probe, but, no, it does not fire.

Next up is to disassemble the function itself. I have found it very
annoying with Linux that there is no /vmlinux binary on the system -
only the /vmlinuz (and /boot equivalents), which are not proper ELF
files, but bootable images. Something as simple as examining the instructions
and bytes at physical addresses is tricky. I had written a
"vmlinux" extractor, but it never worked reliably.

One trick I have to do this is the following:

$ sudo gdb /bin/ls /proc/kcore

We dont care about /bin/ls but use it so we can examine /proc/kcore, and
from this, we can access physical memory addresses (eg as reported
by kernel stack traces or the dtrace probes).

What I found was curious. The two RET instructions in the function
were slap in the middle of a CALL instruction.

This meant the instruction disassembler was wrong. Looking back
a few instructions to see why, we came across the infamous "UD2"
instruction. UD2 is a special instruction to generate an
undefined-opcode fault. In the old days, lots of opcodes could do this,
but Intel formally added this instruction so that compilers and
operating systems had a real instruction, that would never change
in future CPUs, for the purpose of generating an illegal instruction

The Linux kernel uses this in the BUG and BUG_ON macros. Since
these calls are called rarely, the kernel maps to an UD2 instruction
and the fault handler can gracefully report the fault and the location
of the error.

When the INSTR provider was implemented, I came across these
instructions and had put some special "jump-over-it" code in place
to handle this, but either I misread the assembler, or
the kernel changed. Whenever a UD2 instruction is met, the
disassembler would jump 10 bytes forward and continue from there.

This just so happened to be in the middle of a call instruction
which happened to have 0xC3 as part of the relative address field.
Dtrace then slapped a breakpoint on that 0xC3 instruction and
changed the call to something that was wrong.

As soon as we hit the call, all bets were off, and we were lucky
not to crash the kernel.

Interesting that this didnt show up in Ubuntu 11.x or FC15, despite
that code note really changing in a while, but it could be the
quality of the GCC compiler code changed to cause the opcode
to just match something plausible, whilst never tickling the bug
on different kernels.

So, one more bug down for now.

Post created by CRiSP v10.0.19a-b6122

No comments:

Post a Comment