Tuesday 22 November 2011

1 in a million

So, Nigel had reported on the sporadic crashes. Had me very
stumped - running the test suite and being really nasty to the kernel
caused everything to pass. However, I did get a "simple" test to try:


$ dtrace -n fbt:::entry'{cnt++;}'


Some minutes after leaving this running would cause a kernel
assertion about interrupts being disabled when they shouldnt be.

This was a real needle in a haystack - a problem, which occurs
after millions of probes is almost impossible to find. Was it one
specific probe, or maybe single-step, or a race condition in
terminating interrupt routines.

How to find it?

Well, I started by adding /proc/dtrace/fbt - which gives a birds eye
view of every FBT probe, along with state, and how many times it had
been hit. I augmented this with the instruction which was probed.

Running a full fbt::: would cause a large fraction of all kernel
functions to be hit, but, equally, large numbers of functions that are
never called (eg for device drivers which are not active).

In a sense, this device gives a coverage view of FBT traps.

By casting out the known "good" probes, I was left with potentially
hundreds of distinct instructions/probes to wade through.

I didnt get far, other than to look for patterns. If we have a 1:1,000,000
failure, then something has to be happening infrequently. Using the
probe counter for each probe, we get some idea of the rarely called functions.

Whatever was causing the issue wasnt so much a single function being called
and blowing up the kernel, but a small piece of "damage" because the
trap handler didnt do the right thing.

The kernel kept saying the same thing: interrupts were unexpectedly disabled,
but knowing which trap was called just prior to this event was nearly
impossible to determine.

Finally, I started the "binary search" mode of attack: Lets ignore
all instructions which start 0xFn, then 0xEn, ... At some point, around about
0x9n, the problem seemed to disappear. (Absence of evidence isnt evidence of
absence!). Disabling all 0x9n instructions would allow dtrace to run
for hours without issue. Re-enabling would cause a warning within minutes.

Ok, so what are the 0x9n family of instructions? Well, 0x90 is a NOP - quite
common in the kernel, and nothing of interest. Heres the full list:


90 nop
91 xchg %eax,%ecx
92 xchg %eax,%edx
93 xchg %eax,%ebx
94 xchg %eax,%esp
95 xchg %eax,%ebp
96 xchg %eax,%esi
97 xchg %eax,%edi
98 cwtl
99 cltd
9a (bad)
9b fwait
9c pushfq
9d popfq
9e sahf
9f lahf


The main instruction of interest above is pushfq. So, exactly how many probes
have PUSHFQ in the first instruction of the function? Well, exactly two, on my
kernel:


# count patchpoint opcode inslen modrm name
...
21603 0000 ffffffff81032704 9c 1 -1 kernel:end_pv_irq_ops_restore_fl:entry 9c
...
24295 0010 ffffffff81539c00 9c 1 -1 kernel:native_load_gs_index:entry 9c


Now, env_pv_irq_ops_restore_fl hasnt been called (column 2), but
native_load_gs_index, was called 16 times. So, this isnt on every
syscall or interrupt, but its a rare occurrence. And that is what
we are after.

So, off to examine the code, and I found a bug. PUSHFQ was deliberately
disabling interrupts (because of a bug), and PUSHFQ doesnt touch the
interrupt flag. So, bingo! We could return back to the kernel with interrupts
disabled, and, luckily, the sporadic consistency checks would tell
us that interrupts were disabled when they shouldnt be.

When writing code in an SMP kernel, having interrupts disabled and sleeping
waiting for an event, can lead to deadlock. Thats why the checks are there.
And they helped, very nicely, to say "something is wrong" although, not
exactly where.

So far, dtrace is holding up well.

Lets see if Nigel is happy now.


Post created by CRiSP v10.0.17a-b6112


No comments:

Post a Comment