CRiSP, DTrace, and other technobabble: The 'Great Bug' of 2011/2012

In my recent blog postings, I wrote about the "most difficult bug in the
world" to resolve. On i386, loading dtrace and patching the page fault
interrupt vector would panic/hang/double fault the kernel.

I have spent the last few weeks trying absolutely everything conceivable,
to no avail. When faced with such a bug - one has to try everything, but,
in the corner of your mind, you know that eventually, it may not
be the place you are looking at - which is why it can be so elusive.

Lets recall the issue: when loading the driver, it works - the driver
loads and full functionality is available. If we remove the driver, the
system still works. If we reload the driver again, we get panics, where
even the kernel stack dumper panics the kernel. Evidence shows corrupt
page tables or interrupt stacks, and, using the VirtualBox debugger, we
can probe (a little) to see what is happening in the VM.

Removing the providers in dtrace, removing calls to activate timers and
just about removing anything that does real work, shows the problem to
remain. The interrupt patching code was modified to be more careful
about how it does the job. The key difference between "it works very
well" and it "reliably crashes" is patching the page-fault vector (0x0E).

Even if the page-fault handler in dtrace is modified to be a jump to
the original vector - no touching of registers, the problem persists.

I have modified dtrace to avoid touching the page fault vector, and
instead, allow me to on-demand update the vector to the new or
old interrupt location with an "echo" statement to the /proc/dtrace/trace
device.

This is useful - because it helps to isolate all the things that happen
when a driver is loaded, from the fault at hand.

Still: its erratic. I have times where I can reload the driver and
patch the vector zillions of times, and others where, on the 2nd time,
we go bang.

I've been studying in more detail the TSS register in the cpu and better
understanding how Linux and x86 in general handles nested interrupts,
and handling multiple interrupt stacks. I have been using kgdb and the
kdebug debugger in the kernel, along with the VirtualBox debugger.

Working Backwards

Nothing works deterministically: it either works brilliantly or fails
dysmally.

Lets take a trip to a different place: lets work backwards.
A "double-fault" interrupt is caused because an interrupt has effectively
raised a general-protection fault (invalid segment selector, invalid
address, page-fault, etc). So how can this happen? Well, a likely scenario
is that the segment registers (%GS, %FS, %DS, %ES) are wrong.

There are really two scenarios for a page fault: it either happens in
user space or kernel space. A user page fault might happen because
reference to an mmapped area has yet to page-fault-in the page just
touched. Another case for user page faults is the stack - as the level
of nesting in an application increases, lower pages in the stack may
need to be allocated/mapped.

User page faults are typically *rare*, especially on a small system
because the working set can be mapped into the address space on startup.

Kernel faults are most common (are they?!), e.g. read() into a large
buffer which has been mmap()ed could cause this.

I may look at which faults really are most common. (Nothing in the kernel
tracks the types of faults [I think]). The difference is the segment
registers: when a user app faults, the DS/ES/CS registers will point to user
space, so the interrupt routine needs to modify these to point to the
kernel address space. (Otherwise, things like referring to a .data or .bss
object, in C, will generate a fault). If we trap from the kernel, then these
registers are already the correct values.

[Theres more complication here - the Linux interrupt vectors handle
nested interrupts, double faults, NMI and other things, but lets keep it
simple for now]

Now, the dtrace page fault handler keeps a count of how many interrupts
it sees (/proc/dtrace/stats). So, when it works, we can see it working
reliably. The figures are actually lower than I expect, but that may be
the kernel doing a good job of typically preloading new processes to
minimize page faults, so, I actually have to work hard to generate a high
degree of page faults).

So, when we have a double fault - we know an interrupt routine had
a problem. The trouble is, in this case, we dont know which interrupt
routine caused the original violation, because the double-fault handler
generates a new fault. (I think the i386 kernel code is broken - it
is walking a stack and the cpu registers are not consistent; i see streams
of stack dumps where each stack dump is causing another, nested stack
dump, and it all scrolls off the screen. I have used virtualbox debugger
to see the true stack, and this is only partially helpful).

When a nested fault occurs, the CPU will switch from the offending stack
to a new stack, set up just for this purpose. This is a brilliant feature
but its causing me a problem as I am having trouble finding the original
offending stack.

Lets patch the kernel

Ok, so we know something is wrong. Maybe we can detect the issue before
it happens and get a deterministic panic. I modified the code in the
kernel - just before we dispatch to a new process, to validate the
page-table, and also, to monitor the low water mark of the kernel
stack. Suggestions on google are that the symptoms I am seeing are
due to stack overflow. Neither of the mods I have made have helped.
Typically, the kernel will use about 2K of the 4K stack space available
to it - it rarely gets close to 3K, so I dont believe we are overloading
the stack and corrupting key data structures.

Lets give up?

I refuse to give up on this. I am mentally walking thru kernel code and
scenarios, trying to conjure up the "it doesnt happen often" case, to detect
what can be happening.

Today, I was wondering if, in the VM, we give it a nice round number
of memory or an odd number - whether this could impact. I typically give
my VM about 730MB of memory. I tried giving it exactly 256MB and it worked
perfectly! Briliant!. Rebooted and tried again at 256MB, and it failed.

Theres almost a flavor to the underlying problem. Often, I am getting a scenario
where it works for extended periods of time: I cannot crash the machine.
Other times, it crashes exactly on the second load (sometimes, even the first
load, although this is rare).

Its almost as if the problem is to do with exactly what is allocated
in memory. I tried a test by filling memory with a large file (full of
zeros), and checking to see if the file was mutating. (Maybe the interrupt
routine was firing with incorrect DS/ES registers and attempts to increment
a counter was randomly patching a random page in memory - that would exactly
explain the kind of problem I am seeing).

So, so far: nothing. No deterministic testing is locating the root cause
of this fault. (I am also wondering if VirtualBox is broken - I have
seen the many bug reports and complaints about VB on the web, but I have
no evidence that the bugs are my problem. I must get emu or VMware up and
running to do side-by-side comparisons).

Post created by CRiSP v10.0.22a-b6154

CRiSP, DTrace, and other technobabble

Saturday, 14 January 2012

The 'Great Bug' of 2011/2012

Working Backwards

Lets patch the kernel

Lets give up?

No comments:

Post a Comment