Whats the worst thing you can do to a CPU whilst executing code?
How about a buffer overflow .. overwriting beyond the end of a buffer.
Very soon a segmentation violation (or GPF) will happen, and the application
will terminate, or try to recover.
How about inside the kernel? Well, pretty much the same thing.
The x86 architecture is well thought out. When some form of memory
access goes awry, an interrupt is generated (technically a 'fault' or 'trap'),
and the kernel will attempt to recover from this.
The act of taking an interrupt involves pushing the current program counter
on the stack, and jumping to a predefined location.
Great. So - whether a GPF occurs in user space or kernel space, something
will happen. This is either recoverable, or a panic/blue-screen can
happen if the kernel doesnt know what to do.
The predefined location is setup in a table called the IDT (Interrupt
If the interrupt to handle a GPF takes a fault itself, the system
will generate a double-fault. Double-faults are very rare. (GPFs are
very common, and can be caused under normal circumstances via memory
mapped/anonymous memory, as pages are faulted into existence).
A double fault typically indicates a flaw in a driver and can
be caused by using an invalid pointer or a stack exception in an
A triple fault is what can happen if a double fault generates an
exception. This would indicate the double-fault handling code hit
an unexpected condition. On the Intel/AMD architectures, a triple fault
will typically reset and reboot the CPU.
Normally, the kernel and CPU operate together on some very key data
structures. We mentioned the IDT, above. Theres also the GDT - which describes
how segments of memory map to real memory. And then theres the LDT - which
is a per-process view of memory. Corrupting any of these can
lead to double/triple fault behavior.
But theres another data structure: the page table directory. If the
page table is corrupted then all bets are off. The page table can be used to
indicate what blocks of memory are present/not-present in the system and
is the mechanism for virtual memory support. If the page table were
corrupt, then an application would generate a page fault interrupt and the
kernel would quickly shut down the offending process.
But what if the kernel version of the page table were corrupt? On an
interrupt, the CPU wouldnt be able to access the code to execute the
interrupt handler, which in turn would lead to a double fault, and thence
to a triple fault.
All of this is well documented on the web.
But I am having a hard time with dtrace on i386 architectures. After
loading dtrace, and then removing from the system, on a subsequent
reload of the driver, the system crashes/hangs. Most of the time there
is no output on the console; when there is output on the console,
its confused and corrupted. Which indicates that one of the
key data structures in the kernel is corrupt (IDT, Page Tables or GDT).
And, because of this, nearly impossible to debug. Nothing in the kernel
can help debug this scenario - we cannot print or signal what has happened
or where we were prior to the crash.
At the moment I am using the VirtualBox debugger to poke around after
a crash, but the debugger wont let me examine memory exactly because the
page table is corrupt (or the CR3 register is corrupt, but I cannot
tell the difference; CR3 is the register which points to the start
of the page table).
So, this is the worst bug to resolve - no kernel debugger, printk statements
or something in the kernel will help find the cause of the strange hang.
(Strangely, this problem does not exist in in the 64b kernel).