Nigel was asking me today why I was bothering to spend so much
time on a bug which is uninteresting. And if this issue happens
on i386, why dont we see it on x64.
Lets catch up: dtrace, when reloaded on an i386 system, can panic or
hang the system. This doesnt happen on x64.
As much as I like to dismiss i386 as yesterdays technology, it demonstrates
that *something* is wrong. Ignoring this warning sign is perilous.
Before going into the subject in detail: why after a driver load?
Why not half way through debugging your production system?
The underlying scenario may be rare but without the deep understanding,
such a tool can never promote itself in the reliability stakes.
Ok, so lets deep dive. Every process has a page table - which describes
what the process can see. In Linux, process #0 (the 'swapper') also
has a page table, but its a "master page table". It describes what the
kernel can see.
A process is most of the time dealing with its own address space, but
on a system call or interrupt, we are dealing with the kernel. The CPU
contains the circuitry to allow the kernel space to be visible when
the interrupts or system calls happen.
But, how does the kernel map and the per-process map keep in sync? When
you load a device driver (or even plug in a USB drive, for instance),
the kernel will allocate space for the code and data. This belongs to
the swapper/kernel. If whilst your process is executing, the
USB drive generates an interrupt, leading to the USB driver executing,
it will do so in the context of your process page table. You cannot
see this (normally). But those pages are *not* in your page table.
So, as the CPU tries to jump or access this memory, a page fault
will be generated. The page fault handler *IS* in your page table
(as is the whole monolithic kernel). The page fault handler will realise
the page fault happened in kernel space, and will notice that
the swapper page table and your process page table do not agree.
It will copy the offending page table entry from kernel(swapper) to your
process. And the system will continue - as if "by magic". (Function vmalloc_fault()
is the one that does this magic).
Linux/Dtrace is special
Linux/dtrace is very special compared to all other implementations of
dtrace and all other drivers on Linux: it is not only dynamically loaded,
but it contains a page fault handler. (Why? Because when you do silly
things in your D scripts, dtrace wants to prevent you from panicing the
kernel; it has to intercept invalid page faults caused by D scripts;
it doesnt care about normal page faults, and leaves the kernel to do its
If the page fault handler is not in the user page table (why should it?
after a module load, it wont be), then we are in dangerous territory.
You cannot simply ignore a "page fault" - you *must* process it. So, heres
the scenario: when dtrace is loaded, it only exists in kernel page
tables - not in any processes page table. Under normal use of dtrace,
invoking probes or syscalls, the act of these probes firing would cause
a page fault to ensure the dtrace code is mapped into the process table
of the process.
What is happening...
After dtrace is loaded, we have two scenarios to consider: system processes
(especially kernel threads, irqbalance, etc) and user procs. The system
processes run in kernel space and have the page fault handler mapped.
(In theory these system procs shouldnt have page faults, but they
might do). The user procs have no knowledge of dtrace, and as they page fault,
the CPU will try to invoke the page fault handler which is not mapped
into the user proc page table. This causes another fault and
we eventually have stack overflow, page table corruption and a double
The solution is to ensure that every process has the page fault
handler mapped as the module is loaded. Ive written/borrowed code
to walk the process table, and ensure the page fault handler
is properly "faulted" into the per-process page table.
My first experiments were a failure: even the tiniest of coding blips
will show up as a crash/hang/panic. After validating the code very
carefully: it appears to work.
When a process forks, the new process gets a copy of the same page
tables as its parent. So, if a process has the page fault handler
mapped, so will its child. I.e. we just need to "seed" every process
on the module load, and we are done.
Why doesnt this happen on x64?
I believe the reason is: probability. It *can* happen but either has not
been observed, or was assumed to be a different bug. I havent
directly measured this (yet) on x64, but all it requires is that the
page where the dtrace page fault handler is loaded into memory,
be mapped into every processes page table *by accident*. This might
happen due to bootup modprobes and other things, or it could be caused
by the layout of the page table directory structure leading to
likelihood of dtrace being on a previously mapped page.
(Maybe even the layout of the ELF format module file might "help").
But on a large memory system, it might not be sufficient and it is likely
the same bug would crash - at the least opportune time.
Next up is to tidy up the horror of code bodges I have in my VM
and push back to the master dtrace source; see if I can prove it could
be a problem in x64, and ensure the new code is x64 palatable.
(The Linux kernel, in arch/x86/mm/fault.c has two implementations of
vmalloc_fault - one for i386 and one for x64, so I cannot assume the
i386 "fix" will work for x64).