Monday 16 January 2012

Naughty naughty bug.

I believe I have finally found / confirmed the root cause of
the "impossible" bug.

Lets go on a journey....

The i386 virtual memory architecture relies on page tables.
Each process has a (complex) array of descriptions for each page.
Each page in the 4GB address space either has an entry, or is missing
an entry. (Each process does not need the full 4MB to describe the
address space, if the process is not using every page of the 4GB space;
4MB is what is needed to describe each and every page for a fat 4GB page).

Now, typically in that 4GB range is user-space (typically everything
below around 3.5GB) and the kernel (everything in the last 0.5GB).
[Details differ in release to release of the kernel].

Now the kernel can see all the user space pages, but typically, the
user space process *cannot* see the kernel pages. (Would be nice if you
could but that would be a security hole). Physical RAM is
mapped so that the kernel can see every page of memory, but the kernel
pages are marked so you can not (in user space) "see them". (With root
access and access to /proc/kmem, you can poke around, but thats not normal
behaviour).

Now, lets consider what happens when a (module) device driver is loaded.
The kernel locates some free memory and loads the image into memory.
The kernel does a lot of housekeeping to link the module into various
lists and expose the /proc, /dev and other entries.

Here it gets interesting...



The driver is loaded into memory - the kernel knows about the memory
before the driver is loaded - its the physical RAM in your box. But
maybe/maybe not - the unallocated "free" memory in the kernel, is not
really addressable - certainly not by user space, and possibly, not
even by the kernel - trying to access free memory would indicate
a rogue pointer or array-out-of-bounds exception. When the kernel
needs a free page, in can ask the kernel allocator for it.

So, this means, if you were to examine the page table for each process
in the system, these free pages are effectively "not there" and this can
help detect rogue pointers and bugs in the kernel.

As the driver is loaded, the pages are flipped to "being there" and visible.
Eg the code for the driver has to be visible to the rest of the kernel,
because you are going to do an open/read/write/close, for example.

Now, from user space - it doesnt know or care about the physical memory
for a driver. You cannot just blindly execute a subroutine in a driver - you
can only get to it, by executing a system call, which takes us from
user space to kernel space, and, once in kernel space, we can see the
code + data for the driver.

The Bombshell



Now consider a driver which embeds its own interrupt routine. When
an interrupt fires, we normally switch to supervisor mode, and the
page where the interrupt routine resides, is visible and executable.

I have been trying to track down a kernel blow up with dtrace, when
its loaded one or more times and a page fault fires. (Only observed
in the i386 kernel, not seen it in the x64 kernel).

When the user space fires a page fault, we switch to supervisor mode
and run the page fault handler. The first bit of this is in the dtrace
driver. If we decide this is not interesting, we jump to the existing kernel
handler.

Half way through the kernel handler, it decides to take a context switch.
(I dont know why - maybe its just being polite, and giving other high
priority tasks a chance to run). As we load the %CR3 register (which points
to the page table directory for the new process), we suddenly lose visibilty
of the dtrace driver. It is no longer visible, in the context of that
process, *EVEN FROM SUPERVISOR MODE*.

That new process which just got the CPU takes a page-fault and *BANG*!
GAME OVER!

The page fault handler is no longer visible. In fact, trying to take
the interrupt fires a page fault exception, which in turn fires a page
fault exception. The stack overflows and the CPU merrily trundles along
overwriting the entirety of memory until it shoots both its feet off.
(Eg, it starts to overwrite the page table itself or some other important
structure). I strongly suspect that using the *page table* as a *stack*
is what causing the CPU to triple fault and for VMWare and VirtualBox to
report an unexpected unrecoverable event has happened, and shuts down the VM.

Eh? Whaddya say?



The evidence suggests that when a driver is loaded into kernel memory,
ONLY SOME PROCESSES HAVE IT MAPPED INTO THEIR PAGE TABLE.

I did an experiment: I wrote some kernel code to let me probe
each process on the system, to see if that process could see, in kernel
space, a specific address. I tried a kernel address and that was fine (eg sys_open).
I tried the dtrace interrupt (dtrace_page_fault), and it wasnt. I
loaded a random other driver, and confirmed the same.

So, lets revisit. When a driver is loaded into kernel memory - it
is touch and go as to whether the driver should exist in the page
tables of all processes in the system. Loading a driver could cause a lot
of page table updates, as each processes page table would need to be
updated to reflect the mappings. Or, instead the kernel might decide its
not worth the bother: user space cannot access system space, except via
a trap into the kernel via a syscall or interrupt.

So, why do half of the procs in the system have the driver loaded and the
others do not?

Heres my guess: when a driver is loaded into memory at least one
page table needs to be modified. This is a special page table which
belongs to process zero (the swapper process). [A data structure
called the swapper_pg_dir holds the kernel page table]. Under normal
circumstances, every time a new process is created, it is a fork/clone
of an existing process, so that new process gets a copy of the kernel
page tables.

But loading a driver means we cause a "warp" effect - the kernel gets
the new mappings, but some/none of the user procs do not get this.

The solution



Is this a bug? Is this a misinterpretation by me? It feels like a bug.
Maybe the dtrace driver is miscompiled and I havent put the
interrupt codes into the right ELF section (so I will go and check).

If its not a compile/declaration problem, then either I need to
update every processes page table to see the driver pages, or find
a way to ensure that a kmalloc()ed page is visible by all processes.

The evidence



Heres some evidence to support my findings. I invoke a kernel
function to probe three addresses: static kernel function, dtrace page
fault handler, "other" loadable module:


...
5.568007271 #0 1490:1294 d87fa000
5.568007271 #0 1490: lookup1: c182009c 1
5.568007271 #0 1490: lookup2: ebc236d0 1
5.568007271 #0 1490: lookup3: 00000000 3
5.568007271 #0 1490:1489 d86ef000
5.568007271 #0 1490: lookup1: c182009c 1
5.568007271 #0 1490: lookup2: ebc236d0 1
5.568007271 #0 1490: lookup3: eb0c52b4 1
...


"1490" is the PID of the process invoking the tracing. The first entry
is for pid 1294. This PID can see the kernel function and the dtrace function,
but not the "other" driver. Pid 1489 can see all three addresses I
specified. Theres no real logic to why pid 1294 cannot see the new
driver.


root 1294 1 0 21:38 ? 00:00:00 /usr/sbin/console-kit-daemon --no-daemon
fox 1489 1289 0 21:38 ? 00:00:03 sshd: fox@pts/1


Heres the kernel code, in case anyone is interested, which
dumps the output:


void xx_procs(void)
{ struct task_struct *t;

// hack to call a GPL function in the kernel
pte_t *(*lookup_address)(unsigned long, unsigned int *) = 0xc10277f0;
int level = -1;
pte_t *p;

printk("process list:\n");
p = lookup_address(0xc10277f0, &level);
printk(" lookup: %p %d\n", p, level);
for_each_process(t) {
struct mm_struct *mm;
struct vm_area_struct *vma;

printk("%d %p\n", t->pid, t->mm ? t->mm->pgd : NULL);
if ((mm = t->mm) == NULL)
continue;

// lookup_address1() is the same as the kernels
// lookup_address() - but private copy to allow a
// procs mm_struct to be passed so we can probe another
// processes page table.

// random kernel address
p = lookup_address1(mm, 0xc10277f0, &level);
printk(" lookup1: %p %d\n", p, level);

p = lookup_address1(mm, dtrace_page_fault, &level);
printk(" lookup2: %p %d\n", p, level);

// random "other" module address, gained from /proc/modules
p = lookup_address1(mm, 0xeecad000, &level);
printk(" lookup3: %p %d\n", p, level);
}
}



Post created by CRiSP v10.0.22a-b6154


No comments:

Post a Comment