CRiSP, DTrace, and other technobabble: Update on the impossible

I've been walking through the scenario of why some kernel
addresses are not visible to some processes. Think of a block of
memory allocated by the kernel for internal use, but triggering
a page fault, e.g. because page is swapped, or hasnt been touched
yet, by the user space.

When the page fault handler is invoked, the address of the buffer
exists only in some process page tables.

Turns out this is a (nice!) clever trick of the kernel. When things
are allocated in the kernel, and should be visible to all processes,
e.g. a driver module or other buffer, when the page fault kicks in,
a check is made if the page is valid in the "master page table".
Process #0, created on kernel boot up, houses the master table.
The currently running process may not have the mapping in place, and instead
of paying a large cost to update all processes page tables to represent
these kernel pages, the page fault handler will update the local process
page table when the fault occurs.

This explains why some processes can see the page in question, and, others
can not.

Bear in mind we are putting in place a page-fault interrupt handler.
This *must*, repeat *MUST* be visible at the time of a page fault, else
we get a cascade of nested page-faults because the handler isnt mapped
in the process page tables.

So, we need to arrange this to be true. At the moment, the options
include: (a) see if anything in the kernel allows us to propagate the
page-table mapping across all procs (nobody else, other than possibly
a Virtualisation guest, such as Xen/VMWare/VirtualBox, is likely to do this),
or, (b) do the hard work myself, (c) move the interrupt handler
into an existing page of mapped memory [hard], or (d) dont patch the
IDT, but patch the existing page fault handler [not sure if this doesnt
just put off the problem].

Let me scrape some cobwebs off my brain...

BTW - heres the relevant comment in kernel/fault.c, function do_page_fault:

        /*
        * We fault-in kernel-space virtual memory on-demand. The
        * 'reference' page table is init_mm.pgd.
        *
        * NOTE! We MUST NOT take any locks for this case. We may
        * be in an interrupt or a critical region, and should
        * only copy the information from the master page table,
        * nothing more.
        *
        * This verifies that the fault happens in kernel space
        * (error_code & 4) == 0, and that the fault was not a
        * protection error (error_code & 9) == 0.
        */

Post created by CRiSP v10.0.22a-b6154

CRiSP, DTrace, and other technobabble

Tuesday, 17 January 2012

Update on the impossible

No comments:

Post a Comment