Sunday, 4 November 2012

Xen blog ... strangeness

I am finding that running DTrace in a Xen guest is a painful thing
to debug. I havent managed to get a decent debugger to help
diagnose the issue I am currently investigating, but thought it
worth writing up. This might help myself jog my own memory.

I have DTrace working with the various key interrupts (INT1, INT3) and
in trying to get the page_fault handler to work, keep breaking the
guest. We want the page_fault handler so that DTrace can intercept
certain locations within itself, where a user D script might
dereference memory incorrectly. Consider:

$ dtrace -n 'syscall::: { printf("%s", stringof(arg0)); }'

when the arg0 to a syscall is not a string pointer, we will get a
warning from DTrace about a bad memory reference. (Technically, the
kernel generates a GPF but we save outselves from paniccing the kernel).

What is special about the page_fault handler compared to say, INT1
(single step interrupt)? I dont know.

Looking at the kernel code and google searching is not helpful at all.
Lets ignore Xen and just visit some basics of assembler.

In assembler, we have subroutines - a CALL instruction jumps to the
target subroutine, and the return address is on the stack. The simplest
subroutine is:

ret // for an interrupt routine, this is an IRET instruction

An interrupt handler has to be careful to preserve all registers as
it does it stuff. (In user land we have to be careful too, but we have
some registers we can use without having to save them, such as the
incoming arg list).

So lets modify the above function, and do something as a no-op:

// Example 1
push %rax
pop %rax

This will crash the Xen guest. The following will not:

// Example 2

call silly
call silly

Whats the difference between example 1 and 2? I dont know. If I look
at example 1, I might hazard a guess that we have an invalid stack,
or a non-writable stack. But example 2 seems to work - we write to the
stack to call function silly and return.

In the actual Linux page fault handler, it does something slightly
weird, along the lines of:

call *xen_handler // see below

sub $0x78,%rsp
call save_regs

mov %rdi,0x78(%rsp)
mov %rsi,0x70(%rsp)
mov %rdx,0x68(%rsp)
mov %rcx,0x60(%rsp)

Its a strange sequence - the initial "sub $0x78,%rsp" decrements the stack
pointer, leaving room on the stack for the registers, and calls a subroutine
to populate the saved area, rather than a sequence of "push/push/push.."
instructions. The kernel is like this with or without Xen, and possibly
this is a good thing to do for various reasons.

Now "xen_handler" is a very interesting function; firstly, its not a
function but a pointer to a function. I think its like this because
the same kernel can be a Xen guest or running native, so the target
function is either a no-op or some actual code. Inside a Xen guest, the
eventual function is:

0xffffffff8100aae0: mov 0x8(%rsp),%rcx
0xffffffff8100aae5: mov 0x10(%rsp),%r11
0xffffffff8100aaea: retq $0x10

That is a very weird function. Examination of the entry_64.S file in the
kernel, shows that registers %RCX and %R11 need to be extracted - the
Xen hypervisor is pushing these registers on the stack in addition to the
normal semantics of a page fault. The "retq $0x10" is returning
from the subroutine, and also *removing* the two extra registers.

Lets rewrite the code:

call xen_pop
sub $0x78,%rsp

mov 0x8(%rsp),%rcx
mov 0x10(%rsp),%r11
retq $0x10

By simplification, this becomes:

pop %rcx
pop %r11
sub $0x78,%rsp

But this appears not to work. It looks like the Xen hypervisor knows something
about the code in a page fault handler, and unless the code obeys what
it is expecting, we get a guest reboot.

Debugging here is very difficult - when things are wrong, the guest
reboots - very few, if any, console messages. Various web references
to debug tools which arent available in the Ubuntu apt cache.

Post created by CRiSP v11.0.13a-b6462


  1. Hi Paul!

    Thanks for your work on DTrace/Xen

    And what xen debugging tools are you actually using?