Sunday 15 February 2015

address: 0000f00000000000

Strange. Continue to keep finding why dtrace is not passing my tests.
I have narrowed it down to a strange exception. If the user script
accesses an invalid address, we either get a page fault or a GPF.
DTrace handles this and stubs out the offending memory access. Heres
a script


build/dtrace -n '
BEGIN {
cnt = 0;
tstart = timestamp;
}
syscall::: {
this->pid = pid;
this->ppid = ppid;
this->execname = execname;
this->arg0 = stringof(arg0);
this->arg1 = stringof(arg1);
this->arg2 = stringof(arg2);
cnt++;
}
tick-1s { printf("count so far: %d", cnt); }
tick-500s { exit(0); }
'


This script will examine all syscalls and try and access the string
for arg0/1/2 - and for most syscalls, there isnt one. So we end up
dereferencing a bad pointer. But only some pointers cause me pain.
Most are handled properly. The address in the title is one such address.
I *think* what we have is the difference between a page fault and a GPF.
Despite a lot of hacking to the code - I cannot easily debug, since
once this exception happens the kernel doesnt recover. I have modified
the script above to only do syscall::chdir: which means I can manually
test via a shell, doing a "cd" command. On my 3-cpu VM, I lose one of the
CPUs and the machine behaves erratically. Now I need to figure out if
we are getting a GPF or some other exception.

I tried memory addresses: 0x00..00f, 0x00..0f0, 0x00..f00, ... in order
to find this. I suspect there is no page table mapping here or its
special in some other way. May need to dig into the kernel GDT or
page table to see what is causing this.

UPDATE: 20150215

After a bunch of digging I found that the GPF interrupt handler had
been commented out. There was a bit more to this than that, because
even when I re-enabled it, I was getting some other spurious issues.
All in all, various bits of hack code and debugging had got in the way
of a clear message.

I have been updating the sources to merge back in the fixes
for the 3.16 kernel, but have a regression on syscall tracing which
can cause spurious panics. I need to fix that before I do a next
release.

Post created by CRiSP v12.0.3a-b6801


No comments:

Post a Comment