CRiSP, DTrace, and other technobabble: Dtrace progress

I've spent the last week or so looking at why Dtrace on the 3.x
kernel is misbehaving and found some interesting things on the journey.

Firstly, printk() seems to be broken. In the 3.x kernel, they
appear to have added a recursion detection and stronger locking
semantics, which means attempts to call printk() from an interrupt
(or some other strong locking mode), will hang the kernel.

That has made debugging the other problems very difficult.
So be it, I have removed or disabled most of the printk() console
output.

I have an internal function - dtrace_printf() which does what
printk() does, but writes to an internal circular buffer, which is
safe, but isnt much help if the kernel locks up.

I have decided to move totally away from the kernel spin locks
and mutex/semaphore support, and use my own mutex implementation.
This avoids problems when the kernel mutexes can allow preemption
and other re-entrant problems, probably leading to some of the lock up.

I even went the whole hog and added the kernel debuggers to try and
debug the problem (kdebug/kgdb). This was totally unuseful for two
reasons. One, you cannot set breakpoints in the kernel (despite
trying). I had to turn off the kernel RODATA config option
since this prevents some parts of the kernel being writable,
but I still got errors. I could never get the debugger to wake
up on an RCU lockup or allow interactive "break-in" to the locked
up kernel.

The kernel rarely panics - it just locks up. Very frustrating.

Having fixed various things, it appears dtrace works much better.
The "dtrace -n fbt:::" works very well, EXCEPT ctrl-c can hang the kernel.

What is interesting is that during ctrl-c of dtrace, the amount
of work to teardown the probes is huge - specifically, the number of
xcalls called on teardown is about 3x the number of probes actually
played. So, down an fbt::: probe, which is placing around 40,000
probes, means on the ^C, the system is sluggish for a number of seconds as
tens/hundreds of thousands of xcalls are invoked. I think this is
problem which needs to be addressed in the real dtrace.

The problem is that as each probe is torn down, a number of dtrace_sync()
calls are invoked, and each one is talking to all the other CPUs
to ensure synchronisation. For reasons I cannot tell, this
is very expensive - presumably the other CPUs may be in a disabled
interrupt state, so we have to wait for that
cpu to come out of the interrupt region, or leverage the NMI
interrupt to break the deadlock.

I'm still trying to understand how xcall works -- if we ask
another cpu to take an IPI interrupt and it is stuck in a long
spinlock, then the xcall may take a while or forever to be invoked.
I dont understand how Solaris breaks this deadlock, but it does
appear solaris tries hard to keep interrupts enabled at all times,
which is different from Linux. I may have that misunderstood in Solaris.

MacOS does something different/similar in that on a xcall, the
invoking CPU keeps an eye out for recursive TLB shootdowns.

I hope to make a new release of dtrace in next few days if
I feel happy I havent made things worse.

I really dont like the Ctrl-C/teardown performance - it can
lead to the kernel complaining about stuck CPUs.

Post created by CRiSP v10.0.17a-b6103

CRiSP, DTrace, and other technobabble

Saturday, 5 November 2011

Dtrace progress

No comments:

Post a Comment