dtrace was looking good. Surviving various torture tests.
But then Nigel had a go, and he reported a no-go. Slightly better
but not robust enough and could crash/hang the kernel within a few
minutes. The test was very simple:
$ dtrace -n fbt::[a-e]*:'{exit(0);}'
and
$ dtrace -ln syscall:::
running in parallel. The long teardown from the first dtrace
would allow the much shorter syscall::: list to interleave, and
sooner or later, lead to deadlock.
I modified the first dtrace to fbt::[a-m]*: to improve the chance
of the deadlock. (fbt::: is very draconian, enabling probes on all
functions, and makes ^C or other checks for responsiveness, a little
excessive. fbt::[a-m]*: was a middle ground - quick to fail.
In investigating this, I ended up writing a couple of very simple
but useful tools. The mode of failure was to cause "stuck CPU" messages
to display on the console. Linux has a periodic timer fire which
can be used to detect CPUs stuck in an infinite loop - a useful feature.
When the stuck cpu message fires, you should get a stack
trace in /var/log/messages (Fedora), or /var/log/kern.log (Ubuntu).
Unfortunately, once a CPU is stuck, chances are high that the system
is in an unrecoverable state - and attempting to look at the stacks
doesnt help. I occasionally could get the kernel logs.
Each time, I could see a suggestion that the cpu was stuck
in the kfree() function, freeing memory. As the dtrace session ends,
all the probes need to be unwound, and the memory freed. For fbt::[a-m]*:,
this is around 29,000 probes (and some multiple of this in memory allocations).
Freeing 29000 memory blocks should take well under a millisecond.
But it takes *seconds*. (See prior threads on slow teardowns of
fbt).
The reason it is so slow, is that dtrace needs to broadcast to the
other cpus, effectively saying "Hands off this memory block". This
involves a cpu cross-call to synchronise state. Unfortunately, this is
slow. It also does 3 of these per probe that is dismantled.
Previous attempts to optimise this failed. (Nigel had pointed out
this is significantly faster on real hardware vs a VM). I didnt
quite get to the bottom of why it failed...
So, we know we have a stuck cpu, something to do with freeing memory.
Very difficult to debug. I wanted to know what the other CPUs were doing.
If one cpu is waiting for something (a lock, semaphore, spinlock, whatever),
then the others must be causing the lock. The evidence showed they werent
really doing much.
During a teardown, the other cpus may continue to fire probes, so the
chances were that another cpu was causing a deadlock.
The big implication was that either my mutex implementation or my
xcall implementation was at fault. I spent a lot of time going through
both with a fine tooth comb, and trying various experiments.
It really didnt make sense. To cut a long story short, eventually,
after despairing I could ever find it, it started working! Having
made lots of little changes, eventually it survived over an hour, vs
the 1-3 minutes previously.
In the end, a couple of small changes were made (it wasnt obvious
which of the many changes were the cause of the fix). By suitable
diffing, and cutting back of the noise, I found that the xcall code
and mutex code needed to occasionally drain the pending xcall
buffers. If we run a cpu with interrupts disabled, and another cpu
is trying to call us, then we put a long delay in responding to the
function call. Worse, we can deadlock - the other cpu is waiting for us,
and we are waiting for that cpu. By suitable "draining" calls, the
problems disappeared.
So, hopefully it is hardened.
Lets see how Nigel copes with this, over the weekend.
Its worth noting that debugging this kind of code is very difficult.
I tried using the kernel debuggers, but faced a number of problems.
One is that you cannot interrupt the kernel to see whats going on if
the cpus have turned off interrupts. (I believe).
I'll document a couple of the tools I added to the war chest in a
follow up blog post.
Post created by CRiSP v10.0.19a-b6127
No comments:
Post a Comment