Sunday, 11 December 2011

Ding dong! Who's there? Anybody? Someone say hello!

Nigel has been up to his tricks again. No sooner do I give him
a new release, and after a 21h marathon, some dtrace problems surfaced.

I'm not sure if I have fixed or found the issue...but whilst trying to
reproduce the issue (and I am not happy having to wait 21h to get
a feel for the issue), I found a new bad scenario.

On a 3 CPU VM, if I run a simple syscall:::{exit(0);} type of probe,
repeatedly, then running *four* of these will deadlock the system.

After a lot of poking around, adding debug, and trying to understand it,
I think I located the source of the problem.

So, a process runs on cpu#0, and whilst holding on to the
locks, is suspended by the kernel. Next, another dtrace process
comes along and blocks waiting on the mutex held by cpu#0. Repeat
two more times.

Now, the mutex implementation is effectively a spinlock, and we dont
allow the first process, holding the locks to run. So we have deadlocks
and a hung system.

The cure appears to be calling schedule() in the middle of the mutex-wait
loop, avoiding cpu starvation.

This appears to work fine.

New release with this fix, and Nigel can spend another 21h finding
more bugs for me. :-)

Post created by CRiSP v10.0.20a-b6134


No comments:

Post a Comment