Nigel has been up to his tricks again. No sooner do I give him
a new release, and after a 21h marathon, some dtrace problems surfaced.
I'm not sure if I have fixed or found the issue...but whilst trying to
reproduce the issue (and I am not happy having to wait 21h to get
a feel for the issue), I found a new bad scenario.
On a 3 CPU VM, if I run a simple syscall:::{exit(0);} type of probe,
repeatedly, then running *four* of these will deadlock the system.
After a lot of poking around, adding debug, and trying to understand it,
I think I located the source of the problem.
So, a process runs on cpu#0, and whilst holding on to the
locks, is suspended by the kernel. Next, another dtrace process
comes along and blocks waiting on the mutex held by cpu#0. Repeat
two more times.
Now, the mutex implementation is effectively a spinlock, and we dont
allow the first process, holding the locks to run. So we have deadlocks
and a hung system.
The cure appears to be calling schedule() in the middle of the mutex-wait
loop, avoiding cpu starvation.
This appears to work fine.
New release with this fix, and Nigel can spend another 21h finding
more bugs for me. :-)
Post created by CRiSP v10.0.20a-b6134
No comments:
Post a Comment