In the beginning was the mutex. And lo! This was mapped to a Linux mutex.
But this caused problems, because a mutex cannot be used inside an interrupt
routine.
On the second day, this was mapped to a semaphore. Semaphores are good.
They can be used in an interrupt routine.
On the third day, the semaphores were replaced with custom mutexes
(effectively spinlocks). Because a semaphore could suspend the calling
process inside a nested interrupt.
On the fourth day, the custom mutexes were replaced by semaphores,
because a timer probe would invoke calls to spinlocks and preempt
disables, and lead to recursive probe faults.
On the fifth day, the semaphores were replaced with different custom
mutexes. Ones which supported nested operation, and avoided depending on
Linux functions.
Get the picture? Its complicated.
Why is it so complicated? Because Solaris interrupt management
(via splx()) doesnt map to a the cli/sti mode of a processor. Solaris'
interrupt mechanism has existed for nearly two decades, based on
a processor model defined for the PDP-11. Linux's model is sophisticated,
and different.
One thing I realised this week .. why it took me so long, I dont know,
is that when you take a breakpoint trap, interrupts are left enabled.
This means, during an FBT probe, the timer can fire and you have a nested
interrupt.
This week has seen me try to solve Nigels Fedora Core issues where,
under load, the system would panic and reboot. The root cause here is nested
interrupts, and the dtrace module not segregating itself strongly enough
from the real kernel, allowing nested operation and deadlocks to arise.
I have to be very careful to get the key execution paths working in my
head (an fbt/breakpoint interrupt, a timer/tick interrupt - both standalone
and on top of an fbt/breakpoint interrupt, and the xcall/deadlock issue).
I also believe I have caught another implementation issue. Interestingly
I believe Linux ftrace mechanism suffers the same issue. Namely,
when single stepping, one has to be careful of 64-bit instructions
which use %RIP relative addressing modes. Both dtrace and ftrace almost
do the right thing, but can fail if dtrace is more than 4GB away from
where the kernel is loaded. (I think this might explain some erraticness
on large RAM machines).
I am not sure there is a cure for this (well, not a quick/easy one), and
I may have to disable those offending probes (typically only a handful on
a normal kernel - not a great loss).
Anyway, let me continue experimenting with mutexes and see how close I
can get. (My current experiment is real good, except for problems when
kzalloc is called with a mutex...lets see if I can fix that).
Post created by CRiSP v10.0.17a-b6103
No comments:
Post a Comment