Wednesday 13 April 2011

smp_call_function and friends

Dtrace has one reliability bug in it - which affects multicore cpus.
I have tried hard to get the reliability up in recent releases, torture
testing the code on various linux platforms - many survive very well, some
dont.

The torture test involves use of copyinstr() against an argument which is
not a string pointer. Good ones to test are fbt:::return, where the
argN parameters are whatever is left on the stack or in registers.
Dereferencing these lead to kernel level GPF or page faults.
Dtrace should be resilient to these.

Yet, it still can crash.

Some kernels are better at handling this than others - the later releases
generally have better interrupt handling routines or BUG_ON warnings
which log a message when a driver breaks a programming contract.

Which brings us to smp_call_function and it friends. From day 1, these
functions caused me pain; fortunately, Linux provides a set of these
functions, very similar to Solaris, so we won, handsomely.

Or didnt.

What are they? On a multicpu system, it is sometimes necessary for
drivers to invoke synchronisation barriers on other cpus. Typically
an SMP kernel and driver will utilise arrays of structures - one per cpu,
so that each cpu can process work independent of what another cpu is doing.
In the case of dtrace, we may be handling an fbt trap on one cpu, whilst
another is doing a system call. As the number of cpus goes up, the permutation
of scenarios of user code, kernel code, drivers, shared structures, etc
goes up.

What does dtrace do to warrant inter-cpu function calls?

Well, dtrace, the user space program works by periodically polling
the driver for trace information. Each cpu utilises a double-buffering
approach for tracing: traps and probes are recorded in an event buffer.
When the buffer is full, it can switch to an alternate buffer. When
bin/dtrace asks for a buffer dump, the buffer is emptied, and dtrace
asks the kernel driver to switch to the alternate buffer - just like
video double buffering.

A single bin/dtrace is effectively asking for data from all cpus.
So, the cpu which takes the ioctl() to ask for the buffer dump has
to tell the other cpus to "empty their buffers". It could do this
by swizzling the pointers in the per-cpu buffer, but this is dangerous -
the other cpu may be executing code leveraging these things. You
cannot simply disable interrupts and lock them out with a normal
type of mutex (you could do, but the number of places to litter
mutex can be high). Instead, because this is so rare, the
smp_call_function functions are invoked to ask each CPU in turn to
do an action on behalf of the invoker. Its quite elegant.

This is all based around the APIC implementation on the Intel/AMD
cpus which provides a mechanism to send a forced-interrupt to another cpu.
The target cpu takes the interrupt (assuming interrupts are not presently
disabled on that cpu), performs the task, and exits the interrupt.

The Solaris and Linux implementations are similar, but different. That
difference hurts.

Ok, so now to the tricky bit. Consider at any point in time, what
is a cpu doing? It could be in user space, or kernel space. In kernel
space, interrupts may be enabled or disabled.

Lets consider kernel running with interrupts enabled - another cpu
invokes smp_call_function; the first cpu takes the interrupt and returns.

Now, Linux has a programming contract: when invoking the function, we
must honor the following contract (taken from comments in kernel/smp.c):


* You must not call this function with disabled interrupts or from a
* hardware interrupt handler or from a bottom half handler. Preemption
* must be disabled when calling this function.


It is saying that the invoke of the function cannot be an interrupt
handler. Guess where (my) dtrace implementation invokes smp_call_function's
from? The hrtimer code. A timer callback, by definition, is an interrupt.
So, because of this contract, which we have broken, profile tick probes
can interrupt the kernel when it should not do, leading to either
kernel warnings, or, kernel deadlocks.

I recently attempted to avoid recursive dtrace probes, but if we are
not careful, we will lose timer tick probes, which can break
scripts. (My own regression tests, which terminates after 5s, never
terminate because the tick we need is discarded).

So, we need to fix this problem. The SMP function calls rely on a fair
amount of fabric to control the APICs and other cpu register masks, so
its not as simple as reimplementing the code to shield from
kernel artefacts.

So, I need to come up with a plan.


Post created by CRiSP v10.0.5a-b5969


No comments:

Post a Comment