Tuesday, 27 November 2012

The Case for a Safe XCall

I've written about this before, and its time to write again -
dtrace_xcall(). This is the function DTrace uses on multiprocessor
systems to sync the CPUs so that the CPUs can agree, e.g. on
which buffer to use when logging the traced probes.

dtrace_xcall() is the function in the driver to do this. On Linux,
this maps to smp_call_function() and friends. A CPU cross-call is
an interesting concept - an ability for one CPU to make a function
call on the other CPU. The use case is rarely needed, and if it
is done, whilst breaking the calling protocol, you can lock up one
or more CPUs or crash the system.

On Linux, the cross-call (or IPI, interprocessor-interrupt), can
be seen by examining /proc/interrupts and looking for a line

$ cat /proc/interrupts
CAL: 52770 36972 Function call interrupts

My system has had many hours of uptime, yet the calls are rare (the above
is showing the calls for each CPU).

When dtrace is called on to do a heavy action, like:

$ dtrace -n fbt::::

tens or hundreds of thousands of probes may be collected per second.
DTrace has two internal buffers to log these probes, and the buffers
will fill up quickly, and the calls to the IPI xcall code will happen
a lot.

I've been pondering how this actually works - both on Linux and Solaris.

Lets take a thought experiment:

Imagine a dual cpu system. One processor is sitting inside a lock region,
with interrupts disabled. The other CPU is trying to access the same lock/region.
Now this other CPU is blocked until the first cpu exits the lock.

Now, imagine this again. This time, the first CPU takes a very
long time to hold the lock. This would block the other CPU indefinitely.
Normally this is rare - the kernel arranges to never hold locks for
long periods of time.

Now, lets modify this scenario. One cpu is holding on to a lock, interrupts
disabled, and we do an IPI cross-call. The other CPU is holding on to a
different lock and has interrupts disabled. The first cpu cannot interrupt
the other CPU and so we deadlock. In normal scenarios, this mutual exclusion
cannot happen (other than bugs in the kernel or drivers).

IPI interrupts are just like normal interrupts - they can be ignored
when interrupts are disabled, and processed when interrupts are reenabled.

The kernel smp_call_function() call has a contract: it must not be
called with interrupts disabled. Doing so generates a kernel
log/BUG warning, and indicates the kernel could deadlock.

When we use DTrace, we can place a probe on any function in the kernel,
especially functions which run with interrupts disabled. This means
we break the contract.

(I note Oracle UEL Linux DTrace simply calls smp_call_function() and
suffers the bug, unless they have fixed it). In my DTrace, I take
steps to avoid calling the Linux smp_call_function() and implement my
own. It *seems* to work.

Whilst examining Xen, I had great difficulty finding a way
to avoid smp_call_function() so someone invoking fbt probes
with interrupts enabled can cause deadlocks or long live locks. (DTrace
will detect a mutual deadlock and break the lock, but this is horrible
and can panic the kernel in some extreme circumstances).

DTrace is supposed to be reliable and the above behavior is horrible.
Simple *HORRIBLE*.

I have a (new) workaround...see below.

But, why doesnt Solaris have this problem? Well, the Solaris xcall code
is intimate with the kernel interrupt code, and a CPU waiting in a xcall
runs with interrupts enabled. (The whole Solaris/BSD kernel uses interrupt
priorities to allow much of the kernel to run with interrupts enabled
and even when interrupts are disabled, deadlocks cannot occur).

I wish I understood the above paragraph more - but experiments,
user success stories demonstrate that Solaris has no deadlock issue.

Ok, so the solution for Linux.

If you followed the above carefully, you will note the problem
is caused if we try to do a cross-call whilst interrupts are disabled.
And interrupts are disabled either when probing a function in
interrupt handler code, or inside a locked region.

So, lets disable probes whilst interrupts are disabled. If we did
this, then the kernel should be safe for fbt::*: probes and never
deadlock. Most interesting scenarios are in the non-locked kernel regions.

But thats bizarre! How could we do this? That defeats one of the
deep probing aspects of DTrace.

Well to resolve this conflict of interest - we can have DTrace run
in 'safe' mode by default, and when the user wants to remove this safety
barrier, they can do so, by sending a message to the driver.

And this is what I am going to do for the next release.

Post created by CRiSP v11.0.13a-b6462

1 comment:

  1. Paul, Good to see you are looking at optimizing the xcalls on Linux again. It will be interesting to see the results of your latest experiment with this.