Saturday 5 November 2011

How fast is fast?

In deeper diving the dtrace lockup on 3.x kernels, I have
been revisiting the xcall code.

Heres a test for you to try (on FreeBSD, MacOS, Solaris and Linux):


$ dtrace -n fbt:::


This will trace all the FBT probes (on my Linux VM, thats about 48000
probes). "Time" how quick it is to start scrolling the output. (About 1-2s?)

Now ^C (twice, you need to do this twice for some reason in dtrace).

How long til you get your shell prompt back? (On my linux system,
its down to 1-2s vs maybe 3-4s on a dual core MacOS box).

Why is it *slower* to exit than it is to invoke?

The answer is: IPI or interprocessor interrupts. Now, this may
be fast on Solaris - its engineered nicely. On Mac and Linux at least,
its not. Its really difficult to work out "why".

When you ^C dtrace, it has to tear down all the probes. In theory
this is easy, and faster than the initial construction. But the
Solaris/Dtrace code has a nasty performance issue. For every probe,
three dtrace_sync() functions are invoked, and this involves
communication with the N-1 other CPUs to process an IPI interrupt.

This is what I emulate on Linux. But its slow. My 48000 probes
involve nearly 200k IPI interrupts to the N-1 processors. (I am
testing on a 4-cpu VM). And the IPI is either delivered "slowly" or
"received slowly" on the target CPUs.

What is worse, far far worse is the Linux 3.0.4 kernel I am
using in Ubuntu 11.10 (I compiled my own; the default distro is
3.0.12). If the tear down takes too long, the kernel may notice
a "hung" cpu, and after a minute or two, will hang, hard, the kernel
due to the lack of responsiveness. (I will need to see
if I can find out how it knows the CPU is not-idle, and maybe fool it).

I really dislike this 3.0 kernel - its a very harsh environment
for a buggy driver to live in, and dtrace has to work even harder
to avoid being caught in the searchlight of the kernel.

I have a hack/optimisation for this problem, which is proving
rewarding (if the "other" cpu is not sitting inside a dtrace probe
handler, then we have nothing to do, so we can skip the IPI interrupt.
But its not bullet-proof in the few lines of code addition).

How does Solaris handle this? Well, on Solaris, direct interrupt
disabling does not happen. Instead, a software processor level flag
is set, and the interrupt handler can allow interrupts, even if
logically, the code in question, does not want to be interrupted.
I believe by making direct dtrace checks, that IPIs across cpus
can happen even if the other cpu is in a critical section. I wish
I could prove my understanding of the code (but that would be a distraction).

Hm. Just took a look at Oracles version:


void dtrace_xcall(processorid_t cpu, dtrace_xcall_t func, void *arg)
{
if (cpu == DTRACE_CPUALL) {
smp_call_function(func, arg, 1);
} else
smp_call_function_single(cpu, func, arg, 1);
}


All I can say is good luck to them if thats what they think is sufficient
to do the job. Thats pretty much the code I implemented originally, and
it doesnt work.

Oh well.

Post created by CRiSP v10.0.17a-b6103


No comments:

Post a Comment