Friday, 20 May 2011

Dtrace progress - update on xcall

I have spent the last few weeks on trying to perfect the dtrace_xcall
emulation. Its quite possible I have been wasting my time and/or
looking at the wrong problem and solution.

What is dtrace_xcall? Very occasionally, dtrace needs to broadcast to the
other CPUs to ensure they are in sync. Typically this happens when
shutting down the user space application as the trace buffers need to be
dropped, but happens under other scenarios.

Solaris provides a cross cpu function mechanism, as does Linux.
But they do not work the same.

On later Linux kernels, you may see a BUG warning in the
kernel logs, like this:

[ 245.215564] WARNING: at kernel/smp.c:421 smp_call_function_many+0x69/0x1b9()

This is the kernel being nice, and warning that the smp_call_function_many()
function is being abused -- specifically being called from an interrupt

The problem is to do with locking and rescheduling should delays happen.
At worst, this can lead to deadlock and a hang of the system you are using.

In practise, this problem is rare - but that is not good enough.
Dtrace needs to be rock solid. As part of the distro, I include a
torture test ("make test") which tries various things, whilst putting
excessive load on the system (forked processes, opening files and doing
things which I have found to cause problems on earlier dtrace releases).

The test runs well, despite the problem described above, but
very occasionally, one sees horrible "failed to open shlib" type errors
as Linux applications are spawned - rare, but enough to demonstrate we
are doing something wrong and/or have race and locking problems.

In reviewing carefully the Solaris code, and the Linux code, I have
tried about 3 algorithms to resolve the problem. The latest code
is good, and is a fairly close emulation of what Solaris does, but
it is difficult to prove correctness. In this latest code, the
driver allocates its own distinct interrupt vector, and uses this to
send an inter-cpu interrupt (IPI), whilst spin locking waiting for the
other CPUs to complete.

It works real well, except occasionally - an attempt to raise an interrupt
on the other cpus fails, and we have deadlock. I put some counters
and delays into the code and some "get out of jail" logic to avoid
total machine hang, but thats not good enough. I have tried various
other hacks, but it is demonstrable that we ask an interrupt be fired
and sometimes it never arrives. This, I fear is me not fully understanding
how the APIC works or having interference with other interrupt sources.

But an important question in all of this is why do we get
that BUG warning, as illustrated above?

The reason is the timer interrupt. If you use the tick/profile probes to
get periodic probes (and even if you do not), the clock will fire
and occasionally this will happen whilst a probe is being serviced.
(You get dtrace timer interrupts even without the use of the
profile provider, since dtrace internally uses this for deadlock or
lack of system responsiveness detection). The more probes you fire
the higher the chance of collision.

Add to this the complexity of a multicpu system, and debugging is very
difficult. One of the most difficult things to do is to see what another
CPU is doing from another. Its easy to litter the driver code
with printk() tracing, and to generate a stack trace when interesting events
happen - but doing this for another cpu to see what/where it is stuck is
not easy (or, at least, I havent found simple code to do it - I am sure
its doable, since the kernel provides support via SYSRQ to do this, but,
possible using the same smp_call_function_many calls we are trying to debug).

But - back to the timer interrupt -- should they even be happening?

I am about to research this: dtrace_probe() disables interrupts during
operation, so the CPU invoking this function cannot have a reentrancy problem.

But the cyclic timer code eventually invokes the dtrace_xcall function:

[ 245.215564] <IRQ> [<ffffffff8104d9ad>] warn_slowpath_common+0x85/0x9d
[ 245.215564] [<ffffffffa02472d6>] ? dtrace_sync_func+0x0/0xb [dtracedrv]
[ 245.215564] [<ffffffffa02472d6>] ? dtrace_sync_func+0x0/0xb [dtracedrv]
[ 245.215564] [<ffffffff8104d9df>] warn_slowpath_null+0x1a/0x1c
[ 245.215564] [<ffffffff81077c34>] smp_call_function_many+0x69/0x1b9
[ 245.215564] [<ffffffffa02472d6>] ? dtrace_sync_func+0x0/0xb [dtracedrv]
[ 245.215564] [<ffffffff81077da6>] smp_call_function+0x22/0x26
[ 245.215564] [<ffffffffa025575d>] orig_dtrace_xcall+0x35/0x4b [dtracedrv]
[ 245.215564] [<ffffffffa0255a85>] dtrace_xcall+0xe/0x1b [dtracedrv]
[ 245.215564] [<ffffffffa0248c9f>] dtrace_sync+0x1a/0x1c [dtracedrv]
[ 245.215564] [<ffffffffa022f9cc>] dtrace_state_deadman+0x46/0x89 [dtracedrv]
[ 245.215564] [<ffffffffa022c875>] be_callback+0x1d/0x2f [dtracedrv]
[ 245.215564] [<ffffffff810692ed>] __run_hrtimer+0xbb/0x143
[ 245.215564] [<ffffffffa022c858>] ? be_callback+0x0/0x2f [dtracedrv]
[ 245.215564] [<ffffffff81069aa6>] hrtimer_interrupt+0xd4/0x1b3
[ 245.215564] [<ffffffff8146ff95>] smp_apic_timer_interrupt+0x79/0x8c
[ 245.215564] [<ffffffff8100a693>] apic_timer_interrupt+0x13/0x20
[ 245.215564] <EOI> [<ffffffff81397010>] ? read_pmtmr+0x10/0x17
[ 245.215564] [<ffffffff81397025>] acpi_pm_read+0xe/0x12
[ 245.215564] [<ffffffff8106dce9>] timekeeping_get_ns+0x1b/0x3d
[ 245.215564] [<ffffffff8106e1ce>] getnstimeofday+0x54/0x89
[ 245.215564] [<ffffffff81065f2c>] sys_clock_gettime+0x61/0x90
[ 245.215564] [<ffffffff81009cf2>] system_call_fastpath+0x16/0x1b

So, now I question whether the invocation of a timer interrupt
should invoke a direct call into dtrace, or whether this should be
queued up and deferred. I am 50:50 on this - deferring would imply
complication which doesnt appear in the dtrace code. Its possible
that in the non-dtrace Solaris code, this scenario is handled in the interrupt
handlers - maybe some form of decouple of the interrupt from the
probe firing. (I cant figure out if/how that can be done, without
the kernel doing occasional checks).

It is possible that Solaris (and MacOS) are both broken and that
it is possible for the system to be deadlocked, but I havent
found evidence this can happen - so, somehow, this is resolved.

(Note that dtrace linux takes some cheap shortcuts in the hrtimer
code and avoids some of the complexity of the cycle timer code - I havent
understood enough of the details to see what it does and if we need it
in Linux; maybe eventually for accurate and non-skewable clock sources).

Before I release a new dtrace, I need to decide what to do with some
of the experimental code (presently, there is missing code on the i386
side, and issues with older kernel compilations, along with a bit of
dirtyness in stealing an interrupt rather than properly allocating one
via the kernel APIs).

If anyone is still reading this, you can occasionally pick up private
beta releases (dtrace-tmp.tar.bz2) in the dtrace ftp dir, but I dont
advise touching these - as the code will be in an indeterminate state
along with too much debugging enabled.

Post created by CRiSP v10.0.10a-b6012

No comments:

Post a Comment