worthwhile to give a brief update of what is annoying me.
The most annoying thing in dtrace at the moment is me. I have spent
the last two months trying hard to resolve some resilience issues.
At the moment, there are two of them: (1) is the xcall code, and
the other is (2) something to do with syscall tracing.
As related on prior blogs, the mapping of dtrace_xcall() (which
does inter-cpu synchronisation), doesnt map to Linux very well. On
Solaris, the inter-cpu calling code works from interrupt context,
but we cannot do that in Linux. (Linux will write a warning to
the /var/log/messages files when this happens - although it does mostly
I have tried a number of variants of a native IPI system in dtrace and
they have failed with various problems. The biggest problem is that
on an SMP system, cpu#1 will invoke a call to cpu#2 but cpu#2 wont respond
until cpu#1 finishes the xcall (a deadlock). In the code, I have
resolved the deadlock by giving up after a suitable period of time,
but thats not good enough. Trying to find out what cpu#2 is doing
when it refuses to respond to the interrupt is very tricky. Various
ad-hoc debug tricks (like using the native smp_call_function() to dump
stacks) failed. Additionally, the synchronous order of messages
written to /var/log/messages is horrendous when I am doing my implementation
of xcall - the cpus write out of order with timestamps going backwards.
(I can understand why, but it doesnt help).
I have given up resolving the SMP cross-call issue: and instead have
been trying something different. The only place where the
xcall issue is a problem is the profile/tick provider hr_timer clock
interrupts. So I have modified the code to use a tasklet structure instead.
This seems to work (I have some race condition problem to fix before I
can release it).
But, during all this testing, I hit another strange and annoying scenario.
$ dtrace -n syscall:::
and doing intensive things in another window, like:
$ while true ; do date ; done
date: error while loading shared libraries: /lib64/ld-linux-x86-64.so.2: cannot apply additional memory protection after relocation: Error 9
Very occasionally, a system barfs. I have seen the output from dtrace
hang (it hangs until I press a key on the keyboard). I have tracked
this down : when a write() syscall is being executed, its being
turned into a read() syscall.
The event is very rare - 1 in hundreds of thousands of syscalls, but its
horrible. And its *this* problem which is likely what prompted me to go
on the xcall wild-goose chase. The "make test" regression suite is very
good at pushing the cpu load to the max whilst doing dtrace things, but
it occasionally would have issues.
So, if I can chase the 1:100,000 issue in syscall tracing, then I can
move forward. (I suspect a timer interrupt coming in during a syscall might
be causing the issue).
As always, I will release the code when I feel its better than where