Brendan Gregg does a good write up of dtrace vs system tap, here:
I have been thinking about systemtap and ftrace, and stumbled
across a peculiarity in the performance game.
Solaris/dtrace has very little in the way of code when wondering
about dropping or removing probes. On x86, a probe translates into
a single byte breakpoint instruction. Dumping this on top of an
instruction is effectively atomic (forget about barriers and all
the other SMP complexities...for a moment).
But in diagnosing issues in linux/dtrace, I had to look at ftrace to
check my sanity.
On x86, you can safely write self-modifying code. For future portability,
Intel recommends instruction sequences to avoid problems with other CPUs,
and the term "cross-modifying code" is used, whereby you are modifying
instructions which might be sitting in the i-cache.
Now, in general, replacing an N-byte instruction with an M-byte, where
N != M is a problem, depending on if M > 1, and M > N. Atomically
modifying memory which might be executed by another cpu is "hard".
Linux 3.x and the ftrace code honor the Intel recommendation. They
*stop* all the other CPUs whilst making these probe changes and
use NMI interrupts to implement a hard locking region. Neat trick.
But very costly.
Solaris/dtrace doesnt. Neither does Linux/dtrace. We just "plop" our breakpoint
wherever we see fit and the CPU does the rest of the work.
This is great. But, Solaris/dtrace has a problem which they havent
noticed yet. In Linux/dtrace, I provide the instruction provider.
This provides additional ways to drop probes on "interesting instructions"
and there is a chance that the user could drop the same address
probe via FBT and INSTR.
So, when a breakpoint is hit, which one wins? On Solaris, I dont think
this can happen. It could in Linux, and at the moment, FBT will win
and INSTR wont get a chance to handle the probe.
Thats not so much a problem.
But consider this: how do you remove a probe? Well, you
just overwrite the instruction where you placed a breakpoint with
the original opcode byte.
There. Done. Nice. Neat.
Solaris/FBT does not have a notion of a probe being enabled or not.
It does...but it uses the breakpoint byte to indicate that the probe is
set or not. In fact, Solaris/FBT doesnt really care.
So consider this: when disabling an FBT probe by overwriting the instruction,
and another CPU, who has yet to see the byte modify, executes that
instruction, may in fact, execute the breakpoint trap, even tho the probe
is undone. So, now another CPU fires a FBT probe which was disabled.
And Solaris/FBT lets it happen. It blindly fires the probe.
But, at this time, *nobody is listening*. This is fine - but a probe
is potentially firing when it should not do.
The reason is, that Solaris doesnt flush the other CPUs i-cache.
I noticed this on Linux, that probes were firing, even after dtrace had
terminated because I was handling the enabled/disabled state. This
caused a problem. If FBT knows the probe cannot fire, then it wont
intercept it. And if this happens, then we have a breakpoint
trap and the rest of dtrace wont know how to handle the breakpoint - we
wont know what the original byte of the instruction was.
I had to disable this feature and allow FBT to process probe traps
even if the probe points had been disabled.
Which all comes down to the fact that cross-modifying code is
extremely intrusive, but you can get away with it, until one day,
you might not. ("One day" might mean on a different CPU architecture
or some future Intel chips).
And finally: if Solaris had done the "right thing" it might be very
slow at placing lots of traps or removing them. And this might account
for ftrace losing some aspect of performance compared to dtrace.
BTW the current latest release of dtrace is proving remarkably resilient.
I am up to 5.3 billion probes running the torture tests, on real hardware,
and no problems so far.