I've been looking at profiling systems on Linux and other OS's. Its
an interesting landscape. With the advent of ever more powerful
CPUs over the last decade, along with multicore and the impact of
cache misses, its necessary for people to have low level tools
to do performance analysis.
Performance measuring is a large topic - I can only cover it briefly
here. Statistical sampling (similar to classic Unix "prof" and "gprof"),
is great for weeding out hot spots in code. The first time you profile,
its easy to quickly find areas to optimise.
After a while, using those tools runs out of steam. In multithreaded
applications and multicore CPUs, other factors quickly come into play,
e.g. lock contention, cache misses etc.
The Intel and AMD chips provide quite sophisticated counters for measuring
all sorts of things you may never have thought about. Unfortunately,
not only are they different between Intel and AMD, but the counters
supported will vary by chip family. (I dont even know if every
new CPU is a superset of all older ones).
In user space, tools like "oprofile" and "perf" provide a way to gain
access to these counters, and are great for deeper diving into hospots.
You may know 90% of your time is spent in a matrix multiply, but you
may not realise that 50% of that time is wasted in cache-thrashing.
Linux has had a varied past not adopting, and subsequently adopting
profiling subsystems, and although it should be easy, it isnt. The
difficulty of cpu family differences, and complexity due to the
hardware of a system, means that providing a chip-independent API
In recent years, AMD and Intel have provided new monitoring facilities
which aim to allow instruction accurate samples to be made of performance.
(Prior facilities relied on counters and interrupts which couldnt
pinpoint the exact instruction, e.g. where a cache miss occurred).
In Solaris, and DTrace, they added the CPC provider - which allows
probes to be placed based on the counter interrupts. The documentation
is somewhat vague, because everyone is trying hard not to
replicate the Intel/AMD documents which list the counters, since they
evolve so rapidly. The CPC provider is not (currently) in Linux/Dtrace.
Its been on my TODO list and I am just checking it out. It relies on
Solaris handling user level requests and abstracts the CPU away, but,
reading on the web, appears to suffer from inability to handle
the "new style" counters from AMD and Intel.
[I believe that the old style counters are simply counters which can
be set up to generate an interrupt, either on reaching a threshhold
or on a periodic basis, ie sampling based monitoring. The new counters
likely require an area of RAM to fill up, and the code in Solaris,
and probably Linux may not be ready to support this, at least not on older
I may experiment with adding a CPC provider, just because I am interested
in seeing these counters and the issues they present.
[I have tried oprofile, and hit problems since it does not work inside
a VM; the newer 'perf' subsystem does appear to work inside a VM, but requires
rebuilding the kernel to enable the subsystem].