Sunday, 10 April 2011

Diary of a compiler bug

I do hate compiler bugs. They waste a lot of time and energy realising
you have hit one. The quality of todays compilers are excellent, spurned
by automatic regression tests, and the vastness of code which users
want to compile.

GCC - an excellent compiler - has been around for many years. I think
I first learned of it back in 1987 or so, and not only has a huge number
of updates and releases, but works on nearly every CPU/OS combination ever

But, when you hit a problem with the compiler, you are hosed -- you
could report a bug, but this may be fruitless, especially when the compiler
version is quite a few years out of date.

Today, I hit a compiler bug causing strange behavior in Dtrace on Ubuntu 8.04,
using the gcc 4.2.4 compiler. Yes this is old, and there may be patches
for it, but who actually runs Ubuntu 8.04 today? Is my Ubuntu even
patched up to date? (No, it isnt).

And does this bug exist in many other prior/next compiler versions?

There isnt enough time in the world to validate this. So a kludge, but a nice
one, was applied:

if (dp->dtdo_rtype.dtdt_kind ==
char c = '\0' + 1;
int intuple = act->dta_intuple;
size_t s;

for (s = 0; s < size; s++) {
if (c != '\0')
c = dtrace_load8(val++);

#if linux
/* This pointless code, which will never */
/* fire, is to work around a gcc compiler */
/* bug which causes a page fault because */
/* 'act' gets overwritten. I havent exactly */
/* figured out whats going on here, but */
/* turning off optimisation (which is not a */
/* good plan for __dtrace_probe()) isnt */
/* viable. I have seen this on Ubuntu 8.04, */
/* gcc 4.2.4, i386. */
if (act == valoffs) {
printk("defeat compiler bug! %p act=%p s=%x/%x %x %x\n",
&act, valoffs, end, act, s, size);
DTRACE_STORE(uint8_t, tomax,
valoffs++, c);

if (c == '\0' && intuple)

The above code is embedded into the middle of the very important dtrace_probe()
function. (Renamed to __dtrace_probe, to allow/detect re-entrancy problems,
which lead to finding this bug).

The code is too large to easily find this bug or refactor. I verified that
turning off optimisation avoids the bug, but thats not a good thing,
as per the comment above.

What seems to be going wrong is register allocation/spilling in the compiler.
I *hope* it only affects this code fragment, but its very difficult to tell.

The effect of the bug was to cause a key variable ("act") to be overwritten.
Fortunately, the kernel survived the subsequent panic/page-fault, but
its not comforting to have this happening in an unexpected way.

BTW I managed to nuke my first VM last night - Fedora Core 14. After
debugging something similar on FC14, I started getting strange filesystem
bugs. Even a reboot didnt fix things. I got annoyed and fsck'ed the root
filesystem manually, despite fsck telling me "this was a bad idea". And it
was right. Because of the LVM, i nuked the root filesystem, and had to reinstall.

I do love VMs - took less than an hour to recover (oh, I wish I had
snapshotted the filesystem first, but, well, you know, I didnt!)

New dtrace later tonite to fix the compiler issue above.

Post created by CRiSP v10.0.3b-b5955

No comments:

Post a Comment