Sunday, 17 June 2012

A Bad API: XtAppAddTimeOut

Thought I would recount tails of woe with the above API. This
is the API used for timer callbacks in X Windows. You can
define a function to be called after N milliseconds, and you
can use XtRemoveTimeOut to remove the timer. When the timer
is invoked, the timer is automatically deleted, and there is no need
to remove the timer.

This API is used in CRiSP, and has been for around 20 years.
Recently, I encountered a strange bug, which was annoying me. The
flashing cursor would periodically stop. At first I thought it was
a regression in some aspect of performance or an issue with one of
the newer features, but it wasnt. It *was* being tickled by the new
features, but they were not directly responsible.

Lets consider malloc() and friends. People who use malloc() (or new[]
for the C++ folks), know that you can free memory and two types of
problems present itself: (a) using memory after it is freed, and (b)
forgetting to delete a memory block, leading to a memory leak.

Now, the X11 timer API is similar to malloc. If you fire a timer,
it has a finite life, and if you end up with multiple timers for the
same callback, they will all fire. This can cause issues, such as
"frantic cursor" flashing, or whatever the code is which handles the
callback. Its typically easy to detect this scenario, and callbacks
will often have preventative measures to avoid core dumps which could
be caused in this kind of scenario.

Now, XtRemoveTimeOut is particular nasty. The current X window implementations
tend to reuse an internal timer structure. You can do this:


XtRemoveTimeOut(id1);
...
XtRemoveTimeOut(id1);


and although the second XtRemoveTimeOut is redundant, it can have
a strange side effect. If the code between the first call and the
second calls XtAppAddTimeOut, then the memory freed for the original
timer is reused by another timer. The second call to XtRemoveTimeOut
then removes the timer for the "other code". We may have taken
away someone elses timer.

This is what was (erratically) happening in CRiSP. Multiple calls
to remove the same timeout were not protected, and this lead to
a piece of code removing the timeout for another piece of code.

CRiSP doesnt use many timers, but one is tied to cursor flashing,
and if that gets removed, it will never fire again. So, the cursor
stopped flashing. (It would flash if you typed in as the
screen display code would need to hide/unhide the cursor, but it
wasnt obvious this was happening).

I ended up debugging this by adding an LD_PRELOAD trace library
to observe the "double-free" scenario, and eventually found a style
of coding that could lead to this (and, in v11.0.7, is fixed).

Strangely, I had hit a similar problem on MacOSX, where CRiSP
implements the X11 primitives as a layer on top of the Cocoa interface,
but hadnt noticed the same issue on X11, as it took a number of
events, in the right order, to reproduce the scenario.

XtAppAddTimeOut()/XtRemoveTimeOut() need to either not reuse
memory or provide a debugging API to detect cancels of freed timers,
and avoid timer reuse.

Post created by CRiSP v11.0.6a-b6400


No comments:

Post a Comment