Sunday, 5 June 2011

dtrace -- some updates

After spending a lot of effort on the xcall issue, I had hit an issue
where occasionally, system calls would fail. The regression
test shows this up by running a perl script which continuously
opens an existing and a non-existing file, plus a variety of other things.

Very occasionally, Perl would emit a warning relating to a file handle
being referred to which belong to a file which couldnt be opened.
(/etc/hosts - which always exists).

Similarly, other apps would occasionally fail to start with rtld
linker errors.

This proved very hard to track down: I was pretty certain it was
related to the xcall work I was doing. The error rates were rare - less
than 1 in a million, and almost impossible to track down.

I moved away from xcall debugging and found that by having two
simple perl scripts (on a dual core machine), which continuously opened
files and nothing else, that the error rate would increase whilst
the two scripts ran.

To try and get a better handle on this, I moved from 64-bit kernel
debugging to 32-bit kernel, where the error rate was significantly

After a lot of experimentation, it transpired that the error wasnt to do
with xcall, but the syscall provider. Specifically, a piece of
assembler glue turned out to be rubbish. I am not sure why it appeared to
work, but it didnt. (I had made some changes earlier on which may
have broken the syscall tracing on 32-bit kernels).

After recoding the assembler glue - things looked much better. The
errors in syscall processing appeared to be gone. But a new problem
surfaced - one I wasnt too surprised to see. There are a handful
of 32-bit syscalls which use a differing calling convention to the others.
(The 64-bit code handles this, but not the 32-bit code).

I have nearly finished redoing the 32-bit syscall tracing, and, once
done, will need to validate the 64-bit syscall tracing.

If I am lucky, hopefully in the next few days or weeks, the resiliency
issues will disappear and I can put out a new release.

The syscall tracing code is horribly ugly - because we have to support
different calling conventions across the two types of cpu architecture.
I may split the code up into an x86 and x86_64 code file.

Post created by CRiSP v10.0.11a-b6022

No comments:

Post a Comment