Sunday 22 April 2012

Are your backups *good*?

Wanted to recount a little scenario that happened to me, and a warning
to others.

I've been writing a "makefile" generator tool. The makefiles for CRiSP
have become a little unwieldy over the years - specifically, parallelisation
issues, incomplete dependencies, and inability to adapt them quickly
to a new scenario.

CRiSP runs on MacOS, Linux and Windows. The Mac and Linux makefiles are
shared (as are all the Unix makefiles), with an auxiliary small makefile
to handle the Cocoa part of the build cycle. Unfortunately, for Windows,
nothing is shared - it has a complete set of its own makefiles.

I wanted to compile CRiSP under MingW - a GCC port to windows, because I
wanted to use GCC, and avoid issues with Microsoft Visual Studio.
MingW provides a Unix like environment, but Windows is too far from Unix
and MingW is too far from Windows, in terms of my makefiles.

So, I built a makefile generator. It has some good features - auto/recursive
header file dependencies, avoiding "cd" and creating a parallelisable
tree, along with proper cleaning - clean only what is built.

That latter feature had a bug in it: the initial version did the
equivalent of "rm -f */*". Unfortunately, when I ran it, it hadnt
cd'ed to the build directory, so it deleted all my files in subdirectories.
Oops!

Not to worry - I do backups every few days and propagate the sources to
other machines. Mount the USB drive and go look at the .tar.bz2 file
containing the current sources. (Alas, about 2-3 days old, but that
didnt matter).

What I found was a corrupted .tar.bz2 file. My initial thoughts were,
*how* did this happen? My backup script is used all the time, and I
have validated the backups, but this was strange and new.

Never mind, I reconstituted the missing files from my other backups
and systems.

But I was curious, what could cause this to happen.

On investigation, I found the following worrying sign. I tend to backup
to a USB flash drive. I use Linux. I use swsusp to suspend to RAM. My current
kernel/distro has a bug in it. When you suspend with mounted USB drives, on
a wakeup, it doesnt understand the filesystem was mounted or the hardware
needs to be reprobed. "mount" would show the filesystem mounted
but the device was totally empty. If I unmounted the rogue device, and
remounted, all the files were present.

I am guessing that I did a backup and left the device mounted, suspended
the system, and didnt notice for a while (1-2 days) and eventually this
may have lead to the corruption.

Moral: even if your backup system is perfect and doing validation -
your operating system (or some other component) may work against you.
You may not know this.

In my case, I may have to strengthen the backup system to consider
applying md5sums to the files, and validating them before writing
to the device, or maybe to cache the backups on HD and verify before dropping the
local HD.

How good are *your* backups?

Post created by CRiSP v10.0.31a-b6278


No comments:

Post a Comment