The statute of limitations is past on this one, so a tale of sysadminry.

Once upon a time there were two servers, which we shall call Robert and Dave. Between them they ran an important system that ran on a somewhat clunky framework. They were load balanced, but for database purposes one was a Master and one was a Slave, with a magic switch so that they could swap should it become necessary. They ran Gentoo, because when they were bought it was the only “released” distro that supported the RAID cards they had.

After a while a new system ran on these servers, which for operational reasons wasn’t load balanced, and that ran within PHP. Because these servers were too important to ever take down for maintainance, and compiling PHP on Gentoo was – at the time – a heartache akin to losing your first true love. So the sysadmin at the time installed a debian chroot on Robert and ran apache within that. Shortly after that, I’m hired, and eventually the old coder leaves, and I take over both those projects.

A while later, a drive in the RAID on David fails. Me and the lead dev wander down to London to fix it, meeting a courier with the new drive at the datacentre. In the process of replacing the drive, David’s raid controller entirely fails and takes the remaining drives with it. Now, because Gentoo compiles its own packages, system upgrades require a large amount of load, a great deal of planning and a significant amount of downtime while this goes on (No, you can’t recompile the system while it’s still running as a high loaded real server, don’t be silly), so updates have been confined to security updates, but that’s fine, because it’s only been about nine months since it was new.

We can’t install the same version of Gentoo anymore, because it’s so very far out of date that the package management server doesn’t support it. The idea of the mirrored servers running different versions of the core OS cannot be borne, but we can’t update the other server either, because there’s no upgrade route from there to here. Fortunately by this time Debian has updated and now supports the RAID drivers of the servers. So we install Debian on it, and it works. And we install the Big System on it, with it’s clunky framework. But the version of the clunky framework is a major revision after the one the system was written for, and we don’t support that version (because it doesn’t work). We don’t have time to rewrite the software. The old package for the framework doesn’t work, and the binary download of the old version doesn’t work either.

Can’t go over it, can’t go under it, have to go around it.

So we copy the contents of Robert onto the new debianised David, and set up a chroot on David running Robert’s gentoo root, and thus the old working version of the framework.

So now we have Robert, running Gentoo with a crunchy Debian centre; and we have David, running Debian with an orangey Gentoo centre. Except that because the Gentoo centre is a copy of Robert, that too has a toffee flavoured Debian install, although it’s not activated. Inception eat your brains out, six years early.

At this point, me and the other coder had spent eight hours in the air-conditioned white-noise of a hosting centre room, during which the number of things that had gone right was rapidly drowning in the sea of everything going wrong. Since it was working, we went home and fixed it later.

So, things to learn:

  • Raid is not a backup solution.
  • Factor in a method to test and install security and system upgrades, otherwise you’re going to be forced to find all the holes in sudden short order.
  • Note that this isn’t specifically “Gentoo sucks”, it’s that “Not having an upgrade path sucks”
  • Raid is not a backup solution.
  • A major problem was that Gentoo was the exception, all the rest of the servers ran Debian.
  • Raid is not a backup solution.
  • Just because you have made all the backups in the world before you start, does not stop everything going wrong in short order.
  • Bring a packed lunch to the datacentre.
  • Raid is not a backup solution.