Back in 2006, when I was a semi-regular watcher of Top Gear, I learned a valuable lesson.
Show host Richard Hammond was seriously injured when he crashed a rocket powered car doing close to 250 km/h on the last run of the day. I’m reliably informed another of Hammond’s death-defying accidents in 2017 — this time in a million dollar concept car doing 130 km/h — also occurred on the last run of the day.
As the 2021 working year came to a close, I was reminded that the last run of the day — or in this case, the last change you make in a day — is often one you should walk away from without doing.
The day was like any other but around lunchtime I had reached an impasse. I was stuck waiting on other people to complete some tasks and the work I was doing involved a cold shutdown of a production database — something you definitely don’t want to do on a Friday afternoon, as we all know how badly that can go.
I decided instead to finish off some physical work that was in progress, electing to decommission and unrack some old sleds (servers that sit in a shared chassis rather than being standalone). Plenty of planning had already gone into this, lots of migrations were performed to empty them, so it was just a matter of double-checking things and pulling power cables.
The list of things to do was pretty straightforward:
- Identify sleds and their chassis for this batch.
- Migrate any straggling virtual machines.
- Verify API endpoints and ensure the server knew virtual machines had moved.
- Turn the sled and its chassis off.
- Remove from the rack and put in blanking plates.
Things moved along okay and by 1:00 pm sled one was done. By 2:00 pm I’d knocked over sleds two and three, as well as removed the chassis from sleds one and two from the rack. That pace continued until I’d removed all five sleds and their chassis. By 5:00 pm it was a job well done and I was headed for home.
Around an hour later I had someone on the phone telling me that sled six was turned off. The caller had noticed the sled was down, driven to the office to investigate and found the sled’s chassis wasn’t even in the rack.
Despite my good intentions at lunchtime, I had broken the golden rule that applies in IT on Friday afternoons — don’t make any changes. I’d tempted fate and the last change came back to bite me.
Of course, it transpired that sled six shared a chassis with sled five. I had dutifully removed it — and its cables — and put them aside for storage when I removed sled five.
As is the way with fixing something after the fact, it took a bit of time to rectify. My colleague was onsite, so he located and re-racked the chassis (along with its sleds) and cabled them in. I got the server’s virtual machines running again, fixed up the related configuration and dealt with tickets. We did a mail-out to affected users and finished the clean-up just before 8:30 that night.
After a bit of analysis, I identified the following causes of the failure which I hope to mitigate in the future:
- Assuming a sled that is not mentioned in documentation is not in production.
- Not checking other data sources to confirm the status of the mystery sled.
- Visually checking the display of the mystery sled would have told me what it was. This doesn’t scale but is a useful last resort.
- Not powering off all sleds in software but pulling their power cords — doing it in software would have provided a chance to validate there were no running virtual machines before striking them down.
- Performing a shutdown-and-remove instead of shutting down and leaving them there for a couple of hours (or days).
At least the root cause of the issue here was simple to identify — our decommissioning process is manual and not comprehensively documented so it is easy to make a mistake like this.
Following this particular incident we made some changes. I’ve written some automation to help us perform shutdowns in software, including a step to check that all systems being powered off contain no virtual machines. Other process improvements will most likely happen when we come to our next batch of decommissioning around Q2 2022.
While the results aren’t as dramatic as a high-speed car accident, attempting major IT changes late in the day — and week — can still be a recipe for disaster. Fortunately, I’ve lived and learned from this one and have implemented some improvements to avoid similar issues in the future. But, thanks to Richard Hammond, I’ll definitely think twice before embarking on major changes late in the day from here on in.