Friday, October 03, 2008

A crazy week of work

I went in to work on Monday expecting what we had internally billed as "Manic Monday". We're working on a big change to our systems at the moment which will I believe transform the way in which information flows in supply chains. Monday was the day all the new bits of code all the development team had been individually working on would be stitched together and for the first time we could see what we'd been planning for months.

That alone was enough to make it a manic day. However mid morning a phone call came in which transformed the next 3 days for me. It went like this "Hello Simon, is there a problem with email?". I investigated... and 30 mins later was in a state of shock as one of the disks on one of our oldest but still quite busy servers (used mainly by small businesses we know well and friends and family) decided to pack up and the course of events became set for me to get the data back on-line as quickly as possible. This machine was pretty dated not even being a RAIDED disk machine unlike every one of the other servers we use.

What's worse... is that as the disk was failing a full backup was in progress. So, given that we are paying for a professional backup solution I assumed the previous full backup would have to be used and we'd have to layer on the incrementals... However what the support engineer told me next nearly made me fall off my seat. The full backup overwrites the previous full backup! Yikes! That meant no backups either for this server. I swore quite a lot at that point.

Being a professional sort of person, who worries about this sort of thing I used to have my own set of backups in place before we started paying for a service. My backups were however only partial, and just of the data that mattered to my company, not of emails for friends and family that also use this server...

As the day unfurled my worst fears gradually subsided as the superb engineers at UKFast managed to somehow recover the disk, build a new machine and put the "fixed" disk in an external enclosure attached to the new machine so I could pull off everything needed. Phew... No data loss, everything was recoverable. I think we lost a few emails as the server started to fail - but thats all. Very lucky. I believe that changes will be put in place at our backup service provider to ensure such an edge case failure is catered for in future.

It took me the next few days to get everyone back online and everything back how it was before, even down to the same usernames and passwords. I worked through to 1am on Monday night, starting again at 7 the next day and kept going until it was all done. I need to learn some lessons from this, as do my service provider.

If anyone who was affected by this is reading, please accept my apologies once again.

The rest of the week has been spent catching up what I didn't do on Monday and Tuesday. And now I am exhausted having failed to catch up and have decided to take today off in lieu of the night I worked.

Its sunny outside and I intend to make the most of it, get a haircut and get and new tyre to fix the puncture I had yesterday on my Subabru (a nasty incident where I nearly lost control at over 60mph when the front tyre blew out).

No comments: