So this past weekend, I did the unthinkable, I accidentally recycled the wrong dedicated server at work. Usually, this is not much of an issue (not that I make a habit of it) with the continuous data protection we have implemented at the data center (cdp r1soft) except that the backup server this particular client system was using had suffered a catastrophic raid failure the very night before. We have had raid arrays go bust on us before, typically very rare but it does happen… Obviously this resulted in the clients site and databases getting absolutely toasted and having only a static tar.gz cpanel backup available which was over a week old, they were none-too-happy about the loss of the database content.
I have dealt with data loss in the past of various degree’s but never had I dealt with it in the capacity where a format had occurred WITH data being rewritten to the disk. We are also not talking about just a few hundred megs of rewritten data but a complete OS reload along with cpanel install, which comprises multiple gigabytes worth of data and countless software compilations that consist of countless write-delete cycles of data to the disk.
So, the first thing I did on realization of the “incident” was stop everything on the server, remount all file systems read-only then had a “omg wtf” moment. Once I had collected myself I did the usual data loss chore of making a dd image of the disk to a NFS share over our gigabit private network while contemplating my next step. My last big data recovery task was some years ago, perhaps 2 or more years and since I am such a pack rat I still had a custom data recovery gzip on my workstation system that contained a number of tools I used back then. The main ones being testdisk and the sleuth (tsk) tool kit, these tools together are invaluable.
The testdisk tool is designed to recover partition data from a formatted disk and even those that have had minimal data rewrites, which it does exceptionally well. In this case I went in a bit unsure of the success I would have, sure enough though after some poking and prodding of teskdisk options I was able to recover the partition table for the previous system installation. This was an important task as any data that had not been overwritten on the disk instantly became available with the old partition scheme restored, sadly though this did not provide the data I required which was the clients databases. The partitions restored still provided me some metadata to work with and a relative range on the disk of where the data is located, instead of having to ferret over the whole disk. So with that, I created a new dd image of the disk with a more limited scope that comprised the /var partition which effectively cut the amount of unallocated space I needed to search down from 160gb to 40gb.
It was now time to crack out the latest version of the sleuth tool kit and its companion autopsy web application, installed it into shared memory through /dev/shm and then went through the chore of remembering how to use the autopsy webapp. After a few minutes poking around it started to come back to me and before I knew it I was browsing my image files, which is a painfully tedious task in hopes that the metadata can lead you to what your looking for through associative information on file-names to inode relationships. That is really pretty pointless in the end though as ext3 when it deletes data, as I understand, zeros the metadata before completing the unlink process from the journal. I quickly scrapped anything to do with metadata and moved on to generating ASCII string dumps of the images allocated and unallocated space, which allows for quick pattern based searches to find data.
The string dumps took a couple of hours to complete generating, after which I was able to keyword/regexp search the disks contents with relative ease (do not try searching large images without the string dumps, it is absurdly slow). I began some string searches looking for the backup SQL dumps that had been taken less than 24h ago during the weekend backups, although I did eventually find these dumps it turned out some of them were so large they spanned non-sequential parts of the disk. This made my job very difficult as it then became a matter of trying to string together various chunks of an SQL dump which I had no real knowledge of the underlying db structure for. After many hours of effort and some hit-or-miss results, I managed to recover a smaller database for the client which in the end turned out to be absolutely useless to them, that was it for the night for me – I needed sleep.
Sunday morning brought about an individual from the clients organization who was familiar with the database structure to the custom web application they have and was able to give me exact table names they needed recovered – which was exactly what I needed. I was then able to craft some regexp queries that found all insert, update and structure definitions for each of the tables they required and despite some of the parts of these tables being spread across the disk – knowing what they needed allowed my regexp queries to be accurate and provide me the locations of all the data. Now that I had all the locations it was just a task of browsing the data, modifying the fragment range of the data I was browsing so that it included the beginning and end of the data elements followed by exporting the data into notepad where I reconstructed the sql dumps to what turned out to be a very consistent state. This did take a little while though but was not near as painful a process as my efforts from the night before, so I was very happy with where we had ended up.
A couple hours after I turned the data over to the client they were restoring tables they much needed to get back online, this was followed by ping from the client on AIM that they had successfully restored all data and were back online in a state near identicle to just before they went offline. What the client took from this is to never trust anyone else with safe guarding there data and they intend to keep regular backups of there own data now in addition to the backups we retain, which is a very sensible practice to say the least.