I recently had an issue at a remote location (12000km away) where the old multi-purpose Linux server that had been working for the past 5 years wouldn’t boot again after a nasty power failure.
The server was used as a firewall, a local email store, a file server and a backup server, so its failure is a big deal for the small business that was using it.
RAID configurations explained
You can’t always have complete redundancy, so some amount of bad crash is to be expected over the years. Fortunately, I always construct my servers around a simple software RAID1 array and that leaves some hope for recovery.
In this instance, the server would start and then miserably fail in a fashion that would suggest a hardware failure of some sort. Not being able to be physically present and having no dedicated system admin on location, I directed the most knowledgeable person there to use a spare internet router to recover Internet connectivity and connect one of the disk to another Linux server (their fax server) through a USB external drive.
Doing this, I was able to remotely connect to the working server and access the disk, mount it and access the data.
### Salvaging the data ###
Once one of the RAID1 drives was placed into the USB enclosure and connected to the other available Linux box it was easy to just remount the drives:
`fdisk` will tell us which partitions are interesting, assuming that `/dev/sdc` is our usb harddrive:
[root@fax ~]# fdisk -l /dev/sdc
Disk /dev/sdc: 81.9 GB, 81964302336 bytes
16 heads, 63 sectors/track, 158816 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 * 1 207 104296+ fd Linux raid autodetect
/dev/sdc2 208 20526 10240776 fd Linux raid autodetect
/dev/sdc3 20527 22615 1052856 fd Linux raid autodetect
/dev/sdc4 22616 158816 68645304 f W95 Ext’d (LBA)
/dev/sdc5 22616 158816 68645272+ fd Linux raid autodetect
We can’t simply `mount` the partitions, they need to be assembled into a RAID partition first:
[root@fax ~]# mdadm –assemble /dev/md6 /dev/sdc1 –run
mdadm: /dev/md6 has been started with 1 drive (out of 2).
The `–run` argument forces the RAID partition to be assembled, otherwise, `mdadm` will complain that there is only a single drive available instead of the 2 -or more- it would expect.
Now simply mount the assembled partition to make it accessible in `/mnt` for instance:
[root@fax ~]# mount /dev/md6 /mnt
We can now salvage our data by repeating this process for every partition.
Using RAID1 means you have at least 2 disks to choose from, so if one is damaged beyond repair, you may be lucky and the mirror one on the other drive should work.
If the drives are not physically damaged but they won’t boot, you can always use a pair (or more) of USB HDD enclosures and reconstruct the RAID arrays manually from another Linux box.
### Planning for disasters ###
The lesson here is about planning: you can’t foresee every possible event and have contingencies for each one of them, either because of complexity or cost, but you can easily make your life much easier by planning ahead a little bit.
Most small businesses cannot afford dedicated IT staff, so they will usually end-up having the least IT-phobic person become their ‘system administrator’.
It’s your job as a consultant/technical support to ensure that they have the minimum tools at hand to perform emergency recovery, especially if you cannot intervene yourself quickly.
### On-Site emergency tools ###
In every small business spare parts closet there should be at least:
* Whenever possible, a __spare Linux box__, even if it’s just using older salvaged components (like a decommissioned PC). Just have a generic Linux install on it and make sure it is configured so it can be plugged in and accessed from the network.
* a __spare US$50 router__, preferably pre-configured to be a temporary drop-in replacement for the existing router/firewall. Ideally, configure it to forward port 22 (SSH) to the spare Linux box so you can easily access the spare box from outside.
* USB external __hard-drive enclosure__.
* a spare PC __power supply__.
* some network cables, a couple of screwdrivers.
There are many more tools, such as rescue-CDs (like bootable Linux distributions), spare HDD, etc that can be kept but you have to remember that your point of contact need to be able to be your eyes and hands, so the amount of tools you provide should match their technical abilities.
Don’t forget to clearly label confusing things like network ports (LAN, WAN) on routers, cables and PCs.
The point is that if you can’t be on site within a short period of time, then having these cheap tools and accessories already on site mean that your customers can quickly recover just by following your instructions on the phone.
Once everything is plugged-in, you should be able to remotely carry-out most repairs.
### Resources ###
* [Linux software RAID1 setup]
* [Fedora System Recovery Week]