Linux sysadmin: a short RAID trouble-shooting story

Table of Content

Linux
I recently had an issue at a remote location (12000km away) where the old multi-purpose Linux server that had been working for the past 5 years wouldn’t boot again after a nasty power failure.
The server was used as a firewall, a local email store, a file server and a backup server, so its failure is a big deal for the small business that was using it.

![RAID explained][1]
RAID configurations explained
You can’t always have complete redundancy, so some amount of bad crash is to be expected over the years. Fortunately, I always construct my servers around a simple software RAID1 array and that leaves some hope for recovery.
In this instance, the server would start and then miserably fail in a fashion that would suggest a hardware failure of some sort. Not being able to be physically present and having no dedicated system admin on location, I directed the most knowledgeable person there to use a spare internet router to recover Internet connectivity and connect one of the disk to another Linux server (their fax server) through a USB external drive.

Doing this, I was able to remotely connect to the working server and access the disk, mount it and access the data.

### Salvaging the data ###
Once one of the RAID1 drives was placed into the USB enclosure and connected to the other available Linux box it was easy to just remount the drives:

`fdisk` will tell us which partitions are interesting, assuming that `/dev/sdc` is our usb harddrive:
~~~~~~~~
[root@fax ~]# fdisk -l /dev/sdc

Disk /dev/sdc: 81.9 GB, 81964302336 bytes
16 heads, 63 sectors/track, 158816 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes

Device Boot Start End Blocks Id System
/dev/sdc1 * 1 207 104296+ fd Linux raid autodetect
/dev/sdc2 208 20526 10240776 fd Linux raid autodetect
/dev/sdc3 20527 22615 1052856 fd Linux raid autodetect
/dev/sdc4 22616 158816 68645304 f W95 Ext’d (LBA)
/dev/sdc5 22616 158816 68645272+ fd Linux raid autodetect
~~~~~~~~
We can’t simply `mount` the partitions, they need to be assembled into a RAID partition first:
~~~~~~~~
[root@fax ~]# mdadm –assemble /dev/md6 /dev/sdc1 –run
mdadm: /dev/md6 has been started with 1 drive (out of 2).
~~~~~~~~
The `–run` argument forces the RAID partition to be assembled, otherwise, `mdadm` will complain that there is only a single drive available instead of the 2 -or more- it would expect.

Now simply mount the assembled partition to make it accessible in `/mnt` for instance:
~~~~~~~~
[root@fax ~]# mount /dev/md6 /mnt
~~~~~~~~

We can now salvage our data by repeating this process for every partition.
Using RAID1 means you have at least 2 disks to choose from, so if one is damaged beyond repair, you may be lucky and the mirror one on the other drive should work.

If the drives are not physically damaged but they won’t boot, you can always use a pair (or more) of USB HDD enclosures and reconstruct the RAID arrays manually from another Linux box.

### Planning for disasters ###
The lesson here is about planning: you can’t foresee every possible event and have contingencies for each one of them, either because of complexity or cost, but you can easily make your life much easier by planning ahead a little bit.

Most small businesses cannot afford dedicated IT staff, so they will usually end-up having the least IT-phobic person become their ‘system administrator’.
It’s your job as a consultant/technical support to ensure that they have the minimum tools at hand to perform emergency recovery, especially if you cannot intervene yourself quickly.

### On-Site emergency tools ###
In every small business spare parts closet there should be at least:

* Whenever possible, a __spare Linux box__, even if it’s just using older salvaged components (like a decommissioned PC). Just have a generic Linux install on it and make sure it is configured so it can be plugged in and accessed from the network.
* a __spare US$50 router__, preferably pre-configured to be a temporary drop-in replacement for the existing router/firewall. Ideally, configure it to forward port 22 (SSH) to the spare Linux box so you can easily access the spare box from outside.
* USB external __hard-drive enclosure__.
* a spare PC __power supply__.
* some network cables, a couple of screwdrivers.

There are many more tools, such as rescue-CDs (like bootable Linux distributions), spare HDD, etc that can be kept but you have to remember that your point of contact need to be able to be your eyes and hands, so the amount of tools you provide should match their technical abilities.
Don’t forget to clearly label confusing things like network ports (LAN, WAN) on routers, cables and PCs.

The point is that if you can’t be on site within a short period of time, then having these cheap tools and accessories already on site mean that your customers can quickly recover just by following your instructions on the phone.
Once everything is plugged-in, you should be able to remotely carry-out most repairs.

### Resources ###
* [Linux software RAID1 setup][2]
* [Fedora System Recovery Week][3]
[1]: http://46.101.162.45/wp-content/uploads/2008/06/266169495_863b3d935f.jpg
[2]: http://www.linuxconfig.org/Linux_Software_Raid_1_Setup
[3]: http://dailypackage.fedorabook.com/index.php?/categories/11-System-Recovery-Week

Comments

RobNY commented 16 years ago

Hi how are you doing? I manage to find this article about you trying to connect your MyBook 1GB! You mau be able to help me. The 500 GB WD mybook stopped working, so after removing it from the enclosure and using a USB adaptor, Ubuntu won’t automount it. all of my files are in /dev/sdb4 Here’s what I see using fdisk -l Disk /dev/sdb: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00007c00 Device Boot Start End Blocks Id System /dev/sdb1 4 369 2939895 fd Linux raid autodetect /dev/sdb2 370 382 104422+ fd Linux raid autodetect /dev/sdb3 383 505 987997+ fd Linux raid autodetect /dev/sdb4 506 60801 484327620 fd Linux raid autodetect ////////////////////////////////////////////////////////////////PLEASE HELP

Author

Renaud commented 16 years ago

@RobNY: if you just follow the instructions in the article you should have no problem recovering your data (as long as it’s not a physical failure of the drive or the data is irremediably corrupted of course). Just do something like: mdadm --assemble /dev/md10 /dev/sdb4 --run Then, if it worked, assuming you have a /mnt folder: mount /dev/md10 /mnt Now your data should be accessible from /mnt.

IGadget commented 16 years ago

I love the RAID types picture, Is it under Creative Commons?

@ ‘IGadget’: that picture has been on the internets for a while and I couldn’t find the original for it. If anybody does, please let me know so I can give proper credit. Just search for water+cooler+raid in google…

Joeri commented 15 years ago

Thanks for the article, it saved my life 😉 For Knoppix user, booting from the live CD, use modprobe md before the mdadm command to load the md module into the kernel.

Comments are closed.