Archive for June, 2008

Linux sysadmin: a short RAID trouble-shooting story

Linux I recently had an issue at a remote location (12000km away) where the old multi-purpose Linux server that had been working for the past 5 years wouldn’t boot again after a nasty power failure.
The server was used as a firewall, a local email store, a file server and a backup server, so its failure is a big deal for the small business that was using it.

RAID explained
RAID configurations explained
You can’t always have complete redundancy, so some amount of bad crash is to be expected over the years. Fortunately, I always construct my servers around a simple software RAID1 array and that leaves some hope for recovery.
In this instance, the server would start and then miserably fail in a fashion that would suggest a hardware failure of some sort. Not being able to be physically present and having no dedicated system admin on location, I directed the most knowledgeable person there to use a spare internet router to recover Internet connectivity and connect one of the disk to another Linux server (their fax server) through a USB external drive.

Doing this, I was able to remotely connect to the working server and access the disk, mount it and access the data.

Salvaging the data

Once one of the RAID1 drives was placed into the USB enclosure and connected to the other available Linux box it was easy to just remount the drives:

fdisk will tell us which partitions are interesting, assuming that /dev/sdc is our usb harddrive:

[root@fax ~]# fdisk -l /dev/sdc
Disk /dev/sdc: 81.9 GB, 81964302336 bytes
16 heads, 63 sectors/track, 158816 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Device      Boot    Start         End      Blocks   Id  System
/dev/sdc1   *           1         207      104296+  fd  Linux raid autodetect
/dev/sdc2             208       20526    10240776   fd  Linux raid autodetect
/dev/sdc3           20527       22615     1052856   fd  Linux raid autodetect
/dev/sdc4           22616      158816    68645304    f  W95 Ext'd (LBA)
/dev/sdc5           22616      158816    68645272+  fd  Linux raid autodetect

We can’t simply mount the partitions, they need to be assembled into a RAID partition first:

[root@fax ~]# mdadm --assemble /dev/md6 /dev/sdc1 --run
mdadm: /dev/md6 has been started with 1 drive (out of 2).

The --run argument forces the RAID partition to be assembled, otherwise, mdadm will complain that there is only a single drive available instead of the 2 -or more- it would expect.

Now simply mount the assembled partition to make it accessible in /mnt for instance:

[root@fax ~]# mount /dev/md6 /mnt

We can now salvage our data by repeating this process for every partition.
Using RAID1 means you have at least 2 disks to choose from, so if one is damaged beyond repair, you may be lucky and the mirror one on the other drive should work.

If the drives are not physically damaged but they won’t boot, you can always use a pair (or more) of USB HDD enclosures and reconstruct the RAID arrays manually from another Linux box.

Planning for disasters

The lesson here is about planning: you can’t foresee every possible event and have contingencies for each one of them, either because of complexity or cost, but you can easily make your life much easier by planning ahead a little bit.

Most small businesses cannot afford dedicated IT staff, so they will usually end-up having the least IT-phobic person become their ‘system administrator’.
It’s your job as a consultant/technical support to ensure that they have the minimum tools at hand to perform emergency recovery, especially if you cannot intervene yourself quickly.

On-Site emergency tools

In every small business spare parts closet there should be at least:

  • Whenever possible, a spare Linux box, even if it’s just using older salvaged components (like a decommissioned PC). Just have a generic Linux install on it and make sure it is configured so it can be plugged in and accessed from the network.
  • a spare US$50 router, preferably pre-configured to be a temporary drop-in replacement for the existing router/firewall. Ideally, configure it to forward port 22 (SSH) to the spare Linux box so you can easily access the spare box from outside.
  • USB external hard-drive enclosure.
  • a spare PC power supply.
  • some network cables, a couple of screwdrivers.

There are many more tools, such as rescue-CDs (like bootable Linux distributions), spare HDD, etc that can be kept but you have to remember that your point of contact need to be able to be your eyes and hands, so the amount of tools you provide should match their technical abilities.
Don’t forget to clearly label confusing things like network ports (LAN, WAN) on routers, cables and PCs.

The point is that if you can’t be on site within a short period of time, then having these cheap tools and accessories already on site mean that your customers can quickly recover just by following your instructions on the phone.
Once everything is plugged-in, you should be able to remotely carry-out most repairs.


5 comments June 7th, 2008

Next Posts

Most Recent Posts



Posts by Month