Linux sysadmin: a short RAID trouble-shooting story

Saturday, June 7, 2008

Linux I recently had an issue at a remote location (12000km away) where the old multi-purpose Linux server that had been working for the past 5 years wouldn’t boot again after a nasty power failure.
The server was used as a firewall, a local email store, a file server and a backup server, so its failure is a big deal for the small business that was using it.

RAID explained
RAID configurations explained
You can’t always have complete redundancy, so some amount of bad crash is to be expected over the years. Fortunately, I always construct my servers around a simple software RAID1 array and that leaves some hope for recovery.
In this instance, the server would start and then miserably fail in a fashion that would suggest a hardware failure of some sort. Not being able to be physically present and having no dedicated system admin on location, I directed the most knowledgeable person there to use a spare internet router to recover Internet connectivity and connect one of the disk to another Linux server (their fax server) through a USB external drive.

Doing this, I was able to remotely connect to the working server and access the disk, mount it and access the data.

Salvaging the data

Once one of the RAID1 drives was placed into the USB enclosure and connected to the other available Linux box it was easy to just remount the drives:

fdisk will tell us which partitions are interesting, assuming that /dev/sdc is our usb harddrive:

[root@fax ~]# fdisk -l /dev/sdc
Disk /dev/sdc: 81.9 GB, 81964302336 bytes
16 heads, 63 sectors/track, 158816 cylinders
Units = cylinders of 1008 * 512 = 516096 bytes
Device      Boot    Start         End      Blocks   Id  System
/dev/sdc1   *           1         207      104296+  fd  Linux raid autodetect
/dev/sdc2             208       20526    10240776   fd  Linux raid autodetect
/dev/sdc3           20527       22615     1052856   fd  Linux raid autodetect
/dev/sdc4           22616      158816    68645304    f  W95 Ext'd (LBA)
/dev/sdc5           22616      158816    68645272+  fd  Linux raid autodetect

We can’t simply mount the partitions, they need to be assembled into a RAID partition first:

[root@fax ~]# mdadm --assemble /dev/md6 /dev/sdc1 --run
mdadm: /dev/md6 has been started with 1 drive (out of 2).

The --run argument forces the RAID partition to be assembled, otherwise, mdadm will complain that there is only a single drive available instead of the 2 -or more- it would expect.

Now simply mount the assembled partition to make it accessible in /mnt for instance:

[root@fax ~]# mount /dev/md6 /mnt

We can now salvage our data by repeating this process for every partition.
Using RAID1 means you have at least 2 disks to choose from, so if one is damaged beyond repair, you may be lucky and the mirror one on the other drive should work.

If the drives are not physically damaged but they won’t boot, you can always use a pair (or more) of USB HDD enclosures and reconstruct the RAID arrays manually from another Linux box.

Planning for disasters

The lesson here is about planning: you can’t foresee every possible event and have contingencies for each one of them, either because of complexity or cost, but you can easily make your life much easier by planning ahead a little bit.

Most small businesses cannot afford dedicated IT staff, so they will usually end-up having the least IT-phobic person become their ‘system administrator’.
It’s your job as a consultant/technical support to ensure that they have the minimum tools at hand to perform emergency recovery, especially if you cannot intervene yourself quickly.

On-Site emergency tools

In every small business spare parts closet there should be at least:

  • Whenever possible, a spare Linux box, even if it’s just using older salvaged components (like a decommissioned PC). Just have a generic Linux install on it and make sure it is configured so it can be plugged in and accessed from the network.
  • a spare US$50 router, preferably pre-configured to be a temporary drop-in replacement for the existing router/firewall. Ideally, configure it to forward port 22 (SSH) to the spare Linux box so you can easily access the spare box from outside.
  • USB external hard-drive enclosure.
  • a spare PC power supply.
  • some network cables, a couple of screwdrivers.

There are many more tools, such as rescue-CDs (like bootable Linux distributions), spare HDD, etc that can be kept but you have to remember that your point of contact need to be able to be your eyes and hands, so the amount of tools you provide should match their technical abilities.
Don’t forget to clearly label confusing things like network ports (LAN, WAN) on routers, cables and PCs.

The point is that if you can’t be on site within a short period of time, then having these cheap tools and accessories already on site mean that your customers can quickly recover just by following your instructions on the phone.
Once everything is plugged-in, you should be able to remotely carry-out most repairs.

Resources

Entry Filed under  :  Linux,sysadmin

5 Comments Add your own

  • 1. RobNY  |  November 19th, 2008 at 11:54 pm

    Hi how are you doing?
    I manage to find this article about you trying to connect your MyBook 1GB! You mau be able to help me.

    The 500 GB WD mybook stopped working, so after removing it from the enclosure and using a USB adaptor, Ubuntu won’t automount it.
    all of my files are in /dev/sdb4

    Here’s what I see using fdisk -l

    Disk /dev/sdb: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk identifier: 0x00007c00 Device Boot Start End Blocks Id System /dev/sdb1 4 369 2939895 fd Linux raid autodetect /dev/sdb2 370 382 104422+ fd Linux raid autodetect /dev/sdb3 383 505 987997+ fd Linux raid autodetect /dev/sdb4 506 60801 484327620 fd Linux raid autodetect

    ////////////////////////////////////////////////////////////////PLEASE HELP

  • 2. Renaud  |  November 20th, 2008 at 9:29 am

    @RobNY: if you just follow the instructions in the article you should have no problem recovering your data (as long as it’s not a physical failure of the drive or the data is irremediably corrupted of course).

    Just do something like:

    mdadm --assemble /dev/md10 /dev/sdb4 --run

    Then, if it worked, assuming you have a /mnt folder:

    mount /dev/md10 /mnt

    Now your data should be accessible from /mnt.

  • 3. IGadget  |  December 5th, 2008 at 11:50 am

    I love the RAID types picture, Is it under Creative Commons?

  • 4. Renaud  |  December 5th, 2008 at 12:03 pm

    @ ‘IGadget’: that picture has been on the internets for a while and I couldn’t find the original for it. If anybody does, please let me know so I can give proper credit.
    Just search for water+cooler+raid in google…

  • 5. Joeri  |  January 31st, 2009 at 2:59 am

    Thanks for the article, it saved my life ;-)

    For Knoppix user, booting from the live CD, use modprobe md before the mdadm command to load the md module into the kernel.

Leave a Comment

(Will not be shown)
Notify me of follow-up comments via e-mail

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackback this post  |  Subscribe to the comments via RSS Feed


about

Renaud This is a simple technical weblog where I dump thoughts and experiences from my computer-related world.
It is mostly focused on software development but I also have wider interests and dabble in architecture, business and system administration.
More About me…

My StackOverflow Profile
My (sporadically active) StackOVerflow account

Most Recent Posts

Categories

Links