Bug#674527: MDADM RAID1 Catastrophic Failure with SSDs

Kyle kyle at peopleplex.net
Fri May 25 08:39:08 UTC 2012


Package:  mdadm
Version:  Lenny (I think)

Hello Wonderful Debian Sages -

I have a RAID1 MDADM array using two identical SSD's.  We had lost
our VPN to the box but I could still SSH in, and wow what did I find.

For awhile I was able to browse around in the file system but files
kept slowly turning into ??????'s like the below.

-rwxr-xr-x  1 root root   4872 Jan  1  2011 runlevel
-?????????  ? ?    ?         ?            ? sfdisk
-rwxr-xr-x  1 root root    879 Feb 15  2011 shadowconfig
-?????????  ? ?    ?         ?            ? shorewall
-rwxr-xr-x  1 root root  15976 Jan 13 16:07 showmount
-rwxr-xr-x  1 root root  23696 Jan  1  2011 shutdown
-rwxr-xr-x  1 root root  31728 Mar 16  2009 slattach
-rwxr-xr-x  1 root root  44464 Jan 13 16:07 sm-notify

And then eventually all storage device commands report an
"Input/output error" which presumably means it can't read the
hard-drive.

It appears that one of the SSD's failed and one stayed up, but the
one that was live was slowly getting corrupted or was slowly copying
bad sectors off of the failed SSD, slowly corrupting the good drive.
This is just a guess what was happening as it was sort of progressive
slowly losing access to files right in front of my eyes.

However I did check mdadm in the middle of this and it had marked
one of the drives as faulty but the degradation of the system kept
continuing anyway.

I was able to copy a couple critical config files but if anyone knows
a trick I might grab the /etc/ directory from this that would be a huge
help and will likely save me many hours.

I haven't done anything yet but will likely be yanking both SSDs and
reverting back to old technology.  One thing to be said about the
platter drives, when they fail RAID actually works right.

Anyone have any hope or a prayer here that might save the day
(at least to be able to read the /etc files).

Are there any tricks to do this:

  - Grab the firewall configuration out of memory?  I'm using
    shorewall but I can't access the directory.

  - Get a list of the interfaces and IP addresses and tunnel
    configurations.

Or best of all any tricks that might get me access to the filesystem
here again on EITHER volume?

The processes in memory are running some critical functions and amusingly
they seem fine (they don't use disk) so things are still running but
obviously if I reboot ...

Don't ask for logs (unless I can get them out of memory or proc/etc.) as
all of the log files are inaccessible.  Depressing because I thought
using RAID1 here would protect me from these issues.  At least the
system didn't halt at least.

Thanks for any tips/help at all.

This machine will be in this state likely for another 24 hours before
I rebuild it with non-SSDs.

Is there anything else I can retrieve from this machine that may help
isolate a non-repeat of this for someone else?


-------------------------------------------------------------------------


XXXXXX:/sbin# uname -r
2.6.32-5-amd64

-------------------------------------------------------------------------

XXXXXX:/# ls
ls: reading directory .: Input/output error

-------------------------------------------------------------------------

XXXXXX:/sbin# mdadm --examine /dev/md0
mdadm: No md superblock detected on /dev/md0.

-------------------------------------------------------------------------

XXXXXX:/dev# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Tue Oct 18 19:04:03 2011
     Raid Level : raid1
     Array Size : 59939768 (57.16 GiB 61.38 GB)
  Used Dev Size : 59939768 (57.16 GiB 61.38 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu May 24 23:50:02 2012
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       17        1      active sync   /dev/sdb1

       0       8        1        -      faulty spare   /dev/sda1

-------------------------------------------------------------------------

XXXXXX:/dev# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda5[0](F) sdb5[1]
      2578420 blocks super 1.2 [2/1] [_U]

md0 : active raid1 sda1[0](F) sdb1[1]
      59939768 blocks super 1.2 [2/1] [_U]

unused devices: <none>








More information about the pkg-mdadm-devel mailing list