Bug#674527: MDADM RAID1 Catastrophic Failure with SSDs
Kyle
kyle at peopleplex.net
Fri May 25 08:39:08 UTC 2012
Package: mdadm
Version: Lenny (I think)
Hello Wonderful Debian Sages -
I have a RAID1 MDADM array using two identical SSD's. We had lost
our VPN to the box but I could still SSH in, and wow what did I find.
For awhile I was able to browse around in the file system but files
kept slowly turning into ??????'s like the below.
-rwxr-xr-x 1 root root 4872 Jan 1 2011 runlevel
-????????? ? ? ? ? ? sfdisk
-rwxr-xr-x 1 root root 879 Feb 15 2011 shadowconfig
-????????? ? ? ? ? ? shorewall
-rwxr-xr-x 1 root root 15976 Jan 13 16:07 showmount
-rwxr-xr-x 1 root root 23696 Jan 1 2011 shutdown
-rwxr-xr-x 1 root root 31728 Mar 16 2009 slattach
-rwxr-xr-x 1 root root 44464 Jan 13 16:07 sm-notify
And then eventually all storage device commands report an
"Input/output error" which presumably means it can't read the
hard-drive.
It appears that one of the SSD's failed and one stayed up, but the
one that was live was slowly getting corrupted or was slowly copying
bad sectors off of the failed SSD, slowly corrupting the good drive.
This is just a guess what was happening as it was sort of progressive
slowly losing access to files right in front of my eyes.
However I did check mdadm in the middle of this and it had marked
one of the drives as faulty but the degradation of the system kept
continuing anyway.
I was able to copy a couple critical config files but if anyone knows
a trick I might grab the /etc/ directory from this that would be a huge
help and will likely save me many hours.
I haven't done anything yet but will likely be yanking both SSDs and
reverting back to old technology. One thing to be said about the
platter drives, when they fail RAID actually works right.
Anyone have any hope or a prayer here that might save the day
(at least to be able to read the /etc files).
Are there any tricks to do this:
- Grab the firewall configuration out of memory? I'm using
shorewall but I can't access the directory.
- Get a list of the interfaces and IP addresses and tunnel
configurations.
Or best of all any tricks that might get me access to the filesystem
here again on EITHER volume?
The processes in memory are running some critical functions and amusingly
they seem fine (they don't use disk) so things are still running but
obviously if I reboot ...
Don't ask for logs (unless I can get them out of memory or proc/etc.) as
all of the log files are inaccessible. Depressing because I thought
using RAID1 here would protect me from these issues. At least the
system didn't halt at least.
Thanks for any tips/help at all.
This machine will be in this state likely for another 24 hours before
I rebuild it with non-SSDs.
Is there anything else I can retrieve from this machine that may help
isolate a non-repeat of this for someone else?
-------------------------------------------------------------------------
XXXXXX:/sbin# uname -r
2.6.32-5-amd64
-------------------------------------------------------------------------
XXXXXX:/# ls
ls: reading directory .: Input/output error
-------------------------------------------------------------------------
XXXXXX:/sbin# mdadm --examine /dev/md0
mdadm: No md superblock detected on /dev/md0.
-------------------------------------------------------------------------
XXXXXX:/dev# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Oct 18 19:04:03 2011
Raid Level : raid1
Array Size : 59939768 (57.16 GiB 61.38 GB)
Used Dev Size : 59939768 (57.16 GiB 61.38 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Update Time : Thu May 24 23:50:02 2012
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Number Major Minor RaidDevice State
0 0 0 0 removed
1 8 17 1 active sync /dev/sdb1
0 8 1 - faulty spare /dev/sda1
-------------------------------------------------------------------------
XXXXXX:/dev# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sda5[0](F) sdb5[1]
2578420 blocks super 1.2 [2/1] [_U]
md0 : active raid1 sda1[0](F) sdb1[1]
59939768 blocks super 1.2 [2/1] [_U]
unused devices: <none>
More information about the pkg-mdadm-devel
mailing list