Bug#588965: Please add support for replacing a failing but still usable drive with a spare without marking the first drive as failed

Tue Jul 13 20:51:42 UTC 2010

Package: mdadm
Version: 3.1.2-2
Severity: wishlist
Tags: upstream

Hi,

Especially in the case of RAID5 arrays it would often be life-saving to be
able to activate a hot-spare and prepare to replace a live drive with it,
without marking that drive as failed first.

Consider the following scenario. Let's say we have a RAID5 array composed of
sdb, sdc and sdd, with sde added as a spare (i.e. 3 active drives).

sdc starts to noticeably fail. Unknown to the user, sdd also has developed a
bad sector. The user marks sdc as failed and waits for sde to be synced;
however, during the resync, the system hits the bad sector on sdd, causing
sdd to also be marked as failed, the resync to fail and the array to become
unusable. (The same can happen if an intermittent bit error occurs during
the resync operation.)

The algorithm I'd like to see implemented would work as follows:

sdc starts to noticeably fail. The user marks it for replacement. sde is
activated and the system copies everything from sdc to sde, using the
redundancy provided by the other drives if/when necessary. Temporarily,
while this operation is in progress, sdc and sde are both active and in the
same slot; any writes that hit the array get committed to both. When sde is
completely up to date, sdc gets deactivated and marked as failed. The bad
sector on sdd doesn't compromise our ability to sync the hotspare. At this
point, another spare could be added, sdd marked for replacement, and so on.

I realise this also requires changes to the kernel. Apologies if it's
already planned; I haven't seen it discussed anywhere.

Best regards,

Andras

-- 
 Andras Korn <korn at elan.rulez.org> - <http://chardonnay.math.bme.hu/~korn/>
  All that glitters may not be gold, but it sure has a high refractive index.