Bug#588516: success/error reporting for checkarray cronjobs

Mon Oct 11 16:10:47 UTC 2010

Package: mdadm
Version: 3.1.4-1+8efb9d1
Severity: normal
Tags: patch

I propose a (maybe partial) fix to this at the end of #405919 which I'll
attach to the end.

mdadm includes logcheck rules which mean that non-zero mismatch counts
get reported, however due to a kernel bug (or at least weirdness), for
RAID1 and RAID10 the mismatch count is basically meaningless anyway.

The attached fixes logcheck so that it doesn't report mismatches, and
adds to the daily cron so that mismatches on array types where the
mismatch count is actually worth looking at is checked....

So, you don't get an error message immediately, but you should get one
within 24 hours.

Tim.


*** mdadm-logcheck-patch.diff

--- mdadm.orig	2010-09-28 16:45:03.000000000 +0100
+++ /etc/logcheck/ignore.d.server/mdadm	2010-09-28 16:58:25.000000000 +0100
@@ -17,7 +17,7 @@
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])? RAID([01456]|10) conf printout:$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])?[[:space:]]+---( [wrf]d:[[:digit:]]+){2,3}$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ kernel:( \[ *[[:digit:]]+\.[[:digit:]]+\])?[[:space:]]+disk [[:digit:]]+,( wo:[[:digit:]]+,)? o:[[:digit:]]+, dev:[[:alnum:]]+$
-^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: Rebuild((Start|Finish)ed|[[:digit:]]+) event detected on md device /dev/[-_./[:alnum:]]+$
+^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: Rebuild((Start|Finish)ed|[[:digit:]]+) event detected on md device /dev/[-_./[:alnum:]]+(, component device  ?mismatches found: [[:digit:]]+)?$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: SpareActive event detected on md device /dev/[-_./[:alnum:]]+, component device /dev/[-_./[:alnum:]]+$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: (New|Degraded)Array event detected on md device /dev/[-_./[:alnum:]]+$
 ^\w{3} [ :0-9]{11} [._[:alnum:]-]+ mdadm(\[[[:digit:]]+\])?: DeviceDisappeared event detected on md device /dev/[-_./[:alnum:]]+$

*** /home/tim/mdadm-mismatch-fix.diff
--- /etc/cron.daily/mdadm.old	2010-09-28 15:35:15.954390947 +0100
+++ /etc/cron.daily/mdadm	2010-09-28 17:07:19.954518154 +0100
@@ -15,4 +15,59 @@
 MDADM=/sbin/mdadm
 [ -x $MDADM ] || exit 0 # package may be removed but not purged
 
+PRINT_SUMMARY=0
+
+for mcnt in /sys/block/md?/md/mismatch_cnt
+do
+	if [ -f $mcnt ]
+	then
+		read cnt < $mcnt
+		read level < $( dirname $mcnt )/level
+		if [ $cnt != 0 ] && ! ( [ "$level" = "raid10" ] || [ "$level" = "raid1" ])
+		then
+			cat << WARN_TEXT
+
+Warning - $mcnt indicates that the associated RAID
+device has $cnt blocks in which the data on one array member is inconsistent
+with the data on the other array member(s).
+WARN_TEXT
+			PRINT_SUMMARY=1
+		fi
+	fi
+done
+
+exit
+
+
+if [ $PRINT_SUMMARY != 0 ]
+then
+	cat << WARN_TEXT
+
+DATA LOSS MAY HAVE OCCURRED.
+
+This condition may have been caused by one or more of the following events:
+
+. A power failure whilst the array was being written-to.
+. Data corruption by faulty hard disk drive, drive controller, cabling, RAM,
+    motherboard, PSU etc. etc.
+. A kernel bug.
+. An array being forcibly created in an inconsistent state using the 
+    "--assume-clean" argument to mdadm.
+
+This count is updated when the md subsystem carries out a 'check' or
+'repair' action.  In the case of 'repair' it reflects the number of
+mismatched blocks prior to carrying out the repair.
+
+Once you have fixed the error, carry out a 'check' action to reset the count
+to zero.
+
+Note that this check is only applied to arrays which aren't RAID1 or RAID10,
+due to a kernel limitation.  See the md (section 4) manual page, and the
+following URL for details:
+
+https://raid.wiki.kernel.org/index.php/Linux_Raid#Frequently_Asked_Questions_-_FAQ
+
+WARN_TEXT
+fi
+
 exec $MDADM --monitor --scan --oneshot