multipath-tools: cascading failures

Fri Oct 16 00:52:42 UTC 2009

Hi,

I am trying to understand a strange fault and I would like to get some
advice from you multipath experts. If I should be asking elsewhere,
please excuse me and let me know where I should be asking.

I'm asking for help so we can best direct our debugging efforts -
see the 'Questions' section below. Also we wanted to raise this
because the system behaviour is probably not what the linux kernel and
multipath-tools programmers intended.

System setup
------------
I have a Dell PowerEdge 1750 i386 box running Debian etch.

We are running the kernel in the debian package linux-image-2.6.18-6-686:
% uname -a
Linux horus 2.6.18-6-686 #1 SMP Thu Aug 20 21:56:59 UTC 2009 i686 GNU/Linux
# apt-cache policy linux-image-2.6.18-6-686
linux-image-2.6.18-6-686:
   Installed: 2.6.18.dfsg.1-24etch4
   Candidate: 2.6.18.dfsg.1-24etch4
   Version table:
  *** 2.6.18.dfsg.1-24etch4 0
         500 http://debian-archive etch/updates/main Packages
         100 /var/lib/dpkg/status
      2.6.18.dfsg.1-24 0
         500 http://debian-archive etch/main Packages

I have the host connected via an LSI FC card to a Qlogic switch
(model SB5202). LSI card details -
03:06.0 0c04: 1000:0626
03:06.0 Fibre Channel: LSI Logic / Symbios Logic FC929X Fibre Channel Adapter
03:06.1 0c04: 1000:0626
03:06.1 Fibre Channel: LSI Logic / Symbios Logic FC929X Fibre Channel Adapter

The host connects via the switch to a Promise VTrak E610f rev: 1005.
The VTrak is configured as a 'stack' containing an E-class chassis and
two J-class chassis. The E-class has the smarts, the J-class are JBODS
with a simpler controller.

The E-class and one of the J-class units have 750Gb disks, the other
J-class was added later and has 1Tbyte disks. To support this
configuration, we had to upgrade the firmware in the E-class unit,
which we did according to the manufacturer's recommendation.
This may be the root cause of our issue, but we are trying first to
understand what Linux is doing in response - we doubt it is the
behaviour the kernel and multipath-tools programmers intended.

There is a multipathed connection from the storage unit to the Qlogic.
The E-class has two controllers, each with two FC connections, a total
of four paths. All four go back to the same switch, we don't have switch
redundancy.

Each chassis (or enclosure) is configured as a single logical disk
and each logical disk is presented to the host as a different LUN.
Each LUN is formatted as a single large XFS filesystem.

We have multipath-tools set up on the system, and it has been working
pretty well up until a couple of weeks ago.
# apt-cache policy multipath-tools
multipath-tools:
   Installed: 0.4.7-1.1etch2
   Candidate: 0.4.7-1.1etch2
   Version table:
  *** 0.4.7-1.1etch2 0
         500 http://debian-archive etch/updates/main Packages
         100 /var/lib/dpkg/status
      0.4.7-1.1etch1 0
         500 http://debian-archive etch/main Packages
with a small hack to kpartx, see bug #512601.

In a working state, multipath -l shows this:
# multipath -list
mpath2 (222ae000155cb4c63) dm-9 Promise,VTrak E610f
[size=13T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
  \_ 1:0:0:2 sdc 8:32  [active][undef]
  \_ 1:0:1:2 sdf 8:80  [active][undef]
  \_ 1:0:2:2 sdi 8:128 [active][undef]
  \_ 1:0:3:2 sdl 8:176 [active][undef]
mpath1 (222aa00015518499c) dm-8 Promise,VTrak E610f
[size=9.5T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
  \_ 1:0:0:1 sdb 8:16  [active][undef]
  \_ 1:0:1:1 sde 8:64  [active][undef]
  \_ 1:0:2:1 sdh 8:112 [active][undef]
  \_ 1:0:3:1 sdk 8:160 [active][undef]
mpath0 (222a60001559596f8) dm-7 Promise,VTrak E610f
[size=9.5T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
  \_ 1:0:0:0 sda 8:0   [active][undef]
  \_ 1:0:1:0 sdd 8:48  [active][undef]
  \_ 1:0:2:0 sdg 8:96  [active][undef]
  \_ 1:0:3:0 sdj 8:144 [active][undef]

The host's config files are given at the end of this tract^Wmail.

Fault
-----
We are seeing a fault show up on one of the multipath connections
to one of the LUNs (the one enumerated as mpath1).

The fault we are seeing starts like this

Oct 15 23:55:45 horus kernel: mptscsih: ioc0: attempting task abort! (sc=dfcaa380)
Oct 15 23:56:27 horus kernel: sd 1:0:3:1:
Oct 15 23:56:27 horus kernel:         command: Write(16): 8a 00 00 00 00 04 12 15 1d f2 00 00 01 40 00 00
Oct 15 23:56:27 horus kernel: mptscsih: ioc0: task abort: SUCCESS(sc=dfcaa380)

then we see
Oct 16 00:04:45 horus kernel: mptscsih: ioc0: task abort: SUCCESS(sc=d3d97680)
Oct 16 00:04:45 horus kernel: sd 1:0:3:1: timing out command, waited 180s
Oct 16 00:04:45 horus kernel: sd 1:0:3:1: SCSI error: return code =0x00000008
Oct 16 00:04:45 horus kernel: end_request: I/O error, dev sdk, sector17483242994
Oct 16 00:04:45 horus kernel: device-mapper: multipath: Failing path 8:160.
Oct 16 00:04:46 horus kernel: sd 1:0:3:1: timing out command, waited 180s
Oct 16 00:04:46 horus kernel: sd 1:0:3:1: SCSI error: return code =0x00000008
Oct 16 00:04:46 horus kernel: end_request: I/O error, dev sdk, sector14914238666

and
Oct 16 00:05:15 horus kernel: mptscsih: ioc0: attempting task abort!(sc=ded92b00)
Oct 16 00:05:15 horus kernel: sd 1:0:2:1:
Oct 16 00:05:15 horus kernel:         command: Write(16): 8a 00 00 00 00 03 c3 d5 a6 aa 00 00 01 40 00 00
Oct 16 00:05:16 horus kernel: mptscsih: ioc0: task abort: SUCCESS(sc=ded92b00)

and then
Oct 16 00:05:19 horus kernel: mptscsih: ioc0: bus reset: SUCCESS(sc=ded92b00)
Oct 16 00:05:29 horus kernel: mptscsih: ioc0: Attempting host reset!(sc=ded92b00)
Oct 16 00:05:29 horus kernel: mptbase: Initiating ioc0 recovery
Oct 16 00:06:12 horus kernel: sd 1:0:2:1: scsi: Device offlined - not readyafter error recovery
Oct 16 00:06:12 horus last message repeated 7 times
Oct 16 00:06:12 horus kernel: sd 1:0:2:1: rejecting I/O to offline device
Oct 16 00:06:12 horus kernel: device-mapper: multipath: Failing path 8:112.
Oct 16 00:06:12 horus kernel: sd 1:0:2:1: rejecting I/O to offline device
Oct 16 00:06:15 horus last message repeated 62 times
Oct 16 00:06:15 horus kernel: sd 1:0:2:1: SCSI error: return code =0x00010000
Oct 16 00:06:15 horus kernel: end_request: I/O error, dev sdh, sector16170455738
Oct 16 00:06:16 horus kernel: sd 1:0:2:1: SCSI error: return code =0x00010000
Oct 16 00:06:16 horus kernel: end_request: I/O error, dev sdh, sector16170455418
....
Oct 16 00:06:21 horus kernel: sd 1:0:2:1: SCSI error: return code =0x00010000
Oct 16 00:06:21 horus kernel: end_request: I/O error, dev sdh, sector16170461162
Oct 16 00:06:22 horus kernel: sd 1:0:2:1: rejecting I/O to offline device

Oct 16 00:06:22 horus multipathd: sdh: readsector0 checker reports path isdown
Oct 16 00:06:22 horus multipathd: checker failed path 8:112 in map mpath1
Oct 16 00:06:22 horus multipathd: mpath1: remaining active paths: 3
Oct 16 00:06:22 horus kernel: sd 1:0:2:1: rejecting I/O to offline device
Oct 16 00:06:22 horus last message repeated 5 times

this then repeats for the other disks over a period of time.
# grep readsector0 /var/log/syslog|sort -u -k6,6|sort -k3,3
Oct 16 00:06:22 horus multipathd: sdh: readsector0 checker reports path isdown
Oct 16 00:11:41 horus multipathd: sdi: readsector0 checker reports path isdown
Oct 16 00:16:44 horus multipathd: sdj: readsector0 checker reports path isdown
Oct 16 00:21:44 horus multipathd: sdk: readsector0 checker reports path isdown
Oct 16 00:26:44 horus multipathd: sdl: readsector0 checker reports path isdown
Oct 16 00:31:49 horus multipathd: sda: readsector0 checker reports path isdown
Oct 16 00:31:49 horus multipathd: sdb: readsector0 checker reports path isdown
Oct 16 00:36:49 horus multipathd: sdd: readsector0 checker reports path isdown
Oct 16 00:36:49 horus multipathd: sde: readsector0 checker reports path isdown
Oct 16 00:41:49 horus multipathd: sdg: readsector0 checker reports path isdown
Oct 16 02:37:05 horus multipathd: sdc: readsector0 checker reports path isdown
Oct 16 02:47:05 horus multipathd: sdf: readsector0 checker reports path isdown

By the end of this we are in this state:
# multipath -list
mpath2 (222ae000155cb4c63) dm-9 Promise,VTrak E610f
[size=13T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][active]
    \_ 1:0:0:2 sdc 8:32  [failed][undef]
    \_ 1:0:1:2 sdf 8:80  [failed][undef]
    \_ 1:0:2:2 sdi 8:128 [failed][undef]
    \_ 1:0:3:2 sdl 8:176 [failed][undef]
mpath1 (222aa00015518499c) dm-8 Promise,VTrak E610f
[size=9.5T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
    \_ 1:0:0:1 sdb 8:16  [failed][faulty]
    \_ 1:0:1:1 sde 8:64  [failed][faulty]
    \_ 1:0:2:1 sdh 8:112 [failed][undef]
    \_ 1:0:3:1 sdk 8:160 [failed][undef]
mpath0 (222a60001559596f8) dm-7 Promise,VTrak E610f
[size=9.5T][features=1 queue_if_no_path][hwhandler=0]
\_ round-robin 0 [prio=0][enabled]
    \_ 1:0:0:0 sda 8:0   [failed][undef]
    \_ 1:0:1:0 sdd 8:48  [failed][undef]
    \_ 1:0:2:0 sdg 8:96  [failed][undef]
    \_ 1:0:3:0 sdj 8:144 [failed][undef]

and any process attempting to talk to these devices is wedged.

The trigger for the fault seems to be related to transferring data
across the network interface to the multipathed devices.

We have successfully transferred a couple of Tbyte from mpath1 to mpath2
and back again, using a single 'rsync' process, ie just exercising the
FC data paths, and not touching the network. Works fine.

However when people try to send data from a remote host, like this:
   remote% rsync -t -essh /data/FOO/2009-10-* user at horus:/data/BAR_2/
and start more than 1 simultaneous run like this, the problem is
triggered.

The incoming data rate is not huge. The rsync runs over 1Gbit/s links
but is mainly rate-limited by SSH.
We are not certain the multiple sessions are a required aspect, but
at present they seem to be.

With the host in this state, the VTrak device shows no error conditions.

The only recovery, which seems to always work, is to reboot the host.

There are two network cards, but only one connected. They are (lspci)
02:00.0 0200: 14e4:1648 (rev 02)
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)
02:00.1 0200: 14e4:1648 (rev 02)
02:00.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 02)

Things we've tried
------------------
Check physical connections. All look sound, FC management tools on
the switch and on the storage unit show no link errors.

Remake the filesystem on mpath1p1, which seems to be where the problem
always starts. This doesn't seem to change the behaviour.

stracing multipathd from boot until failure. see bottom of mail.
Doesn't seem to show anything useful.

Questions
---------
Has anyone seen a problem like this before?

Does this look like a multipath problem, or at the SCSI block layer?

Is this a known kernel bug?

What other diagnostic tests can you suggest?

Configs
-------
% cat /etc/multipath.conf
#
defaults {
         user_friendly_names     yes
#       udev_dir                /dev
#       polling_interval        10
#       prio_callout            /bin/true
}
blacklist {
#       devnode "^sda$"
         devnode "^(ram|raw|loop|fd|md|dm-|sr|scd|st)[0-9]*"
         devnode "^hd[a-z][[0-9]*]"
         devnode "^cciss!c[0-9]d[0-9]*[p[0-9]*]"
         device {
                 vendor MegaRAID
         }
         device {
                 vendor APPLE
         }
}
devices {
         device {
                 vendor                  "Promise"
                 product                 "VTrak"
                 path_grouping_policy    multibus
                 getuid_callout          "/sbin/scsi_id -g -u -s /block/%n"
                 path_checker            readsector0
                 path_selector           "round-robin 0"
                 hardware_handler        "0"
                 failback                immediate
                 rr_weight               uniform
                 rr_min_io               100
                 no_path_retry           20
                 features                "1 queue_if_no_path"
                 product_blacklist       "VTrak V-LUN"
         }
}

% cat /var/lib/multipath/bindings
# Multipath bindings, Version : 1.0
# NOTE: this file is automatically maintained by the multipath program.
# You should not need to edit this file in normal circumstances.
#
# Format:
# alias wwid
#
mpath0 222a60001559596f8
mpath1 222aa00015518499c
mpath2 222ae000155cb4c63

strace output during fault
--------------------------
straced with:
# strace -t -F -ff -s 2048 -p 6263

(initial fault - 23:55:45 mptscsih: ioc0: attempting task abort!)

23:55:43 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
23:55:43 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
23:55:43 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
23:55:43 nanosleep({1, 0}, [{fd=19, events=POLLIN}], 1, 5000) = 0
23:55:43 poll({1, 0})                   = 0
23:55:44 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
23:55:44 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
23:55:44 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
23:55:44 nanosleep({1, 0}, {1, 0})      = 0
23:55:45 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
23:55:45 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
23:55:45 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
23:55:45 nanosleep({1, 0}, {1, 0})      = 0
23:55:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
23:55:46 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
23:55:46 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
23:55:46 nanosleep({1, 0}, {1, 0})      = 0
23:55:47 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
23:55:47 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
23:55:47 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
23:55:47 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
23:55:47 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
23:55:47 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
23:55:47 nanosleep({1, 0}, {1, 0})      = 0

(00:04:45 multipath: Failing path 8:160)

00:04:41 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:41 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:41 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:41 nanosleep({1, 0}, {1, 0})      = 0
00:04:42 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
00:04:42 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
00:04:42 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
00:04:42 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:42 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:42 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:42 nanosleep({1, 0}, [{fd=19, events=POLLIN}], 1, 5000) = 0
00:04:43 poll({1, 0})                   = 0
00:04:43 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:43 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:43 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:43 nanosleep({1, 0}, {1, 0})      = 0
00:04:44 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:44 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:44 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:44 nanosleep({1, 0}, {1, 0})      = 0
00:04:45 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:45 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:45 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:45 nanosleep({1, 0}, {1, 0})      = 0
00:04:46 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:46 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:46 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:46 nanosleep({1, 0}, {1, 0})      = 0
00:04:47 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:04:47 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:04:47 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:04:47 nanosleep({1, 0}, [{fd=19, events=POLLIN}], 1, 5000) = 0

(00:06:22 multipathd: sdh: readsector0 checker reports path isdown)

00:06:20 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:06:20 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:06:20 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:06:20 nanosleep({1, 0}, {1, 0})      = 0
00:06:21 ioctl(4, SG_IO, 0xb7c9e118)    = 0
00:06:21 ioctl(5, SG_IO, 0xb7c9e118)    = 0
00:06:21 ioctl(6, SG_IO, 0xb7c9e118)    = 0
00:06:21 ioctl(7, SG_IO, 0xb7c9e118)    = 0
00:06:21 ioctl(8, SG_IO, 0xb7c9e118)    = 0
00:06:21 ioctl(9, SG_IO, 0xb7c9e118)    = 0
00:06:21 ioctl(10, SG_IO, 0xb7c9e118)   = 0
00:06:21 ioctl(11, SG_IO, 0xb7c9e118)   = 0
00:06:21 ioctl(12, SG_IO, 0xb7c9e118)   = 0
00:06:21 ioctl(13, SG_IO, 0xb7c9e118)   = 0
00:06:21 ioctl(14, SG_IO, 0xb7c9e118)   = 0
00:06:21 ioctl(15, SG_IO, 0xb7c9e118)   = 0
00:06:21 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:06:21 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:06:21 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:06:21 nanosleep({1, 0}, {1, 0})      = 0
00:06:22 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:06:22 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:06:22 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:06:22 nanosleep({1, 0}, {1, 0})      = 0
00:06:23 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:06:23 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:06:23 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:06:23 nanosleep({1, 0}, [{fd=19, events=POLLIN}], 1, 5000) = 0
00:06:23 poll({1, 0})                   = 0
00:06:24 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:06:24 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:06:24 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:06:24 nanosleep({1, 0}, {1, 0})      = 0
00:06:25 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
00:06:25 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
00:06:25 ioctl(16, DM_DEV_STATUS, 0x8090000) = 0
00:06:25 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
00:06:25 rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
00:06:25 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
00:06:25 nanosleep({1, 0}, {1, 0})      = 0