Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"

Bill MacAllister whm at stanford.edu
Tue Mar 4 23:35:08 UTC 2014



--On Tuesday, March 04, 2014 11:49:52 PM +0530 Ritesh Raj Sarraf <rrs at debian.org> wrote:

> Let's attemt the email in chunks.
>
>
> On 03/04/2014 01:19 PM, Bill MacAllister wrote:
>> Package: multipath-tools
>> Version: 0.4.9+git0.9c3c5172-1
>> Severity: important
>>
>> Dear Maintainer,
>>
>> * What led up to the situation?
>>
>> We use multipath connected devices on our OpenAFS servers.  We are in the
>> process of attempting to upgrade the systems to wheezy.  Our normal
>> procedure is to completely evacuate a server and rebuild it from scratch
>> and reconfigure the storage.
>
> What is your Storage Array like ?

I don't know the details.  All I know is that it is a NetApp device.  I have
included the folks that manage the array on this message.  They can provide
the details.

>> * What exactly did you do (or not do) that was effective (or
>>   ineffective)?
>>
>> On the systems we have tested:
>>
>>   - mkfs on squeeze with a 2.6.32-5-amd64 kernel SUCCEEDS
>>   - mkfs on squeeze with a 3.2 backport kernel FAILS
>>   - mkfs on wheezy FAILS
>>   - mkfs on jessie FAILS
>>
>> There is one exception to the above list.  We do have one wheezy system
>> where we are able to mkfs file systems on multipath served devices.  These
>> systems are built using FAI with the configuration be supplied by puppet.
>> We have not been able to figure out what is unique about this system.
>>
>> The current system we have available to test with has a qlogic HBA card.
>> We have also tested with Emulex cards with the same results.
>
> This alone doesn't help. Can you please provide the make model ?

The qlogic cards identifies itself as "QLogic QLE2460".

>> When mkfs fails multipathd starts spewing errors on the console:
>>
>> afssvr20:/etc/rsyslog.d# mkfs -t ext4 /dev/mapper/mpathb-part1
>> mke2fs 1.42.9 (4-Feb-2014)
>> [11712.731701] sd 3:0:0:20: [sdd] Unhandled sense code
>> [11712.800335] sd 3:0:0:20: [sdd]
>> [11712.838877] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [11712.905537] sd 3:0:0:20: [sdd]
>> [11712.944077] Sense Key : Medium Error [current]
>> [11712.998246] sd 3:0:0:20: [sdd]
>> [11713.036788] Add. Sense: Incompatible medium installed
>> [11713.097195] sd 3:0:0:20: [sdd] CDB:
>> [11713.139903] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
>> [11713.212987] end_request: I/O error, dev sdd, sector 128
>> [11713.276032] device-mapper: multipath: Failing path 8:48.
>> [11713.340113] sd 4:0:1:20: [sdc] Unhandled sense code
>> [11713.398455] sd 4:0:1:20: [sdc]
>> [11713.437003] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [11713.503669] sd 4:0:1:20: [sdc]
>> [11713.542218] Sense Key : Medium Error [current]
>> [11713.596408] sd 4:0:1:20: [sdc]
>> [11713.634948] Add. Sense: Incompatible medium installed
>> [11713.695357] sd 4:0:1:20: [sdc] CDB:
>> [11713.738077] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
>> [11713.811160] end_request: I/O error, dev sdc, sector 128
>> [11713.874213] device-mapper: multipath: Failing path 8:32.
>> [11713.938320] sd 3:0:1:20: [sde] Unhandled sense code
>> [11713.996662] sd 3:0:1:20: [sde]
>> [11714.035206] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [11714.101868] sd 3:0:1:20: [sde]
>> [11714.140413] Sense Key : Medium Error [current]
>> [11714.194590] sd 3:0:1:20: [sde]
>> [11714.233136] Add. Sense: Incompatible medium installed
>> [11714.293553] sd 3:0:1:20: [sde] CDB:
>> [11714.336268] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
>> [11714.409350] end_request: I/O error, dev sde, sector 128
>> [11714.472357] device-mapper: multipath: Failing path 8:64.
>> [11714.536430] sd 4:0:0:20: [sdb] Unhandled sense code
>> [11714.594768] sd 4:0:0:20: [sdb]
>> [11714.633306] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> [11714.699970] sd 4:0:0:20: [sdb]
>> [11714.738506] Sense Key : Medium Error [current]
>> [11714.792681] sd 4:0:0:20: [sdb]
>> [11714.831221] Add. Sense: Incompatible medium installed
>> [11714.891625] sd 4:0:0:20: [sdb] CDB:
>> [11714.934333] Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00
>> [11715.007395] end_request: I/O error, dev sdb, sector 128
>> [11715.070374] device-mapper: multipath: Failing path 8:16.
>> [11715.134475] device-mapper: multipath: Failing path 8:48.
>> [11715.134690] sd 4:0:1:20: [sdc] Unhandled sense code
>> [11715.134691] sd 4:0:1:20: [sdc]
>> [11715.134692] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
>> .....
>
> Hmmm!!! If I make the count, from these logs, it shows that all your
> paths went faulty.
> Either, you had permanent faulty paths, or else the status of the paths
> wasn't updated because of a bug.

That the paths are fine when using the 2.6 kernel and fail with the 3.2
kernel argues against it being a faulty path.

>> Here is the output from multipath -ll -v3
>
> A query with -v3 will alwasy instruct multipath to go update its status
> entries. So while you might think it was reporting the current status,
> it actually is reporting the new status.
>
>>
>> Mar 03 23:16:44 | loading /lib/multipath/libcheckdirectio.so checker
>> Mar 03 23:16:44 | loading /lib/multipath/libprioconst.so prioritizer
>> Mar 03 23:16:44 | sda: udev property ID_WWN whitelisted
>> Mar 03 23:16:44 | sda: not found in pathvec
>> Mar 03 23:16:44 | sda: mask = 0x25
>> Mar 03 23:16:44 | sda: dev_t = 8:0
>> Mar 03 23:16:44 | sda: size = 142082048
>> Mar 03 23:16:44 | sda: vendor = DELL
>> Mar 03 23:16:44 | sda: product = PERC 6/i
>> Mar 03 23:16:44 | sda: rev = 1.21
>> Mar 03 23:16:44 | sda: h:b:t:l = 0:2:0:0
>> Mar 03 23:16:44 | sda: path state = running
>>
>> Mar 03 23:16:44 | sda: get_state
>> Mar 03 23:16:44 | sda: path checker = directio (internal default)
>> Mar 03 23:16:44 | sda: checker timeout = 90 ms (sysfs setting)
>> Mar 03 23:16:44 | directio: starting new request
>> Mar 03 23:16:44 | directio: io finished 4096/0
>> Mar 03 23:16:44 | sda: state = up
>> Mar 03 23:16:44 | sdc: udev property ID_WWN whitelisted
>> Mar 03 23:16:44 | sdc: not found in pathvec
>> Mar 03 23:16:44 | sdc: mask = 0x25
>> Mar 03 23:16:44 | sdc: dev_t = 8:32
>> Mar 03 23:16:44 | sdc: size = 11488198656
>> Mar 03 23:16:44 | sdc: vendor = NETAPP
>> Mar 03 23:16:44 | sdc: product = LUN
>> Mar 03 23:16:44 | sdc: rev = 811a
>> Mar 03 23:16:44 | sdc: h:b:t:l = 3:0:0:20
>> Mar 03 23:16:44 | SCSI target 3:0:0 -> FC rport 3:0-0
>> Mar 03 23:16:44 | sdc: tgt_node_name = 0x500a09808d44a0ea
>> Mar 03 23:16:44 | sdc: path state = running
>
> Ah!!! You're on NetApp Array. :-)
>
>> Mar 03 23:16:44 | sdc: get_state
>> Mar 03 23:16:44 | loading /lib/multipath/libchecktur.so checker
>> Mar 03 23:16:44 | sdc: path checker = tur (controller setting)
>> Mar 03 23:16:44 | sdc: checker timeout = 30 ms (sysfs setting)
>> Mar 03 23:16:44 | sdc: state = up
>> Mar 03 23:16:44 | sde: udev property ID_WWN whitelisted
>> Mar 03 23:16:44 | sde: not found in pathvec
>> Mar 03 23:16:44 | sde: mask = 0x25
>> Mar 03 23:16:44 | sde: dev_t = 8:64
>> Mar 03 23:16:44 | sde: size = 11488198656
>> Mar 03 23:16:44 | sde: vendor = NETAPP
>> Mar 03 23:16:44 | sde: product = LUN
>> Mar 03 23:16:44 | sde: rev = 811a
>> Mar 03 23:16:44 | sde: h:b:t:l = 3:0:1:20
>> Mar 03 23:16:44 | SCSI target 3:0:1 -> FC rport 3:0-1
>> Mar 03 23:16:44 | sde: tgt_node_name = 0x500a09808d44a0ea
>> Mar 03 23:16:44 | sde: path state = running
>> Mar 03 23:16:44 | sde: get_state
>> Mar 03 23:16:44 | sde: path checker = tur (controller setting)
>> Mar 03 23:16:44 | sde: checker timeout = 30 ms (sysfs setting)
>> Mar 03 23:16:44 | sde: state = up
>> Mar 03 23:16:44 | sdb: udev property ID_WWN whitelisted
>> Mar 03 23:16:44 | sdb: not found in pathvec
>> Mar 03 23:16:44 | sdb: mask = 0x25
>> Mar 03 23:16:44 | sdb: dev_t = 8:16
>> Mar 03 23:16:44 | sdb: size = 11488198656
>> Mar 03 23:16:44 | sdb: vendor = NETAPP
>> Mar 03 23:16:44 | sdb: product = LUN
>> Mar 03 23:16:44 | sdb: rev = 811a
>> Mar 03 23:16:44 | sdb: h:b:t:l = 4:0:0:20
>> Mar 03 23:16:44 | SCSI target 4:0:0 -> FC rport 4:0-0
>> Mar 03 23:16:44 | sdb: tgt_node_name = 0x500a09808d44a0ea
>> Mar 03 23:16:44 | sdb: path state = running
>>
>> Mar 03 23:16:44 | sdb: get_state
>> Mar 03 23:16:44 | sdb: path checker = tur (controller setting)
>> Mar 03 23:16:44 | sdb: checker timeout = 30 ms (sysfs setting)
>> Mar 03 23:16:44 | sdb: state = up
>> Mar 03 23:16:44 | sdd: udev property ID_WWN whitelisted
>> Mar 03 23:16:44 | sdd: not found in pathvec
>> Mar 03 23:16:44 | sdd: mask = 0x25
>> Mar 03 23:16:44 | sdd: dev_t = 8:48
>> Mar 03 23:16:44 | sdd: size = 11488198656
>> Mar 03 23:16:44 | sdd: vendor = NETAPP
>> Mar 03 23:16:44 | sdd: product = LUN
>> Mar 03 23:16:44 | sdd: rev = 811a
>> Mar 03 23:16:44 | sdd: h:b:t:l = 4:0:1:20
>> Mar 03 23:16:44 | SCSI target 4:0:1 -> FC rport 4:0-1
>> Mar 03 23:16:44 | sdd: tgt_node_name = 0x500a09808d44a0ea
>> Mar 03 23:16:44 | sdd: path state = running
>>
>> Mar 03 23:16:44 | sdd: get_state
>> Mar 03 23:16:44 | sdd: path checker = tur (controller setting)
>> Mar 03 23:16:44 | sdd: checker timeout = 30 ms (sysfs setting)
>> Mar 03 23:16:44 | sdd: state = up
>> Mar 03 23:16:44 | sr0: blacklisted, udev property missing
>> Mar 03 23:16:44 | dm-0: blacklisted, udev property missing
>> Mar 03 23:16:44 | dm-1: blacklisted, udev property missing
>> Mar 03 23:16:44 | dm-2: blacklisted, udev property missing
>> ===== paths list =====
>> uuid hcil     dev dev_t pri dm_st chk_st vend/prod/rev             dev_st
>>      0:2:0:0  sda 8:0   -1  undef ready  DELL    ,PERC 6/i         running
>>      3:0:0:20 sdc 8:32  -1  undef ready  NETAPP  ,LUN              running
>>      3:0:1:20 sde 8:64  -1  undef ready  NETAPP  ,LUN              running
>>      4:0:0:20 sdb 8:16  -1  undef ready  NETAPP  ,LUN              running
>> Mar 03 23:16:44 | params = 4 queue_if_no_path pg_init_retries 50 retain_attached
>> _hw_handler 0 2 1 service-time 0 2 2 8:64 1 1 8:16 1 1 service-time 0 2 2 8:32 1
>>  1 8:48 1 1
>> Mar 03 23:16:44 | status = 2 0 0 0 2 1 A 0 2 2 8:64 A 0 0 1 8:16 A 0 0 1 E 0 2 2
>>  8:32 A 0 0 1 8:48 A 0 0 1
>> Mar 03 23:16:44 | mpathb: disassemble map [4 queue_if_no_path pg_init_retries 50
>>  retain_attached_hw_handler 0 2 1 service-time 0 2 2 8:64 1 1 8:16 1 1 service-t
>> ime 0 2 2 8:32 1 1 8:48 1 1 ]
>> Mar 03 23:16:44 | sde: mask = 0x8
>> Mar 03 23:16:44 | sde: path state = running
>>
>> Mar 03 23:16:44 | sde: detect_prio = 2 (controller default)
>> Mar 03 23:16:44 | loading /lib/multipath/libprioalua.so prioritizer
>> Mar 03 23:16:44 | sde: prio = alua (detected setting)
>> Mar 03 23:16:44 | reported target port group is 0
>> Mar 03 23:16:44 | aas = 00 [active/optimized]
>> Mar 03 23:16:44 | sde: alua prio = 50
>> Mar 03 23:16:44 | sdb: mask = 0x8
>> Mar 03 23:16:44 | sdb: path state = running
>>
>> Mar 03 23:16:44 | sdb: detect_prio = 2 (controller default)
>> Mar 03 23:16:44 | sdb: prio = alua (detected setting)
>> Mar 03 23:16:44 | reported target port group is 0
>> Mar 03 23:16:44 | aas = 00 [active/optimized]
>> Mar 03 23:16:44 | sdb: alua prio = 50
>> Mar 03 23:16:44 | sdc: mask = 0x8
>> Mar 03 23:16:44 | sdc: path state = running
>>
>> Mar 03 23:16:44 | sdc: detect_prio = 2 (controller default)
>> Mar 03 23:16:44 | sdc: prio = alua (detected setting)
>> Mar 03 23:16:44 | reported target port group is 1
>> Mar 03 23:16:44 | aas = 01 [active/non-optimized]
>> Mar 03 23:16:44 | sdc: alua prio = 10
>> Mar 03 23:16:44 | sdd: mask = 0x8
>> Mar 03 23:16:44 | sdd: path state = running
>>
>> Mar 03 23:16:44 | sdd: detect_prio = 2 (controller default)
>> Mar 03 23:16:44 | sdd: prio = alua (detected setting)
>> Mar 03 23:16:44 | reported target port group is 1
>> Mar 03 23:16:44 | aas = 01 [active/non-optimized]
>> Mar 03 23:16:44 | sdd: alua prio = 10
>> Mar 03 23:16:44 | mpathb: disassemble status [2 0 0 0 2 1 A 0 2 2 8:64 A 0 0 1 8
>> :16 A 0 0 1 E 0 2 2 8:32 A 0 0 1 8:48 A 0 0 1 ]
>> mpathb (360a9800064704842534a6c5063513577) dm-0 NETAPP  ,LUN
>> size=5.3T features='4 queue_if_no_path pg_init_retries 50 retain_attached_hw_han
>> dle' hwhandler='0' wp=rw
>> | -+- policy='service-time 0' prio=50 status=active
>> | | - 3:0:1:20 sde 8:64 active ready running
>> | `- 4:0:0:20 sdb 8:16 active ready running
>> `-+- policy='service-time 0' prio=10 status=enabled
>>   | - 3:0:0:20 sdc 8:32 active ready running
>>   `- 4:0:1:20 sdd 8:48 active ready running
>
> And as expected, now everything is perfect.
>
> From what I look into your logs, it is hard to conclude the cause. It
> looks like a case of stale path status, which actually failed, when you
> did a write() using the mkfs command.
> If you have the resources, you may want to try the directio path checker
> to see if it provide the real status.

How do I do that?

> Also at the time when mkfs failed, was the SCSI device responding ? I
> mean the TUR command definitely was, but what about, say, running the
> mkfs directly on the SCSI device ??

Do you mean using the /dev/dm-n device instead of the mapper device?
I did try that once today, but it didn't seem to make a difference.

> The other observation is that the SCSI devices have an ALUA priority
> defined (inherited the default, I guess), whereas your mulitpath
> hardware handler is set to none. For NetApp, on the newer ONTAP
> versions, it is highly recommended to use ALUA.

In talking to the folks that manage the storage array they indicated
that there was a new device configuration available from NetApp.  We
plan to try that.  I'll let you know what we find out.

> The conclusion is simple. You've seen mkfs fail because all the paths
> were faulty.

Thanks for you help,

Bill

-- 

Bill MacAllister
Infrastructure Delivery Group, Stanford University



More information about the pkg-lvm-maintainers mailing list