Bug#740701: multipath-tools: mkfs fails "Add. Sense: Incompatible medium installed"

Tue Jun 17 23:07:49 UTC 2014

On 06/17/2014 07:43 PM, Hans van Kranenburg wrote:
>
> But I have to leave now, will continue later.

Btw, netapp-linux-community, I kept the Cc in my last update, which is 
now in a moderation queue of the mailing list. I joined the list, I 
didn't even know it existed before... Read up at 
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=740701 when interested.

This evening I tried to reproduce the problem in a test setup, issuing 
unmap (discard) requests, using fstrim and mkfs in the following situations:

1. Just /dev/sdf, single path to single lun
2. Use multipath to single lun, /dev/mapper/mpatha
3. Use encryption on top of multipath, /dev/mapper/mpatha_luks
4. Use lvm on top of the encryption, /dev/vg_discard/lv_discard
5. Start using a second equally sized lun on the other netapp 
controller, multipath to it, put encryption on it, pvcreate it, create a 
new volume group containing both pvs, create a striped lv.

The sad part of the story is that I could not manage to get my iSCSI 
connections toasted in any of the test cases yet.

For reference, this is what step 5 looks like:

# multipath -l
mpathb (360a9800042576c32412b4532614a6750) dm-2 NETAPP,LUN
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
   |- 36:0:0:1 sdc 8:32  active undef running
   |- 38:0:0:1 sde 8:64  active undef running
   |- 35:0:0:1 sdb 8:16  active undef running
   `- 37:0:0:1 sdd 8:48  active undef running
mpatha (360a9800042577239353f4532614a6339) dm-1 NETAPP,LUN
size=10G features='1 queue_if_no_path' hwhandler='0' wp=rw
`-+- policy='round-robin 0' prio=0 status=active
   |- 40:0:0:1 sdf 8:80  active undef running
   |- 42:0:0:1 sdi 8:128 active undef running
   |- 39:0:0:1 sdg 8:96  active undef running
   `- 41:0:0:1 sdh 8:112 active undef running

# cryptsetup --verbose --verify-passphrase luksFormat /dev/mapper/mpatha
# cryptsetup --verbose --verify-passphrase luksFormat /dev/mapper/mpathb

# cryptsetup status /dev/mapper/mpatha_luks
/dev/mapper/mpatha_luks is active.
   type:    LUKS1
   cipher:  aes-cbc-essiv:sha256
   keysize: 256 bits
   device:  /dev/mapper/mpatha
   offset:  4096 sectors
   size:    20967424 sectors
   mode:    read/write

# pvcreate /dev/mapper/mpatha_luks
# pvcreate /dev/mapper/mpathb_luks

# pvs
   PV                      VG         Fmt  Attr PSize  PFree
   /dev/mapper/mpatha_luks vg_discard lvm2 a--  10.00g 9.50g
   /dev/mapper/mpathb_luks vg_discard lvm2 a--  10.00g 9.50g

# vgcreate vg_discard /dev/mapper/mpatha_luks /dev/mapper/mpathb_luks
# lvcreate -i 2 -L 10G -n lv_discard --addtag $(hostname) vg_discard
# mkfs.ext4 /dev/vg_discard/lv_discard

or, do something like:

# dd if=/dev/zero of=sparse bs=1048576 seek=1024 count=0
# shred -n 1 -v sparse
# sync
# rm sparse
# sync
# fstrim -v -o 0MB -l 512MB ./

So, conclusions for now:
  - This is not very easily reproducible, it's not just like "you need 
to have multipath or this and that and then do mkfs or fstrim and then 
it fails". But it's there, and in the production setup I've seen it 
happen more than once now, yesterday being the case in which we could 
connect the dots and pinpoint where the actual problem is. (1st time:
  what the .. just happened, collect logs, 2nd time: different 
situation, same cause, compare, etc, blam! it's the unmap iscsi)
  - I can find very few search hits on this on the web, it does not seem 
like a known issue, besides the OP of this bug and me reporting it. 
Google for "CDB: Unmap/Read sub-channel: 42 00 00 00 00 00 00 00 18 00".
  - There must be something different in the production setup, which is 
now running separately from this one physical server and test luns, on 
the exact same type of hardware, using identical software and identical 
configuration, but fails all I/O after any UNMAP request. Differences 
are that the production luns are accessed concurrently from multiple 
physical servers, that there's a lot more I/O going on at any moment, 
that there's a lot more of logical volumes and data written to the luns 
etc etc...

Any ideas?

-- 
Hans van Kranenburg - System / Network Engineer
T +31 (0)10 2760434 | hans.van.kranenburg at mendix.com | www.mendix.com