Bug#623613: Removing SAN mapping spawn uninterruptible process

Thu Apr 28 12:25:18 UTC 2011

On 04/28/2011 03:51 PM, Laurent Bigonville wrote:
>> Yes. Recently, the flush_on_last_del feature was added. Can you check
>> if that is active by default?
>> (multipath -v3 should show you all the applied options)
> 
> According to redhat doc [0] this is disabled by default.
>

Okay! But I'm not sure what upstream's default is. Can you confirm that
that feature is NOT in use?

> But it seems that when doing:
> 
> echo 1> /sys/block/sdX/device/delete
>

You shouldn't do this. This is not correct.

> to remove the paths with flush_on_last_del enabled, that the mapping is
> not release if there other processes that are were already waiting for
> IO (kpartx and blkid, again in this case)
> 
> multipathd: cc_fleet_otf_test_4_1 Last path deleted, disabling queueing
> multipathd: cc_fleet_otf_test_4_1: map in use
> multipathd: cc_fleet_otf_test_4_1: can't flush
> multipathd: flush_on_last_del in progress
> multipathd: cc_fleet_otf_test_4_1: failed in domap for removal of path
> sda
> 
> I think that this should be reported upstream.
>

If there is pending IO on the map, and if queue_if_no_path is enabled,
you are asking for too much trouble by not releasing the map first.
Just for the check, disable queue_if_no_path and see the behavior on
your box. It will be very different. No hangs.

>>
>>> 2) as soon as the mapping on the san is removed
>>>
>>> all the paths get faulty
>>>
>>> cc_fleet_otf_test_4_1 (3600a0b80004725120000152a4d5cdd9e) dm-0
>>> IBM,1815      FAStT size=500G features='0' hwhandler='1 rdac' wp=rw
>>> |-+- policy='round-robin 0' prio=0 status=active
>>> | |- 0:0:1:1 sdb 8:16 active faulty running
>>> | `- 1:0:1:1 sdd 8:48 active faulty running
>>> `-+- policy='round-robin 0' prio=0 status=enabled
>>>  |- 0:0:0:1 sda 8:0  active faulty running
>>>  `- 1:0:0:1 sdc 8:32 active faulty running
>>>
>>> And /var/log/daemon.log gets spammed with:
>>>
>>> [...]
>>>
>>> so it seems that the status is flipping (the up and ghost paths
>>> seems the same as when the LUN was mapped), is this expected?
>>>
>>>
>>
>> This might be weird. The pathchecker reports that sdd and sdb are up
>> but the overall active paths listed go down to 0.
>> At this point, if you run `multipath -v3`, does the status change to
>> what pathchecker is reporting?
> 
> With the LUN unmapped on the san manager, no, the paths are still
> considered 'faulty'
> 
> But with the version from unstable I get the following entries in the
> logs which look more sensible:
> 
> multipathd: cc_fleet_otf_test_4_1: sdb - rdac checker reports path is down
> multipathd: checker failed path 8:16 in map cc_fleet_otf_test_4_1
> multipathd: cc_fleet_otf_test_4_1: remaining active paths: 3
> multipathd: cc_fleet_otf_test_4_1: sdc - rdac checker reports path is down
> multipathd: checker failed path 8:32 in map cc_fleet_otf_test_4_1
> multipathd: cc_fleet_otf_test_4_1: remaining active paths: 2
> multipathd: cc_fleet_otf_test_4_1: sdd - rdac checker reports path is down
> multipathd: checker failed path 8:48 in map cc_fleet_otf_test_4_1
> multipathd: cc_fleet_otf_test_4_1: remaining active paths: 1
> multipathd: cc_fleet_otf_test_4_1: sde - rdac checker reports path is down
> multipathd: checker failed path 8:64 in map cc_fleet_otf_test_4_1
> multipathd: cc_fleet_otf_test_4_1: remaining active paths: 0
> multipathd: dm-0: add map (uevent)
> multipathd: dm-0: devmap already registered
> multipathd: cc_fleet_otf_test_4_1: sdb - rdac checker reports path is down
> multipathd: cc_fleet_otf_test_4_1: sdc - rdac checker reports path is down
> multipathd: cc_fleet_otf_test_4_1: sdd - rdac checker reports path is down
> multipathd: cc_fleet_otf_test_4_1: sde - rdac checker reports path is down
> [...]
> 
> So this seems fixed in unstable/testing.
>

Unmaps on the target should not be done without the release on the host.
Because for the host, it is not an unmap. It is a sporadic failure and
the host will wait/expect it to be back.

We've done a lot of work on this. And we've concluded that _this_ is the
Right Approach. Just to quote from the Red Hat docs:

====
Procedure 21.1. Ensuring a Clean Device Removal
Close all users of the device and backup device data as needed.
Use umount to unmount any file systems that mounted the device.
Remove the device from any md and LVM volume using it. If the device is
a member of an LVM Volume group, then it may be necessary to move data
off the device using the pvmove command, then use the vgreduce command
to remove the physical volume, and (optionally) pvremove to remove the
LVM metadata from the disk.
If the device uses multipathing, run multipath -l and note all the paths
to the device. Afterwards, remove the multipathed device using multipath
-f device.
Run blockdev –flushbufs device to flush any outstanding I/O to all paths
to the device. This is particularly important for raw devices, where
there is no umount or vgreduce operation to cause an I/O flush.
Remove any reference to the device's path-based name, like /dev/sd,
/dev/disk/by-path or the major:minor number, in applications, scripts,
or utilities on the system. This is important in ensuring that different
devices added in the future will not be mistaken for the current device.
Finally, remove each path to the device from the SCSI subsystem. To do
so, use the command echo 1 > /sys/block/device-name/device/delete where
device-name may be sde, for example.
Another variation of this operation is echo 1 >
/sys/class/scsi_device/h:c:t:l/device/delete, where h is the HBA number,
c is the channel on the HBA, t is the SCSI target ID, and l is the LUN.

You can determine the device-name, HBA number, HBA channel, SCSI target
ID and LUN for a device from various commands, such as lsscsi, scsi_id,
multipath -l, and ls -l /dev/disk/by-*.
After performing Procedure 21.1, “Ensuring a Clean Device Removal”, a
device can be physically removed safely from a running system. It is not
necessary to stop I/O to other devices while doing so.
====

Laurent, also note:
====
Other procedures, such as the physical removal of the device, followed
by a rescan of the SCSI bus (as described in Section 21.9, “Scanning
Storage Interconnects”) to cause the operating system state to be
updated to reflect the change, are not recommended. This will cause
delays due to I/O timeouts, and devices may be removed unexpectedly. If
it is necessary to perform a rescan of an interconnect, it must be done
while I/O is paused, as described in Section 21.9, “Scanning Storage
Interconnects”.
====

More details:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/removing_devices.html

>>
>>> 3) The whole problem I have first described here could be due to the
>>> queue_if_no_path feature and the fact that some udev rules call
>>> kpartx and blkid, because when I issues dmsetup message
>>> cc_fleet_otf_test_4_1 0 fail_if_no_path the process that were stuck
>>> exit and I can then remove the previously stuck mapping.
>>
>> queue_if_no_path is used to ensure that applications don't fail when
>> all paths go down (commonly seen during target cluster faults where
>> there is a window for all paths being unavailable). All it does is to
>> suspend all relevant processes (notice is the ps output, they'll all
>> be in 'D' UnInterruptible State).
>> Now, _if_ the udev rules touch device mapper devices at that moment,
>> yes, you will see all those processes stuck.
>>
>> When those processes are stuck, can you take a ps output to see what
>> those commands (kpartx/blkid) are trying to access?
> 
> With both version (from stable and unstable) as soon the lun is
> unmapped in the SAN manager a kpartx and a blkid are spawned waiting
> for the IO's to resume and preventing the removal of the mapping from
> multipath, this seems to be related to the fact that multipath try to
> readd the dm device (see logs above) at that point.
> 
> This is spawned when multipathd see that the paths are faulty:
> 
> root      2884  0.0  0.0   8096   592 ?        D<   11:17
> 0:00 /sbin/blkid -o udev -p /dev/dm-1
> root      2885  0.0  0.0  12408   868 ?        S    11:17
> 0:00 /sbin/kpartx -a -p -part /dev/dm-0
> 
> (and surprisingly, blkid didn't appears the 1st time I try today, but
> anyway...)
> 
> Moreover, after some times I get:
> 
> udevd[540]: worker [2549] unexpectedly returned with status 0x0100
> udevd[540]: worker [2549] failed while
> handling'/devices/virtual/block/dm-1'
> udevd[540]: worker [2373] unexpectedly returned with status 0x0100
> udevd[540]: worker [2373] failed while
> handling'/devices/virtual/block/dm-0'
> 
> But the previous processes are still stuck.
> 
> At that point if I do multipath -v3 a new kpartx process is spawned.
> 
> I'm not sure that the udev rules could be modified to prevent the
> spawing of kpartx and/or blkid in such cases, but I guess that a
> warning about queuing should be at least added in the README, like the
> redhat[1] one.
> 

I agree. I'll try to cook something up. In case you have something,
please feel free to add.

-- 
Ritesh Raj Sarraf | Linux SAN Engineering | NetApp Inc.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 900 bytes
Desc: OpenPGP digital signature
URL: <http://lists.alioth.debian.org/pipermail/pkg-lvm-maintainers/attachments/20110428/389a6300/attachment.pgp>