Bug#549660: md raid1 + lvm2 + snapshot resulted in lvm2 hang

Mon Oct 5 10:30:01 UTC 2009

Package: lvm2
Version: 2.02.39-7
Severity: important

Hello,

This problem is sort of similar to #419209 but I believe
it is different because in my case the snapshot 
was successfully created and the volume stalled 
later while writing to the snapshot.

I use a proxmox 1.3:
Linux ns300364.ovh.net 2.6.24-7-pve #1 SMP PREEMPT Fri Aug 21 09:07:39 CEST 2009 x86_64 GNU/Linux

The system was installed a couple weeks ago and is not an upgrade.
Snapshot worked fine until the problem occured
despite no changes to the disks configuration.

My configuration:

md1: md RAID 1 + ext3 mounted as /
md0: md RAID 1 + lvm2  divided in 2 x ext3 volumes vmdata and vmbackups, mounted as /var/lib/vz and /backups.

root at ns300364:/backups/tmp# lvdisplay
  --- Logical volume ---
  LV Name                /dev/data/vmdata
  VG Name                data
  LV UUID                9CzFBp-k7fV-wlls-qeeG-v7Or-u1pq-9XhKKy
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                309.57 GB
  Current LE             79250
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:1

  --- Logical volume ---
  LV Name                /dev/data/vmbackup
  VG Name                data
  LV UUID                jzCjXx-IodU-chBx-Aw3L-JUbv-dRho-vaOFCl
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                309.57 GB
  Current LE             79250
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           254:4

root at ns300364:~# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Tue Sep 15 17:48:43 2009
     Raid Level : raid1
     Array Size : 664986496 (634.18 GiB 680.95 GB)
  Used Dev Size : 664986496 (634.18 GiB 680.95 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Sun Oct  4 02:02:04 2009
          State : active, recovering
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

 Rebuild Status : 29% complete

           UUID : ab296276:ea3e622e:7008e345:84b8f442 (local to host ns300364.ov            h.net)
         Events : 0.17

    Number   Major   Minor   RaidDevice State
       0       8        3        0      active sync   /dev/sda3
       1       8       19        1      active sync   /dev/sdb3

Symptoms:

- snapshot creation for vmdata OK
- backup on vmbackups started OK
- after writing about 1Gb the snapshot stalled. I mean that all requests
to read files on the lvm volume will hang. 
However "ls" and "cd" commands do work and I can get directories listing. 
Any command to read a file content stall the ssh session (ex cat,cp,mv). 
In particular, "cat /backups/phil.log" will also stall the ssh session. 
Remember that the snapshot is for volume vmdata, and the "cat" above concern volume vmbackups.
- smartctl do not report any problem (including the long test)
- "wa" in "top" is blocked at 99%, cpu is near zero.
- the snapshot is visible in /dev/mapper
- the snapshot cannot be removed (lvremove -f). again no error reported. just hanged with no output at all.
- the system seem to work fine as long as nothing tries to read on one of the 
2 lvm2 volumes.
- no error reported in messages or syslog. 
- it seem a md check started after the snapshot creation. This check process 
also stalled at 29% (speed=0K/sec). again no error reported.
- soft reboot did not work
- hard reboot worked. But a md resync started and stalled at 0.1% leaving the system in the same context as before the hard reboot.

To recover a working system I set sdb3 as faulty, removed sdb3 from the raid1 
and hard rebooted. That worked and I could remove the snapshot and access the data 
on both lvm volumes. Since then, I did not try to create a snapshot and system 
seem to work fine.

Greetings,

Phil Ten