cinder volume hanging on removing snapshots

Bug #1270192 reported by Dirk Mueller on 2014-01-17
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Dirk Mueller
Icehouse
Undecided
Unassigned

Bug Description

On any lvm2 version without lvmetad running, I can get cinder-volume to hang on issuing lvm related commands after deleting snapshots (that are non-thin provisioned LVM snapshots with clear_volume set to zero).

the issue is that lvm locks up due to trying to access suspended device mapper entries, and at some point cinder-volume does a lvm related command and hangs on that. a setting of ignore_suspended_devices = 1 in lvm.conf helps with that, as lvremove hangs on scanning the device state (which it
needs to do because it doesn't have current information available via lvmetad).

I can use this script to trigger the issue:

=== cut hang.sh ===
enable_fix=1
#enable_fix=0

vg=cinder-volumes
v=testvol.$$

lvcreate --name $v $vg -L 1g
sleep 2
lvcreate --name snap-$v --snapshot $vg/$v -L 1g

vgp=/dev/mapper/${vg/-/--}-snap--${v/-/--}

sleep 2

( sleep 10 < $vgp-cow ) &
test "$enable_fix" -eq "1" && lvchange -y -an $vg/snap-$v
lvremove -f $vg/snap-$v
sleep 1
lvremove -f $vg/$v
=== cut hang.sh ===

vg needs to be set to a lvm VG that exists and can take a few gig of space. whenever enable_fix is set to 0, lvremove -f ends with :

  Unable to deactivate open cinder--volumes-snap--testvol.27700-cow (252:5)
  Failed to resume snap-testvol.27700.
  libdevmapper exiting with 1 device(s) still suspended.

this is because the sleep command before keeps a fd open on the -cow. The script then also never finishes and any other lvm command hangs as well.

apparently in real-life this is either udev or the dd command still having the fd open for some reason I have not yet understood.

The deactivation before removing seems to help.

Fix proposed to branch: master
Review: https://review.openstack.org/67499

Changed in cinder:
assignee: nobody → Dirk Mueller (dmllr)
status: New → In Progress
Dirk Mueller (dmllr) on 2014-01-17
description: updated
Dirk Mueller (dmllr) on 2014-01-17
description: updated

Reviewed: https://review.openstack.org/67499
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=3764cecfc3b0a5b35634b15a4b049f433a8a22de
Submitter: Jenkins
Branch: master

commit 3764cecfc3b0a5b35634b15a4b049f433a8a22de
Author: Dirk Mueller <email address hidden>
Date: Fri Jan 17 16:45:32 2014 +0100

    Deactivate LV before removing

    With certain versions of LVM2, removing an active LV can end up with

      Unable to deactivate open XXX
      libdevmapper exiting with 1 device(s) still suspended.

    which causes any lvm command afterwards to hang endlessly on
    trying to access the suspended volume. This seems to be caused
    by a race with udev, so lets be conservative and do the deactivation,
    then wait for udev and then finish the removal.

    Closes-Bug: #1270192

    Change-Id: I4703133180567090878ea5047dd29d9f97ad85ab

Changed in cinder:
status: In Progress → Fix Committed
lirenke (lvhancy) wrote :

It seems that lvm2(2.02.98-5) have fixed it. I'm not sure. see: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=659762

Thierry Carrez (ttx) on 2014-03-05
Changed in cinder:
milestone: none → icehouse-3
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2014-04-17
Changed in cinder:
milestone: icehouse-3 → 2014.1
Attila Fazekas (afazekas) wrote :
Download full text (3.3 KiB)

On Ubuntu 14.04 something really wrong happens:
According the dmesg the lvm relates tasks are blocked in the io scheduler.
[ 1440.468099] INFO: task lvm:21729 blocked for more than 120 seconds.
[ 1440.469298] Not tainted 3.13.0-24-generic #47-Ubuntu
[ 1440.470334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.471736] lvm D ffff88021fd94440 0 21729 21726 0x00000000
[ 1440.471741] ffff8801c259dae0 0000000000000002 ffff8800ae3ac7d0 ffff8801c259dfd8
[ 1440.471743] 0000000000014440 0000000000014440 ffff8800ae3ac7d0 ffff88021fd94cd8
[ 1440.471745] 0000000000000000 ffff880212174f00 0000000000000000 ffff8800ae3ac7d0
[ 1440.471748] Call Trace:
[ 1440.471755] [<ffffffff8171a20d>] io_schedule+0x9d/0x140
[ 1440.471760] [<ffffffff811f6d74>] do_blockdev_direct_IO+0x1ce4/0x2910
[ 1440.471762] [<ffffffff811f2640>] ? bdev_inode_switch_bdi+0x120/0x160
[ 1440.471765] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
[ 1440.471767] [<ffffffff811f79f5>] __blockdev_direct_IO+0x55/0x60
[ 1440.471768] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
[ 1440.471770] [<ffffffff811f22d6>] blkdev_direct_IO+0x56/0x60
[ 1440.471772] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
[ 1440.471777] [<ffffffff81150b5b>] generic_file_aio_read+0x69b/0x700
[ 1440.471781] [<ffffffff811c8e18>] ? path_openat+0x158/0x620
[ 1440.471783] [<ffffffff811f275b>] blkdev_aio_read+0x4b/0x70
[ 1440.471786] [<ffffffff811b8d1a>] do_sync_read+0x5a/0x90
[ 1440.471789] [<ffffffff811b93b5>] vfs_read+0x95/0x160
[ 1440.471790] [<ffffffff811b9ec9>] SyS_read+0x49/0xa0
[ 1440.471794] [<ffffffff817266bf>] tracesys+0xe1/0xe6
[ 1440.471796] INFO: task lvremove:21759 blocked for more than 120 seconds.
[ 1440.473042] Not tainted 3.13.0-24-generic #47-Ubuntu
[ 1440.474019] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1440.475436] lvremove D ffff88021fc14440 0 21759 21544 0x00000000
[ 1440.475440] ffff8800ae1c3ae0 0000000000000002 ffff8801dad12fe0 ffff8800ae1c3fd8
[ 1440.475442] 0000000000014440 0000000000014440 ffff8801dad12fe0 ffff88021fc14cd8
[ 1440.475444] 0000000000000000 ffff8801daeaf200 0000000000000000 ffff8801dad12fe0
[ 1440.475446] Call Trace:
[ 1440.475449] [<ffffffff8171a20d>] io_schedule+0x9d/0x140
[ 1440.475451] [<ffffffff811f6d74>] do_blockdev_direct_IO+0x1ce4/0x2910
[ 1440.475454] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
[ 1440.475456] [<ffffffff811f79f5>] __blockdev_direct_IO+0x55/0x60
[ 1440.475457] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
[ 1440.475459] [<ffffffff811f22d6>] blkdev_direct_IO+0x56/0x60
[ 1440.475461] [<ffffffff811f1be0>] ? I_BDEV+0x10/0x10
[ 1440.475463] [<ffffffff81150b5b>] generic_file_aio_read+0x69b/0x700
[ 1440.475466] [<ffffffff811c8e18>] ? path_openat+0x158/0x620
[ 1440.475468] [<ffffffff811f275b>] blkdev_aio_read+0x4b/0x70
[ 1440.475470] [<ffffffff811b8d1a>] do_sync_read+0x5a/0x90
[ 1440.475472] [<ffffffff811b93b5>] vfs_read+0x95/0x160
[ 1440.475474] [<ffffffff811b9ec9>] SyS_read+0x49/0xa0
[ 1440.475476] [<ffffffff817266bf>] tracesys+0xe1/0xe6

It does not seams to be cinder bug, please consider adding kernel or lvm or udev to the affected components.

I was n...

Read more...

Mike Perez (thingee) on 2014-05-21
tags: added: drivers lvm

Related fix proposed to branch: master
Review: https://review.openstack.org/94828

Dirk Mueller (dmllr) on 2014-06-12
description: updated

Related fix proposed to branch: master
Review: https://review.openstack.org/99784

Reviewed: https://review.openstack.org/99784
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=da9597aed0186e68dbf1c7304b30e49f8e6a54ff
Submitter: Jenkins
Branch: master

commit da9597aed0186e68dbf1c7304b30e49f8e6a54ff
Author: Dirk Mueller <email address hidden>
Date: Fri Jun 13 00:24:23 2014 +0200

    Retry lvremove with ignore_suspended_devices

    A lvremove -f might leave behind suspended devices
    when it is racing with udev or other processes
    still accessing any of the device files. The previous
    solution of using lvchange -an on the LV had the
    side-effect of deactivating origin LVs alongway in
    the thick volume case, which was undesired.

    It turns out retrying the deactivation twice and
    ignoring the suspended devices on the second iteration
    avoids the hang of all LVM operations after an initial
    failure.

    Change-Id: I0d6fb74084d049ea184e68f2dcc4e74f400b7dbd
    Closes-Bug: #1317075
    Related-Bug: #1270192

Change abandoned by afazekas (<email address hidden>) on branch: master
Review: https://review.openstack.org/94828
Reason: https://review.openstack.org/#/c/99784/ is merged.

Reviewed: https://review.openstack.org/106300
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=c74efd7765e33379c50056359123eb6f00fedb07
Submitter: Jenkins
Branch: stable/icehouse

commit c74efd7765e33379c50056359123eb6f00fedb07
Author: Dirk Mueller <email address hidden>
Date: Fri Jun 13 00:24:23 2014 +0200

    Retry lvremove with ignore_suspended_devices

    A lvremove -f might leave behind suspended devices
    when it is racing with udev or other processes
    still accessing any of the device files. The previous
    solution of using lvchange -an on the LV had the
    side-effect of deactivating origin LVs alongway in
    the thick volume case, which was undesired.

    It turns out retrying the deactivation twice and
    ignoring the suspended devices on the second iteration
    avoids the hang of all LVM operations after an initial
    failure.

    Change-Id: I0d6fb74084d049ea184e68f2dcc4e74f400b7dbd
    Closes-Bug: #1317075
    Related-Bug: #1270192
    (cherry picked from commit da9597aed0186e68dbf1c7304b30e49f8e6a54ff)

tags: added: in-stable-icehouse
Matt Riedemann (mriedem) wrote :
tags: added: volumes
Matt Riedemann (mriedem) wrote :

Actually bug 1191960 is a better fit for the nova issue.

Matt Riedemann (mriedem) on 2015-10-30
no longer affects: nova

I am out of the office until 11/02/2015.

I will be available occasionally on email.

For any urgent matters please contact Ohad Atia.

Note: This is an automated response to your message "[Bug 1270192] Re:
cinder volume hanging on removing snapshots" sent on 10/30/15 8:56:47 PM.

This is the only notification you will receive while this person is away.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.