Cannot attach volume after volume snapshot create/delete when using lvm>=2.02.99

Bug #1317075 reported by Attila Fazekas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cinder
Fix Released
Medium
Dirk Mueller
Icehouse
Fix Released
Undecided
Unassigned
tempest
Won't Fix
Undecided
Unassigned

Bug Description

lvm backend, with tgtadm ISCSI helper.

1. nova boot test2 --image cirros-0.3.2-x86_64-uec --flavor 42 # wait for 'active'
2. cinder create 1 # wait for 'available'
3. cinder snapshot-create <vol_id> # wait for create
4. cinder snapshot-delete <snap_id> # wait for delete
5. nova volume-attach test2 <vol_id> /dev/vdc
6. cinder list # says it is in 'attaching' status for long time (>10s)
7. cinder list # the volume in 'available' status

The cinder snapshot-delete <vol_id> causes the volume looses it (a)ctive lvm flag.

 LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
volume-c86347f9-3f42-4824-aee1-ae4aa33a2cf9 stack-volumes -wi------- 1.00g

This command has effect also to the original volume, even if it used on the snapshot:
$ lvchange -y -an stack-volumes/_snapshot-1b5cfb86-be8e-4ad0-89f5-ecc4e64a5f5d

Revision history for this message
Eric Harney (eharney) wrote :

I looked into this, seeing the same behavior.

The problem is that brick's vg.delete() calls lvchange -y -an on the snapshot LV which has the side effect of also deactivating the volume's LV. This is not what we want to have happen here.

Changed in cinder:
status: New → Confirmed
Revision history for this message
Attila Fazekas (afazekas) wrote :

Adding tempest to bug for creating a similar test case after the issue fixed in cinder.

This issue also prevents of the deletion of the original lv when the [DEFAULT]volume_clear is not 'none' ,
 in the gate jobs, it is 'none', so the tests can pass.

Revision history for this message
Attila Fazekas (afazekas) wrote :

The snapshot deactivation introduced by https://bugs.launchpad.net/cinder/+bug/1270192.
may be the change tried to workaround a lower level issue: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=659762.

I was not able to reproduce the original issue mentioned in #1270192 with another lvm version.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94051

Changed in cinder:
assignee: nobody → Attila Fazekas (afazekas)
status: Confirmed → In Progress
Revision history for this message
John Griffith (john-griffith) wrote : Re: Cannot attache volume after volume snapshot create/delete

I'm able to perform the sequence listed on an Ubuntu devstack setup with no issues. I think the activate changes in general have introduced quite a bit of confusion and honestly I have to admin I've never been able to get a clear understanding of the issues they were introduced to fix in the first place. Maybe we need to do some work on documenting exactly when this is an issue (what scenarios, LVM versions, OS versions etc).

That being said, I don't necessarily object to reverting the patch (I've proposed reverting it once already for other reasons) but I think we need more detailed data and understanding (like why I can't reproduce this particular issue, but also why I could never reproduce the issues that prompted adding all of this activate work in the first place.).

Revision history for this message
Attila Fazekas (afazekas) wrote :

This was the lvm change which actually made possible the snapshot deactivation before the origin deactivation: http://www.redhat.com/archives/lvm-devel/2011-November/msg00124.html, before that version the deactivation was refused, now the system can ask do you want to deactivate the original volume.

Looks like in gate setup using older lvchange and the feature even not back-ported.
http://logs.openstack.org/05/94105/1/check/check-tempest-dsvm-full/10de203/logs/screen-c-vol.txt.gz?#_2014-05-18_19_21_12_224

The c-vol logs are full with:
RESPONSE: Can't change snapshot logical volume "_snapshot-737b3cb3-44c7-4608-9850-d2284ecf122c"

So the deactivation is not working there, so the issue does not exists.

Do you have any system, where the lvchange -y -an stack-volumes/_snapshot-689e8acf-e1d0-4b4a-b2bc-7fd1bfc3ac3e does not deactivates the original volume, but deactivates the snapshot ?

$ lvs
  LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
  _snapshot-689e8acf-e1d0-4b4a-b2bc-7fd1bfc3ac3e stack-volumes swi-a-s--- 1.00g volume-968901e1-1f5b-4828-ae87-3d937f3f829e 0.00
  volume-968901e1-1f5b-4828-ae87-3d937f3f829e stack-volumes owi-a-s--- 1.00g
$ lvchange -y -an stack-volumes/_snapshot-689e8acf-e1d0-4b4a-b2bc-7fd1bfc3ac3e
$ lvs
 LV VG Attr LSize Pool Origin Data% Move Log Cpy%Sync Convert
  _snapshot-689e8acf-e1d0-4b4a-b2bc-7fd1bfc3ac3e stack-volumes swi---s--- 1.00g volume-968901e1-1f5b-4828-ae87-3d937f3f829e
  volume-968901e1-1f5b-4828-ae87-3d937f3f829e stack-volumes owi---s--- 1.00g

The 'a' flag removed from both the origin and snap volume.

Revision history for this message
Attila Fazekas (afazekas) wrote :

Up to lvm lvm2-2.02.103-5.fc20.x86_64 there is two 'lvchange -y -an stack-volumes/snapshot' behavior confirmed.

1. The command fails and does not deactivates neither the origin volume and the snapshot volume
2. Both the origin and the snapshot volume deactivated.

In both cases the patch should be reverted, even if it solves/workarounds some issue related some versions of udev or udev-scripts. The origin volume deactivation (even temporary) MUST be avoided when the volume is attached.
The volume MUST be active when before exporting it via iscsi.

My google-fu was not enough to find any config option to lvm or any patch which haves the 'lvchange -y -an stack-volumes/_snapshot-' to only deactivates the snapshot.

The original patch does not cause noticeable issue, if with an lvm version/config which is able to deactivate only the snap volume with the above command. In this case the only question would be is it really required.
Until I do not see such lvm version/config with my own aye, I would say it does not exists.

On f20 with tgtadm iscsi helper if the volume already attached, the lvchange on the snapshot deactivation fails instead of doing a catastrophe. (Logical volume in-use)

Revision history for this message
Attila Fazekas (afazekas) wrote :

Up to lvm2-2.02.106-1.fc20.x86_64 version.

Mike Perez (thingee)
summary: - Cannot attache volume after volume snapshot create/delete
+ Cannot attach volume after volume snapshot create/delete
Revision history for this message
Attila Fazekas (afazekas) wrote : Re: Cannot attach volume after volume snapshot create/delete

The lvchange on snapshot just prints errors on Ubuntu 12.04, but on 14.04 it works as anywhere else.

Th strange thing on 14.04 when you remove a snapshot, it activates EVERY logical volumes, not just the snapshot's origin.
It is totally unexpected behavior for me, but at the end the origin volume is active.

The commands used to reproduce the hang issue in #1270192 are valid and most not cause hanging,
since the hang issue exists on 14.04 outside to cinder the #1270192 is not a cinder bug.

Revision history for this message
Attila Fazekas (afazekas) wrote :

lvremove normally does not lead to activating anything.

Revision history for this message
Attila Fazekas (afazekas) wrote :

An alternative solution to revert is to always ensuring the volume is active before attach and before zeroing (secure delete).

Revision history for this message
Dirk Mueller (dmllr) wrote :

@Attila, Would have been nice to CC me on the bug. I'll try looking into your report now.

Revision history for this message
John Griffith (john-griffith) wrote :

After a bit of conversation with Atilla and Eric Harney; and performing some comparison's it seems that the behavior described in this bug is specific to F20 (for now).

Ubuntu versions through 14.04 seem to ignore the fact that you're trying to deactivate a thick LVM snap (the LVM docs apparently point out that doing this will deactivate the parent as well), so that's why we didn't see this as a problem in the gates or in my own tests. Also I don't think this behavior exists in versions of LVM prior to .105, but I'm not sure of that yet.

It seems that we should not be deactivating thick snapshots regardless.

summary: - Cannot attach volume after volume snapshot create/delete
+ Cannot attach volume after volume snapshot create/delete when using
+ Fedora 20
Changed in cinder:
status: In Progress → Triaged
importance: Undecided → Medium
Revision history for this message
Attila Fazekas (afazekas) wrote : Re: Cannot attach volume after volume snapshot create/delete when using Fedora 20

The issue not visible below lvm 2.02.99, because the lvremove does not do unexpected activation anymore.
 http://www.redhat.com/archives/lvm-devel/2012-December/msg00072.html this change fixed the unexpected activation, at lvm2 2.02.99

summary: Cannot attach volume after volume snapshot create/delete when using
- Fedora 20
+ lvm>=2.02.99
Revision history for this message
Dirk Mueller (dmllr) wrote :

The deactivation of other, connected snapshots or volumes when invoking lvchange -an is unexpected for me, and I agree that the patch needs to be reverted or modified. I'm currently trying to figure out another workaround, but this will take a few more days as I'm under pressure to finish something else first.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (master)

Fix proposed to branch: master
Review: https://review.openstack.org/94828

Changed in cinder:
status: Triaged → In Progress
Revision history for this message
Dirk Mueller (dmllr) wrote :

for the record the issues I'm trying to solve are with 2.02.98.

Alan Pevec (apevec)
tags: added: icehouse-backport-potential
Revision history for this message
Dirk Mueller (dmllr) wrote :

The issue is not specific to lvm2 versions. I was able to reproduce it on fedora 20 as well, if lvmetad was not running.

The reason that causes lvremove to hang is that by default lvm.conf contains this entry:

ignore_suspended_devices = 0

if there was still an access to the -cow device (see original bug), then the first lvremove fails and leaves the device in suspended state. any lvm command afterwards will wait for the device to resume (due to the above config setting) which never happens. having lvmetad running works around the issue because then it doesn't need to scan the device state (and hence does not hang on a suspended device).

you can reproduce the issue on fedora 20 with use_lvmeta = 0 set.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/99784

Changed in cinder:
assignee: Attila Fazekas (afazekas) → Dirk Mueller (dmllr)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (master)

Reviewed: https://review.openstack.org/99784
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=da9597aed0186e68dbf1c7304b30e49f8e6a54ff
Submitter: Jenkins
Branch: master

commit da9597aed0186e68dbf1c7304b30e49f8e6a54ff
Author: Dirk Mueller <email address hidden>
Date: Fri Jun 13 00:24:23 2014 +0200

    Retry lvremove with ignore_suspended_devices

    A lvremove -f might leave behind suspended devices
    when it is racing with udev or other processes
    still accessing any of the device files. The previous
    solution of using lvchange -an on the LV had the
    side-effect of deactivating origin LVs alongway in
    the thick volume case, which was undesired.

    It turns out retrying the deactivation twice and
    ignoring the suspended devices on the second iteration
    avoids the hang of all LVM operations after an initial
    failure.

    Change-Id: I0d6fb74084d049ea184e68f2dcc4e74f400b7dbd
    Closes-Bug: #1317075
    Related-Bug: #1270192

Changed in cinder:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on cinder (master)

Change abandoned by afazekas (<email address hidden>) on branch: master
Review: https://review.openstack.org/94828
Reason: https://review.openstack.org/#/c/99784/ is merged.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cinder (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/106300

Changed in cinder:
milestone: none → juno-2
status: Fix Committed → Fix Released
Jay Bryant (jsbryant)
tags: added: in-stable-icehouse
removed: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cinder (stable/icehouse)

Reviewed: https://review.openstack.org/106300
Committed: https://git.openstack.org/cgit/openstack/cinder/commit/?id=c74efd7765e33379c50056359123eb6f00fedb07
Submitter: Jenkins
Branch: stable/icehouse

commit c74efd7765e33379c50056359123eb6f00fedb07
Author: Dirk Mueller <email address hidden>
Date: Fri Jun 13 00:24:23 2014 +0200

    Retry lvremove with ignore_suspended_devices

    A lvremove -f might leave behind suspended devices
    when it is racing with udev or other processes
    still accessing any of the device files. The previous
    solution of using lvchange -an on the LV had the
    side-effect of deactivating origin LVs alongway in
    the thick volume case, which was undesired.

    It turns out retrying the deactivation twice and
    ignoring the suspended devices on the second iteration
    avoids the hang of all LVM operations after an initial
    failure.

    Change-Id: I0d6fb74084d049ea184e68f2dcc4e74f400b7dbd
    Closes-Bug: #1317075
    Related-Bug: #1270192
    (cherry picked from commit da9597aed0186e68dbf1c7304b30e49f8e6a54ff)

Changed in tempest:
status: New → Confirmed
tags: added: test-needed
Thierry Carrez (ttx)
Changed in cinder:
milestone: juno-2 → 2014.2
Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

This original bug happened over 2 years ago and there was not any activity on Tempest side.
So I don't think it is worth to keep this on tempest side at this time.

Changed in tempest:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.