OpenStack Compute (nova)

Nova rescue causes LVM timeouts after moving attachments

Bug #1423654 reported by John Griffith on 2015-02-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Won't Fix	Medium	John Griffith

Bug Description

The Nova rescue feature powers off a running instance and, boots a rescue instance attaching the ephemeral disk of the original instance to it to allow an admin to try and recover the instance. The problem is that if a Cinder Volume is attached to that instance when we do a rescue we don't do a detach or any sort of maintenance on the block mapping that we have set up for it. We do check to see if we have it, and verify it's attached but that's it.

The result is that after the rescue operation subsequent LVM calls to do things like lvs and vgs will attempt to open a device file that no longer exists which takes up to 60 seconds for each device. An example is the current tempest test:
tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[gate,negative,volume]

Which if you look at tempest results you'll notice that test always takes in excess of 100 seconds, but it's not just because it's a long test, it's the blocking LVM calls.

We should detach any cinder volumes that are attached to an instance during the rescue process. One concern with this that came from folks on the Nova team was 'what about boot from volume'? Rescue of a volume booted instance is currently an invalid case as is evident by the code that checks for it and fails here:
https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2822

Probably no reason we can't automate this as part of rescue in the future but for now it's a separate enhancement independent of this bug.

Tags:

John Griffith (john-griffith) on 2015-02-19

Changed in nova:
assignee:	nobody → John Griffith (john-griffith)

Davanum Srinivas (DIMS) (dims-v) on 2015-02-20

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

John Griffith (john-griffith) wrote on 2015-02-27:

Seems that what happens here is that during the attach swap we run into a problem with devices not being quiesced. If we add 10 seconds (yes, 10 seconds) between when we issue the unrescue call and when we report the instance as ACTIVE again we eliminate the problem. Ideally we'd figure out exactly what the the state we're waiting for is and how to detect it, but in a number of tempest runs it seems that 5 seconds isn't enough, 8 is marginal and 10 seems to consistently eliminate the problem.

Note this also appears to be the root cause of the remaining issues with:
https://bugs.launchpad.net/cinder/+bug/1373513

OpenStack Infra (hudson-openstack) on 2015-02-27

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Daniel Berrange (berrange) wrote on 2015-03-10:

> We should detach any cinder volumes that are attached to an instance during
> the rescue process. One concern with this that came from folks on the Nova
> team was 'what about boot from volume'? Rescue of a volume booted instance
> is currently an invalid case as is evident by the code that checks for it and fails here:

No, we should *not* detach cinder volumes during rescue. The storage attached to a rescue VM should be *identical* to the storage attached to a normally running VM, with the exception of the extra rescue disk being added. The administrator may well need data from the volumes during the rescue process. Nova libvirt doesnt currently handle cinder vols correctly during rescue but we should fix that, not remove them altogether.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-10: Change abandoned on nova (master)

Change abandoned by John Griffith (<email address hidden>) on branch: master
Review: https://review.openstack.org/159713
Reason: I give up, no hard feelings but honestly this is silly.

Davanum Srinivas (DIMS) (dims-v) on 2015-03-14

Changed in nova:
status:	In Progress → Confirmed

Mike Perez (thingee) on 2015-07-08

tags:

added: volumes

Revision history for this message

John Griffith (john-griffith) wrote on 2015-07-08:

FYI, we couldn't get a change in to Nova so we modified setup to work around the issue.

We added the use of lvm.config to isolate the disks cinder looks at/uses. It's automated in devstack setup here:
https://review.openstack.org/#/c/165281/

and it's documents in install guide here:
http://docs.openstack.org/kilo/install-guide/install/apt/content/cinder-install-storage-node.html

Revision history for this message

Andrea Rosa (andrea-rosa-m) wrote on 2016-07-20:

from the last comment it seems to me that this bug is not going to be fixed in Nova but a workaround is in place in Cinder. Considering that and considering that the last comment was 1+ year ago I am going to close this ticket as "Won't fix".
Please feel free to comment here if you disagree.

Changed in nova:
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.