Nova rescue causes LVM timeouts after moving attachments

Bug #1423654 reported by John Griffith
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Medium
John Griffith

Bug Description

The Nova rescue feature powers off a running instance and, boots a rescue instance attaching the ephemeral disk of the original instance to it to allow an admin to try and recover the instance. The problem is that if a Cinder Volume is attached to that instance when we do a rescue we don't do a detach or any sort of maintenance on the block mapping that we have set up for it. We do check to see if we have it, and verify it's attached but that's it.

The result is that after the rescue operation subsequent LVM calls to do things like lvs and vgs will attempt to open a device file that no longer exists which takes up to 60 seconds for each device. An example is the current tempest test:
tempest.api.compute.servers.test_server_rescue_negative.ServerRescueNegativeTestJSON.test_rescued_vm_detach_volume[gate,negative,volume]

Which if you look at tempest results you'll notice that test always takes in excess of 100 seconds, but it's not just because it's a long test, it's the blocking LVM calls.

We should detach any cinder volumes that are attached to an instance during the rescue process. One concern with this that came from folks on the Nova team was 'what about boot from volume'? Rescue of a volume booted instance is currently an invalid case as is evident by the code that checks for it and fails here:
https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2822

Probably no reason we can't automate this as part of rescue in the future but for now it's a separate enhancement independent of this bug.

Tags: volumes
Changed in nova:
assignee: nobody → John Griffith (john-griffith)
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
John Griffith (john-griffith) wrote :

Seems that what happens here is that during the attach swap we run into a problem with devices not being quiesced. If we add 10 seconds (yes, 10 seconds) between when we issue the unrescue call and when we report the instance as ACTIVE again we eliminate the problem. Ideally we'd figure out exactly what the the state we're waiting for is and how to detect it, but in a number of tempest runs it seems that 5 seconds isn't enough, 8 is marginal and 10 seems to consistently eliminate the problem.

Note this also appears to be the root cause of the remaining issues with:
https://bugs.launchpad.net/cinder/+bug/1373513

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Daniel Berrange (berrange) wrote :

> We should detach any cinder volumes that are attached to an instance during
> the rescue process. One concern with this that came from folks on the Nova
> team was 'what about boot from volume'? Rescue of a volume booted instance
> is currently an invalid case as is evident by the code that checks for it and fails here:

No, we should *not* detach cinder volumes during rescue. The storage attached to a rescue VM should be *identical* to the storage attached to a normally running VM, with the exception of the extra rescue disk being added. The administrator may well need data from the volumes during the rescue process. Nova libvirt doesnt currently handle cinder vols correctly during rescue but we should fix that, not remove them altogether.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by John Griffith (<email address hidden>) on branch: master
Review: https://review.openstack.org/159713
Reason: I give up, no hard feelings but honestly this is silly.

Changed in nova:
status: In Progress → Confirmed
Mike Perez (thingee)
tags: added: volumes
Revision history for this message
John Griffith (john-griffith) wrote :

FYI, we couldn't get a change in to Nova so we modified setup to work around the issue.

We added the use of lvm.config to isolate the disks cinder looks at/uses. It's automated in devstack setup here:
https://review.openstack.org/#/c/165281/

and it's documents in install guide here:
http://docs.openstack.org/kilo/install-guide/install/apt/content/cinder-install-storage-node.html

Revision history for this message
Andrea Rosa (andrea-rosa-m) wrote :

from the last comment it seems to me that this bug is not going to be fixed in Nova but a workaround is in place in Cinder. Considering that and considering that the last comment was 1+ year ago I am going to close this ticket as "Won't fix".
Please feel free to comment here if you disagree.

Changed in nova:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.