Nova rescue causes LVM timeouts after moving attachments
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Won't Fix
|
Medium
|
John Griffith |
Bug Description
The Nova rescue feature powers off a running instance and, boots a rescue instance attaching the ephemeral disk of the original instance to it to allow an admin to try and recover the instance. The problem is that if a Cinder Volume is attached to that instance when we do a rescue we don't do a detach or any sort of maintenance on the block mapping that we have set up for it. We do check to see if we have it, and verify it's attached but that's it.
The result is that after the rescue operation subsequent LVM calls to do things like lvs and vgs will attempt to open a device file that no longer exists which takes up to 60 seconds for each device. An example is the current tempest test:
tempest.
Which if you look at tempest results you'll notice that test always takes in excess of 100 seconds, but it's not just because it's a long test, it's the blocking LVM calls.
We should detach any cinder volumes that are attached to an instance during the rescue process. One concern with this that came from folks on the Nova team was 'what about boot from volume'? Rescue of a volume booted instance is currently an invalid case as is evident by the code that checks for it and fails here:
https:/
Probably no reason we can't automate this as part of rescue in the future but for now it's a separate enhancement independent of this bug.
Changed in nova: | |
assignee: | nobody → John Griffith (john-griffith) |
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Changed in nova: | |
status: | Confirmed → In Progress |
Changed in nova: | |
status: | In Progress → Confirmed |
tags: | added: volumes |
Seems that what happens here is that during the attach swap we run into a problem with devices not being quiesced. If we add 10 seconds (yes, 10 seconds) between when we issue the unrescue call and when we report the instance as ACTIVE again we eliminate the problem. Ideally we'd figure out exactly what the the state we're waiting for is and how to detect it, but in a number of tempest runs it seems that 5 seconds isn't enough, 8 is marginal and 10 seems to consistently eliminate the problem.
Note this also appears to be the root cause of the remaining issues with: /bugs.launchpad .net/cinder/ +bug/1373513
https:/