Reboot with bad volume fails ungracefully
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Critical
|
Rick Harris |
Bug Description
If a user has an instance that has a Cinder volume attached and, for whatever reason, that volume becomes inoperable, a subsequent reboot operation may cause the instance to go into a permanent halted state.
This affects the `xenapi` driver for sure; it's unknown whether a similar issue exists in the other virt-drivers.
Steps to replicate:
1. Build an instance
2. Attach a cinder-volume (using lvm+iscsi driver)
3. Sever the iscsi connection: killall -s9 tgtd on the cinder volume server
4. Reboot instance
5. Verify that instance goes to halted and can't be started
Proposed solution:
The proposed solution as a few different steps:
1. Detect that reboot failed due to bad-volumes being attached
2. Detect exactly which volumes are bad
3. Detach these volumes in the virt-layer so that the VM operation can be retried
4. Raise an exception to notify the compute-manager layer that a driver operation had the *side-effect* of detaching a set of 'bad' volumes so that any compute level cleanups (destroy BDM, Cinder volume detach) can be made
Note:
The current method of detecting which volume is 'bad' indirectly makes use of a 120 sec timeout within the XenServer code. An upstream patch from Citrix to so that we can 'fail-fast' here would speed up error recover dramatically.
For example, on a given network, we might want to say that a connection hung for more than 10 secs is in accessible rather than having to wait a full two minutes.
Changed in nova: | |
assignee: | nobody → Rick Harris (rconradharris) |
importance: | Undecided → Critical |
status: | New → In Progress |
Changed in nova: | |
status: | In Progress → Confirmed |
tags: | added: cloud-archive |
Changed in nova: | |
status: | Confirmed → In Progress |
Changed in nova: | |
milestone: | none → grizzly-rc1 |
Changed in nova: | |
status: | Fix Committed → Fix Released |
Changed in nova: | |
milestone: | grizzly-rc1 → 2013.1 |
Fix proposed to branch: master /review. openstack. org/23662
Review: https:/