Instance stuck in reboot on libvirt failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Vish Ishaya |
Bug Description
Lets say instance reboot is in progress. At one point, libvirt driver is asked to reboot the actual VM. Now in this case if the VM itself has disappeared, the instance will be stuck in rebooting forever
(This was accidentally discovered when libvirtd was killed using "sudo kill -9" when reboot was in progress. "virsh list" would also not list any instances)
Refer to following code snippet from nova/virt/
def _wait_for_reboot():
try:
except exception.NotFound:
if state == power_state.
Here exception.NotFound block should NOT raise "utils.
Instead it should just "raise" or "raise exception.
Instances stuck in "rebooting" can't be deleted. Since VM has already disappeared, marking it as Error (Thus allowing delete) seems like correct solution.
There may be similar problems in _wait_for_boot(), _wait_for_running() etc.
Changed in nova: | |
importance: | Medium → High |
assignee: | nobody → Yun Mao (yunmao) |
assignee: | Yun Mao (yunmao) → Vish Ishaya (vishvananda) |
Changed in nova: | |
status: | Triaged → In Progress |
Changed in nova: | |
status: | Fix Committed → Fix Released |
Changed in nova: | |
milestone: | folsom-rc1 → 2012.2 |
I'm not sure that changing the exception that is raised is really the fix, but I think there is probably some state cleanup that needs to be done after the loopingcall if it fails.
Targeting for folsom-rc1 since this is a state corruption bug that has the potential to block normal users, requiring them to manually poke their database.