Nova can lose track of running VM if live migration raises an exception
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Daniel Berrange | ||
Juno |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
There is a fairly serious bug in VM state handling during live migration, with a result that if libvirt raises an error *after* the VM has successfully live migrated to the target host, Nova can end up thinking the VM is shutoff everywhere, despite it still being active. The consequences of this are quite dire as the user can then manually start the VM again and corrupt any data in shared volumes and the like.
The fun starts in the _live_migration method in nova.virt.
At start of migration, we see an event received by Nova for the new QEMU process starting on target host
2015-01-23 15:39:57.743 DEBUG nova.compute.
Upon migration completion we see CPUs start running on the target host
2015-01-23 15:40:02.794 DEBUG nova.compute.
And finally an event saying that the QEMU on the source host has stopped
2015-01-23 15:40:03.629 DEBUG nova.compute.
It is the last event that causes the trouble. It causes Nova to mark the VM as shutoff at this point.
Normally the '_live_migrate' method would succeed and so Nova would then immediately & explicitly mark the guest as running on the target host. If an exception occurrs though, this explicit update of VM state doesn't happen so Nova considers the guest shutoff, even though it is still running :-(
The lifecycle events from libvirt have an associated "reason", so we could see that the shutoff event from libvirt corresponds to a migration being completed, and so not mark the VM as shutoff in Nova. We would also have to make sure the target host processes the 'resume' event upon migrate completion.
An safer approach though, might be to just mark the VM as in an ERROR state if any exception occurs during migration.
Changed in nova: | |
importance: | Undecided → High |
Changed in nova: | |
assignee: | nobody → Daniel Berrange (berrange) |
tags: | added: libvirt |
tags: | added: juno-backport-potential |
Changed in nova: | |
milestone: | none → kilo-3 |
status: | Fix Committed → Fix Released |
summary: |
- Nova can loose track of running VM if live migration raises an exception + Nova can lose track of running VM if live migration raises an exception |
Changed in nova: | |
milestone: | kilo-3 → 2015.1.0 |
There is actually a callback that the libvirt driver _live_migrate method invokes upon seeing an exception from libvirt. It ends up calling the nova.compute. manager. _rollback_ live_migration method. This method blindly assumes the VM will be running on the source, so attempts to re-setup networks & volumes and destroy storage on the target. So we're doubly doomed, because it is tearing down stuff that the VM is using on the target.