Nova can lose track of running VM if live migration raises an exception

Bug #1414065 reported by Daniel Berrange on 2015-01-23
40
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Daniel Berrange
Juno
Undecided
Unassigned

Bug Description

There is a fairly serious bug in VM state handling during live migration, with a result that if libvirt raises an error *after* the VM has successfully live migrated to the target host, Nova can end up thinking the VM is shutoff everywhere, despite it still being active. The consequences of this are quite dire as the user can then manually start the VM again and corrupt any data in shared volumes and the like.

The fun starts in the _live_migration method in nova.virt.libvirt.driver, if the 'migrateToURI2' method fails *after* the guest has completed migration.

At start of migration, we see an event received by Nova for the new QEMU process starting on target host

2015-01-23 15:39:57.743 DEBUG nova.compute.manager [-] [instance: 12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state after lifecycle event "Started"; current vm_state: active, current task_state: migrating, current DB power_state: 1, VM power_state: 1 from (pid=19494) handle_lifecycle_event /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

Upon migration completion we see CPUs start running on the target host

2015-01-23 15:40:02.794 DEBUG nova.compute.manager [-] [instance: 12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state after lifecycle event "Resumed"; current vm_state: active, current task_state: migrating, current DB power_state: 1, VM power_state: 1 from (pid=19494) handle_lifecycle_event /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

And finally an event saying that the QEMU on the source host has stopped

2015-01-23 15:40:03.629 DEBUG nova.compute.manager [-] [instance: 12bac45e-aca8-40d1-8f39-941bc6bb59f0] Synchronizing instance power state after lifecycle event "Stopped"; current vm_state: active, current task_state: migrating, current DB power_state: 1, VM power_state: 4 from (pid=23081) handle_lifecycle_event /home/berrange/src/cloud/nova/nova/compute/manager.py:1134

It is the last event that causes the trouble. It causes Nova to mark the VM as shutoff at this point.

Normally the '_live_migrate' method would succeed and so Nova would then immediately & explicitly mark the guest as running on the target host. If an exception occurrs though, this explicit update of VM state doesn't happen so Nova considers the guest shutoff, even though it is still running :-(

The lifecycle events from libvirt have an associated "reason", so we could see that the shutoff event from libvirt corresponds to a migration being completed, and so not mark the VM as shutoff in Nova. We would also have to make sure the target host processes the 'resume' event upon migrate completion.

An safer approach though, might be to just mark the VM as in an ERROR state if any exception occurs during migration.

Changed in nova:
importance: Undecided → High
Daniel Berrange (berrange) wrote :

There is actually a callback that the libvirt driver _live_migrate method invokes upon seeing an exception from libvirt. It ends up calling the nova.compute.manager._rollback_live_migration method. This method blindly assumes the VM will be running on the source, so attempts to re-setup networks & volumes and destroy storage on the target. So we're doubly doomed, because it is tearing down stuff that the VM is using on the target.

Changed in nova:
assignee: nobody → Daniel Berrange (berrange)
tags: added: libvirt

Related fix proposed to branch: master
Review: https://review.openstack.org/151663

Changed in nova:
status: New → In Progress

Reviewed: https://review.openstack.org/151663
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=584a44f0157e84ce0100da6ee4f7b94bbb4088e3
Submitter: Jenkins
Branch: master

commit 584a44f0157e84ce0100da6ee4f7b94bbb4088e3
Author: Daniel P. Berrange <email address hidden>
Date: Wed Jan 28 17:46:55 2015 +0000

    libvirt: remove pointless loop after live migration finishes

    The libvirt 'migrateToURI' API(s) all block the caller until the
    live migration operation has completed. As such, the timer call
    used to check if live migration has completed is entirely pointless.
    It appears this is code left over from the very first impl of live
    migration in Nova, when Nova would simply shell out to the 'virsh'
    command instead of using the libvirt APIs. Even back then though
    it looks like it was redundant, since the command being spawned
    would also block until live migration was finished.

    Related-bug: #1414065
    Change-Id: Ib3906ef8564a986f7c0e980774e4ed76b3f93a38

Reviewed: https://review.openstack.org/151664
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7dd6a4a19311136c02d89cd2afd97236b0f4cc27
Submitter: Jenkins
Branch: master

commit 7dd6a4a19311136c02d89cd2afd97236b0f4cc27
Author: Daniel P. Berrange <email address hidden>
Date: Thu Jan 29 14:33:32 2015 +0000

    libvirt: proper monitoring of live migration progress

    The current live migration code simply invokes migrateToURI
    and waits for it to finish, or raise an exception. It considers
    all exceptions to mean the live migration aborted and the VM is
    still running on the source host. This is totally bogus, as there
    are a number of reasons why an error could be raised from the
    migrateToURI call. There are at least 5 different scenarios for
    what the VM might be doing on source + dest host upon error.
    The migration might even still be going on, even if after the
    error has occurred.

    A more reliable way to deal with this is to actively query
    libvirt for the domain job status. This gives an indication
    of whether the job is completed, failed or cancelled. Even
    with that though, there is a need for a few heuristics to
    distinguish some of the possible error scenarios.

    This change to do active monitoring of the live migration process
    also opens the door for being able to tune live migration on the
    fly to adjust max downtime or bandwidth to improve chances of
    getting convergence, or to automatically abort it after too much
    time has elapsed instead of letting it carry on until the end of
    the universe. This change merely records memory transfer progress
    and leaves tuning improvements to a later date.

    Closes-bug: #1414065
    Change-Id: I6fcbfa31a79c7808c861bb3a84b56bd096882004

Changed in nova:
status: In Progress → Fix Committed
tags: added: juno-backport-potential
Thierry Carrez (ttx) on 2015-03-20
Changed in nova:
milestone: none → kilo-3
status: Fix Committed → Fix Released

Sahid, Daniel Berrange, without having had debug enabled, is there a way from "regular" logging for me to determine I'm running into this in Juno. All of the obvious symptoms are there:
Live Migration failed. Instances go to SHUTDOWN. (this when live-migrating to evac a node for maintenance.)

David Medberry (med) on 2015-04-02
summary: - Nova can loose track of running VM if live migration raises an exception
+ Nova can lose track of running VM if live migration raises an exception
Daniel Berrange (berrange) wrote :

The only sure way to know that you hit this bug is if you see the same VM instance running on two hosts at the same time. It is possible you might see any exception in the Nova compute logs mention migrateToURI in the stack trace, but that's not a 100% reliable test.

Reviewed: https://review.openstack.org/162112
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6833176b56ecbef9565bccb06a372acba8487691
Submitter: Jenkins
Branch: stable/juno

commit 6833176b56ecbef9565bccb06a372acba8487691
Author: Daniel P. Berrange <email address hidden>
Date: Wed Jan 28 17:46:55 2015 +0000

    libvirt: remove pointless loop after live migration finishes

    The libvirt 'migrateToURI' API(s) all block the caller until the
    live migration operation has completed. As such, the timer call
    used to check if live migration has completed is entirely pointless.
    It appears this is code left over from the very first impl of live
    migration in Nova, when Nova would simply shell out to the 'virsh'
    command instead of using the libvirt APIs. Even back then though
    it looks like it was redundant, since the command being spawned
    would also block until live migration was finished.

    Conflicts:
     nova/virt/libvirt/driver.py

    Related-bug: #1414065
    Change-Id: Ib3906ef8564a986f7c0e980774e4ed76b3f93a38
    (cherry-pick from commit 584a44f0157e84ce0100da6ee4f7b94bbb4088e3)

tags: added: in-stable-juno
Thierry Carrez (ttx) on 2015-04-30
Changed in nova:
milestone: kilo-3 → 2015.1.0

Change abandoned by sahid (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/162113

David Medberry (med) wrote :

so this shows that this is fixed in Juno 2014.2.4 but.... there's no info here documenting that. Is it true? The change was abandoned in August.

Billy Olsen (billy-olsen) wrote :

Med that's likely because there are two patch sets which are related to this fix. The first patch set was to remove a pointless loop after live migration finishes (that one was merged into stable/juno) and the second patch set titled 'proper monitoring of live migration process' (this was not included in stable/juno).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers