live migration does not clean up at target node if a failure occurs during post migration

Bug #1628606 reported by Paul Carlton on 2016-09-28
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Low
Artom Lifshitz

Bug Description

If a live migration fails during the post processing on the source (i.e. failure to disconnect volumes) it can lead to the instance being shutdown on the source node and left in a migrating task state. Also the copy of the instance on the target node will be left running although not usable because neutron networking has not yet been switch to target and nova stills records the instance as being on the source node.

This situation can be resolved as follows:

on target
virsh destroy <instance domain id>
if the compute nodes are NOT using shared storage
sudo rm -rf <instance uuid directory>

Then use nova client as admin to restart the instance on the source node:
nova reset-state --active <instance uuid>
nova reboot --hard <instance uuid>

I will investigate how to address this issue

Changed in nova:
assignee: nobody → Paul Carlton (paul-carlton2)
Changed in nova:
importance: Undecided → Low
tags: added: live-migration
Paul Carlton (paul-carlton2) wrote :

Thinking about this it is not that simple. Once the instance has been started on the target it could do work that would be lost if we destroy it and resurrect the instance on the source. As we found out when Matt Booth was fixing the post copy network bug with certain neutron providers the instance at the target becomes accessible to the network immediately it starts up (due to arp'ing) so effectively once libvirt has un-paused the instance on the target and destroyed the instance on the the source we are effective beyond the point of no return.

Trouble is the instance host does not get updated until the end of the post migration processing so it still looks like it is on the source in a migrating state. If any step in post migration give rise to an exception it skips the rest of the post migration and updates the migration as failed but leaves the instance as is.

The best solution I can think of is to wrap the call to the post method in a try except that will set the instance to the target host if any exception occurs. Given that in some circumstances the source instance could still be present, i.e. not cleaned up and the networking to the target might not be setup correctly so I'm thinking maybe the instance on the target should be placed in error state to indicate that there may be an issue? Alternatively, is the fact that the migration status will be failed enough to indicate that some further operator action might be needed?

Fix proposed to branch: master
Review: https://review.openstack.org/379491

Changed in nova:
status: New → In Progress

Change abandoned by Paul Carlton (<email address hidden>) on branch: master
Review: https://review.openstack.org/379491
Reason: Leaving for someone else to fix as they see fit

Since Paul Carton abandoned his patch, removing him as assignee.

Changed in nova:
assignee: Paul Carlton (paul-carlton2) → nobody
status: In Progress → Confirmed
Matthew Booth (mbooth-9) wrote :

I think this bug is pretty serious. Say we fail get a cinder error in driver.post_live_migration() (this specific example is taken from a customer bug):

ComputeManager._post_live_migration() does:

  ...
  self.driver.post_live_migration(ctxt, instance, block_device_info,
                                        migrate_data)
  ...
  self.compute_rpcapi.post_live_migration_at_destination(ctxt,
                    instance, block_migration, dest)

The above code runs on the source compute. We update instance.host to the destination in post_live_migration_at_destination. Therefore driver.post_live_migration() above fails, we never call post_live_migration_at_destination, and we never update instance.host to point to the destination.

Hostever, _post_live_migration is called via callback from the driver *after* migration has occurred. So at this point the VM is *actually running* on the destination, but Nova thinks it's still on the source. The instance will be in an error state, and a hard reboot at this point will cause it to start running again on the source, at which point it will be running on 2 compute hosts simultaneously.

Fix proposed to branch: master
Review: https://review.openstack.org/609517

Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/609517
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5513f48dea529fe4e690f50a462300129594210c
Submitter: Zuul
Branch: master

commit 5513f48dea529fe4e690f50a462300129594210c
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 10 14:53:14 2018 -0400

    Handle volume API failure in _post_live_migration

    Previously, if the call to Cinder in _post_live_migration failed, the
    exception went unhandled and prevented us from calling
    post_live_migration_at_destination - which is where we set instance
    host and task state. This left the system in an inconsistent state,
    with the instance actually running on the destination, but
    with instance.host still set to the source. This patch simply wraps
    the Cinder API calls in a try/except, and logs the exception instead
    of blowing up. While "dumb", this has the virtue of being simple and
    minimizing potential side effects. A comprehensive refactoring of
    when, where and how we set instance host and task state to try to
    guarantee consistency is left as a TODO.

    Partial-bug: 1628606
    Change-Id: Icb0bdaf454935b3713c35339394d260b33520de5

Reviewed: https://review.openstack.org/611083
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cf3c2f391ad0f9d2a2d94247509bcb2709413e4f
Submitter: Zuul
Branch: stable/rocky

commit cf3c2f391ad0f9d2a2d94247509bcb2709413e4f
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 10 14:53:14 2018 -0400

    Handle volume API failure in _post_live_migration

    Previously, if the call to Cinder in _post_live_migration failed, the
    exception went unhandled and prevented us from calling
    post_live_migration_at_destination - which is where we set instance
    host and task state. This left the system in an inconsistent state,
    with the instance actually running on the destination, but
    with instance.host still set to the source. This patch simply wraps
    the Cinder API calls in a try/except, and logs the exception instead
    of blowing up. While "dumb", this has the virtue of being simple and
    minimizing potential side effects. A comprehensive refactoring of
    when, where and how we set instance host and task state to try to
    guarantee consistency is left as a TODO.

    Partial-bug: 1628606
    Change-Id: Icb0bdaf454935b3713c35339394d260b33520de5
    (cherry picked from commit 5513f48dea529fe4e690f50a462300129594210c)

tags: added: in-stable-rocky

Reviewed: https://review.openstack.org/611084
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53f9c8e5104076872021ed33e221e5526ca5059b
Submitter: Zuul
Branch: stable/queens

commit 53f9c8e5104076872021ed33e221e5526ca5059b
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 10 14:53:14 2018 -0400

    Handle volume API failure in _post_live_migration

    Previously, if the call to Cinder in _post_live_migration failed, the
    exception went unhandled and prevented us from calling
    post_live_migration_at_destination - which is where we set instance
    host and task state. This left the system in an inconsistent state,
    with the instance actually running on the destination, but
    with instance.host still set to the source. This patch simply wraps
    the Cinder API calls in a try/except, and logs the exception instead
    of blowing up. While "dumb", this has the virtue of being simple and
    minimizing potential side effects. A comprehensive refactoring of
    when, where and how we set instance host and task state to try to
    guarantee consistency is left as a TODO.

    Partial-bug: 1628606
    Change-Id: Icb0bdaf454935b3713c35339394d260b33520de5
    (cherry picked from commit 5513f48dea529fe4e690f50a462300129594210c)
    (cherry picked from commit cf3c2f391ad0f9d2a2d94247509bcb2709413e4f)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/611093
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=28bc3c8221c6f12bd9f1dd6701026f389b1fc2f6
Submitter: Zuul
Branch: stable/pike

commit 28bc3c8221c6f12bd9f1dd6701026f389b1fc2f6
Author: Artom Lifshitz <email address hidden>
Date: Wed Oct 10 14:53:14 2018 -0400

    Handle volume API failure in _post_live_migration

    Previously, if the call to Cinder in _post_live_migration failed, the
    exception went unhandled and prevented us from calling
    post_live_migration_at_destination - which is where we set instance
    host and task state. This left the system in an inconsistent state,
    with the instance actually running on the destination, but
    with instance.host still set to the source. This patch simply wraps
    the Cinder API calls in a try/except, and logs the exception instead
    of blowing up. While "dumb", this has the virtue of being simple and
    minimizing potential side effects. A comprehensive refactoring of
    when, where and how we set instance host and task state to try to
    guarantee consistency is left as a TODO.

    Conflicts in nova/compute/manager.py due to absence of new Cinder flow
    conditional (and corresponding modifications to tests).

    Partial-bug: 1628606
    Change-Id: Icb0bdaf454935b3713c35339394d260b33520de5
    (cherry picked from commit 5513f48dea529fe4e690f50a462300129594210c)
    (cherry picked from commit cf3c2f391ad0f9d2a2d94247509bcb2709413e4f)
    (cherry picked from commit 53f9c8e5104076872021ed33e221e5526ca5059b)

tags: added: in-stable-pike
Matt Riedemann (mriedem) wrote :

Bug 1818873 is related, possibly a duplicate.

Change abandoned by huanhongda (<email address hidden>) on branch: stable/pike
Review: https://review.opendev.org/670016
Reason: Maybe cherry-pick 013f421bca4067bd430a9fac1e3b290cf1388ee4 is a better way.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/pike
Review: https://review.opendev.org/670016
Reason: https://review.opendev.org/#/c/683008/ is better.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers