port binding 'migrating_to' attribute not cleaned up on failed live migration if using local shared disk

Bug #1757292 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Matt Riedemann
Pike
Confirmed
Low
Unassigned
Queens
Fix Committed
Low
Brian Haley

Bug Description

This code was added back in Newton: https://review.openstack.org/#/c/275073/

That plumbs a 'migrating_to' attribute in the port binding profile during live migration. It's needed on the neutron side for live migration an instance with floating IPs using DVR.

When live migration completes, either successfully or due to failure, the migrating_to attribute should be cleaned up. This happens via the setup_networks_on_host() method in the network API (nova.network.neutronv2.api.API).

The problem is that on a failed live migration, that cleanup only happens if the instance is not using shared local disk storage because of this do_cleanup flag:

https://github.com/openstack/nova/blob/3fd863d8bf2fa1fc09acd08d976689462cffd2e3/nova/compute/manager.py#L6540

https://github.com/openstack/nova/blob/3fd863d8bf2fa1fc09acd08d976689462cffd2e3/nova/compute/manager.py#L6506

https://github.com/openstack/nova/blob/3fd863d8bf2fa1fc09acd08d976689462cffd2e3/nova/compute/manager.py#L6185

This is based purely on code inspection since I don't have a multinode DVR setup with the rbd imagebackend for the libvirt driver to test this out (we could create a CI job to do all that if we wanted to). But it seems pretty obvious that the ComputeManager.rollback_live_migration_at_destination_host code was primarily meant for disk cleanup of the disks created on the destination host during pre_live_migration, and also for anything setup on the physical destination host for nova-network, and doesn't take into account this 'migrating_to' scenario which can be cleaned up from the source host.

Having said all this, this code has been in nova since newton and the DVR migrating_to changes have been in neutron since mitaka, and no one has reported this problem, so it's either not widely used or it doesn't cause much of a problem if we don't cleanup the migrating_to entry in the binding profile on failed live migration, although I'd think neutron should cleanup the floating IP router gateway that DVR creates on the dest host.

Revision history for this message
Matt Riedemann (mriedem) wrote :

FWIW, calling setup_networks_on_host() from post_live_migration_at_destination() was added way back in Essex:

https://github.com/openstack/nova/commit/0c7a54b3b44f849bf397bb4068ab16c576c3559c

So it was primarily a nova-network thing (was Quantum even being used in Essex?).

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/555481

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/555481
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bb8ba2cf568fda4c5c59352296b758705869fb2f
Submitter: Zuul
Branch: master

commit bb8ba2cf568fda4c5c59352296b758705869fb2f
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 22 17:38:00 2018 -0400

    Teardown networking when rolling back live migration even if shared disk

    Change I2c86989ab7c6593bf346611cde8c043116d55bc5 way back in Essex
    added the "setup_network_on_host" network API calls to the migration
    flows, including rollback_live_migration_at_destination. The initial
    implementation of that method for Quantum (Neutron) was a no-op.

    Change Ib1cc44bf9d01baf4d1f1d26c2a368a5ca7c6ab68 in Newton added the
    Neutron implementation for the setup_networks_on_host method in order
    to track the destination host being migrated to for instances that
    have floating IPs with DVR.

    When rolling back from a live migration failure on the destination host,
    the "migrating_to" attribute in the port binding profile, added in
    pre_live_migration() on the destination compute, is cleared.

    However, that only happens in rollback_live_migration_at_destination,
    which is only called if the instance is not on shared storage (think
    libvirt with the rbd image backend or with NFS). That's controlled
    via the "do_cleanup" flag returned from _live_migration_cleanup_flags().

    If the live migration is happening over shared storage and fails, then
    rollback_live_migration_at_destination isn't called which means
    setup_network_on_host isn't called, which means the "migrating_to"
    attribute in the port binding profile isn't cleaned up.

    This change simply adds the cleanup in _rollback_live_migration in the
    event that neutron is being used and we're live migrating over shared
    storage so rollback_live_migration_at_destination isn't called.

    Change-Id: I658e0a749e842163ed74f82c975bcaf19f9f7f07
    Closes-Bug: #1757292

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.0.0.0b1

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/658149

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/658149
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=89063ecb61edb31522b6f1ab3f10a682a6c17f5d
Submitter: Zuul
Branch: stable/queens

commit 89063ecb61edb31522b6f1ab3f10a682a6c17f5d
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 22 17:38:00 2018 -0400

    Teardown networking when rolling back live migration even if shared disk

    Change I2c86989ab7c6593bf346611cde8c043116d55bc5 way back in Essex
    added the "setup_network_on_host" network API calls to the migration
    flows, including rollback_live_migration_at_destination. The initial
    implementation of that method for Quantum (Neutron) was a no-op.

    Change Ib1cc44bf9d01baf4d1f1d26c2a368a5ca7c6ab68 in Newton added the
    Neutron implementation for the setup_networks_on_host method in order
    to track the destination host being migrated to for instances that
    have floating IPs with DVR.

    When rolling back from a live migration failure on the destination host,
    the "migrating_to" attribute in the port binding profile, added in
    pre_live_migration() on the destination compute, is cleared.

    However, that only happens in rollback_live_migration_at_destination,
    which is only called if the instance is not on shared storage (think
    libvirt with the rbd image backend or with NFS). That's controlled
    via the "do_cleanup" flag returned from _live_migration_cleanup_flags().

    If the live migration is happening over shared storage and fails, then
    rollback_live_migration_at_destination isn't called which means
    setup_network_on_host isn't called, which means the "migrating_to"
    attribute in the port binding profile isn't cleaned up.

    This change simply adds the cleanup in _rollback_live_migration in the
    event that neutron is being used and we're live migrating over shared
    storage so rollback_live_migration_at_destination isn't called.

    Change-Id: I658e0a749e842163ed74f82c975bcaf19f9f7f07
    Closes-Bug: #1757292
    (cherry picked from commit bb8ba2cf568fda4c5c59352296b758705869fb2f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.11

This issue was fixed in the openstack/nova 17.0.11 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.