post_live_migration does not handle Neutron errors

Bug #1879787 reported by Artom Lifshitz
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Artom Lifshitz
Rocky
Fix Released
Undecided
Unassigned
Stein
Fix Released
Undecided
Unassigned
Train
Fix Released
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Unassigned

Bug Description

Description
===========

_post_live_migration() on the destination host has a call to self.network_api.get_instance_nw_info(), which eventually ends up calling the Neutron REST API (via nova.network.neutron.API._build_network_info_model()). Any exceptions in that call are unhandled, meaning if Neutron fails we'll never reach post_live_migration_at_destination(), which is where we update the database to reflect the instance's new host and other housekeeping. IOW words, if Neutron fails we'll be left with the instance actually running on the destination, but still on the source according to the database.

Steps to reproduce
==================

1. Boot an instance.
2. Live migrate it.
3. While that's happening, Neutron explodes.

Expected result
===============

While it's dubiously reasonnable to expect VIFs to get cleaned up correctly in such a case, Nova should at least correctly update the database to reflect where the instance is actually running - the destination.

Actual result
=============

The database shows the instance as running on the source.

Environment
===========

This is still a problem on master, but has been reported on OSP13/queens [1]

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1818829

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/729763

Changed in nova:
assignee: nobody → Artom Lifshitz (notartom)
status: New → In Progress
melanie witt (melwitt)
tags: added: live-migration
Changed in nova:
importance: Undecided → Medium
Revision history for this message
melanie witt (melwitt) wrote :

I just wanted to note that we landed a change recently to retry up to 3 times upon connect failure from neutron [1][2] which should at least help with resiliency to temporary network glitches in an environment.

That said, I agree we should leave the failed migration in a better state than we do today if we can, in the case of neutron failures other than connection failures or if retries have been exceeded.

[1] https://review.opendev.org/712226
[2] https://bugs.launchpad.net/nova/+bug/1866937

Changed in nova:
assignee: Artom Lifshitz (notartom) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Artom Lifshitz (notartom)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/729763
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9f205c620e57c9af760d3b52c93a4ff6d5c0e618
Submitter: Zuul
Branch: master

commit 9f205c620e57c9af760d3b52c93a4ff6d5c0e618
Author: Artom Lifshitz <email address hidden>
Date: Wed May 20 17:14:47 2020 -0400

    Handle Neutron errors in _post_live_migration()

    _post_live_migration() on the destination host has a call to
    self.network_api.get_instance_nw_info(), which eventually ends up
    calling the Neutron REST API (via
    nova.network.neutron.API._build_network_info_model()). Any exceptions
    in that call were unhandled, and caused both the server and the
    migration to end up in ERROR. This patch handles the error, and
    modifies the functional test to reflect the newly fixed behavior.

    Closes-bug: 1879787
    Change-Id: I4c89c8ba0153ec01ff57979b9bf1cfd5b2a9da89

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/746529

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: stable/ussuri
Review: https://review.opendev.org/746529

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/747451

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/747451
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6488a5dfb293831a448596e2084f484dd0bfa916
Submitter: Zuul
Branch: master

commit 6488a5dfb293831a448596e2084f484dd0bfa916
Author: Artom Lifshitz <email address hidden>
Date: Fri Aug 21 13:06:58 2020 -0400

    post live migration: don't call Neutron needlessly

    In bug 1879787, the call to network_api.get_instance_nw_info() in
    _post_live_migration() on the source compute manager eventually calls
    out to the Neutron REST API. If this fails, the exception is
    unhandled, and the migrating instance - which is fully running on the
    destination at this point - will never be updated in the database.
    This update normally happens later in
    post_live_migration_at_destination().

    The network_info variable obtained from get_instance_nw_info() is used
    for two things: notifications - which aren't critical - and unplugging
    the instance's vifs on the source - which is very important!

    It turns out that at the time of the get_instance_nw_info() call, the
    network info in the instance info cache is still valid for unplugging
    the source vifs. The port bindings on the destination are only
    activated by the network_api.migrate_instance_start() [1] call that
    happens shortly *after* the problematic get_instance_nw_info() call.
    In other words, get_instance_nw_info() will always return the source
    ports. Because of that, we can replace it with a call to
    instance.get_network_info().

    [1] https://opendev.org/openstack/nova/src/commit/d9e04c4ff0b1a9c3383f1848dc846e93030d83cb/nova/network/neutronv2/api.py#L2493-L2522

    Change-Id: If0fbae33ce2af198188c91638afef939256c2556
    Closes-bug: 1879787

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/750374

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/750670

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/750673

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/750674

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/750993

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/ussuri)

Reviewed: https://review.opendev.org/750374
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2c949cb3eea9cd9282060da12d32771582953aa2
Submitter: Zuul
Branch: stable/ussuri

commit 2c949cb3eea9cd9282060da12d32771582953aa2
Author: Artom Lifshitz <email address hidden>
Date: Fri Aug 21 13:06:58 2020 -0400

    post live migration: don't call Neutron needlessly

    In bug 1879787, the call to network_api.get_instance_nw_info() in
    _post_live_migration() on the source compute manager eventually calls
    out to the Neutron REST API. If this fails, the exception is
    unhandled, and the migrating instance - which is fully running on the
    destination at this point - will never be updated in the database.
    This update normally happens later in
    post_live_migration_at_destination().

    The network_info variable obtained from get_instance_nw_info() is used
    for two things: notifications - which aren't critical - and unplugging
    the instance's vifs on the source - which is very important!

    It turns out that at the time of the get_instance_nw_info() call, the
    network info in the instance info cache is still valid for unplugging
    the source vifs. The port bindings on the destination are only
    activated by the network_api.migrate_instance_start() [1] call that
    happens shortly *after* the problematic get_instance_nw_info() call.
    In other words, get_instance_nw_info() will always return the source
    ports. Because of that, we can replace it with a call to
    instance.get_network_info().

    [1] https://opendev.org/openstack/nova/src/commit/d9e04c4ff0b1a9c3383f1848dc846e93030d83cb/nova/network/neutronv2/api.py#L2493-L2522

    Change-Id: If0fbae33ce2af198188c91638afef939256c2556
    Closes-bug: 1879787
    (cherry picked from commit 6488a5dfb293831a448596e2084f484dd0bfa916)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/750670
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7ace26e4bcd69a0a3d780cfae9f6745e4177b6c2
Submitter: Zuul
Branch: stable/train

commit 7ace26e4bcd69a0a3d780cfae9f6745e4177b6c2
Author: Artom Lifshitz <email address hidden>
Date: Fri Aug 21 13:06:58 2020 -0400

    post live migration: don't call Neutron needlessly

    In bug 1879787, the call to network_api.get_instance_nw_info() in
    _post_live_migration() on the source compute manager eventually calls
    out to the Neutron REST API. If this fails, the exception is
    unhandled, and the migrating instance - which is fully running on the
    destination at this point - will never be updated in the database.
    This update normally happens later in
    post_live_migration_at_destination().

    The network_info variable obtained from get_instance_nw_info() is used
    for two things: notifications - which aren't critical - and unplugging
    the instance's vifs on the source - which is very important!

    It turns out that at the time of the get_instance_nw_info() call, the
    network info in the instance info cache is still valid for unplugging
    the source vifs. The port bindings on the destination are only
    activated by the network_api.migrate_instance_start() [1] call that
    happens shortly *after* the problematic get_instance_nw_info() call.
    In other words, get_instance_nw_info() will always return the source
    ports. Because of that, we can replace it with a call to
    instance.get_network_info().

    NOTE(artom) The functional test has been excised, as in stable/train
    the NeutronFixture does not properly support live migration with
    ports, making the test worthless. The work to support this was done as
    part of bp/support-move-ops-with-qos-ports-ussuri, and starts at
    commit b2734b5a9ae8b869fc9e8e229826343da3b47fcb.

    NOTE(artom) The
    test_post_live_migration_no_shared_storage_working_correctly and
    test_post_live_migration_cinder_v3_api unit tests had to be adjusted
    as part of the backport to pass with the new code.

    [1] https://opendev.org/openstack/nova/src/commit/d9e04c4ff0b1a9c3383f1848dc846e93030d83cb/nova/network/neutronv2/api.py#L2493-L2522

    Change-Id: If0fbae33ce2af198188c91638afef939256c2556
    Closes-bug: 1879787
    (cherry picked from commit 6488a5dfb293831a448596e2084f484dd0bfa916)
    (cherry picked from commit 2c949cb3eea9cd9282060da12d32771582953aa2)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/750674
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=703c8ef4e6d282ce26c3527ecbbe6dc3f47b363f
Submitter: Zuul
Branch: stable/stein

commit 703c8ef4e6d282ce26c3527ecbbe6dc3f47b363f
Author: Artom Lifshitz <email address hidden>
Date: Fri Aug 21 13:06:58 2020 -0400

    post live migration: don't call Neutron needlessly

    In bug 1879787, the call to network_api.get_instance_nw_info() in
    _post_live_migration() on the source compute manager eventually calls
    out to the Neutron REST API. If this fails, the exception is
    unhandled, and the migrating instance - which is fully running on the
    destination at this point - will never be updated in the database.
    This update normally happens later in
    post_live_migration_at_destination().

    The network_info variable obtained from get_instance_nw_info() is used
    for two things: notifications - which aren't critical - and unplugging
    the instance's vifs on the source - which is very important!

    It turns out that at the time of the get_instance_nw_info() call, the
    network info in the instance info cache is still valid for unplugging
    the source vifs. The port bindings on the destination are only
    activated by the network_api.migrate_instance_start() [1] call that
    happens shortly *after* the problematic get_instance_nw_info() call.
    In other words, get_instance_nw_info() will always return the source
    ports. Because of that, we can replace it with a call to
    instance.get_network_info().

    [1] https://opendev.org/openstack/nova/src/commit/d9e04c4ff0b1a9c3383f1848dc846e93030d83cb/nova/network/neutronv2/api.py#L2493-L2522

    Change-Id: If0fbae33ce2af198188c91638afef939256c2556
    Closes-bug: 1879787
    (cherry picked from commit 6488a5dfb293831a448596e2084f484dd0bfa916)
    (cherry picked from commit 2c949cb3eea9cd9282060da12d32771582953aa2)
    (cherry picked from commit 7ace26e4bcd69a0a3d780cfae9f6745e4177b6c2)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/750673
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a1d3ae3876eb394bf594625e3c3cd1f0892d9ea6
Submitter: Zuul
Branch: stable/rocky

commit a1d3ae3876eb394bf594625e3c3cd1f0892d9ea6
Author: Artom Lifshitz <email address hidden>
Date: Fri Aug 21 13:06:58 2020 -0400

    post live migration: don't call Neutron needlessly

    In bug 1879787, the call to network_api.get_instance_nw_info() in
    _post_live_migration() on the source compute manager eventually calls
    out to the Neutron REST API. If this fails, the exception is
    unhandled, and the migrating instance - which is fully running on the
    destination at this point - will never be updated in the database.
    This update normally happens later in
    post_live_migration_at_destination().

    The network_info variable obtained from get_instance_nw_info() is used
    for two things: notifications - which aren't critical - and unplugging
    the instance's vifs on the source - which is very important!

    It turns out that at the time of the get_instance_nw_info() call, the
    network info in the instance info cache is still valid for unplugging
    the source vifs. The port bindings on the destination are only
    activated by the network_api.migrate_instance_start() [1] call that
    happens shortly *after* the problematic get_instance_nw_info() call.
    In other words, get_instance_nw_info() will always return the source
    ports. Because of that, we can replace it with a call to
    instance.get_network_info().

    Small conflict in nova/tests/unit/compute/test_compute_mgr.py due to
    different mocks in the context.

    [1] https://opendev.org/openstack/nova/src/commit/d9e04c4ff0b1a9c3383f1848dc846e93030d83cb/nova/network/neutronv2/api.py#L2493-L2522

    Change-Id: If0fbae33ce2af198188c91638afef939256c2556
    Closes-bug: 1879787
    (cherry picked from commit 6488a5dfb293831a448596e2084f484dd0bfa916)
    (cherry picked from commit 2c949cb3eea9cd9282060da12d32771582953aa2)
    (cherry picked from commit 7ace26e4bcd69a0a3d780cfae9f6745e4177b6c2)
    (cherry picked from commit 703c8ef4e6d282ce26c3527ecbbe6dc3f47b363f)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova rocky-eol

This issue was fixed in the openstack/nova rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.