Failed evacuations leave neutron ports on destination host

Bug #1659062 reported by Matt Rabe
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Wishlist
Jack Ding

Bug Description

Description
===========
This is related to https://bugs.launchpad.net/nova/+bug/1430042 and the associated fix https://review.openstack.org/#/c/169827/; if an evacuation fails there is no reverting of the neutron ports' host_id binding back to the source host.

This may or may not be a bug, but if the evacuation fails and the source host comes back up and VMs are expected to be running, then the neutron ports should probably be rolled back.

Steps to reproduce
==================
* Raise an exception at some point in the evacuation flow after the setup_instance_network_on_host calls in _do_rebuild_instance in the manager
* Issue an evacuation of a VM to the host that will fail

Expected result
===============
* If the evacuation fails the expectation would be to have the neutron ports have their host_id binding updated to be the source host.

Actual result
=============
* The ports host_id bindings remain as the destination host.

Environment
===========
1. Exact version of OpenStack you are running. See the following
   Newton

2. Which hypervisor did you use?
   PowerVM

2. Which storage type did you use?
   N/A

3. Which networking type did you use?
   Neutron with SEA

Tags: evacuate
Revision history for this message
Sean Dague (sdague) wrote :

Evacuate behavior changes are so dicey at this point that I think anything like this probably needs a spec to actually think through the edge conditions.

Please dive in here if you are interested - https://specs.openstack.org/openstack/nova-specs/readme.html

tags: added: evacuate
Changed in nova:
status: New → Opinion
importance: Undecided → Wishlist
Jack Ding (jackding)
Changed in nova:
assignee: nobody → Jack Ding (jackding)
status: Opinion → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/603844

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/640516

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i think the actuall casue of this bug is this neutron bug
https://bugs.launchpad.net/neutron/+bug/1815345
as such im not sure we should be "fixing" this in nova
as we are actully working around a neutron bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/603844
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=542635034882e1b6897e1935f09d6feb6e77d1ce
Submitter: Zuul
Branch: master

commit 542635034882e1b6897e1935f09d6feb6e77d1ce
Author: Jack Ding <email address hidden>
Date: Wed Sep 19 11:54:44 2018 -0400

    Correct instance port binding for rebuilds

    The following 2 scenarios could result in an instance with incorrect
    port binding and cause subsequent rebuilds to fail.

    If an evacuation of an instance fails part way through, after the point
    where we reassign the port binding to the new host but before we change
    the instance host, we end up with the ports assigned to the wrong host.
    This change adds a check to determine if there's any port binding host
    mismatches and if so trigger setup of instance network.

    During recovery of failed hosts, neutron could get overwhelmed and lose
    messages, for example when active controller was powered-off in the
    middle of instance evacuations. In this case the vif_type was set to
    'binding_failed' or 'unbound'. We subsequently hit "Unsupported VIF
    type" exception during instance hard_reboot or rebuild, leaving the
    instance unrecoverable.

    This commit changes _heal_instance_info_cache periodic task to update
    port binding if evacuation fails due to above errors so that the
    instance can be recovered later.

    Closes-Bug: #1659062
    Related-Bug: #1784579

    Co-Authored-By: Gerry Kopec <email address hidden>
    Co-Authored-By: Jim Gauld <email address hidden>
    Change-Id: I75fd15ac2a29e420c09499f2c41d11259ca811ae
    Signed-off-by: Jack Ding <email address hidden>

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/640516

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.