Port binding failures after failover

Bug #1795299 reported by Lucian Petrut
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
compute-hyperv
Fix Released
Undecided
Lucian Petrut
os-win
Invalid
Undecided
Unassigned

Bug Description

In some cases, after instances are failed over, the destination host cannot find the instance vNICs or switch ports.

It looks like the ports are not ready yet by the time we get the failover event. Adding a retry to the method that connects ports to vswitches seems to avoid this issue.

Trace: http://paste.openstack.org/raw/730451/

Later edit: This is caused by the fact that we're not waiting for "pending" cluster groups, in which case the VMs aren't registered in Hyper-V yet.

description: updated
Changed in os-win:
status: New → Invalid
Changed in compute-hyperv:
assignee: nobody → Lucian Petrut (petrutlucian94)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on os-win (master)

Change abandoned by Lucian Petrut (<email address hidden>) on branch: master
Review: https://review.openstack.org/606893
Reason: Retrying here isn't the right thing to do. We have to wait for pending cluster groups.

Changed in compute-hyperv:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to compute-hyperv (master)

Reviewed: https://review.openstack.org/609017
Committed: https://git.openstack.org/cgit/openstack/compute-hyperv/commit/?id=9f5628e55bc70bc57bcf66e456fb18af7fe47859
Submitter: Zuul
Branch: master

commit 9f5628e55bc70bc57bcf66e456fb18af7fe47859
Author: Lucian Petrut <email address hidden>
Date: Tue Oct 9 17:20:16 2018 +0300

    Improve clustered instance failover handling

    Instances can bounce between hosts many times in a short interval,
    especially if the CSVs go down (as much as 20 times in less than 2
    minutes).

    We're not handling this properly. The failover handling logic is
    prone to race conditions, as multiple hosts may attempt to claim
    the instance, which will end up in an inconsistent state.

    We're introducing distributed locks, preventing races between hosts.
    At the same time, we're validating the events, as the instances can
    move again by the time we process the event.

    The distributed lock backend will have to be configured.

    At the same time, we're now waiting for "pending" cluster groups,
    which may not even be registered in Hyper-V, so any action we take
    on the VM would fail.

    Closes-Bug: #1795299
    Closes-Bug: #1796673

    Change-Id: I3dbdcf208bb7a96bd516b41e4725a5fcb37280d6

Changed in compute-hyperv:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/compute-hyperv 9.0.0.0rc1

This issue was fixed in the openstack/compute-hyperv 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.