os-win

Port binding failures after failover

Bug #1795299 reported by Lucian Petrut on 2018-10-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	compute-hyperv	Fix Released	Undecided	Lucian Petrut
	os-win	Invalid	Undecided	Unassigned

Bug Description

In some cases, after instances are failed over, the destination host cannot find the instance vNICs or switch ports.

It looks like the ports are not ready yet by the time we get the failover event. Adding a retry to the method that connects ports to vswitches seems to avoid this issue.

Trace: http://paste.openstack.org/raw/730451/

Later edit: This is caused by the fact that we're not waiting for "pending" cluster groups, in which case the VMs aren't registered in Hyper-V yet.

See original description

Lucian Petrut (petrutlucian94) on 2018-10-18

description:	updated
Changed in os-win:
status:	New → Invalid
Changed in compute-hyperv:
assignee:	nobody → Lucian Petrut (petrutlucian94)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-10-18: Change abandoned on os-win (master)

Change abandoned by Lucian Petrut (<email address hidden>) on branch: master
Review: https://review.openstack.org/606893
Reason: Retrying here isn't the right thing to do. We have to wait for pending cluster groups.

Changed in compute-hyperv:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-11-19: Fix merged to compute-hyperv (master)

Reviewed: https://review.openstack.org/609017
Committed: https://git.openstack.org/cgit/openstack/compute-hyperv/commit/?id=9f5628e55bc70bc57bcf66e456fb18af7fe47859
Submitter: Zuul
Branch: master

commit 9f5628e55bc70bc57bcf66e456fb18af7fe47859
Author: Lucian Petrut <email address hidden>
Date: Tue Oct 9 17:20:16 2018 +0300

Improve clustered instance failover handling

    Instances can bounce between hosts many times in a short interval,
    especially if the CSVs go down (as much as 20 times in less than 2
    minutes).

    We're not handling this properly. The failover handling logic is
    prone to race conditions, as multiple hosts may attempt to claim
    the instance, which will end up in an inconsistent state.

    We're introducing distributed locks, preventing races between hosts.
    At the same time, we're validating the events, as the instances can
    move again by the time we process the event.

The distributed lock backend will have to be configured.

    At the same time, we're now waiting for "pending" cluster groups,
    which may not even be registered in Hyper-V, so any action we take
    on the VM would fail.

Closes-Bug: #1795299
Closes-Bug: #1796673

Change-Id: I3dbdcf208bb7a96bd516b41e4725a5fcb37280d6

Changed in compute-hyperv:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-25: Fix included in openstack/compute-hyperv 9.0.0.0rc1

This issue was fixed in the openstack/compute-hyperv 9.0.0.0rc1 release candidate.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.