Cluster driver failover race condition

Bug #1796673 reported by Lucian Petrut
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
compute-hyperv
Fix Released
Undecided
Unassigned

Bug Description

In some cases, clustered instances may bounce multiple times between hosts when a failure occurs. For example, if a CSV goes down, none of the hosts will be eligible and the instances may bounce more than 20 times between the hosts in a couple of minutes [1].

Our driver won't handle this gracefully. Multiple services will attempt to claim the instance, which will end up in an inconsistent state. The ports will no longer be bound while the Nova DB will not contain the right instance host.

[1] http://paste.openstack.org/raw/731678/

We should probably use distributed locks and double check the instance state when processing failovers.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to compute-hyperv (master)

Reviewed: https://review.openstack.org/609016
Committed: https://git.openstack.org/cgit/openstack/compute-hyperv/commit/?id=91476d7417148b1f0e89c47c0417a787d00e77d7
Submitter: Zuul
Branch: master

commit 91476d7417148b1f0e89c47c0417a787d00e77d7
Author: Lucian Petrut <email address hidden>
Date: Tue Oct 9 16:23:25 2018 +0300

    Add distributed lock helpers

    This change imports the "coordination" module, which is shared by
    Cinder, Manila and a few other projects. At some point, it should
    probably be submitted to oslo.

    It uses tooz, an OpenStack library, in order to provide distributed
    locks. Tooz supports various backends, such as etcd, mysql, file
    locks, redis, zookeeper, etc.

    The lock backend can be selected using the CONF.coordination.backend_url
    config option.

    A subsequent change will use distributed locks for the cluster driver,
    preventing race conditions when handling failovers.

    Related-Bug: #1796673

    Change-Id: I5a7d79fe1cf6ce13ff9d20d7618886add6221300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to compute-hyperv (master)

Reviewed: https://review.openstack.org/609017
Committed: https://git.openstack.org/cgit/openstack/compute-hyperv/commit/?id=9f5628e55bc70bc57bcf66e456fb18af7fe47859
Submitter: Zuul
Branch: master

commit 9f5628e55bc70bc57bcf66e456fb18af7fe47859
Author: Lucian Petrut <email address hidden>
Date: Tue Oct 9 17:20:16 2018 +0300

    Improve clustered instance failover handling

    Instances can bounce between hosts many times in a short interval,
    especially if the CSVs go down (as much as 20 times in less than 2
    minutes).

    We're not handling this properly. The failover handling logic is
    prone to race conditions, as multiple hosts may attempt to claim
    the instance, which will end up in an inconsistent state.

    We're introducing distributed locks, preventing races between hosts.
    At the same time, we're validating the events, as the instances can
    move again by the time we process the event.

    The distributed lock backend will have to be configured.

    At the same time, we're now waiting for "pending" cluster groups,
    which may not even be registered in Hyper-V, so any action we take
    on the VM would fail.

    Closes-Bug: #1795299
    Closes-Bug: #1796673

    Change-Id: I3dbdcf208bb7a96bd516b41e4725a5fcb37280d6

Changed in compute-hyperv:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/compute-hyperv 9.0.0.0rc1

This issue was fixed in the openstack/compute-hyperv 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.