Cluster driver failover race condition
Bug #1796673 reported by
Lucian Petrut
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
compute-hyperv |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
In some cases, clustered instances may bounce multiple times between hosts when a failure occurs. For example, if a CSV goes down, none of the hosts will be eligible and the instances may bounce more than 20 times between the hosts in a couple of minutes [1].
Our driver won't handle this gracefully. Multiple services will attempt to claim the instance, which will end up in an inconsistent state. The ports will no longer be bound while the Nova DB will not contain the right instance host.
[1] http://
We should probably use distributed locks and double check the instance state when processing failovers.
To post a comment you must log in.
Reviewed: https:/ /review. openstack. org/609016 /git.openstack. org/cgit/ openstack/ compute- hyperv/ commit/ ?id=91476d74171 48b1f0e89c47c04 17a787d00e77d7
Committed: https:/
Submitter: Zuul
Branch: master
commit 91476d7417148b1 f0e89c47c0417a7 87d00e77d7
Author: Lucian Petrut <email address hidden>
Date: Tue Oct 9 16:23:25 2018 +0300
Add distributed lock helpers
This change imports the "coordination" module, which is shared by
Cinder, Manila and a few other projects. At some point, it should
probably be submitted to oslo.
It uses tooz, an OpenStack library, in order to provide distributed
locks. Tooz supports various backends, such as etcd, mysql, file
locks, redis, zookeeper, etc.
The lock backend can be selected using the CONF.coordinati on.backend_ url
config option.
A subsequent change will use distributed locks for the cluster driver,
preventing race conditions when handling failovers.
Related-Bug: #1796673
Change-Id: I5a7d79fe1cf6ce 13ff9d20d761888 6add6221300