Close OVN VIP race by adding an ordering constraint
Currently there is a race with the high-availability of ovn when resetting a
controller. Namely, the VIP that OVN uses (namely the internal_api VIP
by default) only has a colocation constraint with the master role of the
ovn-dbs resource. This leaves the following race open:
1) We reboot ctrl-0 hosting the master role of ovn-dbs
2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
promotion operation running in the background is not completed)
3) OVN VIP moves to ctrl-1 even though it is still in slave mode
(there is only a colocation constraint between vip and master role for
ovn)
4) OVN controllers on the overcloud connect to the VIP but it is in
read-only mode because it was a slave
5) OVN controllers that connected at 4) stay in read-only forever
until they get restarted manually.
With the addition of this constraint we force the VIP move only after
the master role has been promoted. This makes it much more unlikely
for a client to connect to the VIP and get a read-only db in the
background. With only this patch applied I did not manage to reproduce
the issue (even after 7 reboots of controllers).
Note that there is still a small race window possible because the
current OVN resource agent has a bug: it promotes a resource to master
after issuing the promotion command to the DB but without waiting for
this promotion to complete. A patch for OVN-ra will also be submitted
but from initial testing this change seems to be largely sufficient.
Also note that this change introduces a small less desirable
side-effect:
A failover of the internal VIP will now take a bit longer because it
will happen only after ovn-dbs gets promoted to master.
We plan to take care of this fully by decoupling the OVN VIP from the
internal_api one. This change addresses the immediate issue related
to ovn_controllers being stuck in read-only due to premature promotion.
(OVN upstream is discussing how to make connections to read-only VIP
trigger a reconnection eventually)
Reviewed: https:/ /review. opendev. org/669610 /git.openstack. org/cgit/ openstack/ puppet- tripleo/ commit/ ?id=5c10f331977 152d4e69e91ed67 cd14304cef6a18
Committed: https:/
Submitter: Zuul
Branch: master
commit 5c10f331977152d 4e69e91ed67cd14 304cef6a18
Author: Michele Baldessari <email address hidden>
Date: Mon Jul 8 08:53:27 2019 +0200
Close OVN VIP race by adding an ordering constraint
Currently there is a race with the high-availability of ovn when resetting a
controller. Namely, the VIP that OVN uses (namely the internal_api VIP
by default) only has a colocation constraint with the master role of the
ovn-dbs resource. This leaves the following race open:
1) We reboot ctrl-0 hosting the master role of ovn-dbs
2) OVN becomes master on ctrl-1 from pacemaker's POV (but the
promotion operation running in the background is not completed)
3) OVN VIP moves to ctrl-1 even though it is still in slave mode
(there is only a colocation constraint between vip and master role for
ovn)
4) OVN controllers on the overcloud connect to the VIP but it is in
read-only mode because it was a slave
5) OVN controllers that connected at 4) stay in read-only forever
until they get restarted manually.
With the addition of this constraint we force the VIP move only after
the master role has been promoted. This makes it much more unlikely
for a client to connect to the VIP and get a read-only db in the
background. With only this patch applied I did not manage to reproduce
the issue (even after 7 reboots of controllers).
Note that there is still a small race window possible because the
current OVN resource agent has a bug: it promotes a resource to master
after issuing the promotion command to the DB but without waiting for
this promotion to complete. A patch for OVN-ra will also be submitted
but from initial testing this change seems to be largely sufficient.
Also note that this change introduces a small less desirable
side-effect:
A failover of the internal VIP will now take a bit longer because it
will happen only after ovn-dbs gets promoted to master.
We plan to take care of this fully by decoupling the OVN VIP from the
internal_api one. This change addresses the immediate issue related
to ovn_controllers being stuck in read-only due to premature promotion.
(OVN upstream is discussing how to make connections to read-only VIP
trigger a reconnection eventually)
Closes-Bug: #1835830
Change-Id: I3fa07e28c4e371 97890664d12a265 f1673c780f2