HA: moving VIP away during minor update can leave some VIP stopped
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Confirmed
|
Undecided
|
Damien Ciabrini |
Bug Description
in a HA control plane, during minor updates, controller nodes are stopped and updated one after the other. To minimise service outage, we pro-actively move the VIP away from the node we're updating.
There is a race between when we determine the VIP to move and when we execute the move action, that can result in the VIP to be moved twice, and that can leave a ban flag in the cluster.
# seen while updating controller-0. Note that the ban of the ovn VIP is performed for controller-2 instead of controller-0
[root@messaging-0 pacemaker]# grep cli-ban /var/log/
Mar 25 00:32:44 messaging-
-0" score="-INFINITY"/>
Mar 25 00:32:47 messaging-
" score="-INFINITY"/>
Mar 25 00:33:25 messaging-
Mar 25 00:33:25 messaging-
Mar 25 00:33:25 messaging-
.150" score="-INFINITY"/>
Mar 25 00:33:25 messaging-
40" score="-INFINITY"/>
When that happens, the VIP can no longer be hosted on the node which is left with the ban flag. Under some circumstances, this can lead to the cluster hosting the ovn_dbs on the node with the left over ban, and prevent the VIP from moving with it, causing an outage of the ovn service.
Changed in tripleo: | |
milestone: | wallaby-rc1 → xena-1 |
Changed in tripleo: | |
milestone: | xena-1 → xena-2 |
Changed in tripleo: | |
milestone: | xena-2 → xena-3 |
Reviewed: https:/ /review. opendev. org/c/openstack /tripleo- heat-templates/ +/785205 /opendev. org/openstack/ tripleo- heat-templates/ commit/ de98fdb208dac1c f01d474c4363f6e 2177e22c7e
Committed: https:/
Submitter: "Zuul (22348)"
Branch: stable/victoria
commit de98fdb208dac1c f01d474c4363f6e 2177e22c7e
Author: Damien Ciabrini <email address hidden>
Date: Thu Mar 25 10:23:48 2021 +0100
HA: fix race when moving VIP during minor update
During a minor update of a controller node, we list the
VIP hosted on the node and we force them to move away
with a pcs command.
The way we call the pcs command is racy, so change the
action to make sure the VIP move is performed only if
the VIP is still hosted on the node when we run the action.
Change-Id: Id379f4fe1668d0 1fdd5f91b46e2f7 5d7cdb577ae c9f222bcdcc4899 12502d23cb)
Closes-Bug: #1921351
(cherry picked from commit 3da1e7661036a73