HA: moving VIP away during minor update can leave some VIP stopped

Bug #1921351 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Confirmed
Undecided
Damien Ciabrini

Bug Description

in a HA control plane, during minor updates, controller nodes are stopped and updated one after the other. To minimise service outage, we pro-actively move the VIP away from the node we're updating.

There is a race between when we determine the VIP to move and when we execute the move action, that can result in the VIP to be moved twice, and that can leave a ban flag in the cluster.

# seen while updating controller-0. Note that the ban of the ovn VIP is performed for controller-2 instead of controller-0

[root@messaging-0 pacemaker]# grep cli-ban /var/log/pacemaker/pacemaker.log
Mar 25 00:32:44 messaging-0.redhat.local pacemaker-based [2019] (cib_perform_op) info: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-ovn-dbs-bundle-on-controller-0" rsc="ovn-dbs-bundle" role="Started" node="controller
-0" score="-INFINITY"/>
Mar 25 00:32:47 messaging-0.redhat.local pacemaker-based [2019] (cib_perform_op) info: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-ip-10.0.0.150-on-controller-0" rsc="ip-10.0.0.150" role="Started" node="controller-$
" score="-INFINITY"/>
Mar 25 00:33:25 messaging-0.redhat.local pacemaker-based [108231] (cib_perform_op) info: -- /cib/configuration/constraints/rsc_location[@id='cli-ban-ip-10.0.0.150-on-controller-0']
Mar 25 00:33:25 messaging-0.redhat.local pacemaker-based [108231] (cib_perform_op) info: + /cib/configuration/constraints/rsc_location[@id='cli-ban-ovn-dbs-bundle-on-controller-0']: @node=controller-0, @rsc=ovn-dbs-bundle
Mar 25 00:33:25 messaging-0.redhat.local pacemaker-based [108231] (cib_perform_op) info: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-ip-172.17.4.150-on-controller-0" node="controller-0" role="Started" rsc="ip-172.17.$
.150" score="-INFINITY"/>
Mar 25 00:33:25 messaging-0.redhat.local pacemaker-based [108231] (cib_perform_op) info: ++ /cib/configuration/constraints: <rsc_location id="cli-ban-ip-172.17.1.40-on-controller-2" node="controller-2" role="Started" rsc="ip-172.17.1$
40" score="-INFINITY"/>

When that happens, the VIP can no longer be hosted on the node which is left with the ban flag. Under some circumstances, this can lead to the cluster hosting the ovn_dbs on the node with the left over ban, and prevent the VIP from moving with it, causing an outage of the ovn service.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/victoria)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/785205
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/de98fdb208dac1cf01d474c4363f6e2177e22c7e
Submitter: "Zuul (22348)"
Branch: stable/victoria

commit de98fdb208dac1cf01d474c4363f6e2177e22c7e
Author: Damien Ciabrini <email address hidden>
Date: Thu Mar 25 10:23:48 2021 +0100

    HA: fix race when moving VIP during minor update

    During a minor update of a controller node, we list the
    VIP hosted on the node and we force them to move away
    with a pcs command.

    The way we call the pcs command is racy, so change the
    action to make sure the VIP move is performed only if
    the VIP is still hosted on the node when we run the action.

    Change-Id: Id379f4fe1668d01fdd5f91b46e2f75d7cdb577ae
    Closes-Bug: #1921351
    (cherry picked from commit 3da1e7661036a73c9f222bcdcc489912502d23cb)

tags: added: in-stable-victoria
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ussuri)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/788601
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/eafdee6aedaab6d210cd29970b08932f48221495
Submitter: "Zuul (22348)"
Branch: stable/ussuri

commit eafdee6aedaab6d210cd29970b08932f48221495
Author: Damien Ciabrini <email address hidden>
Date: Thu Mar 25 10:23:48 2021 +0100

    HA: fix race when moving VIP during minor update

    During a minor update of a controller node, we list the
    VIP hosted on the node and we force them to move away
    with a pcs command.

    The way we call the pcs command is racy, so change the
    action to make sure the VIP move is performed only if
    the VIP is still hosted on the node when we run the action.

    Change-Id: Id379f4fe1668d01fdd5f91b46e2f75d7cdb577ae
    Closes-Bug: #1921351
    (cherry picked from commit 3da1e7661036a73c9f222bcdcc489912502d23cb)
    (cherry picked from commit de98fdb208dac1cf01d474c4363f6e2177e22c7e)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/c/openstack/tripleo-heat-templates/+/782931
Committed: https://opendev.org/openstack/tripleo-heat-templates/commit/2416eb3b145bb2d0b2e93ec2bbf1ebf0f95abadb
Submitter: "Zuul (22348)"
Branch: stable/train

commit 2416eb3b145bb2d0b2e93ec2bbf1ebf0f95abadb
Author: Damien Ciabrini <email address hidden>
Date: Thu Mar 25 10:23:48 2021 +0100

    HA: fix race when moving VIP during minor update

    During a minor update of a controller node, we list the
    VIP hosted on the node and we force them to move away
    with a pcs command.

    The way we call the pcs command is racy, so change the
    action to make sure the VIP move is performed only if
    the VIP is still hosted on the node when we run the action.

    Change-Id: Id379f4fe1668d01fdd5f91b46e2f75d7cdb577ae
    Closes-Bug: #1921351
    (cherry picked from commit 3da1e7661036a73c9f222bcdcc489912502d23cb)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 14.1.0

This issue was fixed in the openstack/tripleo-heat-templates 14.1.0 release.

Changed in tripleo:
milestone: wallaby-rc1 → xena-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 13.3.0

This issue was fixed in the openstack/tripleo-heat-templates 13.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.6.0

This issue was fixed in the openstack/tripleo-heat-templates 11.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.4.4

This issue was fixed in the openstack/tripleo-heat-templates 12.4.4 release.

Changed in tripleo:
milestone: xena-1 → xena-2
Changed in tripleo:
milestone: xena-2 → xena-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.