galera resource replicas don't start properly during scale up

Bug #1892530 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
Damien Ciabrini

Bug Description

When scaling up the overcloud, each time a node is added, two pacemaker
resources are reconfigured to update the number of galera replicas and
an internal galera<->pacemaker name mapping.

Currently, the number of replicas is updated before the name mapping,
so there's a time window where pacemaker is allowed to start new galera
replicas but the resource agent won't succeed to do so due to the missing
name mapping. This result is error in the cluster, that shouldn't
be there even if the cluster eventually recovers:

Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op_failure) warning: Unexpected result (not configured: Could not determine galera name from pacemaker node <controller-1>.) was recorded for start of galera:1 on galera-bundle-1 at Aug 20 18:25:58 2020 | rc=6 id=galera_last_failure_0
Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op) error: Preventing galera-bundle-master from restarting anywhere because of fatal failure (not configured: Could not determine galera name from pacemaker node <controller-1>.) | rc=6 id=galera_last_failure_0
Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op_failure) warning: Unexpected result (not configured: Could not determine galera name from pacemaker node <controller-1>.) was recorded for start of galera:1 on galera-bundle-1 at Aug 20 18:25:58 2020 | rc=6 id=galera_last_0
Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op) error: Preventing galera-bundle-master from restarting anywhere because of fatal failure (not configured: Could not determine galera name from pacemaker node <controller-1>.) | rc=6 id=galera_last_0

summary: - galera resource don't start properly during scale up
+ galera resource replicas don't start properly during scale up
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/747150
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=16a6ba465d420b23da77bab2f64286037d1ced37
Submitter: Zuul
Branch: master

commit 16a6ba465d420b23da77bab2f64286037d1ced37
Author: Damien Ciabrini <email address hidden>
Date: Thu Aug 20 14:06:13 2020 +0200

    HA: ensure scaling up galera does not cause promotion errors

    During scale up, two galera resources are being updated in the
    pacemaker cluster. Force a specific ordering in puppet to make
    sure the galera resource agent always picks up the up-to-date
    config when it starts new replicas.

    Closes-Bug: #1892530

    Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/747851

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/747852

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/ussuri)

Reviewed: https://review.opendev.org/747851
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e75842e53eb4563734b55055384018b77d22072d
Submitter: Zuul
Branch: stable/ussuri

commit e75842e53eb4563734b55055384018b77d22072d
Author: Damien Ciabrini <email address hidden>
Date: Thu Aug 20 14:06:13 2020 +0200

    HA: ensure scaling up galera does not cause promotion errors

    During scale up, two galera resources are being updated in the
    pacemaker cluster. Force a specific ordering in puppet to make
    sure the galera resource agent always picks up the up-to-date
    config when it starts new replicas.

    Closes-Bug: #1892530

    Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa
    (cherry picked from commit 16a6ba465d420b23da77bab2f64286037d1ced37)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/747852
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=48f35b5274547e3844c62031b1ae68b2fbc75989
Submitter: Zuul
Branch: stable/train

commit 48f35b5274547e3844c62031b1ae68b2fbc75989
Author: Damien Ciabrini <email address hidden>
Date: Thu Aug 20 14:06:13 2020 +0200

    HA: ensure scaling up galera does not cause promotion errors

    During scale up, two galera resources are being updated in the
    pacemaker cluster. Force a specific ordering in puppet to make
    sure the galera resource agent always picks up the up-to-date
    config when it starts new replicas.

    Closes-Bug: #1892530

    Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa
    (cherry picked from commit 16a6ba465d420b23da77bab2f64286037d1ced37)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.5.0

This issue was fixed in the openstack/puppet-tripleo 11.5.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers