tripleo

galera resource replicas don't start properly during scale up

Bug #1892530 reported by Damien Ciabrini on 2020-08-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Medium	Damien Ciabrini	tripleo victoria-3 "tripleo victoria"

Bug Description

When scaling up the overcloud, each time a node is added, two pacemaker
resources are reconfigured to update the number of galera replicas and
an internal galera<->pacemaker name mapping.

Currently, the number of replicas is updated before the name mapping,
so there's a time window where pacemaker is allowed to start new galera
replicas but the resource agent won't succeed to do so due to the missing
name mapping. This result is error in the cluster, that shouldn't
be there even if the cluster eventually recovers:

Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op_failure) warning: Unexpected result (not configured: Could not determine galera name from pacemaker node <controller-1>.) was recorded for start of galera:1 on galera-bundle-1 at Aug 20 18:25:58 2020 | rc=6 id=galera_last_failure_0
Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op) error: Preventing galera-bundle-master from restarting anywhere because of fatal failure (not configured: Could not determine galera name from pacemaker node <controller-1>.) | rc=6 id=galera_last_failure_0
Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op_failure) warning: Unexpected result (not configured: Could not determine galera name from pacemaker node <controller-1>.) was recorded for start of galera:1 on galera-bundle-1 at Aug 20 18:25:58 2020 | rc=6 id=galera_last_0
Aug 20 18:25:58 controller-0 pacemaker-schedulerd[64083] (unpack_rsc_op) error: Preventing galera-bundle-master from restarting anywhere because of fatal failure (not configured: Could not determine galera name from pacemaker node <controller-1>.) | rc=6 id=galera_last_0

Tags:

Damien Ciabrini (dciabrin) on 2020-08-21

summary:

- galera resource don't start properly during scale up
+ galera resource replicas don't start properly during scale up

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-24: Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/747150
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=16a6ba465d420b23da77bab2f64286037d1ced37
Submitter: Zuul
Branch: master

commit 16a6ba465d420b23da77bab2f64286037d1ced37
Author: Damien Ciabrini <email address hidden>
Date: Thu Aug 20 14:06:13 2020 +0200

HA: ensure scaling up galera does not cause promotion errors

    During scale up, two galera resources are being updated in the
    pacemaker cluster. Force a specific ordering in puppet to make
    sure the galera resource agent always picks up the up-to-date
    config when it starts new replicas.

Closes-Bug: #1892530

Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-25: Fix proposed to puppet-tripleo (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/747851

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-25: Fix proposed to puppet-tripleo (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/747852

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-26: Fix merged to puppet-tripleo (stable/ussuri)

Reviewed: https://review.opendev.org/747851
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=e75842e53eb4563734b55055384018b77d22072d
Submitter: Zuul
Branch: stable/ussuri

commit e75842e53eb4563734b55055384018b77d22072d
Author: Damien Ciabrini <email address hidden>
Date: Thu Aug 20 14:06:13 2020 +0200

HA: ensure scaling up galera does not cause promotion errors

Closes-Bug: #1892530

Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa
(cherry picked from commit 16a6ba465d420b23da77bab2f64286037d1ced37)

tags:

added: in-stable-ussuri

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-27: Fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/747852
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=48f35b5274547e3844c62031b1ae68b2fbc75989
Submitter: Zuul
Branch: stable/train

commit 48f35b5274547e3844c62031b1ae68b2fbc75989
Author: Damien Ciabrini <email address hidden>
Date: Thu Aug 20 14:06:13 2020 +0200

HA: ensure scaling up galera does not cause promotion errors

Closes-Bug: #1892530

Change-Id: Id40ac8c10fd0348ce4fd99ce319dab933312acfa
(cherry picked from commit 16a6ba465d420b23da77bab2f64286037d1ced37)

tags:

added: in-stable-train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-02-08: Fix included in openstack/puppet-tripleo 11.5.0

This issue was fixed in the openstack/puppet-tripleo 11.5.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.