tripleo

HA overcloud, controller replacement or node scale up broken with pcs 0.10+

Bug #1839209 reported by Damien Ciabrini on 2019-08-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	In Progress	High	Unassigned	tripleo victoria-3 "tripleo victoria"

Bug Description

Since Stein and RHEL/CentOS 8, the code path that handles the replacement of controller node in a pacemaker cluster seems broken. The same applies to automatic scale up of the control plane.

When redeploying a stack with a list of new controllers to add to the cluster, the deployment times out and the new controllers are never added into the cluster. When inspecting the journal, one can see that the puppet run on the host (bootstrap node) yields an error when trying to add the new nodes:

# journalctl -t puppet-user
[...]
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: Host 'controller-3' is not known to pcs, try>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: None of hosts is known to pcs.
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unabl>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Error: '/sbin/pcs cluster node add controller-3 addr=172.17.1.148 --start --wait' returned 1 instead of one of [0]
Aug 06 12:08:22 controller-0 puppet-user[595062]: Error: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: change from 'notrun' to ['0'] failed: '/sbin/pcs clu>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-controller-3]: Dependency Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster] has failur>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Warning: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-controller-3]: Skipping because of failed dependencies
Aug 06 12:08:23 controller-0 puppet-user[595062]: Notice: Applied catalog in 222.91 seconds

The untruncated error message looks something like:

Error: Host 'vm3' is not known to pcs, try to authenticate the host using 'pcs host auth vm3' command
Error: None of hosts is known to pcs.
Error: Errors have occurred, therefore pcs is unable to continue

Since Stein and RHEL/CentOS 8, the pacemaker cluster that composes the HA control plane is configured with pcs 0.10 which has a breaking change: before adding a node in the cluster, we must now explicitely authenticate the node to all the pcsd.

Alex Schultz (alex-schultz) on 2019-09-11

Changed in tripleo:
milestone:	train-3 → ussuri-1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-22: Fix included in openstack/puppet-pacemaker 0.8.0

This issue was fixed in the openstack/puppet-pacemaker 0.8.0 release.

Emilien Macchi (emilienm) on 2019-12-19

Changed in tripleo:
milestone:	ussuri-1 → ussuri-2

wes hayutin (weshayutin) on 2020-02-10

Changed in tripleo:
milestone:	ussuri-2 → ussuri-3

wes hayutin (weshayutin) on 2020-04-13

Changed in tripleo:
milestone:	ussuri-3 → ussuri-rc3

wes hayutin (weshayutin) on 2020-05-26

Changed in tripleo:
milestone:	ussuri-rc3 → victoria-1

Emilien Macchi (emilienm) on 2020-07-28

Changed in tripleo:
milestone:	victoria-1 → victoria-3

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.