HA overcloud, controller replacement or node scale up broken with pcs 0.10+

Bug #1839209 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
In Progress
High
Unassigned

Bug Description

Since Stein and RHEL/CentOS 8, the code path that handles the replacement of controller node in a pacemaker cluster seems broken. The same applies to automatic scale up of the control plane.

When redeploying a stack with a list of new controllers to add to the cluster, the deployment times out and the new controllers are never added into the cluster. When inspecting the journal, one can see that the puppet run on the host (bootstrap node) yields an error when trying to add the new nodes:

# journalctl -t puppet-user
[...]
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: Host 'controller-3' is not known to pcs, try>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: None of hosts is known to pcs.
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unabl>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Error: '/sbin/pcs cluster node add controller-3 addr=172.17.1.148 --start --wait' returned 1 instead of one of [0]
Aug 06 12:08:22 controller-0 puppet-user[595062]: Error: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster]/returns: change from 'notrun' to ['0'] failed: '/sbin/pcs clu>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Notice: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-controller-3]: Dependency Exec[Adding Cluster node: controller-3 addr=172.17.1.148 to Cluster tripleo_cluster] has failur>
Aug 06 12:08:22 controller-0 puppet-user[595062]: Warning: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-controller-3]: Skipping because of failed dependencies
Aug 06 12:08:23 controller-0 puppet-user[595062]: Notice: Applied catalog in 222.91 seconds

The untruncated error message looks something like:

Error: Host 'vm3' is not known to pcs, try to authenticate the host using 'pcs host auth vm3' command
Error: None of hosts is known to pcs.
Error: Errors have occurred, therefore pcs is unable to continue

Since Stein and RHEL/CentOS 8, the pacemaker cluster that composes the HA control plane is configured with pcs 0.10 which has a breaking change: before adding a node in the cluster, we must now explicitely authenticate the node to all the pcsd.

Changed in tripleo:
milestone: train-3 → ussuri-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-pacemaker 0.8.0

This issue was fixed in the openstack/puppet-pacemaker 0.8.0 release.

Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.