tripleo

FFU: HA controller scale-up logics fail to match regex

Bug #1950294 reported by Damien Ciabrini on 2021-11-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	In Progress	Low	Damien Ciabrini

Bug Description

During FFU to Train, a new pacemaker cluster is recreated node by node. puppet-pacemaker handles the logics that detects whether a new node is being added into the pacemaker cluster. There is a safeguard check to make sure the scale-up is idempotent:

          exec {"Adding Cluster node: ${node_to_add} to Cluster ${cluster_name}":
            unless => "${::pacemaker::pcs_bin} status 2>&1 | grep -e \"^Online:.* ${node_name} .*\"",
            command => "${::pacemaker::pcs_bin} cluster node add ${node_to_add} ${node_add_start_part} --wait",
            timeout => $cluster_start_timeout,
            tries => $cluster_start_tries,
            try_sleep => $cluster_start_try_sleep,
            notify => Exec["node-cluster-start-${node_name}"],
            tag => 'pacemaker-scaleup',
          }

However, the regex used in the "unless" attribute fails to match node due to an unecessary beg-of-line match. Consequently, under specific circumstances, it may happen that puppet-pacemaker tries to add a node to the cluster even if it's already present:

<13>Oct 20 23:44:22 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Authenticating new cluster node: overcloud-controller-2 addr=10.1.0.22]/returns: executed successfully
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Node name 'overcloud-controller-2' is already used by existing nodes; please, use other name
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Node address '10.1.0.22' is already used by existing nodes; please, use other address
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: Running cluster services: 'corosync', 'pacemaker', the host seems to be in a cluster already, use --force to override
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: overcloud-controller-2: Cluster configuration files found, the host seems to be in a cluster already, use --force to override
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Some nodes are already in a cluster. Enforcing this will destroy existing cluster on those nodes. You should remove the nodes from their clusters instead to keep the clusters working properly, use --force to\
override
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: Error: Errors have occurred, therefore pcs is unable to continue
<13>Oct 20 23:48:23 puppet-user: Error: '/sbin/pcs cluster node add overcloud-controller-2 addr=10.1.0.22 --start --wait' returned 1 instead of one of [0]
<13>Oct 20 23:48:23 puppet-user: Error: /Stage[main]/Pacemaker::Corosync/Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster]/returns: change from 'notrun' to ['0'] failed: '/sbin/pcs cluster node add overcloud-controller-2 addr=10.1.0.22 --start --wait' returned 1 instead of one of [0]
<13>Oct 20 23:48:23 puppet-user: Notice: /Stage[main]/Pacemaker::Corosync/Exec[node-cluster-start-overcloud-controller-2]: Dependency Exec[Adding Cluster node: overcloud-controller-2 addr=10.1.0.22 to Cluster tripleo_cluster] has failures: true

Damien Ciabrini (dciabrin) on 2021-11-09

Changed in tripleo:
importance:	Undecided → Low

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-04-12: Fix included in openstack/puppet-pacemaker 1.4.0

This issue was fixed in the openstack/puppet-pacemaker 1.4.0 release.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.