[Rocky to Stein] Galera bundle failing during the upgrade with: Could not determine galera name from pacemaker node <controller-2>.

Bug #1859961 reported by Jose Luis Franco on 2020-01-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Damien Ciabrini

Bug Description

Launchpad based on BZ https://bugzilla.redhat.com/show_bug.cgi?id=1791675:

When upgrading from OSP14 to 15 we start by upgrading the first controller, upgrade it's OS and create a new cluster with that single node. Then the remaining controllers join the cluster.
When upgrading the second controller, if we do a pcs status we can see that the galera-bundle is stopped for that new node and an error appears:

Online: [ controller-0 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-2 ]

Full list of resources:

 Container bundle set: galera-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-mariadb:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
   galera-bundle-1 (ocf::heartbeat:galera): Stopped controller-2
 Container bundle set: rabbitmq-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-2
 Container bundle set: redis-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-redis:pcmklatest]
   redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
   redis-bundle-1 (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.21 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.1.16 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.3.12 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.4.26 (ocf::heartbeat:IPaddr2): Started controller-0
 Container bundle set: haproxy-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0
   haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-2
   haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped
 Container bundle: openstack-cinder-backup [192.168.24.1:8787/rh-osbs/rhosp15-openstack-cinder-backup:pcmklatest]
   openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started controller-0
 Container bundle: openstack-cinder-volume [192.168.24.1:8787/rh-osbs/rhosp15-openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-0

Failed Resource Actions:
* galera_start_0 on galera-bundle-1 'not configured' (6): call=39, status=complete, exitreason='Could not determine galera name from pacemaker node <controller-2>.',
    last-rc-change='Thu Jan 16 07:37:41 2020', queued=0ms, exec=91ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

The cause seems to be related to the pacemaker mapping variable cluster_host_map, which gets wrongly created:

When running the first controller upgrade (controller-0):

2020-01-16 01:07:40 | "Debug: try 1/10: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20200116-9-w2c054 resource create galera ocf:heartbeat:galera log='/var/log/mysql/mysqld.log' additional_parameters='--open-files-limit=16384' enable_creation=true wsrep_cluster_address='gcomm://controller-0.internalapi.redhat.local' cluster_host_map='controller-0:controller-0.internalapi.redhat.local;controller-1:;controller-2:' meta master-max=1 ordered=true container-attribute-target=host op promote timeout=300s on-fail=block bundle galera-bundle",

However, when running the deploy steps for the second controller (controller-2):

2020-01-16 02:42:10 | "Debug: pcs_offline: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20200116-9-1366jzl resource update galera ocf:heartbeat:galera log='/var/log/mysql/mysqld.log' additional_parameters='--open-files-limit=16384' enable_creation=true wsrep_cluster_address='gcomm://controller-0.internalapi.redhat.local,controller-2.internalapi.redhat.local' cluster_host_map='controller-0:controller-0.internalapi.redhat.local;controller-1:controller-2.internalapi.redhat.local;controller-2:' meta master-max=2 ordered=true container-attribute-target=host op promote timeout=300s on-fail=block bundle galera-bundle. Output: ",

The cluster_host_map is mapping controller-1:controller-2.internalapi.redhat.local

Fix proposed to branch: master
Review: https://review.opendev.org/702851

Changed in tripleo:
status: Triaged → In Progress

Reviewed: https://review.opendev.org/702851
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=0a64eebb6454483e823c4cf12c55832935c2319f
Submitter: Zuul
Branch: master

commit 0a64eebb6454483e823c4cf12c55832935c2319f
Author: Damien Ciabrini <email address hidden>
Date: Thu Jan 16 12:57:10 2020 +0100

    HA: Honour all hiera override variables in mysql_bundle

    During a major upgrade, upgrade tasks can rebuild a new pacemaker
    cluster by adding nodes one at a time. This is implemented by
    using two special hiera variables mysql_node_names_override and
    mysql_short_node_names_override.

    Make sure the mysql_bundle puppet module uses both variables
    when such cluster rebuild is in progress.

    Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
    Closes-Bug: #1859961
    Co-Authored-By: Jose Luis Franco Arza <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/703029
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=450c23519480e10912677505c59ef7a9a1ea6b58
Submitter: Zuul
Branch: stable/stein

commit 450c23519480e10912677505c59ef7a9a1ea6b58
Author: Damien Ciabrini <email address hidden>
Date: Thu Jan 16 12:57:10 2020 +0100

    HA: Honour all hiera override variables in mysql_bundle

    During a major upgrade, upgrade tasks can rebuild a new pacemaker
    cluster by adding nodes one at a time. This is implemented by
    using two special hiera variables mysql_node_names_override and
    mysql_short_node_names_override.

    Make sure the mysql_bundle puppet module uses both variables
    when such cluster rebuild is in progress.

    Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
    Closes-Bug: #1859961
    Co-Authored-By: Jose Luis Franco Arza <email address hidden>
    (cherry picked from commit 0a64eebb6454483e823c4cf12c55832935c2319f)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/703016
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=7aec7fc5cf422da88d682beff8102e52c532baf7
Submitter: Zuul
Branch: stable/train

commit 7aec7fc5cf422da88d682beff8102e52c532baf7
Author: Damien Ciabrini <email address hidden>
Date: Thu Jan 16 12:57:10 2020 +0100

    HA: Honour all hiera override variables in mysql_bundle

    During a major upgrade, upgrade tasks can rebuild a new pacemaker
    cluster by adding nodes one at a time. This is implemented by
    using two special hiera variables mysql_node_names_override and
    mysql_short_node_names_override.

    Make sure the mysql_bundle puppet module uses both variables
    when such cluster rebuild is in progress.

    Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
    Closes-Bug: #1859961
    Co-Authored-By: Jose Luis Franco Arza <email address hidden>
    (cherry picked from commit 0a64eebb6454483e823c4cf12c55832935c2319f)

tags: added: in-stable-train

This issue was fixed in the openstack/puppet-tripleo 12.1.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers