[Rocky to Stein] Galera bundle failing during the upgrade with: Could not determine galera name from pacemaker node <controller-2>.

Bug #1859961 reported by Jose Luis Franco
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Damien Ciabrini

Bug Description

Launchpad based on BZ https://bugzilla.redhat.com/show_bug.cgi?id=1791675:

When upgrading from OSP14 to 15 we start by upgrading the first controller, upgrade it's OS and create a new cluster with that single node. Then the remaining controllers join the cluster.
When upgrading the second controller, if we do a pcs status we can see that the galera-bundle is stopped for that new node and an error appears:

Online: [ controller-0 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-0 galera-bundle-1@controller-2 rabbitmq-bundle-0@controller-0 rabbitmq-bundle-1@controller-2 redis-bundle-0@controller-0 redis-bundle-1@controller-2 ]

Full list of resources:

 Container bundle set: galera-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-mariadb:pcmklatest]
   galera-bundle-0 (ocf::heartbeat:galera): Master controller-0
   galera-bundle-1 (ocf::heartbeat:galera): Stopped controller-2
 Container bundle set: rabbitmq-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0 (ocf::heartbeat:rabbitmq-cluster): Started controller-0
   rabbitmq-bundle-1 (ocf::heartbeat:rabbitmq-cluster): Started controller-2
 Container bundle set: redis-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-redis:pcmklatest]
   redis-bundle-0 (ocf::heartbeat:redis): Master controller-0
   redis-bundle-1 (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.21 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.1.16 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.3.12 (ocf::heartbeat:IPaddr2): Started controller-0
 ip-172.17.4.26 (ocf::heartbeat:IPaddr2): Started controller-0
 Container bundle set: haproxy-bundle [192.168.24.1:8787/rh-osbs/rhosp15-openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0 (ocf::heartbeat:podman): Started controller-0
   haproxy-bundle-podman-1 (ocf::heartbeat:podman): Started controller-2
   haproxy-bundle-podman-2 (ocf::heartbeat:podman): Stopped
 Container bundle: openstack-cinder-backup [192.168.24.1:8787/rh-osbs/rhosp15-openstack-cinder-backup:pcmklatest]
   openstack-cinder-backup-podman-0 (ocf::heartbeat:podman): Started controller-0
 Container bundle: openstack-cinder-volume [192.168.24.1:8787/rh-osbs/rhosp15-openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0 (ocf::heartbeat:podman): Started controller-0

Failed Resource Actions:
* galera_start_0 on galera-bundle-1 'not configured' (6): call=39, status=complete, exitreason='Could not determine galera name from pacemaker node <controller-2>.',
    last-rc-change='Thu Jan 16 07:37:41 2020', queued=0ms, exec=91ms

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

The cause seems to be related to the pacemaker mapping variable cluster_host_map, which gets wrongly created:

When running the first controller upgrade (controller-0):

2020-01-16 01:07:40 | "Debug: try 1/10: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20200116-9-w2c054 resource create galera ocf:heartbeat:galera log='/var/log/mysql/mysqld.log' additional_parameters='--open-files-limit=16384' enable_creation=true wsrep_cluster_address='gcomm://controller-0.internalapi.redhat.local' cluster_host_map='controller-0:controller-0.internalapi.redhat.local;controller-1:;controller-2:' meta master-max=1 ordered=true container-attribute-target=host op promote timeout=300s on-fail=block bundle galera-bundle",

However, when running the deploy steps for the second controller (controller-2):

2020-01-16 02:42:10 | "Debug: pcs_offline: pcs -f /var/lib/pacemaker/cib/puppet-cib-backup20200116-9-1366jzl resource update galera ocf:heartbeat:galera log='/var/log/mysql/mysqld.log' additional_parameters='--open-files-limit=16384' enable_creation=true wsrep_cluster_address='gcomm://controller-0.internalapi.redhat.local,controller-2.internalapi.redhat.local' cluster_host_map='controller-0:controller-0.internalapi.redhat.local;controller-1:controller-2.internalapi.redhat.local;controller-2:' meta master-max=2 ordered=true container-attribute-target=host op promote timeout=300s on-fail=block bundle galera-bundle. Output: ",

The cluster_host_map is mapping controller-1:controller-2.internalapi.redhat.local

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (master)

Fix proposed to branch: master
Review: https://review.opendev.org/702851

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (master)

Reviewed: https://review.opendev.org/702851
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=0a64eebb6454483e823c4cf12c55832935c2319f
Submitter: Zuul
Branch: master

commit 0a64eebb6454483e823c4cf12c55832935c2319f
Author: Damien Ciabrini <email address hidden>
Date: Thu Jan 16 12:57:10 2020 +0100

    HA: Honour all hiera override variables in mysql_bundle

    During a major upgrade, upgrade tasks can rebuild a new pacemaker
    cluster by adding nodes one at a time. This is implemented by
    using two special hiera variables mysql_node_names_override and
    mysql_short_node_names_override.

    Make sure the mysql_bundle puppet module uses both variables
    when such cluster rebuild is in progress.

    Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
    Closes-Bug: #1859961
    Co-Authored-By: Jose Luis Franco Arza <email address hidden>

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/703016

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to puppet-tripleo (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/703029

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/stein)

Reviewed: https://review.opendev.org/703029
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=450c23519480e10912677505c59ef7a9a1ea6b58
Submitter: Zuul
Branch: stable/stein

commit 450c23519480e10912677505c59ef7a9a1ea6b58
Author: Damien Ciabrini <email address hidden>
Date: Thu Jan 16 12:57:10 2020 +0100

    HA: Honour all hiera override variables in mysql_bundle

    During a major upgrade, upgrade tasks can rebuild a new pacemaker
    cluster by adding nodes one at a time. This is implemented by
    using two special hiera variables mysql_node_names_override and
    mysql_short_node_names_override.

    Make sure the mysql_bundle puppet module uses both variables
    when such cluster rebuild is in progress.

    Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
    Closes-Bug: #1859961
    Co-Authored-By: Jose Luis Franco Arza <email address hidden>
    (cherry picked from commit 0a64eebb6454483e823c4cf12c55832935c2319f)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to puppet-tripleo (stable/train)

Reviewed: https://review.opendev.org/703016
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=7aec7fc5cf422da88d682beff8102e52c532baf7
Submitter: Zuul
Branch: stable/train

commit 7aec7fc5cf422da88d682beff8102e52c532baf7
Author: Damien Ciabrini <email address hidden>
Date: Thu Jan 16 12:57:10 2020 +0100

    HA: Honour all hiera override variables in mysql_bundle

    During a major upgrade, upgrade tasks can rebuild a new pacemaker
    cluster by adding nodes one at a time. This is implemented by
    using two special hiera variables mysql_node_names_override and
    mysql_short_node_names_override.

    Make sure the mysql_bundle puppet module uses both variables
    when such cluster rebuild is in progress.

    Change-Id: I6a06269f55a38071c34d2a95109d213fe7e2452c
    Closes-Bug: #1859961
    Co-Authored-By: Jose Luis Franco Arza <email address hidden>
    (cherry picked from commit 0a64eebb6454483e823c4cf12c55832935c2319f)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 12.1.0

This issue was fixed in the openstack/puppet-tripleo 12.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo 11.5.0

This issue was fixed in the openstack/puppet-tripleo 11.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/puppet-tripleo stein-eol

This issue was fixed in the openstack/puppet-tripleo stein-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.