Deployment failure due to ERROR: Could not determine galera name from pacemaker node <galera-bundle-0>.

Bug #1724920 reported by Alex Schultz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Invalid
Critical
Unassigned

Bug Description

We see overcloud deployment failures in the gate because the galera-ready task fails. The error from galera/pacemaker is:

Oct 7 02:44:53 localhost journal: #033[1;31mError: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]#033[0m
Oct 7 02:44:53 localhost journal: #033[1;31mError: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from notrun to 0 failed: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]#033[0m
Oct 7 02:44:53 localhost journal: #033[0;32mInfo: Class[Tripleo::Profile::Pacemaker::Database::Mysql_bundle]: Unscheduling all events on Class[Tripleo::Profile::Pacemaker::Database::Mysql_bundle]#033[0m
Oct 7 02:44:53 localhost journal: #033[0;32mInfo: Creating state file /var/lib/puppet/state/state.yaml#033[0m
Oct 7 02:44:53 localhost journal: #033[1;31mError: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 "No such file or directory")#033[0m
Oct 7 02:44:53 localhost galera(galera)[74751]: ERROR: Could not determine galera name from pacemaker node <galera-bundle-0>.
Oct 7 02:44:53 localhost pacemaker_remoted[13]: notice: galera_start_0:74751:stderr [ ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0>. ] ]

http://logs.openstack.org/12/510212/3/check/gate-tripleo-ci-centos-7-scenario004-multinode-oooq-container/54ca592/logs/subnode-2/var/log/messages.txt.gz#_Oct__7_02_44_53

Tags: ci containers
Revision history for this message
Michele Baldessari (michele) wrote :

Ok so on http://logs.openstack.org/12/510212/3/check/gate-tripleo-ci-centos-7-scenario004-multinode-oooq-container/54ca592/logs/subnode-2/var/log/cluster/corosync.log.txt.gz I see this:
Oct 07 02:14:31 [28250] centos-7-2-node-inap-mtl01-11266984-941703 cib: info: cib_perform_op: ++ <nvpair id="galera-meta_attributes-container-attribute-target" name="container-attribute-target" value="host"/>

This should only happen only in one situation:
A) puppet-tripleo does have the patch 6bcb011723ad7b75f18914c887dc4fa4bad4d620 (which was reverted because we did not have the proper process for updating the mariadb/rabbitmq/redis containers with the latest pacemaker/resource-agents rpms)
B) the containers have an old pacemaker/resource-agents combo

We know about B) which hopefully will be either solved with a master promotion or with manual fixing of the containers. A) is quite surprising since it was reverted on:
091f92d6f0e8 - (2017-10-07 03:36:52 +0000) Merge "Revert "Set meta container-attribute-target=host attribute"" <Jenkins>

So the reason for A) is we have an old puppet-tripleo in this job (i.e. it predates the revert)?
puppet-tripleo-8.0.0-0.20171006214736.5e54b7e.el7.centos.noarch

The way out of this is to either fix B) or make sure we have a newer puppet-tripleo which contains the revert.

I will recap B) in an email tomorrow (am totally knackered atm)

Revision history for this message
Alex Schultz (alex-schultz) wrote :

I just realized I was looking at the Jenkins logs not the Zuul logs. This was already fixed

Changed in tripleo:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.