galera(galera)[34]: ERROR: Could not determine galera name from pacemaker node <galera-bundle-0>

Bug #1721497 reported by Michele Baldessari on 2017-10-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
John Trowbridge

Bug Description

In http://logs.openstack.org/97/509397/2/check/gate-tripleo-ci-centos-7-scenario004-multinode-oooq-container/fc97d0c/logs/subnode-2/var/log/messages.txt.gz#_Oct__5_08_24_28 we currently fail to bring up galera:

Oct 5 08:24:28 localhost galera(galera)[34]: ERROR: Could not determine galera name from pacemaker node <galera-bundle-0>.
Oct 5 08:24:28 localhost pacemaker_remoted[13]: notice: galera_start_0:34:stderr [ ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0>. ]
Oct 5 08:24:28 localhost crmd[30980]: notice: Result of start operation for galera on galera-bundle-0: 6 (not configured)
Oct 5 08:24:28 localhost crmd[30980]: notice: galera-bundle-0-galera_start_0:7 [ ocf-exit-reason:Could not determine galera name from pacemaker node <galera-bundle-0>.\n ]
Oct 5 08:24:28 localhost crmd[30980]: warning: Action 37 (galera:0_start_0) on galera-bundle-0 failed (target: 0 vs. rc: 6): Error

The reason is that now that https://review.openstack.org/497766 has merged it needs new pacemaker and resource agents. Those do already exist on the host:
pacemaker-1.1.16-12.el7_4.2.0.0.rdo1.x86_64
resource-agents-3.9.5-105.el7.0.0.rdo1.x86_64

The problem is that the following three containers (the ones with OCf resources inside) need to be rebuilt with those packages:
- rabbitmq
- mariadb/galera
- redis

Repos are here https://buildlogs.centos.org/centos/7/cloud/x86_64/openstack-pike/

tags: added: containers
Michele Baldessari (michele) wrote :

Once https://review.openstack.org/#/c/504454/ lands we should get:
docker.io/tripleomaster/centos-binary-mariadb:passed-ci-test has pacemaker-1.1.16-12.el7_4.2.0.0.rdo1.x86_64 and resource-agents-3.9.5-105.el7.0.0.rdo1.x86_64

Martin André (mandre) wrote :

This should be fixed with https://review.openstack.org/#/c/504454/.

Alex Schultz (alex-schultz) wrote :

Patch is merged, moving to fixed release. If this is still a problem let's reopen it.

Changed in tripleo:
assignee: nobody → John Trowbridge (trown)
status: Triaged → Fix Released
Martin André (mandre) wrote :

Re-opened, this is still occurring in gate-tripleo-ci-centos-7-scenario004-multinode-oooq-container even with the new images from tripleomaster:

http://logs.openstack.org/75/462975/17/gate/gate-tripleo-ci-centos-7-scenario004-multinode-oooq-container/55c64de/logs/subnode-2/var/log/messages.txt.gz#_Oct__5_21_31_56

Changed in tripleo:
status: Fix Released → Confirmed
Michele Baldessari (michele) wrote :

So when I checked packages yesterday I used this one:
[root@bandini ~]# docker run -it docker.io/tripleomaster/centos-binary-mariadb:passed-ci-test /bin/bash -c "rpm -q pacemaker resource-agents"
pacemaker-1.1.16-12.el7_4.2.0.0.rdo1.x86_64
resource-agents-3.9.5-105.el7.0.0.rdo1.x86_64

But CI pulls 'passed-ci' (vs passed-ci-test which I used to look at):
[root@bandini ~]# docker run -it docker.io/tripleomaster/centos-binary-mariadb:passed-ci /bin/bash -c "rpm -q pacemaker resource-agents"
pacemaker-1.1.16-12.el7_4.2.x86_64
resource-agents-3.9.5-105.el7.x86_64

Gabriele Cerami (gcerami) wrote :

passed-ci-test was only a tag used to test container images upload after promotion. Don't consider that, I'll remove the tag from all the containers.

John Trowbridge (trown) wrote :

I think the correct thing to do is to revert https://review.openstack.org/#/c/497766/

It clearly would have failed the scenario004 job, but that job did not run and is now broken.

The passed-ci-test tag on dockerhub is/was just there as some testing of the new pipeline and the full set of containers with that tag have not actually passed the CI pipeline.

If we dont want to revert the patch that actually broke this...the only other option is to just manually tag a new mariadb container that has not passed the promote CI with everything else. This seems fine(ish) for the current issue, but is a pretty bad habit to carry forward.

Changed in tripleo:
status: Confirmed → In Progress
Gabriele Cerami (gcerami) wrote :

passed-ci-test tags and its associated hash tag were removed from all the containers. Sorry for the confusion

Reviewed: https://review.openstack.org/510094
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=1681d3bceb2834e8788cc4456d65a76bcf4e1e55
Submitter: Jenkins
Branch: master

commit 1681d3bceb2834e8788cc4456d65a76bcf4e1e55
Author: John Trowbridge <email address hidden>
Date: Fri Oct 6 12:44:16 2017 +0000

    Revert "Set meta container-attribute-target=host attribute"

    This patch broke the containers scenario004 test because it relies on a
    newer mariadb container than has actually passed CI at this time.

    To revert this revert, we need to make sure we test
    scenario004-containers against that patch.

    This reverts commit 6bcb011723ad7b75f18914c887dc4fa4bad4d620.

    Closes-Bug: 1721497

    Change-Id: I34c7c388eed94db1735c45e26661a0af8cdce8e9

Changed in tripleo:
status: In Progress → Fix Released
John Trowbridge (trown) wrote :

Removed alert and lowered to High since the revert landed. We still need to either promote next week, or upload updated ha containers to get the original patch landed.

Changed in tripleo:
importance: Critical → High
tags: removed: alert

This issue was fixed in the openstack/puppet-tripleo 8.0.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers