periodic-tripleo-ci-rhel-8-scenario004-standalone-master fails to deploy standalone - pacemaker/mysql

Bug #1851847 reported by Ronelle Landy on 2019-11-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
yatin

Bug Description

Updating bug description as check and periodic jobs have two different errors now. We suspect that the check job errors will be fixed by a promotion so to focus on the periodic/promotion errors we have:

2019-11-12 14:39:41 | "Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 14:39:41 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 14:39:41 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)",
2019-11-12 14:39:41 | "+ rc=1",

Full deploy log:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/39a2b42/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

History of the test pass/fail stats:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-rhel-8-scenario004-standalone-master

periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master also fails out

--------------------------------------------------------------------

tripleo-ci-rhel-8-scenario004-standalone-rdo have been failing consistently with errors related to pacemaker/rabbit.

Error: unable to get cib
Error: unable to get cib
shows up in the container logs.

019-11-08 13:45:41 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",
2019-11-08 13:45:41 | "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191108-9-1ww1cq7 failed with code: 1 -> Error: unable to get cib",
2019-11-08 13:45:41 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster ci

The full deploy log is below:

http://logs.rdoproject.org/85/692985/4/openstack-check/tripleo-ci-rhel-8-scenario004-standalone-rdo/d06da86/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

The link below shows the paunch errors:

logs.rdoproject.org/85/692985/4/openstack-check/tripleo-ci-rhel-8-scenario004-standalone-rdo/d06da86/logs/undercloud/var/log/extra/errors.txt.txt.gz

Ronelle Landy (rlandy) on 2019-11-08
Changed in tripleo:
milestone: none → ussuri-1
importance: Undecided → High
status: New → Triaged
tags: added: ci
Ronelle Landy (rlandy) wrote :

Moving this to a promotion blocker because the periodic jobs is failing as well ...

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/e5e3d98/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

available)",
2019-11-08 14:29:13 | "Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-08 14:29:13 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-08 14:29:13 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)",

tags: added: promotion-blocker
Changed in tripleo:
importance: High → Critical
Ronelle Landy (rlandy) on 2019-11-08
tags: removed: promotion-blocker
Ronelle Landy (rlandy) wrote :

So - actually scenario004 has never passed so I removed the promotion blocker her e- but keeping the bug open because this is in fact a different error.

Ronelle Landy (rlandy) wrote :

OK - let's put back the promotion-blocker status as rhle-8 OVB has this showing up as well now

tags: added: promotion-blocker
Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
Michele Baldessari (michele) wrote :

error while resolving custom fact is an (annoying) benign error and can be safely ignored. The bug is here because the the version of pacemaker/pcs/corosync/libqb is mismatched between host and containers.

Ronelle Landy (rlandy) wrote :

Question: what version of pacemaker on rhel is appropriate for master and train?

This is what we see currently:

pacemaker.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms
pacemaker-cli.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms
pacemaker-cluster-libs.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-appstream-rhui-rpms
pacemaker-libs.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-appstream-rhui-rpms
pacemaker-remote.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms
pacemaker-schemas.noarch 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-appstream-rhui-rpms
pam.x86_64 1.3.1-4.el8 @anaconda

pcs.x86_64 0.10.2-4.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms

Ronelle Landy (rlandy) wrote :

<ykarel_> rlandy|rover, if the answer is yes, it's possibly due to unpin of puppet modules in master:- https://review.rdoproject.org/r/#/c/23533/
<ykarel_> and some unpinned modules do not working with rhel

Alfredo Moralejo (amoralej) wrote :

Errors in periodic jobs for scenario004 on rhel8 looks a different one:

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

2019-11-12 02:27:47 | "Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 02:27:47 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 02:27:47 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)",

Looking at system logs:

Nov 12 01:55:26 standalone.localdomain podman(galera-bundle-podman-0)[80291]: INFO: running container galera-bundle-podman-0 for the first time
Nov 12 01:55:26 standalone.localdomain podman[75335]: 2019-11-12 01:55:26.734 7f7af8b06700 0 log_channel(cluster) log [DBG] : pgmap v87: 320 pgs: 320 active+clean; 4.9 KiB data, 1.5 GiB used, 7.0 GiB / 9.4 GiB avail
Nov 12 01:55:26 standalone.localdomain podman(galera-bundle-podman-0)[80315]: ERROR: Error: error checking path "/var/log/mariadb": stat /var/log/mariadb: no such file or directory
Nov 12 01:55:26 standalone.localdomain podman(galera-bundle-podman-0)[80322]: ERROR: podman failed to launch container

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/var/log/journal.txt.gz

Ronelle Landy (rlandy) on 2019-11-12
summary: - rhel-8-scenario004 fails to deploy standalone - error while resolving
- custom fact \"rabbitmq_nodename\": undefined method `[]' for
- nil:NilClass"
+ rhel-8-scenario004 fails to deploy standalone - pacemaker/mysql

Also error in mysql_init_bundle (/usr/bin/clustercheck >/dev/null' returned 1 ):

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/var/log/extra/errors.txt.txt.gz

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/var/log/extra/podman/containers/mysql_init_bundle/stdout.log.txt.gz

Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Pacemaker::Resource::Ocf[galera]/Pcmk_resource[galera]/ensure: created
Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]
Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]
Info: Creating state file /var/lib/puppet/state/state.yaml
Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)

Ronelle Landy (rlandy) on 2019-11-12
description: updated
summary: - rhel-8-scenario004 fails to deploy standalone - pacemaker/mysql
+ periodic-tripleo-ci-rhel-8-scenario004-standalone-master fails to deploy
+ standalone - pacemaker/mysql
yatin (yatinkarel) wrote :

<<< <ykarel_> rlandy|rover, if the answer is yes, it's possibly due to unpin of puppet modules <<< in master:- https://review.rdoproject.org/r/#/c/23533/
<<< <ykarel_> and some unpinned modules do not working with rhel

This is only true if both check and promotion jobs would have failed post unpin with same reasons, which is not the case here, both periodic and check fails with different reasons.
periodic fail due to Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0] as mentioned by Alfredo, and check are failing possibly due to mismatch of pacemaker version mismatch in container and host as mentioned by bandini.

promotion job failure is caused by https://review.opendev.org/#/c/692850/, i have tested the revert(https://review.opendev.org/#/c/693974/2) to confirm it. check one is possibly caused by rhel8.1 update, and containers are outdated in check as there is no promotion of container yet, it should be green after rhel 8.1 containers are promoted. We can try container updates with ha repo just to confirm the second case.

Fix proposed to branch: master
Review: https://review.opendev.org/693997

Changed in tripleo:
assignee: Ronelle Landy (rlandy) → yatin (yatinkarel)
status: Triaged → In Progress
yatin (yatinkarel) wrote :

https://review.opendev.org/693997 will fix the promotion jobs, tested with https://review.rdoproject.org/r/#/c/23681/

<< check one is possibly caused by rhel8.1 update, and containers are outdated in check as there << is no promotion of container yet, it should be green after rhel 8.1 containers are promoted. << We can try container updates with ha repo just to confirm the second case.

Tested container updates in check from rhel HA repo with https://review.opendev.org/#/c/693992/ and rhel004 is green:- https://logs.rdoproject.org/92/693992/2/openstack-check/tripleo-ci-rhel-8-scenario004-standalone-rdo/c500985/job-output.txt, so as soon as rhel8 repo/containers promotes, check jobs will be green.

Reviewed: https://review.opendev.org/693997
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=189d9b9211a65857cfd9191a00e9d86894af8293
Submitter: Zuul
Branch: master

commit 189d9b9211a65857cfd9191a00e9d86894af8293
Author: yatinkarel <email address hidden>
Date: Wed Nov 13 12:55:19 2019 +0530

    Readd creation of /var/log/mariadb directory

    https://review.opendev.org/#/c/692850/ cleaned up the
    legacy directories, but since then rhel8 jobs fails while
    starting galera containers with error of missing
    directory /var/log/mariadb, this patch adds it again.

    Closes-Bug: #1851847
    Change-Id: Iea081ecb3fc021fc796c93631ed6f663fd9580db

Changed in tripleo:
status: In Progress → Fix Released

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/693974
Reason: https://review.opendev.org/693997

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers