periodic-tripleo-ci-rhel-8-scenario004-standalone-master fails to deploy standalone - pacemaker/mysql

Bug #1851847 reported by Ronelle Landy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
yatin

Bug Description

Updating bug description as check and periodic jobs have two different errors now. We suspect that the check job errors will be fixed by a promotion so to focus on the periodic/promotion errors we have:

2019-11-12 14:39:41 | "Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 14:39:41 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 14:39:41 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)",
2019-11-12 14:39:41 | "+ rc=1",

Full deploy log:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/39a2b42/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

History of the test pass/fail stats:

https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-rhel-8-scenario004-standalone-master

periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master also fails out

--------------------------------------------------------------------

tripleo-ci-rhel-8-scenario004-standalone-rdo have been failing consistently with errors related to pacemaker/rabbit.

Error: unable to get cib
Error: unable to get cib
shows up in the container logs.

019-11-08 13:45:41 | "Error: Facter: error while resolving custom fact \"rabbitmq_nodename\": undefined method `[]' for nil:NilClass",
2019-11-08 13:45:41 | "Error: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Could not evaluate: backup_cib: Running: pcs cluster cib /var/lib/pacemaker/cib/puppet-cib-backup20191108-9-1ww1cq7 failed with code: 1 -> Error: unable to get cib",
2019-11-08 13:45:41 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Rabbitmq_bundle/Pacemaker::Property[rabbitmq-role-standalone]/Pcmk_property[property-standalone-rabbitmq-role]: Could not evaluate: backup_cib: Running: pcs cluster ci

The full deploy log is below:

http://logs.rdoproject.org/85/692985/4/openstack-check/tripleo-ci-rhel-8-scenario004-standalone-rdo/d06da86/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

The link below shows the paunch errors:

logs.rdoproject.org/85/692985/4/openstack-check/tripleo-ci-rhel-8-scenario004-standalone-rdo/d06da86/logs/undercloud/var/log/extra/errors.txt.txt.gz

Ronelle Landy (rlandy)
Changed in tripleo:
milestone: none → ussuri-1
importance: Undecided → High
status: New → Triaged
tags: added: ci
Revision history for this message
Ronelle Landy (rlandy) wrote :

Moving this to a promotion blocker because the periodic jobs is failing as well ...

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/e5e3d98/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

available)",
2019-11-08 14:29:13 | "Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-08 14:29:13 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-08 14:29:13 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)",

tags: added: promotion-blocker
Changed in tripleo:
importance: High → Critical
Ronelle Landy (rlandy)
tags: removed: promotion-blocker
Revision history for this message
Ronelle Landy (rlandy) wrote :

So - actually scenario004 has never passed so I removed the promotion blocker her e- but keeping the bug open because this is in fact a different error.

Revision history for this message
Ronelle Landy (rlandy) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

OK - let's put back the promotion-blocker status as rhle-8 OVB has this showing up as well now

tags: added: promotion-blocker
Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
Revision history for this message
Michele Baldessari (michele) wrote :

error while resolving custom fact is an (annoying) benign error and can be safely ignored. The bug is here because the the version of pacemaker/pcs/corosync/libqb is mismatched between host and containers.

Revision history for this message
Alfredo Moralejo (amoralej) wrote :
Revision history for this message
Ronelle Landy (rlandy) wrote :

Question: what version of pacemaker on rhel is appropriate for master and train?

This is what we see currently:

pacemaker.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms
pacemaker-cli.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms
pacemaker-cluster-libs.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-appstream-rhui-rpms
pacemaker-libs.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-appstream-rhui-rpms
pacemaker-remote.x86_64 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms
pacemaker-schemas.noarch 2.0.2-3.el8 @rhui-rhel-8-for-x86_64-appstream-rhui-rpms
pam.x86_64 1.3.1-4.el8 @anaconda

pcs.x86_64 0.10.2-4.el8 @rhui-rhel-8-for-x86_64-highavailability-rhui-rpms

Revision history for this message
Ronelle Landy (rlandy) wrote :

<ykarel_> rlandy|rover, if the answer is yes, it's possibly due to unpin of puppet modules in master:- https://review.rdoproject.org/r/#/c/23533/
<ykarel_> and some unpinned modules do not working with rhel

Revision history for this message
Alfredo Moralejo (amoralej) wrote :

Errors in periodic jobs for scenario004 on rhel8 looks a different one:

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/home/zuul/standalone_deploy.log.txt.gz

2019-11-12 02:27:47 | "Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 02:27:47 | "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]",
2019-11-12 02:27:47 | "Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)",

Looking at system logs:

Nov 12 01:55:26 standalone.localdomain podman(galera-bundle-podman-0)[80291]: INFO: running container galera-bundle-podman-0 for the first time
Nov 12 01:55:26 standalone.localdomain podman[75335]: 2019-11-12 01:55:26.734 7f7af8b06700 0 log_channel(cluster) log [DBG] : pgmap v87: 320 pgs: 320 active+clean; 4.9 KiB data, 1.5 GiB used, 7.0 GiB / 9.4 GiB avail
Nov 12 01:55:26 standalone.localdomain podman(galera-bundle-podman-0)[80315]: ERROR: Error: error checking path "/var/log/mariadb": stat /var/log/mariadb: no such file or directory
Nov 12 01:55:26 standalone.localdomain podman(galera-bundle-podman-0)[80322]: ERROR: podman failed to launch container

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/var/log/journal.txt.gz

Ronelle Landy (rlandy)
summary: - rhel-8-scenario004 fails to deploy standalone - error while resolving
- custom fact \"rabbitmq_nodename\": undefined method `[]' for
- nil:NilClass"
+ rhel-8-scenario004 fails to deploy standalone - pacemaker/mysql
Revision history for this message
Alfredo Moralejo (amoralej) wrote : Re: rhel-8-scenario004 fails to deploy standalone - pacemaker/mysql

Also error in mysql_init_bundle (/usr/bin/clustercheck >/dev/null' returned 1 ):

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/var/log/extra/errors.txt.txt.gz

https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-scenario004-standalone-master/5be9f6f/logs/undercloud/var/log/extra/podman/containers/mysql_init_bundle/stdout.log.txt.gz

Notice: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Pacemaker::Resource::Ocf[galera]/Pcmk_resource[galera]/ensure: created
Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]
Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql_bundle/Exec[galera-ready]/returns: change from 'notrun' to ['0'] failed: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0]
Info: Creating state file /var/lib/puppet/state/state.yaml
Error: Failed to apply catalog: Execution of '/usr/bin/mysql --defaults-extra-file=/root/.my.cnf -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)

Ronelle Landy (rlandy)
description: updated
summary: - rhel-8-scenario004 fails to deploy standalone - pacemaker/mysql
+ periodic-tripleo-ci-rhel-8-scenario004-standalone-master fails to deploy
+ standalone - pacemaker/mysql
Revision history for this message
yatin (yatinkarel) wrote :

<<< <ykarel_> rlandy|rover, if the answer is yes, it's possibly due to unpin of puppet modules <<< in master:- https://review.rdoproject.org/r/#/c/23533/
<<< <ykarel_> and some unpinned modules do not working with rhel

This is only true if both check and promotion jobs would have failed post unpin with same reasons, which is not the case here, both periodic and check fails with different reasons.
periodic fail due to Error: '/usr/bin/clustercheck >/dev/null' returned 1 instead of one of [0] as mentioned by Alfredo, and check are failing possibly due to mismatch of pacemaker version mismatch in container and host as mentioned by bandini.

promotion job failure is caused by https://review.opendev.org/#/c/692850/, i have tested the revert(https://review.opendev.org/#/c/693974/2) to confirm it. check one is possibly caused by rhel8.1 update, and containers are outdated in check as there is no promotion of container yet, it should be green after rhel 8.1 containers are promoted. We can try container updates with ha repo just to confirm the second case.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/693997

Changed in tripleo:
assignee: Ronelle Landy (rlandy) → yatin (yatinkarel)
status: Triaged → In Progress
Revision history for this message
yatin (yatinkarel) wrote :

https://review.opendev.org/693997 will fix the promotion jobs, tested with https://review.rdoproject.org/r/#/c/23681/

<< check one is possibly caused by rhel8.1 update, and containers are outdated in check as there << is no promotion of container yet, it should be green after rhel 8.1 containers are promoted. << We can try container updates with ha repo just to confirm the second case.

Tested container updates in check from rhel HA repo with https://review.opendev.org/#/c/693992/ and rhel004 is green:- https://logs.rdoproject.org/92/693992/2/openstack-check/tripleo-ci-rhel-8-scenario004-standalone-rdo/c500985/job-output.txt, so as soon as rhel8 repo/containers promotes, check jobs will be green.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/693997
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=189d9b9211a65857cfd9191a00e9d86894af8293
Submitter: Zuul
Branch: master

commit 189d9b9211a65857cfd9191a00e9d86894af8293
Author: yatinkarel <email address hidden>
Date: Wed Nov 13 12:55:19 2019 +0530

    Readd creation of /var/log/mariadb directory

    https://review.opendev.org/#/c/692850/ cleaned up the
    legacy directories, but since then rhel8 jobs fails while
    starting galera containers with error of missing
    directory /var/log/mariadb, this patch adds it again.

    Closes-Bug: #1851847
    Change-Id: Iea081ecb3fc021fc796c93631ed6f663fd9580db

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.opendev.org/693974
Reason: https://review.opendev.org/693997

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.0.0

This issue was fixed in the openstack/tripleo-heat-templates 12.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/706903

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)

Reviewed: https://review.opendev.org/706903
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=e7351d44c71c007310db5e6481fce6b0b7bb44dc
Submitter: Zuul
Branch: stable/train

commit e7351d44c71c007310db5e6481fce6b0b7bb44dc
Author: Alex Schultz <email address hidden>
Date: Mon Nov 4 08:48:24 2019 -0700

    [train-squash] Backport legacy log folder and readme cleanups

    These 3 backports will save a bit of time during train deployments.

    Ensure service log folder permissions

    We should ensure that the service folders are 0750. We're setting
    /var/log/containers but we should also ensure the service folders also
    have the correct permissions.

    Change-Id: I28e8017edc7e30a60288adf846da722fd6ab310e
    (cherry picked from commit f2147c9974c5e4d9fec91e87a2a42a7c0b8c9d5d)

    Drop legacy log folder and readme

    We switched to containers a long time ago. This patch drops the
    management of a /var/log/<service> directory and the creation of a
    readme indicating that we've moved to containers which makes the logging
    available under /var/log/containers/<service>

    Change-Id: Ia4e991d5d937031ac3312f639b726a944743dd1e
    (cherry picked from commit 7906fb43be72a150b5d10e0e18b21b568895b6e0)

    Readd creation of /var/log/mariadb directory

    https://review.opendev.org/#/c/692850/ cleaned up the
    legacy directories, but since then rhel8 jobs fails while
    starting galera containers with error of missing
    directory /var/log/mariadb, this patch adds it again.

    Closes-Bug: #1851847
    Change-Id: Iea081ecb3fc021fc796c93631ed6f663fd9580db
    (cherry picked from commit 189d9b9211a65857cfd9191a00e9d86894af8293)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.