tripleo fails to deploy in ci : Failed to call refresh: /usr/bin/clustercheck

Bug #1713127 reported by Joe Talerico on 2017-08-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
High
Unassigned

Bug Description

The patch I had didn't touch clustercheck, but I see : http://logs.openstack.org/24/497524/6/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/0b63e4b/logs/undercloud/home/jenkins/failed_deployment_list.log.txt.gz

Capturing the specific error output:

            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Exec[galera-ready]: Failed to call refresh: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Exec[galera-ready]: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
            "Error: Failed to apply catalog: Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")",

This seems to be happening frequently enough to track :
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%20%5C%22Failed%20to%20call%20refresh%3A%20%2Fusr%2Fbin%2Fclustercheck%5C%22

Tags: ci Edit Tag help
Changed in tripleo:
milestone: none → pike-rc2
Bogdan Dobrelya (bogdando) wrote :

there are multiple hits reported (14 for multiple tripleo CI gates), this should be a high bug to get fixed in the pike scope

Changed in tripleo:
importance: Medium → High
tags: added: ci
tags: added: alert
Michele Baldessari (michele) wrote :

Meh I think we lost the CIB collection capability in the CI when we moved to oooq? :/

Michele Baldessari (michele) wrote :

So the issue seems that the stonith-enabled=false is never set and so pacemaker does not start the db and hence clustercheck fails.

On my working deployment I can see the following:
Aug 29 13:42:07 [19665] overcloud-controller-0 cib: info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>

I can't see this here, so maybe we have some dependency issues here. What is a bit odd is that in the pacemaker tripleo profile we have the following (at step1):
    class { '::pacemaker':
      hacluster_pwd => hiera('hacluster_pwd'),
    }
    -> class { '::pacemaker::corosync':
      cluster_members => $pacemaker_cluster_members,
      setup_cluster => $pacemaker_master,
      cluster_setup_extras => $cluster_setup_extras,
      remote_authkey => $remote_authkey,
    }
    if $pacemaker_master {
      class { '::pacemaker::stonith':
        disable => !$enable_fencing,
        tries => $pcs_tries,
      }
    }

The creation of the galera resource happens at step 2, so we should be guaranteed to have stonith property set to false.

Michele Baldessari (michele) wrote :

Oh I see the problem:
Aug 25 15:44:33 localhost os-collect-config: "Error: Execution of '/usr/bin/yum -d 0 -e 0 -y install fence-agents-all' returned 1: Error downloading packages:",
Aug 25 15:44:33 localhost os-collect-config: "Error: /Stage[main]/Pacemaker::Install/Package[fence-agents-all]/ensure: change from purged to present failed: Execution of '/usr/bin/yum -d 0 -e 0 -y install fence-agents-all' returned 1: Error downloading packages:",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Service/Service[pcsd]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/File[etc-pacemaker]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Service/Service[corosync]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Service/Service[pacemaker]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Dependency Package[fence-agents-all] has failures: true",

Maybe since this is a multinode we should just preinstall fence-agents-all on the nodes in order to avoid these failures?

Michele Baldessari (michele) wrote :

Added elastic-recheck query here: https://review.openstack.org/499516

tags: removed: alert
Changed in tripleo:
milestone: pike-rc2 → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Triaged → Incomplete
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers