tripleo fails to deploy in ci : Failed to call refresh: /usr/bin/clustercheck

Bug #1713127 reported by Joe Talerico
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Incomplete
Wishlist
Unassigned

Bug Description

The patch I had didn't touch clustercheck, but I see : http://logs.openstack.org/24/497524/6/check/gate-tripleo-ci-centos-7-scenario001-multinode-oooq/0b63e4b/logs/undercloud/home/jenkins/failed_deployment_list.log.txt.gz

Capturing the specific error output:

            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Exec[galera-ready]: Failed to call refresh: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
            "Error: /Stage[main]/Tripleo::Profile::Pacemaker::Database::Mysql/Exec[galera-ready]: /usr/bin/clustercheck >/dev/null returned 1 instead of one of [0]",
            "Error: Failed to apply catalog: Execution of '/usr/bin/mysql -NBe SELECT CONCAT(User, '@',Host) AS User FROM mysql.user' returned 1: ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2 \"No such file or directory\")",

This seems to be happening frequently enough to track :
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%20%5C%22Failed%20to%20call%20refresh%3A%20%2Fusr%2Fbin%2Fclustercheck%5C%22

Tags: ci
Changed in tripleo:
milestone: none → pike-rc2
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

there are multiple hits reported (14 for multiple tripleo CI gates), this should be a high bug to get fixed in the pike scope

Changed in tripleo:
importance: Medium → High
tags: added: ci
tags: added: alert
Revision history for this message
Michele Baldessari (michele) wrote :

Meh I think we lost the CIB collection capability in the CI when we moved to oooq? :/

Revision history for this message
Michele Baldessari (michele) wrote :

So the issue seems that the stonith-enabled=false is never set and so pacemaker does not start the db and hence clustercheck fails.

On my working deployment I can see the following:
Aug 29 13:42:07 [19665] overcloud-controller-0 cib: info: cib_perform_op: ++ /cib/configuration/crm_config/cluster_property_set[@id='cib-bootstrap-options']: <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>

I can't see this here, so maybe we have some dependency issues here. What is a bit odd is that in the pacemaker tripleo profile we have the following (at step1):
    class { '::pacemaker':
      hacluster_pwd => hiera('hacluster_pwd'),
    }
    -> class { '::pacemaker::corosync':
      cluster_members => $pacemaker_cluster_members,
      setup_cluster => $pacemaker_master,
      cluster_setup_extras => $cluster_setup_extras,
      remote_authkey => $remote_authkey,
    }
    if $pacemaker_master {
      class { '::pacemaker::stonith':
        disable => !$enable_fencing,
        tries => $pcs_tries,
      }
    }

The creation of the galera resource happens at step 2, so we should be guaranteed to have stonith property set to false.

Revision history for this message
Michele Baldessari (michele) wrote :

Oh I see the problem:
Aug 25 15:44:33 localhost os-collect-config: "Error: Execution of '/usr/bin/yum -d 0 -e 0 -y install fence-agents-all' returned 1: Error downloading packages:",
Aug 25 15:44:33 localhost os-collect-config: "Error: /Stage[main]/Pacemaker::Install/Package[fence-agents-all]/ensure: change from purged to present failed: Execution of '/usr/bin/yum -d 0 -e 0 -y install fence-agents-all' returned 1: Error downloading packages:",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Service/Service[pcsd]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/User[hacluster]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[reauthenticate-across-all-nodes]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[auth-successful-across-all-nodes]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/File[etc-pacemaker]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/File[etc-pacemaker-authkey]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[Create Cluster tripleo_cluster]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[Start Cluster tripleo_cluster]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Service/Service[corosync]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Service/Service[pacemaker]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]: Dependency Package[fence-agents-all] has failures: true",
Aug 25 15:44:33 localhost os-collect-config: "Notice: /Stage[main]/Pacemaker::Stonith/Pacemaker::Property[Disable STONITH]/Pcmk_property[property--stonith-enabled]: Dependency Package[fence-agents-all] has failures: true",

Maybe since this is a multinode we should just preinstall fence-agents-all on the nodes in order to avoid these failures?

Revision history for this message
Michele Baldessari (michele) wrote :

Added elastic-recheck query here: https://review.openstack.org/499516

Revision history for this message
Michele Baldessari (michele) wrote :
tags: removed: alert
Changed in tripleo:
milestone: pike-rc2 → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Triaged → Incomplete
Changed in tripleo:
milestone: stein-3 → stein-rc1
Changed in tripleo:
milestone: stein-rc1 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
importance: High → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.