queens branch tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades is broken

Bug #1777132 reported by Marios Andreou on 2018-06-15
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
High
Quique Llorente

Bug Description

The queens scenario 0 upgrade job is broken - the error happens during the controller upgrade with example trace at [0] (and another example at [1])

  u'TASK [Check pacemaker cluster running before upgrade] **************************',
  u'fatal: [192.168.24.16]: FAILED! => {"ansible_job_id": "89193676466.46747", "changed": false, "cmd": "pcs cluster status", "failed": true, "finished": 1, "msg": "[Errno 2] No such file or directory", "rc": 2}',
  u'',
  u'PLAY RECAP *********************************************************************',
  u'192.168.24.16 : ok=32 changed=9 unreachable=0 failed=1 ',
  u'']

The task that fails is a step0 validation that the cluster is running and that is at [2]. AFAICS the problem may in fact be that pacemaker is not deployed here? Evidence is lack of 'cluster' or 'pacemaker' directories in controller /var/log/ at [3], compared to the passing master at [4].

Filing the bug to capture this info for now.

[0] http://logs.openstack.org/24/567224/23/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/6fdcee8/logs/undercloud/home/zuul/overcloud_upgrade_run_Controller.log.txt.gz
[1] http://logs.openstack.org/24/575424/1/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/6389692/logs/undercloud/home/zuul/overcloud_upgrade_run_Controller.log.txt.gz
[2] https://github.com/openstack/tripleo-heat-templates/blob/b7dcbd8da79b6119b0b9e35f5cd221338f1f6306/puppet/services/pacemaker.yaml#L148
[3] http://logs.openstack.org/24/575424/1/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/6389692/logs/subnode-2/var/log/
[4] http://logs.openstack.org/86/575186/4/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/bedb420/logs/subnode-2/var/log/

tags: added: ci quickstart
Changed in tripleo:
assignee: Marios Andreou (marios-b) → Quique Llorente (quiquell)
Marios Andreou (marios-b) wrote :

had another pass today updating with info as I know quique (rover) is also checking this
Still no major breakthrough but definitely no pacemaker here so no surprise that the task which checks cluster status is failing. Still don't know why yet though.

I've been comparing a queens run from http://logs.openstack.org/24/567224/28/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/76476d9/ with a master run from http://logs.openstack.org/46/575146/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/59a34e5/

Evidence for 'definitely no pacemaker' on queens:

 * queens no "pacemaker"/corrosync only puppet-pacemaker ansible-pacemaker in yum.log @ http://logs.openstack.org/24/567224/28/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/76476d9/logs/subnode-2/var/log/yum.log.txt.gz
          but on master have pacemaker-libs/cli clusterlibs etc http://logs.openstack.org/46/575146/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/59a34e5/logs/subnode-2/var/log/yum.log.txt.gz

 * controller/subnode 2 rpm_qa log no pacemaker in queens
  http://logs.openstack.org/24/567224/28/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/76476d9/logs/subnode-2/rpm-qa.txt.gz
vs master at http://logs.openstack.org/46/575146/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/59a34e5/logs/subnode-2/rpm-qa.txt.gz

 * queens subnode2/controller has this error and no other pcs/cluster related stuff" Jun 18 09:32:44 centos-7-rax-iad-0000192611 puppet-user[8812]: Puppet::Type::Service::ProviderPacemaker_xml: file crm_node does not exist
http://logs.openstack.org/24/567224/28/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/76476d9/logs/subnode-2/var/log/journal.txt.gz#_Jun_18_09_32_44
   but master e.g. see cluster start @ http://logs.openstack.org/46/575146/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/59a34e5/logs/subnode-2/var/log/journal.txt.gz#_Jun_15_23_53_59

One diff I can see in the templates but not sure if it is relevant yet is that on queens, we include docker.yaml but not on master like queens @ http://logs.openstack.org/24/567224/28/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/76476d9/logs/undercloud/home/zuul/overcloud_deploy.sh.txt.gz and master at http://logs.openstack.org/46/575146/2/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/59a34e5/logs/undercloud/home/zuul/overcloud_deploy.sh.txt.gz
BUT
they both then also include the scenario000 multinode containers and in both queens/master pacemaker is enabled and set in the controller services. https://github.com/openstack/tripleo-heat-templates/blob/master/ci/environments/scenario000-multinode-containers.yaml vs
https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/ci/environments/scenario000-multinode-containers.yaml

Marios Andreou (marios-b) wrote :

just sanity checked with rlandy (thanks!) that I'm at least looking in the right places. Adding a note which might be helpful for anyone else catching up:

biggest unknown/problem and ultimate root of this bug is: why no pacemaker on queens, but pacemaker on master?

job is defined here https://github.com/openstack-infra/tripleo-ci/blob/cf6b217b2e4f15edbf08dc60f60845e3eb500abc/zuul.d/multinode-jobs.yaml#L384 with toci_jobtype: multinode-1ctlr-featureset051

which means https://github.com/openstack/tripleo-quickstart/blob/master/config/general_config/featureset051.yml and that is specifying https://github.com/openstack/tripleo-heat-templates/blob/stable/queens/ci/environments/scenario000-multinode-containers.yaml

biggest unknown/problem and ultimate root of this bug is: why no pacemaker on queens, but pacemaker on master?

on Queens, it looks like it _tries_ (it being puppet) to start the cluster from controller log http://logs.openstack.org/24/567224/33/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-upgrades/b846f03/logs/subnode-2/var/log/journal.txt.gz

 Jun 22 09:36:59 centos-7-ovh-bhs1-0000291507 puppet-user[8404]: Puppet::Type::Service::ProviderPacemaker: file pcs does not exist

So there might be a bug-bug here, being that the cluster didn't start but the deployment passed OK.

Matt Young (halcyondude) on 2018-06-25
tags: removed: quickstart

Reviewed: https://review.openstack.org/577783
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=6b2ab26582e66839b4c199ae74729a2e66ad983f
Submitter: Zuul
Branch: stable/pike

commit 6b2ab26582e66839b4c199ae74729a2e66ad983f
Author: Sofer Athlan-Guyot <email address hidden>
Date: Mon Jun 25 12:39:50 2018 +0200

    Adding missing pacemaker definition for scenario000.

    During queens jobs upgrade, we install the overcloud using the pike
    tripleo heat templates. So we need this scenario000 to have the
    pacemaker resource definition in order for the queen upgrade ci to
    work.

    Change-Id: Id067bb7a365ebfd1f8cb9d8d3c4518accaa3fa5b
    Closes-bug: #1777132

tags: added: in-stable-pike

This issue was fixed in the openstack/tripleo-heat-templates 7.0.14 release.

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Natal Ngétal (hobbestigrou) wrote :

The patch was merged. So this ticket can be closed.

Changed in tripleo:
milestone: stein-2 → stein-3
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers