stable/newton gate-tripleo-ci-centos-7-nonha-multinode-oooq broken

Bug #1690373 reported by Marios Andreou on 2017-05-12
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Marios Andreou

Bug Description

Examples include https://review.openstack.org/#/c/463529/ and https://review.openstack.org/#/c/463985/
https://review.openstack.org/#/c/463763/1

The overcloud deploy task fails and can't spot the error from logs. For example console fail like @ http://logs.openstack.org/29/463529/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/efc5520/console.html#_2017-05-11_14_00_49_785454 but can't see errors @ http://logs.openstack.org/29/463529/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/efc5520/logs/subnode-2/var/log/messages.txt.gz

Strangely on the same review that job passed successfully on the 9th http://logs.openstack.org/29/463529/1/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/362437e/console.html#_2017-05-09_12_36_45_454626

The stable/newton gate-tripleo-ci-centos-7-ovb-ha-oooq was also previously failing as tracked at https://bugs.launchpad.net/tripleo/+bug/1690132 - the fix for that bug @ https://review.openstack.org/#/c/463985/ is now blocked on this bug for the failing gate-tripleo-ci-centos-7-nonha-multinode-oooq broken

Tags: ci Edit Tag help
Michele Baldessari (michele) wrote :
Download full text (8.9 KiB)

Looking at:
logs.openstack.org/85/463985/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/1067ca3/logs

The last operation from subnode-2 is the following:
May 12 06:56:56 centos-7-2-node-rax-ord-8800008-577002 os-collect-config: [2017-05-12 06:56:56,622] (os-refresh-config) [INFO] Completed phase post-configure

May 12 06:56:53 centos-7-2-node-rax-ord-8800008-577002 os-collect-config: ++ curl -s -w '%{http_code}' -X POST -H 'Content-Type: application/json' -o /tmp/tmp.sj2FHanUyh --data-binary '{"deploy_stdout": "os-apply-config deployment 0e65c5de-e457-440a-9c09-4a51723bf532 completed", "deploy_status_code": "0"}' 'http://192.168.24.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A70e0eeedf2db4f6790f3bbb50c76e223%3Astacks%2Fovercloud-Controller-varysnuo6nyx-0-whpm4maqfizz-Controller-ppc2knqolux2%2F04dc8360-364d-4647-a7c2-90965a94140e%2Fresources%2FInstanceIdDeployment?Timestamp=2017-05-12T06%3A56%3A29Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=a58bdc75277f472aa7690c18184b164f&SignatureVersion=2&Signature=vQ2ZL0avTlmW%2Fzoba4%2BaSQR1i0PofnnpKbCScL6ZBZw%3D'
May 12 06:56:56 centos-7-2-node-rax-ord-8800008-577002 os-collect-config: + status=200

On the undercloud we see the following:
From heat-api-cfn.log:
2017-05-12 06:56:56.594 4047 INFO eventlet.wsgi.server [req-e2a159fe-3806-45a2-9972-4d510300d84a 47c98f069b0d4633a7a17156c6ba04e1 4af916985a214dd5955f7ed87971413c - 6e59edb5d4cf4033b30791e96cfcb654 6e59edb5d4cf4033b30791e96cfcb654] 192.168.24.3 - - [12/May/2017 06:56:56] "POST /v1/signal/arn%3Aopenstack%3Aheat%3A%3A70e0eeedf2db4f6790f3bbb50c76e223%3Astacks%2Fovercloud-Controller-varysnuo6nyx-0-whpm4maqfizz-Controller-ppc2knqolux2%2F04dc8360-364d-4647-a7c2-90965a94140e%2Fresources%2FInstanceIdDeployment?Timestamp=2017-05-12T06%3A56%3A29Z&SignatureMethod=HmacSH...

Read more...

Changed in tripleo:
importance: High → Critical
tags: added: alert ci
Marios Andreou (marios-b) wrote :

just poked again and came across the overcloud-deploy log - bandini's comment about the undercloud swift/glance notwithstanding - i see it seems to fail at exactly the timeout from http://logs.openstack.org/29/463529/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/c5a4cd3/logs/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2017-05-15_15_59_59 - maybe we need to increase it?

the last time it was successful it ran well within the timeout 1209-1228 @ http://logs.openstack.org/29/463529/1/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/362437e/logs/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2017-05-09_12_30_11 (same review but on the 9th May)

Both those logs also contain this http://logs.openstack.org/29/463529/1/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/362437e/logs/undercloud/home/jenkins/overcloud_deploy.log.txt.gz#_2017-05-09_12_07_08 ("Error finding 'bm-deploy-kernel' in glance") but must be non fatal/unrelated since it exists in the job that passsed too.

Marios Andreou (marios-b) wrote :

o/ bandini FYI about glance/nova - I had a look too and it is useful to compare with the 9th with the successful run. there are plenty errors but they also exist in the job that passed... for example http://logs.openstack.org/29/463529/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/c5a4cd3/logs/undercloud/var/log/glance/api.log.txt.gz#_2017-05-15_14_38_03_690 and http://logs.openstack.org/29/463529/1/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/362437e/logs/undercloud/var/log/glance/api.log.txt.gz#_2017-05-09_12_06_09_593 for glance-api like "2017-05-09 12:06:09.593 1920 ERROR swiftclient [req-d81dfb6b-0f50-4dcc-9c1c-8a00344960ff 5e8396de58da4f1cb4a0983503c46a04 a244c3723d104ea8bf00ebf293fc5e0e - default default] Container HEAD failed: http://192.168.24.1:8080/v1/AUTH_62ccccc9d8cf48939e5b1fd01ef55536/glance 404 Not Found"

Michele Baldessari (michele) wrote :
Download full text (4.8 KiB)

Thanks marios, big meh for us not fixing irrelevant errors in logs ;)

- Working:
http://logs.openstack.org/29/463529/1/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/362437e/console.html.gz

From subnode-2 messages:
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 Script Seconds
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 --------------------------------------- ----------
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 20-os-apply-config 0.605
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 20-os-net-config 0.567
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 25-set-network-gateway 0.572
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 40-hiera-datafiles 0.582
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 40-truncate-nova-config 0.005
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 51-hosts 0.574
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 55-heat-config 4.240
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: dib-run-parts Tue May 9 12:09:06 UTC 2017 --------------------- END PROFILING ---------------------
May 9 12:09:06 centos-7-2-node-osic-cloud1-s3500-8770310-574527 os-collect-config: [2017-05-09 12:09:06,232] (os-refresh-config) [INFO] Completed phase configure

- Not working:
http://logs.openstack.org/29/463529/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/c5a4cd3/

From subnode-2 messages:
May 15 22:26:58 centos-7-2-node-osic-cloud1-s3500-8826984-579074 os-collect-config: dib-run-parts Mon May 15 22:26:58 UTC 2017 Script Seconds
May 15 22:26:58 centos-7-2-node-osic-cloud1-s3500-8826984-579074 os-collect-config: dib-run-parts Mon May 15 22:26:58 UTC 2017 --------------------------------------- ----------
May 15 22:26:58 centos-7-2-node-osic-cloud1-s3500-8826984-579074 os-collect-config: dib-run-parts Mon May 15 22:26:58 UTC 2017
May 15 22:26:58 centos-7-2-node-osic-cloud1-s3500-8826984-579074 os-collect-config: dib-run-parts Mon May 15 22:26:58 UTC 2017 20-os-apply-co...

Read more...

Michele Baldessari (michele) wrote :

Also:
- Broken:
May 15 21:53:23 centos-7-2-node-osic-cloud1-s3500-8826984-579074 yum[31250]: Installed: os-collect-config-5.2.0-0.20170428091843.3d7835d.el7.centos.noarch May 15 22:01:15 centos-7-2-node-osic-cloud1-s3500-8826984-579074 yum[15626]: Updated: os-collect-config-5.2.0-1.el7.noarch

- Working:
May 9 11:37:19 centos-7-2-node-osic-cloud1-s3500-8770310-574527 yum[31151]: Installed: os-collect-config-5.2.0-0.20170428091843.3d7835d.el7.centos.noarch

Michele Baldessari (michele) wrote :

Mmmmh I really wonder if this is not yet again a manifestation of BZ:
https://bugzilla.redhat.com/show_bug.cgi?id=1347652

Cfr the update logs:
- Not working
May 15 21:53:22 centos-7-2-node-osic-cloud1-s3500-8826984-579074 yum[31250]: Installed: openstack-tripleo-image-elements-5.3.0-0.20170428024030.3ecf477.el7.centos.noarch
May 15 22:01:13 centos-7-2-node-osic-cloud1-s3500-8826984-579074 yum[15626]: Updated: openstack-tripleo-image-elements-5.3.0-1.el7.noarch

- Working
May 9 11:37:16 centos-7-2-node-osic-cloud1-s3500-8770310-574527 yum[31151]: Installed: openstack-tripleo-image-elements-5.3.0-0.20170428024030.3ecf477.el7.centos.noarch

Marios Andreou (marios-b) wrote :

nice catch - that might be it - I mean there is indeed no os-refresh-config updated here http://logs.openstack.org/29/463529/1/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/362437e/logs/subnode-2/var/log/yum.log.txt.gz where it passed, and it is updated here http://logs.openstack.org/29/463529/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/c5a4cd3/logs/subnode-2/var/log/yum.log.txt.gz on the failing job.

we could versionlock os-refresh-config to see if it fixes the job? not sure how to do that yet

Michele Baldessari (michele) wrote :

As an additional data point via rdo dist-git commit:
commit 943d53e96ef5dce6d4261100a9f00b12e53644c7 (HEAD -> rpm-master, origin/rpm-master, origin/HEAD)
Author: James Slagle <email address hidden>
Date: Sun Jan 8 15:41:58 2017 -0500

    Remove %post script for orc scripts

    All os-refresh-config scripts and os-apply-config templates are now
    delivered via packaging or via Heat SoftwareDeployment's directly. We no
    longer need this
    %post script to resync the scripts from the element manifest on package
    %update.

    Change-Id: Ie205c93a3cdcc3c68668327fde6327cd373a8739

The offending %post was removed in master but not in our openstack-tripleo-image-elements-5.3.0 that is being used here.

Yeah version-locking openstack-tripleo-image-elements should help here (or even better just removing it completely from the image/system, as it is needed only on the undercloud?)

Emilien Macchi (emilienm) wrote :

Please look https://review.rdoproject.org/r/6681 that address the latest comment.

Reviewed: https://review.openstack.org/465934
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=748f4680bb49e6c63012d77cda1133fd498d9b34
Submitter: Jenkins
Branch: master

commit 748f4680bb49e6c63012d77cda1133fd498d9b34
Author: Emilien Macchi <email address hidden>
Date: Thu May 18 08:13:32 2017 -0400

    Dummy patch to build a new rpm in RDO

    The version of openstack-tripleo-image-elements in RDO for stable
    branches is too old and it might be one of the reasons why TripleO CI is
    failing to deploy TripleO on Newton.

    See: https://bugs.launchpad.net/tripleo/+bug/1690373

    This patch is doing nothing important but by merging it, we'll build a
    new package in RDO for master and all branches and hopefully get a new
    build that will help us to resolve the issue.

    Change-Id: Ie6c21aa8d071e791eb0df6db6d700074d3141cf0
    Related-Bug: #1690373

Emilien Macchi (emilienm) wrote :

I'm doing some investigation and comparing packages between before when it works and now when it fails:

undercloud:
https://www.diffchecker.com/ExAwPKMS

overcloud:
https://www.diffchecker.com/Ud4urbft

I'm looking at results and keep digging, but any help is more than welcome.

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/465936

Change abandoned by Emilien Macchi (<email address hidden>) on branch: stable/newton
Review: https://review.openstack.org/465935

Alan Pevec (apevec) wrote :

openstack-tripleo-image-elements-5.3.0-2.el7 was built for RDO Newton testing repo which includes the %post fix

Julie Pichon (jpichon) wrote :

I believe this one is indeed resolved as per comment #17. E.g. https://review.openstack.org/#/c/421541/ used to fail with this and merged on the newton branch yesterday.

Alan Pevec (apevec) on 2017-05-26
tags: removed: alert
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.