teclient returns failures when attempting to provision a stack in rdo-cloud

Bug #1797918 reported by Marios Andreou
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

(infra related?) ovb jobs failing in run-v3.yaml with "(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:164): sudo kill -9 4733"
trace looke like

 2018-10-15 08:32:43.318279 | primary | ERROR

 2018-10-15 08:10:58.065475 | primary | Installing collected packages: lockfile, docutils, python-daemon, extras, pbr, gear
 2018-10-15 08:10:58.631124 | primary | Successfully installed docutils-0.14 extras-1.0.0 gear-0.12.0 lockfile-0.12.2 pbr-4.3.0 python-daemon-2.2.0
 2018-10-15 08:10:58.720575 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:167): NETISO_ENV=multi-nic
 2018-10-15 08:10:58.721003 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:170): ./testenv-client -b 192.168.100.250:4730 -t 10800 --envsize 4 --ucinstance 4aa4e750-2270-407d-b65d-24e5ea108108 --net-iso multi-nic -- ./toci_quickstart.sh
 2018-10-15 08:10:58.721160 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:164): sleep 1200
 2018-10-15 08:30:58.724894 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:164): '[' '!' -e /tmp/toci.started ']'
 2018-10-15 08:30:58.725166 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:164): sudo kill -9 4733
 2018-10-15 08:32:43.318279 | primary | ERROR
 2018-10-15 08:32:43.318517 | primary | {
 2018-10-15 08:32:43.318595 | primary | "delta": "0:21:34.554787",
 2018-10-15 08:32:43.318657 | primary | "end": "2018-10-15 08:31:08.747351",
 2018-10-15 08:32:43.318716 | primary | "msg": "non-zero return code",
 2018-10-15 08:32:43.318772 | primary | "rc": -9,
 2018-10-15 08:32:43.318826 | primary | "start": "2018-10-15 08:09:34.192564"
 2018-10-15 08:32:43.318879 | primary | }
 2018-10-15 08:32:43.336155 |
 2018-10-15 08:32:43.336280 | PLAY RECAP
 2018-10-15 08:32:43.336381 | primary | ok: 25 changed: 14 unreachable: 0 failed: 1
 2018-10-15 08:32:43.336444 |
 2018-10-15 08:32:43.618175 | RUN END RESULT_NORMAL: [untrusted : git.openstack.org/openstack-infra/tripleo-ci/playbooks/tripleo-ci/run-v3.yaml@master]

examples include master gate at [1,2,3], pike/rocky [4/5] and queens [6]failing on what looks like transient network issue (downloading packages)

[1] http://logs.rdoproject.org/98/604298/39/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset053/3cbe87e//job-output.txt.gz
[2] http://logs.rdoproject.org/98/604298/39/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset053/3cbe87e//job-output.txt.gz
[3] http://logs.rdoproject.org/98/604298/39/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/192993e/job-output.txt.gz
[4] pike http://logs.rdoproject.org/48/602248/8/openstack-check/tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset022/2a7ef75//job-output.txt.gz
[5] rocky http://logs.rdoproject.org/93/604293/29/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/033a869//job-output.txt.gz
[6] queens http://logs.rdoproject.org/24/567224/119/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/51539f0//job-output.txt.gz

wes hayutin (weshayutin)
Changed in tripleo:
milestone: none → stein-1
importance: Undecided → Critical
Revision history for this message
Marios Andreou (marios-b) wrote :
description: updated
Revision history for this message
wes hayutin (weshayutin) wrote :

This is normal behavior when rdo-cloud or the tebroker is down.

in this case rdo-cloud was down.

Revision history for this message
wes hayutin (weshayutin) wrote :

This command failed

 ./testenv-client -b 192.168.100.250:4730 -t 10800 --envsize 4 --ucinstance 4aa4e750-2270-407d-b65d-24e5ea108108 --net-iso multi-nic -- ./toci_quickstart.sh

summary: - (infra related?) ovb jobs failing in run-v3.yaml with
- "(/home/zuul/src/git.openstack.org/openstack/tripleo-
- ci/toci_gate_test.sh:164): sudo kill -9 4733"
+ teclient returns failures when attempting to provision a stack in rdo-
+ cloud
Revision history for this message
wes hayutin (weshayutin) wrote :

The issue was with RDO-Cloud, and is now resolved. Closing

Changed in tripleo:
status: Triaged → Won't Fix
Revision history for this message
Marios Andreou (marios-b) wrote :
Changed in tripleo:
status: Won't Fix → Triaged
Revision history for this message
Alan Pevec (apevec) wrote :

needinfo: in job-output.txt.gz there's only "msg": "non-zero return code"
after testenv-client call

Revision history for this message
wes hayutin (weshayutin) wrote :

Alan.. that is done for security purposes as to not expost heat stack creation details.
See my irc for logs please

Revision history for this message
wes hayutin (weshayutin) wrote :

2018-10-16 18:34:06.324807 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:170): ./testenv-client -b 192.168.100.250:4730 -t 10800 --envsize 4 --ucinstance be675ea1-6cf3-47c2-ba11-0410aed2c7fe --net-iso multi-nic -- ./toci_quickstart.sh
2018-10-16 18:34:06.324946 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:164): sleep 1200
2018-10-16 18:36:41.743516 | primary | 2018-10-16 18:36:41,742 - testenv-client - INFO - Received job : Couldn't retrieve env
2018-10-16 18:36:41.743694 | primary | 2018-10-16 18:36:41,743 - testenv-client - WARNING - 155.4 seconds waiting for a worker.
2018-10-16 18:36:41.743795 | primary | 2018-10-16 18:36:41,743 - testenv-client - ERROR - Couldn't retrieve env

https://logs.rdoproject.org/60/608460/3/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/a091061/job-output.txt.gz

Revision history for this message
wes hayutin (weshayutin) wrote :

2018-10-18 13:23:22.465147 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:167): NETISO_ENV=multi-nic
2018-10-18 13:23:22.465519 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:170): ./testenv-client -b 192.168.100.250:4730 -t 10800 --envsize 4 --ucinstance 48dcbd12-fdd8-453a-8610-f4ee8473ef69 --net-iso multi-nic -- ./toci_quickstart.sh

https://logs.rdoproject.org/58/611558/3/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/35dbba6/job-output.txt.gz#_2018-10-18_13_23_22_465147

Revision history for this message
Martin Schuppert (mschuppert) wrote :

2018-10-22 06:52:04.675579 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:168): NETISO_ENV=multi-nic
2018-10-22 06:52:04.675938 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:171): ./testenv-client -b 192.168.100.250:4730 -t 10800 --envsize 4 --ucinstance b797b00d-b1f9-4784-b6fa-45ff2a14ed7d --net-iso multi-nic -- ./toci_quickstart.sh
2018-10-22 06:52:04.676082 | primary | +(/home/zuul/src/git.openstack.org/openstack/tripleo-ci/toci_gate_test.sh:165): sleep 1200
2018-10-22 06:54:45.560654 | primary | 2018-10-22 06:54:45,560 - testenv-client - INFO - Received job : Couldn't retrieve env
2018-10-22 06:54:45.560848 | primary | 2018-10-22 06:54:45,560 - testenv-client - WARNING - 160.8 seconds waiting for a worker.
2018-10-22 06:54:45.560965 | primary | 2018-10-22 06:54:45,560 - testenv-client - ERROR - Couldn't retrieve env

https://logs.rdoproject.org/17/611617/2/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/0060922/job-output.txt.gz#_2018-10-22_06_54_45_560654

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

most common errors in te-broker are:

ERROR (ConnectFailure): Unable to establish connection to https://phx2.cloud.rdoproject.org:13774/v2.1: HTTPSConnectionPool(host='phx2.cloud.rdoproject.org', port=13774): Max retries exceeded with url: /v2.1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f59525b2110>: Failed to establish a new connection: [Errno -2] Name or service not known',))

ERROR (ConnectFailure): Unable to establish connection to https://phx2.cloud.rdoproject.org:13774/v2.1: HTTPSConnectionPool(host='phx2.cloud.rdoproject.org', port=13774): Max retries exceeded with url: /v2.1 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f7756ea6190>: Failed to establish a new connection: [Errno 113] No route to host',))

because it has as nameserver 38.145.33.91 which is in tripleo-infra tenant (different tenant than te-broker in), seems like it fails to connect to DNS server to resolve hostnames.
While checking it locally, I don't see any problem with resolving, seems like it's random network failures. This supports also the second error ("No route to host")

Changed in tripleo:
assignee: Marios Andreou (marios-b) → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

logs from today

Revision history for this message
Rafael Folco (rafaelfolco) wrote :
Revision history for this message
Rafael Folco (rafaelfolco) wrote :

note: had to remove the false ha router from the other controller (infra)

Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

problem was solved yesterday in rdo cloud, let's open a new issue if need

Changed in tripleo:
status: Triaged → Fix Released
tags: removed: alert
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.