Master/stein check and promotion OVB jobs are randomly giving ansible time out while overcloud deploy

Bug #1840616 reported by chandan kumar on 2019-08-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Ronelle Landy

Bug Description

We are seeing randomly Ansible timeout issue on following jobs while deploying overcloud.
In fs02:
https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/edd7af2/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-14_03_56_46

2019-08-14 04:11:40 | changed: [overcloud-controller-0 -> 192.168.24.30] => {
2019-08-14 04:11:40 | "changed": true,
2019-08-14 04:11:40 | "cmd": "/bin/rsync --delay-updates -F --compress --archive --out-format=<<CHANGED>>%i %n%L /opt/puppetlabs/ /var/lib/container-puppet/puppetlabs/",
2019-08-14 04:11:40 | "rc": 0
2019-08-14 04:11:40 | }
2019-08-14 04:11:40 |
2019-08-14 04:11:40 | MSG:
2019-08-14 04:11:40 |
2019-08-14 04:11:40 | .d..t...... ./
2019-08-14 04:11:40 | cd+++++++++ facter/
2019-08-14 04:11:40 | cd+++++++++ facter/cache/
2019-08-14 04:11:40 | cd+++++++++ facter/cache/cached_facts/
2019-08-14 04:11:40 | >f+++++++++ facter/cache/cached_facts/kernel
2019-08-14 04:11:40 | >f+++++++++ facter/cache/cached_facts/memory
2019-08-14 04:11:40 | >f+++++++++ facter/cache/cached_facts/networking
2019-08-14 04:11:40 | >f+++++++++ facter/cache/cached_facts/operating system
2019-08-14 04:11:40 | >f+++++++++ facter/cache/cached_facts/processor
2019-08-14 04:11:40 |
2019-08-14 04:11:40 |
2019-08-14 04:11:40 | TASK [Run container-puppet tasks (generate config) during step 1] **************
2019-08-14 04:11:40 | Wednesday 14 August 2019 03:55:36 +0000 (0:00:00.647) 0:44:07.399 ******
2019-08-14 04:11:40 | ok: [overcloud-novacompute-0] => {
2019-08-14 04:11:40 |
2019-08-14 04:11:40 | "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
2019-08-14 04:11:40 | "changed": false
2019-08-14 04:11:40 | }
2019-08-14 04:11:40 |
2019-08-14 04:11:40 | Ansible timed out at 3612 seconds.
2019-08-14 04:11:40 | + status_code=1
2019-08-14 04:11:40 | + openstack stack list

In fs035
https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-master/532cba5/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-14_03_42_20

2019-08-14 03:57:14 | TASK [Run container-puppet tasks (generate config) during step 1] **************
2019-08-14 03:57:14 | Wednesday 14 August 2019 03:40:59 +0000 (0:00:00.893) 0:44:34.074 ******
2019-08-14 03:57:14 | ok: [overcloud-novacompute-0] => {
2019-08-14 03:57:14 |
2019-08-14 03:57:14 | "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
2019-08-14 03:57:14 | "changed": false
2019-08-14 03:57:14 | }
2019-08-14 03:57:14 | ok: [overcloud-controller-1] => {
2019-08-14 03:57:14 |
2019-08-14 03:57:14 | "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
2019-08-14 03:57:14 | "changed": false
2019-08-14 03:57:14 | }
2019-08-14 03:57:14 | ok: [overcloud-controller-2] => {
2019-08-14 03:57:14 | "censored": "the output has been hidden due to the fact that 'no_log: true' was specified for this result",
2019-08-14 03:57:14 | "changed": false
2019-08-14 03:57:14 | }
2019-08-14 03:57:14 |
2019-08-14 03:57:14 | Ansible timed out at 3652 seconds.

and fs01 in check job:

https://logs.rdoproject.org/64/676364/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/19bb5b1/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-15_17_07_02

We need to find out why ansible timeout is happening randomly.

summary: Master check and promotion jobs are giving ansible time out while
- overcloud deploy
+ overcloud deploy in fs01/02/35
yatin (yatinkarel) on 2019-08-19
summary: - Master check and promotion jobs are giving ansible time out while
- overcloud deploy in fs01/02/35
+ Master check and promotion OVB jobs are randomly giving ansible time out
+ while overcloud deploy

So noticed it's taking time for network operations(example: docker pull) on an overcloud node.
So for some images it takes time > 10 minutes(ideally should take few seconds as it's local network).
so on affected node all docker pull takes time.

Good to check if it's master only or affects all releases to properly isolate the issue.

Example 1(14 minute 21 second):- https://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-1ctlr_1comp-featureset002-master-upload/edd7af2/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-14_03_45_18
2019-08-14 03:45:18 | changed: [overcloud-controller-0] => {
2019-08-14 03:45:18 |
2019-08-14 03:45:18 | "changed": true,
2019-08-14 03:45:18 | "cmd": "docker pull 192.168.24.1:8787/tripleomaster/centos-binary-cinder-volume:dff0f7b2c1ce50d9929321f65f1ab290f2d042a9_e8dbf8af-updated-20190814013900",
2019-08-14 03:45:18 | "delta": "0:14:21.982103",
2019-08-14 03:45:18 | "end": "2019-08-14 03:39:27.913156",
2019-08-14 03:45:18 | "rc": 0,
2019-08-14 03:45:18 | "start": "2019-08-14 03:25:05.931053"
2019-08-14 03:45:18 | }

Example 2(10 minute 32 second):- https://logs.rdoproject.org/64/676364/1/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001/19bb5b1/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-15_16_38_15

2019-08-15 16:38:15 | changed: [overcloud-controller-2] => {
2019-08-15 16:38:15 |
2019-08-15 16:38:15 | "changed": true,
2019-08-15 16:38:15 | "cmd": "docker pull 192.168.24.1:8787/tripleomaster/centos-binary-cinder-volume:6cd9e55d69e90ea73219abced4e9a3f2372b204e_10e135ca-updated-20190815144647",
2019-08-15 16:38:15 | "delta": "0:10:32.870128",
2019-08-15 16:38:15 | "end": "2019-08-15 16:37:34.438094",
2019-08-15 16:38:15 | "rc": 0,
2019-08-15 16:38:15 | "start": "2019-08-15 16:27:01.567966"
2019-08-15 16:38:15 | }

chandan kumar (chkumar246) wrote :

We are seeing the same on stein also https://logs.rdoproject.org/openstack-periodic-latest-released/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein/d2efe9c/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-18_12_16_18

2019-08-18 12:16:18 | Status: Downloaded newer image for 192.168.24.1:8787/tripleostein/centos-binary-cinder-volume:31a5bc909cd08ac5121137803138766659ec76a8_ea18ecf2-updated-20190818101815
2019-08-18 12:16:18 | changed: [overcloud-controller-1] => {
2019-08-18 12:16:18 |
2019-08-18 12:16:18 | "changed": true,
2019-08-18 12:16:18 | "cmd": "docker pull 192.168.24.1:8787/tripleostein/centos-binary-cinder-volume:31a5bc909cd08ac5121137803138766659ec76a8_ea18ecf2-updated-20190818101815",
2019-08-18 12:16:18 | "delta": "0:12:25.999018",
2019-08-18 12:16:18 | "end": "2019-08-18 12:11:43.181600",
2019-08-18 12:16:18 | "rc": 0,

Just at one place it is taking time.

summary: - Master check and promotion OVB jobs are randomly giving ansible time out
- while overcloud deploy
+ Master/stein check and promotion OVB jobs are randomly giving ansible
+ time out while overcloud deploy
Ronelle Landy (rlandy) wrote :
Changed in tripleo:
assignee: nobody → Ronelle Landy (rlandy)
Ronelle Landy (rlandy) wrote :

tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 openstack/python-tripleoclient master openstack-check 674011,4 11694 2019-08-19T15:56:58 SUCCESS
tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001 openstack/tripleo-ansible master openstack-check 677164,1 11127 2019-08-19T16:03:57 SUCCESS

and then some successes ...

Stein promotion results are mixed:

periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein openstack/tripleo-ci master openstack-periodic-latest-released master 14628 2019-08-19T10:12:00 FAILURE
periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein openstack/tripleo-ci master openstack-periodic-latest-released master 9339 2019-08-18T16:14:21 SUCCESS
periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein openstack/tripleo-ci master openstack-periodic-latest-released master 10741 2019-08-18T10:07:32 FAILURE
periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein openstack/tripleo-ci master openstack-periodic-latest-released master 12376 2019-08-17T22:13:22 SUCCESS
periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein openstack/tripleo-ci master openstack-periodic-latest-released master 10019 2019-08-17T16:12:04 SUCCESS
periodic-tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035-stein openstack/tripleo-ci master openstack-periodic-latest-released master 9315 2019-08-17T10:07:18 SUCCESS

We could increase deploy timeout?

tags: added: alert ci
chandan kumar (chkumar246) wrote :

Increasing overcloud deploy timeout might affect all the jobs.

Related fix proposed to branch: master
Review: https://review.opendev.org/678242

Ronelle Landy (rlandy) wrote :

 TASK [Sync cached facts] *******************************************************
takes a long time in the jobs that timeout.

Examples:

http://logs.rdoproject.org/41/678141/3/openstack-check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset035/c558c7b/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

2019-08-23 15:38:22 | changed: [overcloud-controller-1 -> 192.168.24.21] => {
2019-08-23 15:38:22 | "changed": true,
2019-08-23 15:59:04 | "cmd": "/bin/rsync --delay-updates -F --compress --archive --out-format=<<CHANGED>>%i %n%L /opt/puppetlabs/Overcloud configuration failed.
2019-08-23 15:59:04 | /var/lib/container-puppet/puppetlabs/",
2019-08-23 15:59:04 | "rc": 0
2019-08-23 15:59:04 | }
2019-08-23 15:59:04 |
2019-08-23 15:59:04 | MSG:

https://logs.rdoproject.org/82/21882/4/check/tripleo-ci-centos-7-ovb-3ctlr_1comp-featureset001-master-vexxhost/9483159/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2019-08-23_02_24_55

2019-08-23 02:15:32 | skipping: [overcloud-controller-0] => {
2019-08-23 02:24:55 | Overcloud configuration failed.
2019-08-23 02:24:55 | "changed": false,
2019-08-23 02:24:55 | "skip_reason": "Conditional result was False"
2019-08-23 02:24:55 | }

Maybe this change impacted it:
https://github.com/openstack/tripleo-heat-templates/commit/9581614e9353576c6a7b83136793631473851b31

Related fix proposed to branch: master
Review: https://review.opendev.org/678290

Ronelle Landy (rlandy) wrote :

fs001 is doing better. fs035 is still a disaster zone. https://review.opendev.org/678290 adds settings to fs035 fs001 had

Change abandoned by Chandan Kumar (raukadah) (<email address hidden>) on branch: master
Review: https://review.opendev.org/677458
Reason: not needed currently

Reviewed: https://review.opendev.org/678290
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=d0ffb87d4aa157de28a921f647d5887faa8e45b7
Submitter: Zuul
Branch: master

commit d0ffb87d4aa157de28a921f647d5887faa8e45b7
Author: Ronelle Landy <email address hidden>
Date: Fri Aug 23 15:19:00 2019 -0400

    Add settings from fs001 missing in fs035

    fs035 should match fs001 - except for the ipv6-related
    settings. This patch adds some settings recently added
    to fs001 - not yet added to fs035.

    Change-Id: Idd6a347a50c0ef50cd62e792acfef9401eb5d25a
    Related-Bug: #1840616

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers