tripleo

tripleo gate jobs timing out, duplicate containers pulls a possible cause

Bug #1776796 reported by wes hayutin on 2018-06-14

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Sagi (Sergey) Shnaidman	tripleo rocky-rc1

Bug Description

On 6/13/2018 the tripleo gate has 14 gate jobs fail and reset the all the gate jobs resulting in a 25 hour wait time. The normal acceptable range for gate jobs in tripleo is between 5-7 hours.

Several of the failures were due to jobs timing out.
A possible root cause for the job time outs were containers pulling in both the undercloud and overcloud setup and deployments. Each job pulling containers twice could create a load on the mirrors and docker.io registry causing network slowdowns.

Metrics:
https://review.rdoproject.org/grafana/dashboard/db/tripleo-ci?orgId=1&var-pipeline=All&var-branch=All&var-cloud=All&var-type=All&var-jobtype=All&from=1528335538536&to=1528940338536

http://38.145.34.131:3000/d/pgdr_WVmk/ruck-rover?orgId=1&from=1528681453608&to=1528929853000

After reviewing the situation with Emilien Macchi ( a.k.a Vanilla ) and Alex Schultz a decision was made to revert recent changes to enable more jobs to use the containerized undercloud setup [1].

Additionally, Steve Baker has been working on a blueprint to improve the containerized workflow and performance. By reverting [1], and pushing forward on [2] we hope to improve the performance of the job workflow and avoid future timeouts in check and gate. Once performance gains have been realized we will re-enable [1] across most of the upstream master jobs.

[1] https://review.openstack.org/#/c/575264/
[2] https://review.openstack.org/#/q/topic:bp/container-prepare-workflow+(status:open+OR+status:merged)

Tags:

Quique Llorente (quiquell) on 2018-06-18

Changed in tripleo:
assignee:	nobody → Quique Llorente (quiquell)

Revision history for this message

Matt Young (halcyondude) wrote on 2018-06-25:

(triage)

This is more of a tracking issue and/or documentation of current state of jobs/work.

Concrete actions for tripleo-ci squad is to raise this issue in the the next #tripleo meeting to determine next steps.

Revision history for this message

Quique Llorente (quiquell) wrote on 2018-06-26:

Good job

http://logs.openstack.org/32/576632/2/gate/tripleo-ci-centos-7-scenario007-multinode-oooq-container/2e96f4d/logs/ara_oooq/

Timed out job

http://logs.openstack.org/04/576904/6/check/tripleo-ci-centos-7-scenario007-multinode-oooq-container/8be506c/logs/ara_oooq/

Revision history for this message

Quique Llorente (quiquell) wrote on 2018-06-26:

oc and uc install have increased time

and also "overcloud-prep-containers : Prepare for the containerized deployment" has duplicated time

prep containers docker logs

Good

http://logs.openstack.org/32/576632/2/gate/tripleo-ci-centos-7-scenario007-multinode-oooq-container/2e96f4d/logs/undercloud/home/zuul/docker_journalctl.log.txt.gz

Bad

http://logs.openstack.org/04/576904/6/check/tripleo-ci-centos-7-scenario007-multinode-oooq-container/8be506c/logs/undercloud/home/zuul/docker_journalctl.log.txt.gz

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2018-06-29:

Raising this to critical as it blocks the containerized undercloud feature for Rocky, see theCI jobs switching attempt https://review.openstack.org/#/c/575330/ It have been suffering timeouts constantly

Changed in tripleo:
importance:	High → Critical
tags:	added: containers workflows

Revision history for this message

Quique Llorente (quiquell) wrote on 2018-06-29:

Looking at cloud providers, limestone is the one that fails with timeout most of the time.

Revision history for this message

Quique Llorente (quiquell) wrote on 2018-07-04:

After fixing a bug related to unneeded repos in the containers: https://bugs.launchpad.net/tripleo/+bug/1779642, the fix kind of reduce a little the times.

Revision history for this message

chandan kumar (chkumar246) wrote on 2018-07-05:

on 05th July, 2018, We have more timeouts:
- tripleo-ci-centos-7-scenario002-multinode-oooq-container - timedout at overcloud deploy
* [Write the config_step hieradata] step timedout - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/2c3d93a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-05_02_51_10

- tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates - timedout at overcloud deploy
* 2018-07-05 02:51:29 | 2018-07-05 02:51:23Z [overcloud.SshKnownHostsConfig]: CREATE_COMPLETE state changed - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/04f7b4a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-05_02_51_29

- tripleo-ci-centos-7-scenario003-multinode-oooq-container - timedout 2018-07-05 02:50:13.014946 | primary | TASK [overcloud-deploy : Run post-deploy script]
* 2018-07-05 02:50:43 | + openstack role add --project admin --user admin heat_stack_owner - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/6195e6e/logs/undercloud/home/zuul/overcloud_deploy_post.log.txt.gz#_2018-07-05_02_50_43

Revision history for this message

wes hayutin (weshayutin) wrote on 2018-07-10:

Please review the following patches.

https://review.openstack.org/#/q/topic:containers-prerun+(status:open+OR+status:merged)

Changed in tripleo:
assignee:	Quique Llorente (quiquell) → Sagi (Sergey) Shnaidman (sshnaidm)

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2018-07-16:

the problem was in caching dockers in infra, so this problem has gone.
@Wes, do we still want to keep it opened?

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Brad P. Crochet (brad-9) on 2018-07-30

tags:

removed: workflows

Juan Antonio Osorio Robles (juan-osorio-robles) on 2018-08-06

Changed in tripleo:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.