tripleo gate jobs timing out, duplicate containers pulls a possible cause

Bug #1776796 reported by wes hayutin on 2018-06-14
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Sagi (Sergey) Shnaidman

Bug Description

On 6/13/2018 the tripleo gate has 14 gate jobs fail and reset the all the gate jobs resulting in a 25 hour wait time. The normal acceptable range for gate jobs in tripleo is between 5-7 hours.

Several of the failures were due to jobs timing out.
A possible root cause for the job time outs were containers pulling in both the undercloud and overcloud setup and deployments. Each job pulling containers twice could create a load on the mirrors and docker.io registry causing network slowdowns.

Metrics:
https://review.rdoproject.org/grafana/dashboard/db/tripleo-ci?orgId=1&var-pipeline=All&var-branch=All&var-cloud=All&var-type=All&var-jobtype=All&from=1528335538536&to=1528940338536

http://38.145.34.131:3000/d/pgdr_WVmk/ruck-rover?orgId=1&from=1528681453608&to=1528929853000

After reviewing the situation with Emilien Macchi ( a.k.a Vanilla ) and Alex Schultz a decision was made to revert recent changes to enable more jobs to use the containerized undercloud setup [1].

Additionally, Steve Baker has been working on a blueprint to improve the containerized workflow and performance. By reverting [1], and pushing forward on [2] we hope to improve the performance of the job workflow and avoid future timeouts in check and gate. Once performance gains have been realized we will re-enable [1] across most of the upstream master jobs.

[1] https://review.openstack.org/#/c/575264/
[2] https://review.openstack.org/#/q/topic:bp/container-prepare-workflow+(status:open+OR+status:merged)

Changed in tripleo:
assignee: nobody → Quique Llorente (quiquell)
Matt Young (halcyondude) wrote :

(triage)

This is more of a tracking issue and/or documentation of current state of jobs/work.

Concrete actions for tripleo-ci squad is to raise this issue in the the next #tripleo meeting to determine next steps.

Bogdan Dobrelya (bogdando) wrote :

Raising this to critical as it blocks the containerized undercloud feature for Rocky, see theCI jobs switching attempt https://review.openstack.org/#/c/575330/ It have been suffering timeouts constantly

Changed in tripleo:
importance: High → Critical
tags: added: containers workflows
Quique Llorente (quiquell) wrote :

Looking at cloud providers, limestone is the one that fails with timeout most of the time.

Quique Llorente (quiquell) wrote :

After fixing a bug related to unneeded repos in the containers: https://bugs.launchpad.net/tripleo/+bug/1779642, the fix kind of reduce a little the times.

chandan kumar (chkumar246) wrote :

on 05th July, 2018, We have more timeouts:
- tripleo-ci-centos-7-scenario002-multinode-oooq-container - timedout at overcloud deploy
   * [Write the config_step hieradata] step timedout - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/2c3d93a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-05_02_51_10

- tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates - timedout at overcloud deploy
   * 2018-07-05 02:51:29 | 2018-07-05 02:51:23Z [overcloud.SshKnownHostsConfig]: CREATE_COMPLETE state changed - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/04f7b4a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-05_02_51_29

- tripleo-ci-centos-7-scenario003-multinode-oooq-container - timedout 2018-07-05 02:50:13.014946 | primary | TASK [overcloud-deploy : Run post-deploy script]
  * 2018-07-05 02:50:43 | + openstack role add --project admin --user admin heat_stack_owner - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/6195e6e/logs/undercloud/home/zuul/overcloud_deploy_post.log.txt.gz#_2018-07-05_02_50_43

wes hayutin (weshayutin) wrote :
Changed in tripleo:
assignee: Quique Llorente (quiquell) → Sagi (Sergey) Shnaidman (sshnaidm)

the problem was in caching dockers in infra, so this problem has gone.
@Wes, do we still want to keep it opened?

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Brad P. Crochet (brad-9) on 2018-07-30
tags: removed: workflows
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers