tripleo gate jobs timing out, duplicate containers pulls a possible cause

Bug #1776796 reported by wes hayutin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Sagi (Sergey) Shnaidman

Bug Description

On 6/13/2018 the tripleo gate has 14 gate jobs fail and reset the all the gate jobs resulting in a 25 hour wait time. The normal acceptable range for gate jobs in tripleo is between 5-7 hours.

Several of the failures were due to jobs timing out.
A possible root cause for the job time outs were containers pulling in both the undercloud and overcloud setup and deployments. Each job pulling containers twice could create a load on the mirrors and docker.io registry causing network slowdowns.

Metrics:
https://review.rdoproject.org/grafana/dashboard/db/tripleo-ci?orgId=1&var-pipeline=All&var-branch=All&var-cloud=All&var-type=All&var-jobtype=All&from=1528335538536&to=1528940338536

http://38.145.34.131:3000/d/pgdr_WVmk/ruck-rover?orgId=1&from=1528681453608&to=1528929853000

After reviewing the situation with Emilien Macchi ( a.k.a Vanilla ) and Alex Schultz a decision was made to revert recent changes to enable more jobs to use the containerized undercloud setup [1].

Additionally, Steve Baker has been working on a blueprint to improve the containerized workflow and performance. By reverting [1], and pushing forward on [2] we hope to improve the performance of the job workflow and avoid future timeouts in check and gate. Once performance gains have been realized we will re-enable [1] across most of the upstream master jobs.

[1] https://review.openstack.org/#/c/575264/
[2] https://review.openstack.org/#/q/topic:bp/container-prepare-workflow+(status:open+OR+status:merged)

Changed in tripleo:
assignee: nobody → Quique Llorente (quiquell)
Revision history for this message
Matt Young (halcyondude) wrote :

(triage)

This is more of a tracking issue and/or documentation of current state of jobs/work.

Concrete actions for tripleo-ci squad is to raise this issue in the the next #tripleo meeting to determine next steps.

Revision history for this message
Quique Llorente (quiquell) wrote :
Revision history for this message
Quique Llorente (quiquell) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Raising this to critical as it blocks the containerized undercloud feature for Rocky, see theCI jobs switching attempt https://review.openstack.org/#/c/575330/ It have been suffering timeouts constantly

Changed in tripleo:
importance: High → Critical
tags: added: containers workflows
Revision history for this message
Quique Llorente (quiquell) wrote :

Looking at cloud providers, limestone is the one that fails with timeout most of the time.

Revision history for this message
Quique Llorente (quiquell) wrote :

After fixing a bug related to unneeded repos in the containers: https://bugs.launchpad.net/tripleo/+bug/1779642, the fix kind of reduce a little the times.

Revision history for this message
chandan kumar (chkumar246) wrote :

on 05th July, 2018, We have more timeouts:
- tripleo-ci-centos-7-scenario002-multinode-oooq-container - timedout at overcloud deploy
   * [Write the config_step hieradata] step timedout - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario002-multinode-oooq-container/2c3d93a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-05_02_51_10

- tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates - timedout at overcloud deploy
   * 2018-07-05 02:51:29 | 2018-07-05 02:51:23Z [overcloud.SshKnownHostsConfig]: CREATE_COMPLETE state changed - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/04f7b4a/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz#_2018-07-05_02_51_29

- tripleo-ci-centos-7-scenario003-multinode-oooq-container - timedout 2018-07-05 02:50:13.014946 | primary | TASK [overcloud-deploy : Run post-deploy script]
  * 2018-07-05 02:50:43 | + openstack role add --project admin --user admin heat_stack_owner - http://logs.openstack.org/45/560445/78/check/tripleo-ci-centos-7-scenario003-multinode-oooq-container/6195e6e/logs/undercloud/home/zuul/overcloud_deploy_post.log.txt.gz#_2018-07-05_02_50_43

Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
assignee: Quique Llorente (quiquell) → Sagi (Sergey) Shnaidman (sshnaidm)
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

the problem was in caching dockers in infra, so this problem has gone.
@Wes, do we still want to keep it opened?

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Brad P. Crochet (brad-9)
tags: removed: workflows
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.