tripleo gate jobs timing out, duplicate containers pulls a possible cause

Bug #1776796 reported by wes hayutin on 2018-06-14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Sagi (Sergey) Shnaidman

Bug Description

On 6/13/2018 the tripleo gate has 14 gate jobs fail and reset the all the gate jobs resulting in a 25 hour wait time. The normal acceptable range for gate jobs in tripleo is between 5-7 hours.

Several of the failures were due to jobs timing out.
A possible root cause for the job time outs were containers pulling in both the undercloud and overcloud setup and deployments. Each job pulling containers twice could create a load on the mirrors and registry causing network slowdowns.


After reviewing the situation with Emilien Macchi ( a.k.a Vanilla ) and Alex Schultz a decision was made to revert recent changes to enable more jobs to use the containerized undercloud setup [1].

Additionally, Steve Baker has been working on a blueprint to improve the containerized workflow and performance. By reverting [1], and pushing forward on [2] we hope to improve the performance of the job workflow and avoid future timeouts in check and gate. Once performance gains have been realized we will re-enable [1] across most of the upstream master jobs.


Changed in tripleo:
assignee: nobody → Quique Llorente (quiquell)
Matt Young (halcyondude) wrote :


This is more of a tracking issue and/or documentation of current state of jobs/work.

Concrete actions for tripleo-ci squad is to raise this issue in the the next #tripleo meeting to determine next steps.

Bogdan Dobrelya (bogdando) wrote :

Raising this to critical as it blocks the containerized undercloud feature for Rocky, see theCI jobs switching attempt It have been suffering timeouts constantly

Changed in tripleo:
importance: High → Critical
tags: added: containers workflows
Quique Llorente (quiquell) wrote :

Looking at cloud providers, limestone is the one that fails with timeout most of the time.

Quique Llorente (quiquell) wrote :

After fixing a bug related to unneeded repos in the containers:, the fix kind of reduce a little the times.

chandan kumar (chkumar246) wrote :

on 05th July, 2018, We have more timeouts:
- tripleo-ci-centos-7-scenario002-multinode-oooq-container - timedout at overcloud deploy
   * [Write the config_step hieradata] step timedout -

- tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates - timedout at overcloud deploy
   * 2018-07-05 02:51:29 | 2018-07-05 02:51:23Z [overcloud.SshKnownHostsConfig]: CREATE_COMPLETE state changed -

- tripleo-ci-centos-7-scenario003-multinode-oooq-container - timedout 2018-07-05 02:50:13.014946 | primary | TASK [overcloud-deploy : Run post-deploy script]
  * 2018-07-05 02:50:43 | + openstack role add --project admin --user admin heat_stack_owner -

wes hayutin (weshayutin) wrote :
Changed in tripleo:
assignee: Quique Llorente (quiquell) → Sagi (Sergey) Shnaidman (sshnaidm)

the problem was in caching dockers in infra, so this problem has gone.
@Wes, do we still want to keep it opened?

Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Brad P. Crochet (brad-9) on 2018-07-30
tags: removed: workflows
Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers