tripleo gate jobs timing out, duplicate containers pulls a possible cause
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Sagi (Sergey) Shnaidman |
Bug Description
On 6/13/2018 the tripleo gate has 14 gate jobs fail and reset the all the gate jobs resulting in a 25 hour wait time. The normal acceptable range for gate jobs in tripleo is between 5-7 hours.
Several of the failures were due to jobs timing out.
A possible root cause for the job time outs were containers pulling in both the undercloud and overcloud setup and deployments. Each job pulling containers twice could create a load on the mirrors and docker.io registry causing network slowdowns.
http://
After reviewing the situation with Emilien Macchi ( a.k.a Vanilla ) and Alex Schultz a decision was made to revert recent changes to enable more jobs to use the containerized undercloud setup [1].
Additionally, Steve Baker has been working on a blueprint to improve the containerized workflow and performance. By reverting [1], and pushing forward on [2] we hope to improve the performance of the job workflow and avoid future timeouts in check and gate. Once performance gains have been realized we will re-enable [1] across most of the upstream master jobs.
[1] https:/
[2] https:/
Changed in tripleo: | |
assignee: | nobody → Quique Llorente (quiquell) |
Changed in tripleo: | |
milestone: | rocky-3 → rocky-rc1 |
tags: | removed: workflows |
Changed in tripleo: | |
status: | Triaged → Fix Released |
(triage)
This is more of a tracking issue and/or documentation of current state of jobs/work.
Concrete actions for tripleo-ci squad is to raise this issue in the the next #tripleo meeting to determine next steps.