tripleo-ci-centos-7-containers-multinode job is timing out more often than it passes & blocking gate

Bug #1806632 reported by Marios Andreou on 2018-12-04
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Marios Andreou

Bug Description

The tripleo-ci-centos-7-containers-multinode is timing out in the gate requiring multiple rechecks in order to get a successful run. This is a gate blocker since the job is voting

example of timeout at [1] but there are many examples. Some more notes from [2]

    17:02 tripleo-ci-centos-7-containers-multinode job is failing too often and requires multiple recheck in order to pass the gate, look at build history of thi patch: https://review.openstack.org/#/c/621259/

    found: install-undercloud and deploy-overcloud take each ~1h 30m so build duriation is very likely to exceed the 3h timeout.

    inside install-undercloud the **tripleo-container-image-prepare** took 23m by itself: prehttp://logs.openstack.org/59/621259/5/gate/tripleo-ci-centos-7-containers-multinode/cd70f4c/logs/undercloud/home/zuul/install-undercloud.log.txt.gz#_2018-12-03_09_48_22_508

    this is *not* caused by flaky infrastructure, this is caused by our too long jobs.

    (mwhahaha) This is likely caused by container pulls running long. If you look at the history it can run in 2:16 total so it's not the code itself that's causing the excessive run length.

[1] http://logs.openstack.org/59/621259/5/gate/tripleo-ci-centos-7-containers-multinode/cd70f4c/
[2] https://review.rdoproject.org/etherpad/p/ruckrover-sprint23

Changed in tripleo:
assignee: nobody → Marios Andreou (marios-b)
Changed in tripleo:
milestone: none → stein-2
tags: added: alert
Changed in tripleo:
importance: Undecided → Critical
tags: added: deployment-time
Alex Schultz (alex-schultz) wrote :

Not sure if accurate as we haven't had a TIMEOUT in over a day http://zuul.openstack.org/builds?pipeline=gate&result=post_failure&result=timed_out&result=failure&job_name=tripleo-ci-centos-7-containers-multinode

I'm wondering if this was related to all the scenario jobs we had on stable branches in the gate causing extra transit load.

Thomas Herve (therve) wrote :

I think this is definitely flaky infrastructure. If you look at the recheck that succeeded here: http://logs.openstack.org/59/621259/5/gate/tripleo-ci-centos-7-containers-multinode/96b58e9/ the same operations are just much faster. Creating the plan takes 6 minutes instead of 15. The deploy step playbook takes 30 mins instead of 50, etc.

Emilien Macchi (emilienm) wrote :

Removing the alert then.

tags: removed: alert
Changed in tripleo:
milestone: stein-2 → stein-3
wes hayutin (weshayutin) wrote :

No longer seeing this job timeout, I think it's safe to close.

Changed in tripleo:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers