Upgrades jobs timing out regularly

Bug #1680259 reported by Ben Nemec on 2017-04-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

http://logs.openstack.org/27/453127/1/check/gate-tripleo-ci-centos-7-scenario004-multinode-oooq/76a4600/logs/postci.txt.gz

http://logs.openstack.org/27/453127/1/check/gate-tripleo-ci-centos-7-scenario001-multinode-upgrades/caccdec/console.html

As seen in the two logs above, patches are failing on different steps due to timeouts. It's not obvious to me what is taking too long in those jobs, but in the meantime we're going to make them non-voting again on the ocata branch because they're contributing to blocking good patches there.

Christian Schwede (cschwede) wrote :

Same for https://review.openstack.org/#/c/456264/ - if it fails, at around 2:38 it seems. Timeout should be 180 minutes though? The tests passed sometimes, and if it passed it was always way below 2:38.

Ben Nemec (bnemec) wrote :

The reason the jobs timeout a little early is to give us time to collect logs. If we let them run right up to the full gate timeout then the job gets killed with prejudice and there's no way for us to debug what happened. Per "Timeout set to 170 minutes with 10 minutes reserved for cleanup." from the logs we only actually have 170 minutes for our part of the job too.

The -15 in http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/toci_quickstart.sh#n68 is what controls the grace period, I believe. We might be able to drop that to 10, but much lower than that and I think we'd start running the risk of timing out during postci tasks. Collecting logs and debug data can take a while in the more complex jobs.

Christian Schwede (cschwede) wrote :

Thanks, now that makes sense - the job needs to finish in 155 minutes then, and looking at the failures they always run a bit longer than that:

2h 40m 22s
2h 39m 03s
2h 39m 24s
2h 39m 48s
2h 39m 11s
2h 38m 36s
2h 40m 01s
2h 40m 02s
2h 40m 13s
2h 39m 22s

If the same test passes, it runs less than 2:35. So how do we proceed with this? Decrease the buffer, increase the overall timeout? In the longrun it would be nice to decrease the required time, but that might be enough in the shortterm?

Christian Schwede (cschwede) wrote :

Proposed temporary fix for the upgrade jobs: https://review.openstack.org/#/c/458364/

Emilien Macchi (emilienm) wrote :

I'm working on TripleO CI to use AFS mirrors everywhere : https://review.openstack.org/458474

Changed in tripleo:
assignee: nobody → Emilien Macchi (emilienm)
tags: added: alert
Changed in tripleo:
status: Triaged → In Progress
Emilien Macchi (emilienm) wrote :

Let's see if it helped: https://review.openstack.org/458714

Emilien Macchi (emilienm) wrote :

Something interesting I just found:

Jobs that timeout run on Internap or RAX clouds. Jobs that doesn't timeout run on OSIC cloud.

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → nobody
Alan Pevec (apevec) wrote :

OSIC cloud is crazy fast b/c it has SSDs on compute nodes

Emilien Macchi (emilienm) wrote :

I'll close the bug now, because I haven't seen much timeouts over the last days. Feel free to re-open it if needed.

tags: removed: alert
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers