tripleo

Upgrades jobs timing out regularly

Bug #1680259 reported by Ben Nemec on 2017-04-05

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Unassigned	tripleo pike-2 "pike-2"

Bug Description

http://logs.openstack.org/27/453127/1/check/gate-tripleo-ci-centos-7-scenario004-multinode-oooq/76a4600/logs/postci.txt.gz

http://logs.openstack.org/27/453127/1/check/gate-tripleo-ci-centos-7-scenario001-multinode-upgrades/caccdec/console.html

As seen in the two logs above, patches are failing on different steps due to timeouts. It's not obvious to me what is taking too long in those jobs, but in the meantime we're going to make them non-voting again on the ocata branch because they're contributing to blocking good patches there.

Tags:

Revision history for this message

Christian Schwede (cschwede) wrote on 2017-04-19:

#1

Same for https://review.openstack.org/#/c/456264/ - if it fails, at around 2:38 it seems. Timeout should be 180 minutes though? The tests passed sometimes, and if it passed it was always way below 2:38.

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-04-19:

#2

The reason the jobs timeout a little early is to give us time to collect logs. If we let them run right up to the full gate timeout then the job gets killed with prejudice and there's no way for us to debug what happened. Per "Timeout set to 170 minutes with 10 minutes reserved for cleanup." from the logs we only actually have 170 minutes for our part of the job too.

The -15 in http://git.openstack.org/cgit/openstack-infra/tripleo-ci/tree/toci_quickstart.sh#n68 is what controls the grace period, I believe. We might be able to drop that to 10, but much lower than that and I think we'd start running the risk of timing out during postci tasks. Collecting logs and debug data can take a while in the more complex jobs.

Revision history for this message

Christian Schwede (cschwede) wrote on 2017-04-19:

#3

Thanks, now that makes sense - the job needs to finish in 155 minutes then, and looking at the failures they always run a bit longer than that:

2h 40m 22s
2h 39m 03s
2h 39m 24s
2h 39m 48s
2h 39m 11s
2h 38m 36s
2h 40m 01s
2h 40m 02s
2h 40m 13s
2h 39m 22s

If the same test passes, it runs less than 2:35. So how do we proceed with this? Decrease the buffer, increase the overall timeout? In the longrun it would be nice to decrease the required time, but that might be enough in the shortterm?

Revision history for this message

Christian Schwede (cschwede) wrote on 2017-04-20:

#4

Proposed temporary fix for the upgrade jobs: https://review.openstack.org/#/c/458364/

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-20:

#5

I'm working on TripleO CI to use AFS mirrors everywhere : https://review.openstack.org/458474

Changed in tripleo:
assignee:	nobody → Emilien Macchi (emilienm)
tags:	added: alert

Emilien Macchi (emilienm) on 2017-04-20

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-21:

#6

Let's see if it helped: https://review.openstack.org/458714

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-21:

#7

Something interesting I just found:

Jobs that timeout run on Internap or RAX clouds. Jobs that doesn't timeout run on OSIC cloud.

Emilien Macchi (emilienm) on 2017-04-21

Changed in tripleo:
assignee:	Emilien Macchi (emilienm) → nobody

Revision history for this message

Alan Pevec (apevec) wrote on 2017-04-24:

#8

OSIC cloud is crazy fast b/c it has SSDs on compute nodes

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-04-28:

#9

I'll close the bug now, because I haven't seen much timeouts over the last days. Feel free to re-open it if needed.

tags:	removed: alert
Changed in tripleo:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.