Update timeout too long in CI

Bug #1674770 reported by Ben Nemec on 2017-03-21
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

Looking through the logs on http://logs.openstack.org/83/445883/3/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/05ee057/console.html I see that it started the deployment at about 10:25, then the entire job was killed at 12:18. The deploy timeout is set to 80 minutes, which means we should have errored out long before the job was killed.

Unfortunately, the job timeout means we get no logs for anything so it's going to be hard to debug this. It seems to be happening on a pretty regular basis right now though.

Logstash for general gate timeouts: http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%20*tripleo-ci*%20AND%20build_status%3A%20FAILURE%20AND%20message%3A%20%5C%22exit%20code%3A%20137%5C%22

Tags: ci Edit Tag help
Changed in tripleo:
milestone: none → pike-1
tags: added: alert
Steven Hardy (shardy) wrote :

AFAICT the timeout is repected directly via heat:

(undercloud) [stack@undercloud ~]$ heat stack-create test -f hosts-config.yaml -e hosts_env.yaml -t 333

(undercloud) [stack@undercloud ~]$ heat stack-show test | grep timeout
WARNING (shell) "heat stack-show" is deprecated, please use "openstack stack show" instead
| timeout_mins | 333

And also via tripleoclient:

openstack overcloud deploy --templates --timeout 333

| timeout_mins | 333

Could this be an issue specific to CI, e.g we've messed up the deploy arguments?

Michele Baldessari (michele) wrote :

So we definitely have -t 80 in the deploy command:
http://logs.openstack.org/83/445883/3/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/05ee057/console.html#_2017-03-21_10_24_39_793081

2017-03-21 10:24:39.793081 | tripleo.sh -- Deploy command arguments: --libvirt-type=qemu -t 80 -e /usr/share/openstack-tripleo-heat-templates/environments/debug.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml -e /opt/stack/new/tripleo-ci/test-environments/ipv6-network-templates/network-environment.yaml -e /opt/stack/new/tripleo-ci/test-environments/net-iso.yaml -e /opt/stack/new/tripleo-ci/test-environments/enable-tls-ipv6.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml -e /opt/stack/new/tripleo-ci/test-environments/inject-trust-anchor-hiera-ipv6.yaml --ceph-storage-scale 1 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /opt/stack/new/tripleo-ci/test-environments/worker-config.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml --templates --validation-warnings-fatal

For the update we call:
 /opt/stack/new/tripleo-ci/scripts/tripleo.sh --overcloud-update

And it seems to me we do have -t 80 there as well?
2017-03-21 09:54:41.637335 | +++(/opt/stack/new/tripleo-ci/deploy.env:31): OVERCLOUD_UPDATE_ARGS='-e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml
2017-03-21 09:54:41.637459 | --libvirt-type=qemu -t 80 -e /usr/share/openstack-tripleo-heat-templates/environments/debug.yaml
2017-03-21 09:54:41.637566 | -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml
2017-03-21 09:54:41.637675 | -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml
2017-03-21 09:54:41.637778 | -e /opt/stack/new/tripleo-ci/test-environments/ipv6-network-templates/network-environment.yaml
2017-03-21 09:54:41.637847 | -e /opt/stack/new/tripleo-ci/test-environments/net-iso.yaml
2017-03-21 09:54:41.637902 | -e /opt/stack/new/tripleo-ci/test-environments/enable-tls-ipv6.yaml
2017-03-21 09:54:41.637970 | -e /usr/share/openstack-tripleo-heat-templates/environments/tls-endpoints-public-ip.yaml
2017-03-21 09:54:41.638029 | -e /opt/stack/new/tripleo-ci/test-environments/inject-trust-anchor-hiera-ipv6.yaml
2017-03-21 09:54:41.638063 | --ceph-storage-scale 1
2017-03-21 09:54:41.638119 | -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml
2017-03-21 09:54:41.638228 | -e /opt/stack/new/tripleo-ci/test-environments/worker-config.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/low-memory-usage.yaml'

Thomas Herve (therve) wrote :

I think it's worth noting that the timeout value is per operation, not per job. So the 80 minutes are for create or update. Looking at your logs, update starts at 11:23, so 80 minutes bring it around 12:43. But the overall job timeout kicks in before, at 12:18.

On successful jobs, it seems the update takes about 30 minutes, so maybe it should be less than 80 by default to kick in?

Ben Nemec (bnemec) wrote :

Oh, crud. I totally missed that the create completed and moved on to the update in this job. I think you're right that we need a shorter timeout for update.

Looking at the graphite metrics for the update job it looks like the average is around 40 minutes when the cloud is heavily loaded. We'd probably need to go at least 45 to account for normal runtime variations.

summary: - Timeout passed to overcloud deploy not effective
+ Update timeout too long in CI
Ben Nemec (bnemec) wrote :

Okay, maybe the problem was that I linked the wrong log. I just checked another job and it did indeed fail on create after considerably longer than 80 minutes. Since these are probably separate issues I opened a new bug for that one: https://bugs.launchpad.net/tripleo/+bug/1675174

Ben Nemec (bnemec) wrote :
Changed in tripleo:
status: Triaged → Fix Released
tags: removed: alert
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers