Timeout passed to overcloud deploy not effective

Bug #1675174 reported by Ben Nemec on 2017-03-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Medium
Unassigned

Bug Description

As seen in http://logs.openstack.org/46/444746/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/186b755/console.html

Deployment started at approximately 16:55, the entire job timed out at ~18:40, 105 minutes later by my math. The deploy process never timed out the way it should have so we get no logs to debug the hang.

Here's a snippet of the log in case it expires:

2017-03-22 16:54:09.780248 | Started Mistral Workflow tripleo.deployment.v1.deploy_plan. Execution ID: f801016d-2d9e-4451-b409-1dc19f31ad7b
2017-03-22 16:54:49.497060 | 2017-03-22 16:54:37.662874 [overcloud]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 16:54:49.497493 | 2017-03-22 16:54:37.753079 [overcloud.MysqlRootPassword]: CREATE_IN_PROGRESS state changed

...

2017-03-22 16:55:52.409300 | 2017-03-22 16:55:50.186774 [overcloud.ControllerServiceChain.ServiceChain.83]: CREATE_IN_PROGRESS state changed
2017-03-22 16:55:52.532207 | 2017-03-22 16:55:50.178450 [overcloud.ControllerServiceChain.ServiceChain.44]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 16:55:52.536552 | 2017-03-22 16:55:50.258160 [overcloud.ControllerServiceChain.ServiceChain.3.CephBase]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 18:39:59.020136 | /home/jenkins/workspace/gate-tripleo-ci-centos-7-ovb-updates/devstack-gate/functions.sh: line 1074: 15643 Killed timeout -s 9 ${REMAINING_TIME}m bash -c "source $WORKSPACE/devstack-gate/functions.sh && $cmd"

Tags: ci Edit Tag help
Ben Nemec (bnemec) wrote :

If heat-engine OOM'd, is it possible that the stack timeout would not take effect? I've heard reports that 8 GB underclouds are no longer sufficient in some developer environments, so it's possible we're hitting the same problem in ci now.

Please take a look at my old patch that can save logs in case of timeout: https://review.openstack.org/#/c/410470/

We have logs from job with timeout: http://logs.openstack.org/23/449023/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/25d26eb/

See the patch for quickstart jobs which helps to collect logs even if deployment is timeouted: https://review.openstack.org/#/c/449023/

Steven Hardy (shardy) wrote :

> If heat-engine OOM'd, is it possible that the stack timeout would not take effect?

In the case where it dies and is restarted the stack would fail when the engine starts after OOM killing, but if it dies and never starts again then that could explain this - the calls to the heat-api will just hang and eventually RPC timeout, and the engine is the thing that asserts status, e.g calculates the timeout and sets the stack status to FAILED.

Steven Hardy (shardy) wrote :

2017-03-22 16:55:52.355958 | 2017-03-22 16:55:49.839462 [overcloud.ControllerServiceChain.ServiceChain.3.CephBase]: CREATE_IN_PROGRESS state changed
2017-03-22 16:55:52.409300 | 2017-03-22 16:55:50.186774 [overcloud.ControllerServiceChain.ServiceChain.83]: CREATE_IN_PROGRESS state changed
2017-03-22 16:55:52.532207 | 2017-03-22 16:55:50.178450 [overcloud.ControllerServiceChain.ServiceChain.44]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 16:55:52.536552 | 2017-03-22 16:55:50.258160 [overcloud.ControllerServiceChain.ServiceChain.3.CephBase]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 18:39:59.020136 | /home/jenkins/workspace/gate-tripleo-ci-centos-7-ovb-updates/devstack-gate/functions.sh: line 1074: 15643 Killed timeout -s 9 ${REMAINING_TIME}m bash -c "source $WORKSPACE/devstack-gate/functions.sh && $cmd"
2017-03-22 18:39:59.020649 |

Looking at the timestamps, we can see there were no events collected from heat for a while before the timeout, which could support the theory heat-engine got killed, but I'd expect to see client errors from the RPC timeouts here?

Steven Hardy (shardy) wrote :

I did a quick sanity test to prove the heat timestamp is respected on recent heat master (3d4547f):

[stack@undercloud ~]$ heat stack-create timeout_test_3d4547f -f timeout-test.yaml -t 5
...
(undercloud) [stack@undercloud ~]$ openstack stack event list timeout_test_3d4547f --follow
2017-03-24 13:51:14Z [timeout_test_3d4547f]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-24 13:51:14Z [timeout_test_3d4547f.Test]: CREATE_IN_PROGRESS state changed
2017-03-24 13:56:14Z [timeout_test_3d4547f.Test]: CREATE_FAILED CREATE aborted
2017-03-24 13:56:14Z [timeout_test_3d4547f]: CREATE_FAILED Create timed out

One other thing we probably need to check is that the event handling on failure hasn't changed, as IIRC there have been some changes to the zaqar event handling quite recently in tripleo-common and tripleoclient.

tags: removed: alert
Changed in tripleo:
milestone: pike-1 → pike-2
Changed in tripleo:
milestone: pike-2 → pike-3
Ben Nemec (bnemec) wrote :

I don't know that we can call this fixed, but we have enough workarounds in ci now that this isn't blocking work from proceeding. Dropping importance accordingly.

Changed in tripleo:
importance: Critical → Medium
Changed in tripleo:
milestone: pike-3 → pike-rc1
Changed in tripleo:
milestone: pike-rc1 → queens-1
Changed in tripleo:
milestone: queens-1 → queens-2
Ben Nemec (bnemec) on 2017-10-09
Changed in tripleo:
status: Triaged → Incomplete
Changed in tripleo:
milestone: queens-2 → queens-3
Changed in tripleo:
milestone: queens-3 → queens-rc1
Changed in tripleo:
milestone: queens-rc1 → rocky-1
Changed in tripleo:
milestone: rocky-1 → rocky-2
Changed in tripleo:
milestone: rocky-2 → rocky-3
Changed in tripleo:
milestone: rocky-3 → rocky-rc1
Changed in tripleo:
milestone: rocky-rc1 → stein-1
Changed in tripleo:
milestone: stein-1 → stein-2
Changed in tripleo:
milestone: stein-2 → stein-3

Is this still an issue?

Changed in tripleo:
milestone: stein-3 → train-1
Changed in tripleo:
milestone: train-1 → train-2
Changed in tripleo:
milestone: train-2 → train-3
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers