tripleo

Timeout passed to overcloud deploy not effective

Bug #1675174 reported by Ben Nemec on 2017-03-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Medium	Unassigned	tripleo ussuri-1

Bug Description

As seen in http://logs.openstack.org/46/444746/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-updates/186b755/console.html

Deployment started at approximately 16:55, the entire job timed out at ~18:40, 105 minutes later by my math. The deploy process never timed out the way it should have so we get no logs to debug the hang.

Here's a snippet of the log in case it expires:

2017-03-22 16:54:09.780248 | Started Mistral Workflow tripleo.deployment.v1.deploy_plan. Execution ID: f801016d-2d9e-4451-b409-1dc19f31ad7b
2017-03-22 16:54:49.497060 | 2017-03-22 16:54:37.662874 [overcloud]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 16:54:49.497493 | 2017-03-22 16:54:37.753079 [overcloud.MysqlRootPassword]: CREATE_IN_PROGRESS state changed

...

2017-03-22 16:55:52.409300 | 2017-03-22 16:55:50.186774 [overcloud.ControllerServiceChain.ServiceChain.83]: CREATE_IN_PROGRESS state changed
2017-03-22 16:55:52.532207 | 2017-03-22 16:55:50.178450 [overcloud.ControllerServiceChain.ServiceChain.44]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 16:55:52.536552 | 2017-03-22 16:55:50.258160 [overcloud.ControllerServiceChain.ServiceChain.3.CephBase]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 18:39:59.020136 | /home/jenkins/workspace/gate-tripleo-ci-centos-7-ovb-updates/devstack-gate/functions.sh: line 1074: 15643 Killed timeout -s 9 ${REMAINING_TIME}m bash -c "source $WORKSPACE/devstack-gate/functions.sh && $cmd"

Tags:

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-03-22:

If heat-engine OOM'd, is it possible that the stack timeout would not take effect? I've heard reports that 8 GB underclouds are no longer sufficient in some developer environments, so it's possible we're hitting the same problem in ci now.

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-03-23:

Please take a look at my old patch that can save logs in case of timeout: https://review.openstack.org/#/c/410470/

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2017-03-23:

We have logs from job with timeout: http://logs.openstack.org/23/449023/2/check/gate-tripleo-ci-centos-7-nonha-multinode-oooq/25d26eb/

See the patch for quickstart jobs which helps to collect logs even if deployment is timeouted: https://review.openstack.org/#/c/449023/

Revision history for this message

Steven Hardy (shardy) wrote on 2017-03-24:

> If heat-engine OOM'd, is it possible that the stack timeout would not take effect?

In the case where it dies and is restarted the stack would fail when the engine starts after OOM killing, but if it dies and never starts again then that could explain this - the calls to the heat-api will just hang and eventually RPC timeout, and the engine is the thing that asserts status, e.g calculates the timeout and sets the stack status to FAILED.

Revision history for this message

Steven Hardy (shardy) wrote on 2017-03-24:

2017-03-22 16:55:52.355958 | 2017-03-22 16:55:49.839462 [overcloud.ControllerServiceChain.ServiceChain.3.CephBase]: CREATE_IN_PROGRESS state changed
2017-03-22 16:55:52.409300 | 2017-03-22 16:55:50.186774 [overcloud.ControllerServiceChain.ServiceChain.83]: CREATE_IN_PROGRESS state changed
2017-03-22 16:55:52.532207 | 2017-03-22 16:55:50.178450 [overcloud.ControllerServiceChain.ServiceChain.44]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 16:55:52.536552 | 2017-03-22 16:55:50.258160 [overcloud.ControllerServiceChain.ServiceChain.3.CephBase]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-22 18:39:59.020136 | /home/jenkins/workspace/gate-tripleo-ci-centos-7-ovb-updates/devstack-gate/functions.sh: line 1074: 15643 Killed timeout -s 9 ${REMAINING_TIME}m bash -c "source $WORKSPACE/devstack-gate/functions.sh && $cmd"
2017-03-22 18:39:59.020649 |

Looking at the timestamps, we can see there were no events collected from heat for a while before the timeout, which could support the theory heat-engine got killed, but I'd expect to see client errors from the RPC timeouts here?

Revision history for this message

Steven Hardy (shardy) wrote on 2017-03-24:

I did a quick sanity test to prove the heat timestamp is respected on recent heat master (3d4547f):

[stack@undercloud ~]$ heat stack-create timeout_test_3d4547f -f timeout-test.yaml -t 5
...
(undercloud) [stack@undercloud ~]$ openstack stack event list timeout_test_3d4547f --follow
2017-03-24 13:51:14Z [timeout_test_3d4547f]: CREATE_IN_PROGRESS Stack CREATE started
2017-03-24 13:51:14Z [timeout_test_3d4547f.Test]: CREATE_IN_PROGRESS state changed
2017-03-24 13:56:14Z [timeout_test_3d4547f.Test]: CREATE_FAILED CREATE aborted
2017-03-24 13:56:14Z [timeout_test_3d4547f]: CREATE_FAILED Create timed out

One other thing we probably need to check is that the event handling on failure hasn't changed, as IIRC there have been some changes to the zaqar event handling quite recently in tripleo-common and tripleoclient.

Emilien Macchi (emilienm) on 2017-03-25

tags:

removed: alert

Emilien Macchi (emilienm) on 2017-04-11

Changed in tripleo:
milestone:	pike-1 → pike-2

Emilien Macchi (emilienm) on 2017-06-08

Changed in tripleo:
milestone:	pike-2 → pike-3

Revision history for this message

Ben Nemec (bnemec) wrote on 2017-06-13:

I don't know that we can call this fixed, but we have enough workarounds in ci now that this isn't blocking work from proceeding. Dropping importance accordingly.

Changed in tripleo:
importance:	Critical → Medium

Emilien Macchi (emilienm) on 2017-07-30

Changed in tripleo:
milestone:	pike-3 → pike-rc1

Emilien Macchi (emilienm) on 2017-07-31

Changed in tripleo:
milestone:	pike-rc1 → queens-1

Alex Schultz (alex-schultz) on 2017-10-04

Changed in tripleo:
milestone:	queens-1 → queens-2

Ben Nemec (bnemec) on 2017-10-09

Changed in tripleo:
status:	Triaged → Incomplete

Alex Schultz (alex-schultz) on 2017-11-02

Changed in tripleo:
milestone:	queens-2 → queens-3

Emilien Macchi (emilienm) on 2018-01-26

Changed in tripleo:
milestone:	queens-3 → queens-rc1

Alex Schultz (alex-schultz) on 2018-02-20

Changed in tripleo:
milestone:	queens-rc1 → rocky-1

Alex Schultz (alex-schultz) on 2018-04-20

Changed in tripleo:
milestone:	rocky-1 → rocky-2

Emilien Macchi (emilienm) on 2018-06-05

Changed in tripleo:
milestone:	rocky-2 → rocky-3

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-3 → rocky-rc1

Emilien Macchi (emilienm) on 2018-07-26

Changed in tripleo:
milestone:	rocky-rc1 → stein-1

Juan Antonio Osorio Robles (juan-osorio-robles) on 2018-10-30

Changed in tripleo:
milestone:	stein-1 → stein-2

Emilien Macchi (emilienm) on 2019-01-13

Changed in tripleo:
milestone:	stein-2 → stein-3

Revision history for this message

Juan Antonio Osorio Robles (juan-osorio-robles) wrote on 2019-01-22:

Is this still an issue?

Alex Schultz (alex-schultz) on 2019-03-13

Changed in tripleo:
milestone:	stein-3 → train-1

Alex Schultz (alex-schultz) on 2019-06-07

Changed in tripleo:
milestone:	train-1 → train-2

Alex Schultz (alex-schultz) on 2019-07-29

Changed in tripleo:
milestone:	train-2 → train-3

Alex Schultz (alex-schultz) on 2019-09-11

Changed in tripleo:
milestone:	train-3 → ussuri-1

Sagi (Sergey) Shnaidman (sshnaidm) on 2019-10-28

Changed in tripleo:
status:	Incomplete → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.