We have some failures in the gate that manifest as:
Captured traceback:
~~~~~~~~~~~~~~~~~~~
Traceback (most recent call last):
File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 362, in _stack_delete
success_on_not_found=True)
File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 352, in _wait_for_stack_status
raise exceptions.TimeoutException(message)
heat_integrationtests.common.exceptions.TimeoutException: Request timed out
Details: Stack InstanceGroupBasicTest-1200481671 failed to reach DELETE_COMPLETE status within the required time (1200 s).
(from http://logs.openstack.org/36/371536/2/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2/cba18a9/)
This is because we are unable to acquire the lock for a nested stack, in this case the stack InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql.
In the logs we can see that there is a WaitCondition timeout:
2016-09-16 15:19:20.547 16424 INFO heat.engine.resources.openstack.heat.wait_condition [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] WaitCondition "waiter" Stack "InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql" [2b7aa97d-46d0-4293-a098-3239891da119] Timed out (0 of 1 received)
2016-09-16 15:19:20.547 16424 INFO heat.engine.resource [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] CREATE: WaitCondition "waiter" Stack "InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql" [2b7aa97d-46d0-4293-a098-3239891da119]
This causes the stack to fail:
2016-09-16 15:19:20.675 16424 INFO heat.engine.stack [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Stack CREATE FAILED (InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql): Resource CREATE failed: WaitConditionTimeout: resources.waiter: 0 of 1 received
And, later, we also attempt to cancel any update (likely due to a sibling nested stack with a similar WaitCondition failing at the same time) by putting a cancel message in the message queues of any active threads:
2016-09-16 15:19:20.704 16426 INFO heat.engine.service [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Starting cancel of updating stack InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql
Which prompts a bizarre message ("task on engine None"):
2016-09-16 15:19:20.747 16426 DEBUG heat.engine.service [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Successfully sent cancel message to remote task on engine None stack_cancel_update /opt/stack/new/heat/heat/engine/service.py:1154
That's because in the code we assume that acquiring the lock can't ever succeed. We did just check that the stack was still IN_PROGRESS, but there is a race in which the stack can be marked FAILED and the lock released before we attempt to acquire the lock.
# as we cancel only running update, the acquire_result is
# always some engine_id, not None
Bzzzzt.
And of course since we assume we can never get the lock, we never bother to release it. When we come to delete, we find that the lock is still being held by an active engine, even after killing all of the threads running operations on it:
2016-09-16 15:19:27.362 16422 DEBUG heat.engine.stack_lock [req-a71fa271-8210-4dd4-848a-f96536f45139 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Lock on stack 2b7aa97d-46d0-4293-a098-3239891da119 is owned by engine fcdec68a-81bb-4205-b61d-cbf23e72b8e3 acquire /opt/stack/new/heat/heat/engine/stack_lock.py:71
2016-09-16 15:19:27.363 16422 DEBUG oslo_messaging.rpc.server [req-a71fa271-8210-4dd4-848a-f96536f45139 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Expected exception during message handling () _process_incoming /usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py:136
And the stack never gets deleted.
Note that this occurs only in the legacy (non-convergence) path.