Cancelling completed stack retains lock forever

Bug #1624538 reported by Zane Bitter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter

Bug Description

We have some failures in the gate that manifest as:

 Captured traceback:
 ~~~~~~~~~~~~~~~~~~~
     Traceback (most recent call last):
       File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 362, in _stack_delete
         success_on_not_found=True)
       File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 352, in _wait_for_stack_status
         raise exceptions.TimeoutException(message)
     heat_integrationtests.common.exceptions.TimeoutException: Request timed out
     Details: Stack InstanceGroupBasicTest-1200481671 failed to reach DELETE_COMPLETE status within the required time (1200 s).

(from http://logs.openstack.org/36/371536/2/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2/cba18a9/)

This is because we are unable to acquire the lock for a nested stack, in this case the stack InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql.

In the logs we can see that there is a WaitCondition timeout:

2016-09-16 15:19:20.547 16424 INFO heat.engine.resources.openstack.heat.wait_condition [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] WaitCondition "waiter" Stack "InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql" [2b7aa97d-46d0-4293-a098-3239891da119] Timed out (0 of 1 received)
2016-09-16 15:19:20.547 16424 INFO heat.engine.resource [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] CREATE: WaitCondition "waiter" Stack "InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql" [2b7aa97d-46d0-4293-a098-3239891da119]

This causes the stack to fail:

2016-09-16 15:19:20.675 16424 INFO heat.engine.stack [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Stack CREATE FAILED (InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql): Resource CREATE failed: WaitConditionTimeout: resources.waiter: 0 of 1 received

And, later, we also attempt to cancel any update (likely due to a sibling nested stack with a similar WaitCondition failing at the same time) by putting a cancel message in the message queues of any active threads:

2016-09-16 15:19:20.704 16426 INFO heat.engine.service [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Starting cancel of updating stack InstanceGroupBasicTest-1200481671-JobServerGroup-xuaww475lm4i-qirolrfolsh4-2rsausnnujql

Which prompts a bizarre message ("task on engine None"):

2016-09-16 15:19:20.747 16426 DEBUG heat.engine.service [req-bb51aa25-b4ce-47c6-a730-cb77101f76a7 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Successfully sent cancel message to remote task on engine None stack_cancel_update /opt/stack/new/heat/heat/engine/service.py:1154

That's because in the code we assume that acquiring the lock can't ever succeed. We did just check that the stack was still IN_PROGRESS, but there is a race in which the stack can be marked FAILED and the lock released before we attempt to acquire the lock.

        # as we cancel only running update, the acquire_result is
        # always some engine_id, not None

Bzzzzt.

And of course since we assume we can never get the lock, we never bother to release it. When we come to delete, we find that the lock is still being held by an active engine, even after killing all of the threads running operations on it:

2016-09-16 15:19:27.362 16422 DEBUG heat.engine.stack_lock [req-a71fa271-8210-4dd4-848a-f96536f45139 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Lock on stack 2b7aa97d-46d0-4293-a098-3239891da119 is owned by engine fcdec68a-81bb-4205-b61d-cbf23e72b8e3 acquire /opt/stack/new/heat/heat/engine/stack_lock.py:71
2016-09-16 15:19:27.363 16422 DEBUG oslo_messaging.rpc.server [req-a71fa271-8210-4dd4-848a-f96536f45139 ab538bc4af40433598e5ce0c96cbef49 955da3dc7c454d328d98ffb582226c4f - - -] Expected exception during message handling () _process_incoming /usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py:136

And the stack never gets deleted.

Revision history for this message
Zane Bitter (zaneb) wrote :

Note that this occurs only in the legacy (non-convergence) path.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/371809

Changed in heat:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/371809
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e2ba3390cda4d7bfcf8397da097ac5358ead1ca5
Submitter: Jenkins
Branch: master

commit e2ba3390cda4d7bfcf8397da097ac5358ead1ca5
Author: Zane Bitter <email address hidden>
Date: Fri Sep 16 17:05:25 2016 -0400

    Don't acquire the lock when cancelling a stack

    We used to try to acquire the stack lock in order to find out which engine
    to cancel a running update on, in the misguided belief that it could never
    succeed. Accordingly, we never released the lock.

    Since it is entirely possible to encounter a race where the lock has
    already been released, use the get_engine_id() method instead to look up
    the ID of the engine holding the lock without attempting to acquire it.

    Change-Id: I1d026f8c67dddcf840ccbc2f3f1537693dc266fb
    Closes-Bug: #1624538

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/372550

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/newton)

Reviewed: https://review.openstack.org/372550
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=4695f47f68031195b23857a1fce827c39c626faa
Submitter: Jenkins
Branch: stable/newton

commit 4695f47f68031195b23857a1fce827c39c626faa
Author: Zane Bitter <email address hidden>
Date: Fri Sep 16 17:05:25 2016 -0400

    Don't acquire the lock when cancelling a stack

    We used to try to acquire the stack lock in order to find out which engine
    to cancel a running update on, in the misguided belief that it could never
    succeed. Accordingly, we never released the lock.

    Since it is entirely possible to encounter a race where the lock has
    already been released, use the get_engine_id() method instead to look up
    the ID of the engine holding the lock without attempting to acquire it.

    Change-Id: I1d026f8c67dddcf840ccbc2f3f1537693dc266fb
    Closes-Bug: #1624538
    (cherry picked from commit e2ba3390cda4d7bfcf8397da097ac5358ead1ca5)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 7.0.0.0rc2

This issue was fixed in the openstack/heat 7.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 7.0.0

This issue was fixed in the openstack/heat 7.0.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 8.0.0.0b1

This issue was fixed in the openstack/heat 8.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.