deleting in_progress stack with nested stacks fails with convergence enabled

Bug #1592374 reported by Steven Hardy on 2016-06-14
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Critical
Thomas Herve

Bug Description

Since the switch to convergence (unless there's been some other regression), I'm seeing stacks get stuck forever when I try to delete a CREATE_IN_PROGRESS stack that's got an IN_PROGRESS nested stack.

E.g

heat_template_version: 2016-10-14

resources:
  config:
    type: OS::Heat::SoftwareConfig
    properties:
      group: script

  deployment:
    type: OS::Heat::SoftwareDeploymentGroup
    properties:
      config: {get_resource: config}
      servers:
        '0': dummy0
        '1': dummy1
        '2': dummy2
        '3': dummy3

This waits forever (until timeout) unless you manually signal each nested deployment resource, but trying to delete the top-level stack just stalls:

$ heat stack-list -n | grep b2
| 4c78032d-86f6-4566-a483-87ac140ee81b | b2 | DELETE_IN_PROGRESS | 2016-06-14T11:34:07 | 2016-06-14T11:34:24 | None |
| 87c577c9-2969-4df7-8944-a418e1932244 | b2-deployment-a3q6cxaqxchu | CREATE_IN_PROGRESS | 2016-06-14T11:34:08 | None | 4c78032d-86f6-4566-a483-87ac140ee81b |

Here we can see the nested stack is still CREATE_IN_PROGRESS. Even trying to delete the nested stack doesn't seem to work - the only workaround I can find is to restart heat-engine, then manually delete both stacks (again).

Steven Hardy (shardy) wrote :

Another exciting thing I noticed - if you try to switch back to convergence_engine=False without first deleting the parent and all nested stacks, you then delete the parent and all nested stacks are orhpaned, we leave them around and they then require manual cleanup.

Changed in heat:
milestone: none → newton-2
importance: Undecided → High
status: New → Triaged
Steven Hardy (shardy) wrote :

Note re the previous comment, this appears to only happen if the stack was IN_PROGRESS at the point the engine got restarted and switched back to non-convergence mode.

Thomas Herve (therve) wrote :

Reproduced, here's the error that I get:

Error acquiring lock for resource id:X with atomic_key:Y, engine_id:Z.

Changed in heat:
status: Triaged → Confirmed
tags: added: convergence-bugs
Steven Hardy (shardy) wrote :

Actually, it may be unrelated to convergence - I'm seeing this after switching convergence off:

| a2bd165e-4ee3-429b-8de5-c39fc4492773 | SoftwareDeploymentGroupTest-2043582773 | DELETE_IN_PROGRESS | 2016-06-14T13:40:51 | None | None |
| 2dcb5325-d280-43bd-be1a-39ed5ad5656a | SoftwareDeploymentGroupTest-2043582773-deployment-ev7qtvvfbh3t | CREATE_IN_PROGRESS | 2016-06-14T13:40:52 | None | a2bd165e-4ee3-429b-8de5-c39fc4492773 |
+--------------------------------------+----------------------------------------------------------------+--------------------+---------------------+---------------------+--------------------------------------+
-bash-4.3$ heat stack-delete 2dcb5325-d280-43bd-be1a-39ed5ad5656a
WARNING (shell) "heat stack-delete" is deprecated, please use "openstack stack delete" instead
Are you sure you want to delete this stack(s) [y/N]? y
ERROR: Stack SoftwareDeploymentGroupTest-2043582773-deployment-ev7qtvvfbh3t already has an action (CREATE) in progress.

I'm trying to manually delete the nested stack, and it's not working.

Only clue in the log is:
2016-06-14 14:46:02.123 DEBUG heat.engine.stack_lock [req-ced0d1a1-7914-4628-9058-b5aaaa5704f4 None admin] Lock on stack 2dcb5325-d280-43bd-be1a-39ed5ad5656a is owned by engine c118faa9-dec1-4a74-83a7-37261c96ad2f from (pid=32638) acquire /opt/stack/heat/heat/engine/stack_lock.py:69

So, maybe we're just not stealing the lock correctly in general. I know there are a couple of bugs open for this sort of issue for non-convergence when failures require cancelling in-progress operations on nested stacks, but I'm not clear atm if this is the same issue.

Steven Hardy (shardy) wrote :

I created a functional test which aims to reproduce this:

https://review.openstack.org/329460

Rabi Mishra (rabi) wrote :

It seems this issue is there for convergence only. As you can see the gate failures for these patches[1][2].

[1] https://review.openstack.org/#/c/330468/
[2] https://review.openstack.org/#/c/330467/

Rabi Mishra (rabi) wrote :

When the engine is in convergence mode, it seems without a way to cancel the CREATE tasks of the resources(which does not seem to implemented atm), there is no way to get(acquire/steal) the lock to process the DELETE action as long as the resource is in CREATE_IN_PROGRESS.

It seems parent stack gets stuck in DELETE_IN_PROGRESSS and resource-list does not show anything.

However, the child/nested stack is there and is stuck in CREATE_IN_PROGRESS along with it's resources.

Anant Patil (ananta) wrote :

Need https://review.openstack.org/#/c/278231/for the resource-list to work.

If something is in IN_PROGRESS and if it is stuck, then the action needs to be cancelled first before issuing another update. The next update will wait for the in progress resource to complete/fail before moving forward.

Steven Hardy (shardy) wrote :

> If something is in IN_PROGRESS and if it is stuck, then the action needs to be cancelled first before issuing another update. The next update will wait for the in progress resource to complete/fail before moving forward.

This isn't about updating, it's simply about deleting an IN_PROGRESS stack. To me this is a critical regression, as it becomes impossible to ever delete anything without waiting for it to either complete or fail.

Rabi Mishra (rabi) wrote :

@shardy,

I forgot updating this as per our discussion yesterday.

As probably you've mentioned above, it's possible to delete a IN_PROGRESS stack when the resources IN_PROGRESS go to COMPLETE/FAILED state or time out before the stack build timeout happens. However, I think the issue is when the resource gets stuck in IN_PROGRESS for ever and the stack timeout happens. As mentioned above by Anant, We would require to convergence stack cancel patch to land. I assume Zane has some patches on cancel api and would like that patch to be done differently. Not sure though.

Also, I think Jay has been working on a timeout property of a deployment resource. Probably that can help to timeout deployments quickly. Though we need to fix this asap.

Steven Hardy (shardy) wrote :

> Also, I think Jay has been working on a timeout property of a deployment resource. Probably that can help to timeout deployments quickly. Though we need to fix this asap.

Yes this will be useful, but I don't think it really solves this case, where I don't want any timeout to be observed because I'm deleting the stack.

Thanks for the info all, hopefully we can get some clarity on the proposed solution here soon and I'll rebase my test to prove the fix.

Anant Patil (ananta) wrote :

Delete is treated as an update (with an empty template) in convergence. To delete the in progress resource, the resource needs to be cancelled. When a delete request is received, cancel the in progress traversal, and then delete it (which is again an update).

Thomas Herve (therve) on 2016-07-12
Changed in heat:
milestone: newton-2 → newton-3
Steven Hardy (shardy) wrote :

Should this be marked as a duplicate of https://bugs.launchpad.net/heat/+bug/1545063 ?

I rebased the test onto the fixes for that, so we can establish whether they fix for my use-case.

Fix proposed to branch: master
Review: https://review.openstack.org/354000

Changed in heat:
assignee: nobody → Anant Patil (ananta)
status: Confirmed → In Progress
Zane Bitter (zaneb) on 2016-08-31
Changed in heat:
importance: High → Critical
Thomas Herve (therve) on 2016-09-01
Changed in heat:
milestone: newton-3 → newton-rc1
Changed in heat:
assignee: Anant Patil (ananta) → Thomas Herve (therve)

Reviewed: https://review.openstack.org/354000
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6dc3ab9b01c89e0eb6695daa140679661f266519
Submitter: Jenkins
Branch: master

commit 6dc3ab9b01c89e0eb6695daa140679661f266519
Author: Anant Patil <email address hidden>
Date: Tue Sep 6 13:28:53 2016 +0530

    Deletion of in-progress stack

    Unlike in legacy engine case, the stack in in-progress state couldn't be
    deleted in convergence. This patch fixes this issue so that the
    behaviour is compatible.

    Change-Id: I79bb14d5d7744ba2a012e23e690608dc7db1fbc5
    Closes-Bug: #1545063
    Closes-Bug: #1592374

Changed in heat:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/329460
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=9b302b82fd72a546e6375af447155f05b613f41e
Submitter: Jenkins
Branch: master

commit 9b302b82fd72a546e6375af447155f05b613f41e
Author: Steven Hardy <email address hidden>
Date: Tue Jun 14 14:57:28 2016 +0100

    Add test for SoftwareDeploymentGroup resource

    Adds an initial test which tests create/delete for this resource,
    and that signalling works.

    Also tests deleting an in-progress stack as this has been a repeated
    source of bugs related to deleting the child stack correctly.

    Change-Id: I0d5acdca50467da344388d6c262e61aaaaae22eb
    Related-Bug: #1592374

This issue was fixed in the openstack/heat 7.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers