Deleting with an in-progress stack update can fail

Bug #1384750 reported by Steven Hardy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Pavlo Shchelokovskyy
Juno
Fix Released
High
Zane Bitter

Bug Description

If you try to interrupt a long-running stack-update, it's possible to end up in an undeletable state, as it seems the update is cancelled before persisting the new template, so on delete, we're referring to the old template, which doesn't match the current resource-list output:

I hit this when doing a TripleO stack update (the two templates attached, it was actually a mistake as it would be a destructive update if you tried it on a real overcloud, but it shouldn't break heat), basically I did:

devtest.sh --trash-my-machine
<all OK, overcloud stack launched>

devtest_overcloud.sh -c --without-mergepy

(testing https://review.openstack.org/#/c/123761/)

Working on a more minimal reproducer, but here's what info I have atm:

heat stack-delete gives us this in the engine log:

Traceback (most recent call last):
  File "/opt/stack/venvs/openstack/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 455, in fire_timers
    timer()
  File "/opt/stack/venvs/openstack/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/opt/stack/venvs/openstack/lib/python2.7/site-packages/eventlet/greenthread.py", line 212, in main
    result = function(*args, **kwargs)
  File "/opt/stack/venvs/openstack/lib/python2.7/site-packages/heat/engine/service.py", line 113, in _start_with_trace
    return func(*args, **kwargs)
  File "/opt/stack/venvs/openstack/lib/python2.7/site-packages/osprofiler/profiler.py", line 105, in wrapper
    return f(*args, **kwargs)
  File "/opt/stack/venvs/openstack/lib/python2.7/site-packages/heat/engine/stack.py", line 852, in delete
    current_resource = self.resources[key]
KeyError: u'NovaCompute1Passthrough'

but heat resource-list gives us a list which doesn't contain NovaCompute1Passthrough (it existed in the pre-update stack):

$ heat resource-list overcloud
+------------------------+--------------------------------------+----------------------------+-----------------+----------------------+
| resource_name | physical_resource_id | resource_type | resource_status | updated_time |
+------------------------+--------------------------------------+----------------------------+-----------------+----------------------+
| NovaCompute0 | 09ac6d1a-a53a-47ec-8137-a3b304c33ddc | OS::Nova::Server | CREATE_COMPLETE | 2014-10-23T13:22:00Z |
| controller0 | 64ca2b98-f47e-4032-ab03-7ac1573847bc | OS::Nova::Server | CREATE_COMPLETE | 2014-10-23T13:22:02Z |
| MysqlClusterUniquePart | qBA1dmGko5 | OS::Heat::RandomString | CREATE_COMPLETE | 2014-10-23T13:22:03Z |
| MysqlRootPassword | jlxYxXiN75 | OS::Heat::RandomString | CREATE_COMPLETE | 2014-10-23T13:22:03Z |
| NovaCompute1 | 554971c1-6ab7-41fc-952f-7c383497a793 | OS::Nova::Server | CREATE_COMPLETE | 2014-10-23T13:22:03Z |
| RabbitCookie | JRyUtJmBBmNJ5BICTGlp | OS::Heat::RandomString | CREATE_COMPLETE | 2014-10-23T13:22:03Z |
| allNodesConfig | 34f6d9f5-5595-4dc8-82a0-faade92e097b | OS::Heat::StructuredConfig | CREATE_COMPLETE | 2014-10-23T13:27:15Z |
| PublicVirtualIP | d3c66308-da38-4144-a35d-e81bfd880a43 | OS::Neutron::Port | CREATE_COMPLETE | 2014-10-23T14:00:54Z |
| ControlVirtualIP | 281bd0cc-96e2-445e-9b3a-e556971133bb | OS::Neutron::Port | CREATE_COMPLETE | 2014-10-23T14:00:55Z |
| Compute | 32bbcfb0-5c81-4910-9893-4cd69d4677e5 | OS::Heat::ResourceGroup | CREATE_FAILED | 2014-10-23T14:00:57Z |
| Controller | ecc3b574-d8ba-4681-9e89-0127099d19b5 | OS::Heat::ResourceGroup | CREATE_FAILED | 2014-10-23T14:01:02Z |
+------------------------+--------------------------------------+----------------------------+-----------------+----------------------+

From this point, you're stuck, as the stack can't be deleted :(

Tags: tripleo
Steven Hardy (shardy)
summary: - Deleting in-progress stack update can fail
+ Deleting with an in-progress stack update can fail
Changed in heat:
importance: Undecided → High
tags: added: tripleo
Changed in heat:
status: New → Triaged
Steven Hardy (shardy)
tags: removed: tripleo
tags: added: tripleo
Revision history for this message
Pavlo Shchelokovskyy (pshchelo) wrote :

I think I have a "minimal" reproducer.

For that I use custom resource plugin that takes forever to update:
https://github.com/pshchelo/stackdev/tree/9525f896cd93cf3a0b0ae6f1321245fed7201eba/heat_plugins/stuck

and register it with Heat. The I use these templates:
https://github.com/pshchelo/stackdev/tree/9525f896cd93cf3a0b0ae6f1321245fed7201eba/templates/stuck

create stack with two resources
$ heat stack-create stuck -f stuck2.yaml

update stack updating first resource and deleting the second
$ heat stack-update stuck -f stuck1.yaml

stack is hanging in UPDATE_IN_PROGRESS by design. Try to delete the stack:
$ heat stack-delete stuck

now the stack is stuck in DELETE_IN_PROGRESS, and there is nothing that could be done with it.

heat-engine log has the following traceback:

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 455, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 212, in main
    result = function(*args, **kwargs)
  File "/opt/stack/heat/heat/engine/service.py", line 113, in _start_with_trace
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
    return f(*args, **kwargs)
  File "/opt/stack/heat/heat/engine/stack.py", line 972, in delete
    self._delete_backup_stack(backup_stack)
  File "/opt/stack/heat/heat/engine/stack.py", line 851, in _delete_backup_stack
    curr_res = self.resources[key]
KeyError: u'second'

Changed in heat:
status: Triaged → Confirmed
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Is there a workaround? I don't see one documented here. No workaround should mean this is a critical bug.

Angus Salkeld (asalkeld)
Changed in heat:
milestone: none → kilo-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147461

Changed in heat:
assignee: nobody → Pavlo Shchelokovskyy (pshchelo)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/147461
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e44629dab01057c65a14218f5d3ae19d5cdcf9e0
Submitter: Jenkins
Branch: master

commit e44629dab01057c65a14218f5d3ae19d5cdcf9e0
Author: Pavlo Shchelokovskyy <email address hidden>
Date: Thu Jan 15 11:02:23 2015 +0000

    Prevent hanging in DELETE_IN_PROGRESS

    When during update one resource is deleted and another independent
    resource takes long to update itself, deleting such stack when it is
    UPDATE_IN_PROGRESS led to a stack being stuck in DELETE_IN_PROGRESS as
    deleting backup stack was not finding the already deleted resource.

    Change-Id: Ib0c42b718a88ac994c53165362a38da2ad5b6b41
    Closes-Bug: #1384750

Changed in heat:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: kilo-2 → 2015.1.0
Zane Bitter (zaneb)
tags: added: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/227065

Zane Bitter (zaneb)
no longer affects: heat/kilo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/juno)

Reviewed: https://review.openstack.org/227065
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=fe5333395c3ff3f4be4bc6393fca673e38b21cc3
Submitter: Jenkins
Branch: stable/juno

commit fe5333395c3ff3f4be4bc6393fca673e38b21cc3
Author: Pavlo Shchelokovskyy <email address hidden>
Date: Thu Jan 15 11:02:23 2015 +0000

    Prevent hanging in DELETE_IN_PROGRESS

    When during update one resource is deleted and another independent
    resource takes long to update itself, deleting such stack when it is
    UPDATE_IN_PROGRESS led to a stack being stuck in DELETE_IN_PROGRESS as
    deleting backup stack was not finding the already deleted resource.

    Change-Id: Ib0c42b718a88ac994c53165362a38da2ad5b6b41
    Closes-Bug: #1384750
    (cherry picked from commit e44629dab01057c65a14218f5d3ae19d5cdcf9e0)

Zane Bitter (zaneb)
tags: removed: juno-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.