Bug #1499669 “Heat stucks in DELETE_IN_PROGRESS for some input d...” : Bugs : OpenStack Heat

Oleksii Chuprykov (ochuprykov) on 2015-09-25

description:

updated

Sergey Kraynev (skraynev) on 2015-09-28

Changed in heat:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → mitaka-1

Revision history for this message

Zane Bitter (zaneb) wrote on 2015-11-17:

#1

Looks like we are failing to catch a DB exception correctly. We should get NotFound and recognise that as the stack having been deleted.

The "spamming" in the log is the normal polling of the status of the nested stack.

Sergey Kraynev (skraynev) on 2015-12-01

Changed in heat:
milestone:	mitaka-1 → mitaka-2

zhaozhilong (zhaozhilong) on 2016-01-13

Changed in heat:
assignee:	nobody → zhaozhilong (zhaozhilong)

Sergey Kraynev (skraynev) on 2016-01-14

Changed in heat:
milestone:	mitaka-2 → mitaka-3

zhaozhilong (zhaozhilong) on 2016-02-04

Changed in heat:
assignee:	zhaozhilong (zhaozhilong) → nobody

Sergey Kraynev (skraynev) on 2016-03-02

Changed in heat:
milestone:	mitaka-3 → mitaka-rc1

Sergey Kraynev (skraynev) on 2016-03-10

Changed in heat:
milestone:	mitaka-rc1 → newton-1

Thomas Herve (therve) on 2016-05-25

Changed in heat:
milestone:	newton-1 → ongoing

Revision history for this message

Thomas Herve (therve) wrote on 2016-06-04:

#2

Got that traceback here: http://logs.openstack.org/10/323610/2/check/gate-heat-dsvm-functional-orig-mysql-lbaasv2/3f83f93/

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-06-06:

#3

Download full text (5.9 KiB)

It appears this error results from accessing the same session in multiple threads simultaneously:

http://stackoverflow.com/questions/17317344/celery-and-sqlalchemy-this-result-object-does-not-return-rows-it-has-been-clo#17348307

I'm not sure how that could happen in Heat. In the case of the traceback from therve's comment above, it's failing in the link()ed functions that run after a thread has completed (in this case, it's being cancelled):

2016-06-04 02:51:54.091 3205 INFO heat.engine.stack [req-d3d7848b-ea6e-4557-8759-37626e7f5f63 demo demo - default default] Stopped due to GreenletExit() in create
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 217, in main
    self._resolve_links()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 232, in _resolve_links
    f(self, *ca, **ckw)
  File "/opt/stack/new/heat/heat/engine/service.py", line 187, in release
    stack.persist_state_and_release_lock(lock.engine_id)
  File "/opt/stack/new/heat/heat/engine/stack.py", line 913, in persist_state_and_release_lock
    stack = stack_object.Stack.get_by_id(self.context, self.id)
  File "/opt/stack/new/heat/heat/objects/stack.py", line 85, in get_by_id
    db_stack = db_api.stack_get(context, stack_id, **kwargs)
  File "/opt/stack/new/heat/heat/db/api.py", line 146, in stack_get
    eager_load=eager_load)
  File "/opt/stack/new/heat/heat/db/sqlalchemy/api.py", line 401, in stack_get
    result = query.get(stack_id)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 831, in get
    return self._get_impl(ident, loading.load_on_ident)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 864, in _get_impl
    return fallback_fn(self, key)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident
    return q.one()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2718, in one
    ret = list(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2761, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2774, in _execute_and_instances
    close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2765, in _connection_from_session
    **kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 893, in connection
    execution_options=execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 900, in _connection_for_bind
    conn = engine.contextual_connect(**kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2041, in contextual_connect
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 92, in __init__
    self.dispatch.engine_connect(sel...

It appears this error results from accessing the same session in multiple threads simultaneously:

http://stackoverflow.com/questions/17317344/celery-and-sqlalchemy-this-result-object-does-not-return-rows-it-has-been-clo#17348307

I'm not sure how that could happen in Heat. In the case of the traceback from therve's comment above, it's failing in the link()ed functions that run after a thread has completed (in this case, it's being cancelled):

2016-06-04 02:51:54.091 3205 INFO heat.engine.stack [req-d3d7848b-ea6e-4557-8759-37626e7f5f63 demo demo - default default] Stopped due to GreenletExit() in create
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 217, in main
    self._resolve_links()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 232, in _resolve_links
    f(self, *ca, **ckw)
  File "/opt/stack/new/heat/heat/engine/service.py", line 187, in release
    stack.persist_state_and_release_lock(lock.engine_id)
  File "/opt/stack/new/heat/heat/engine/stack.py", line 913, in persist_state_and_release_lock
    stack = stack_object.Stack.get_by_id(self.context, self.id)
  File "/opt/stack/new/heat/heat/objects/stack.py", line 85, in get_by_id
    db_stack = db_api.stack_get(context, stack_id, **kwargs)
  File "/opt/stack/new/heat/heat/db/api.py", line 146, in stack_get
    eager_load=eager_load)
  File "/opt/stack/new/heat/heat/db/sqlalchemy/api.py", line 401, in stack_get
    result = query.get(stack_id)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 831, in get
    return self._get_impl(ident, loading.load_on_ident)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 864, in _get_impl
    return fallback_fn(self, key)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident
    return q.one()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2718, in one
    ret = list(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2761, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2774, in _execute_and_instances
    close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2765, in _connection_from_session
    **kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 893, in connection
    execution_options=execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 900, in _connection_for_bind
    conn = engine.contextual_connect(**kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2041, in contextual_connect
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 92, in __init__
    self.dispatch.engine_connect(self, self.__branch)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/event/attr.py", line 256, in __call__
    fn(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 72, in _connect_ping_listener
    connection.scalar(select([1]))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 844, in scalar
    return self.execute(object, *multiparams, **params).scalar()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 1066, in scalar
    row = self.first()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 1040, in first
    return self._non_result(None)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 922, in _non_result
    "This result object does not return rows. "
ResourceClosedError: This result object does not return rows. It has been closed automatically.
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool [-] Exception during reset or similar
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool Traceback (most recent call last):
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool     fairy._reset(pool)
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 776, in _reset
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool     pool._dialect.do_rollback(self)
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool   File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2542, in do_rollback
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool     dbapi_connection.rollback()
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool   File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 772, in rollback
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool     self._read_ok_packet()
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool   File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 748, in _read_ok_packet
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool     raise err.OperationalError(2014, "Command Out of Sync")
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool OperationalError: (2014, 'Command Out of Sync')
2016-06-04 02:51:54.129 3205 ERROR sqlalchemy.pool.QueuePool 
Exception AssertionError: AssertionError('do not call blocking functions from the mainloop',) in <function <lambda> at 0x7f2b1241b6e0> ignored

So whatever cleanup needed to happen should have happened in that thread before it exited, but apparently it has not.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-06-06:

#4

I suspect the solution is to explicitly use transactions in all of the database updates. That way, if an exception occurs due to the thread being killed in the middle of a transaction then the transaction will be rolled back, and there's no risk of attempting to do something else to the DB in the middle of an implicit transaction.

An alternative would be to try to catch exceptions and roll back at a higher level, but that makes it much easier to miss some cases.

Revision history for this message

Rabi Mishra (rabi) wrote on 2016-06-23:

#5

We probably already use explicit transactions for all updates including making them subtransactions when required ex.[1].

I tried the reproducer and hit a different issue which moved the stack to DELETE_FAILED.

2016-06-22 07:36:42.966 13243 INFO heat.engine.stack [req-e1cb88b8-6b9d-4d0f-a473-14ea0e35a675 - demo - default default] Stack DELETE FAILED (test_stack-rg-zk273ddlyrxh): Resource DELETE failed: TimeoutError: resources[43]: QueuePool limit of size 5 overflow
64 reached, connection timed out, timeout 30

[1] https://github.com/openstack/heat/blob/master/heat/db/sqlalchemy/api.py#L272

Revision history for this message

Evgeny Sikachev (esikachev) wrote on 2016-07-07:

#6

looks like this affects sahara from master

Revision history for this message

Luigi Toscano (ltoscano) wrote on 2016-08-08:

#7

I concur, even the mitaka branch.

OpenStack Infra (hudson-openstack) on 2016-08-24

Changed in heat:
assignee:	nobody → Jason Dunsmore (jasondunsmore)
status:	Triaged → In Progress

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-06:

#8

Just noticed that bug 1546431 is the same issue. Jason, what do you think about closing this as a duplicate and transferring the work to that one?

Revision history for this message

Thomas Herve (therve) wrote on 2016-09-09:

#9

I've made a good amount of tests with the given reproducer, and things are looking good in master. We still get the tracebacks, but we seem to be able to recover nicely.

I've traced it down a little bit, and it seems that https://review.openstack.org/#/c/332963/ is the one improving the situation quite a bit.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-13:

#10

That's probably just covering up the issue. If we're trying to write to the database it's because there's something important we want to record, and if the write is failing then we're losing some data that we needed to keep. The Logstash query shows that's still happening plenty: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Command%20Out%20of%20Sync%5C%22%20tags:%5C%22screen-h-eng%5C%22

It's nice if we don't completely hose the stack in the process though :) We could consider backporting that patch as a mitigation.

Steve Baker (steve-stevebaker) on 2016-09-14

Changed in heat:
importance:	Medium → High
milestone:	ongoing → newton-rc1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-14: Fix proposed to heat (master)

#11

Fix proposed to branch: master
Review: https://review.openstack.org/369827

Changed in heat:
assignee:	Jason Dunsmore (jasondunsmore) → Steve Baker (steve-stevebaker)

Zane Bitter (zaneb) on 2016-09-14

tags:

added: gate-failure

Thomas Herve (therve) on 2016-09-16

Changed in heat:
milestone:	newton-rc1 → nexton-rc2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-16: Change abandoned on heat (master)

#12

Change abandoned by Jason Dunsmore (<email address hidden>) on branch: master
Review: https://review.openstack.org/291931
Reason: Abandoning in favor of https://review.openstack.org/#/c/369827/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-20: Fix merged to heat (master)

#13

Reviewed: https://review.openstack.org/369827
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3000f904080d8dcd841d913dcd2ae658fb526c1a
Submitter: Jenkins
Branch: master

commit 3000f904080d8dcd841d913dcd2ae658fb526c1a
Author: Steve Baker <email address hidden>
Date: Fri Sep 16 03:29:59 2016 +0000

Legacy delete attempt thread cancel before stop

    The error messages 'Command Out of Sync' are due to the threads being
    stopped in the middle of the database operations. This happens in the
    legacy action when delete is requested during a stack create.

    We have the thread cancel message but that was not being used in this
    case. Thread cancel should provide a more graceful way of ensuring the
    stack is in a FAILED state before the delete is attempted.

    This changes does the following in the delete_stack service method for
    legace engine:
    - if the stack is still locked, send thread cancel message
    - in a subthread wait for the lock to be released, or until a
      timeout based on the 4 minute cancel grace period
    - if the stack is still locked, do a thread stop as before

    Closes-Bug: #1499669
    Closes-Bug: #1546431
    Closes-Bug: #1536451
    Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0

Changed in heat:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-20: Fix proposed to heat (stable/newton)

#14

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/373518

Zane Bitter (zaneb) on 2016-09-20

no longer affects:

heat/newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-21: Fix merged to heat (stable/newton)

#15

Reviewed: https://review.openstack.org/373518
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=2dd44db1b9cf4b789d8a083df6f97ae1fb5e22d5
Submitter: Jenkins
Branch: stable/newton

commit 2dd44db1b9cf4b789d8a083df6f97ae1fb5e22d5
Author: Steve Baker <email address hidden>
Date: Fri Sep 16 03:29:59 2016 +0000

Legacy delete attempt thread cancel before stop

    The error messages 'Command Out of Sync' are due to the threads being
    stopped in the middle of the database operations. This happens in the
    legacy action when delete is requested during a stack create.

    We have the thread cancel message but that was not being used in this
    case. Thread cancel should provide a more graceful way of ensuring the
    stack is in a FAILED state before the delete is attempted.

    This changes does the following in the delete_stack service method for
    legace engine:
    - if the stack is still locked, send thread cancel message
    - in a subthread wait for the lock to be released, or until a
      timeout based on the 4 minute cancel grace period
    - if the stack is still locked, do a thread stop as before

    Closes-Bug: #1499669
    Closes-Bug: #1546431
    Closes-Bug: #1536451
    Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0
    (cherry picked from commit 3000f904080d8dcd841d913dcd2ae658fb526c1a)

tags:

added: in-stable-newton

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-21:

#16

The database errors turn out to be a sqlalchemy issue: https://bitbucket.org/zzzeek/sqlalchemy/issues/3803/dbapi-connections-go-invalid-on

Surprisingly enough though, those aren't actually the cause of the problem here. For the most part we deal with an error writing to the DB quite gracefully. The reason the root stack is "hanging" IN_PROGRESS (it's not really hanging; it will eventually time out normally) is that the child stack (the ResourceGroup) doesn't start deleting after we've cancelled its update. And the reason it doesn't start deleting is because we don't wait long enough for the running threads to be stopped before we give up and don't bother starting the delete.

The length of time we wait is configurable as engine_life_check_timeout. The default is 2s - it turns out that it takes at least 4-5s to cancel a stack of this size. A user could work around this problem by increasing the engine_life_check_timeout, however it's probably just inappropriate for us to be using this value (I think it happened in a historical accident).

We're much less likely to encounter this issue now that https://review.openstack.org/369827 has merged, but a fix not only benefit master but be easily backportable to earlier stable branches.

Revision history for this message

Zane Bitter (zaneb) wrote on 2016-09-21:

#17

One mystery remains: when the cancellation times out we raise the exception StopActionFailed, and prior to https://review.openstack.org/369827 this should have resulted in the parent stack getting an exception when calling stack_delete (it's now asynchronous, but previously was raised synchronously), and in turn the ResourceGroup resource in the parent stack being marked DELETE_FAILED.

The reason is that we're using cast() instead of call(), which means that we never see the response. This goes back to the original implementation of the RPC client in July 2012 (https://review.openstack.org/#/c/10614/):

delete_stack seems to be the only method which returns nothing, so it can be
invoked as cast or call, with cast being the default.

That wasn't true even at the time: while it did not return anything, it very much can raise exceptions, and the caller pretty much always needs to know about those exceptions. Both heat-cfn-api and heat-api have always passed cast=False, but we forgot to do the same here.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-21: Fix proposed to heat (master)

#18

Fix proposed to branch: master
Review: https://review.openstack.org/374442

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-21:

#19

Fix proposed to branch: master
Review: https://review.openstack.org/374443

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-21: Related fix proposed to heat (master)

#20

Related fix proposed to branch: master
Review: https://review.openstack.org/374444

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Fix merged to heat (master)

#21

Reviewed: https://review.openstack.org/374442
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e5cec71e52c3fed0ffb4385990758db8ebf367da
Submitter: Jenkins
Branch: master

commit e5cec71e52c3fed0ffb4385990758db8ebf367da
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

In the case of bug 1499669, the exception being missed was
StopActionFailed.

Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
Partial-Bug: #1499669

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Fix proposed to heat (stable/newton)

#22

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/374844

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Fix proposed to heat (stable/mitaka)

#23

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/374846

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-22: Fix proposed to heat (stable/liberty)

#24

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/374847

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix merged to heat (stable/newton)

#25

Reviewed: https://review.openstack.org/374844
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=a96d89b2005dcef2f06b6ae55260fdb0f358abab
Submitter: Jenkins
Branch: stable/newton

commit a96d89b2005dcef2f06b6ae55260fdb0f358abab
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

In the case of bug 1499669, the exception being missed was
StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669
    (cherry picked from commit e5cec71e52c3fed0ffb4385990758db8ebf367da)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix merged to heat (master)

#26

Reviewed: https://review.openstack.org/374443
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e56fc689e19d92b5a3d23736d472c9a1fc698537
Submitter: Jenkins
Branch: master

commit e56fc689e19d92b5a3d23736d472c9a1fc698537
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
Closes-Bug: #1499669

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix proposed to heat (stable/newton)

#27

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/375469

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix proposed to heat (stable/mitaka)

#28

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/375472

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix proposed to heat (stable/liberty)

#29

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/375473

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-26: Fix merged to heat (stable/newton)

#30

Reviewed: https://review.openstack.org/375469
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ee86435e44065f9d59425023b1b8220826a6e0f2
Submitter: Jenkins
Branch: stable/newton

commit ee86435e44065f9d59425023b1b8220826a6e0f2
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669
    (cherry picked from commit e56fc689e19d92b5a3d23736d472c9a1fc698537)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-27: Fix included in openstack/heat 7.0.0.0rc2

#31

This issue was fixed in the openstack/heat 7.0.0.0rc2 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-06: Fix merged to heat (stable/liberty)

#33

Reviewed: https://review.openstack.org/374847
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=9a2c7ecf34d0fdacad11ce484c96ae022c7b2062
Submitter: Jenkins
Branch: stable/liberty

commit 9a2c7ecf34d0fdacad11ce484c96ae022c7b2062
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

In the case of bug 1499669, the exception being missed was
StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669
    (cherry picked from commit e5cec71e52c3fed0ffb4385990758db8ebf367da)

tags:

added: in-stable-liberty

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-06:

#34

Reviewed: https://review.openstack.org/375473
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=cd4ab44929447f40f63cd75cdb760dcc5b9ae92e
Submitter: Jenkins
Branch: stable/liberty

commit cd4ab44929447f40f63cd75cdb760dcc5b9ae92e
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669
    (cherry picked from commit e56fc689e19d92b5a3d23736d472c9a1fc698537)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-11: Fix merged to heat (stable/mitaka)

#35

Reviewed: https://review.openstack.org/374846
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=adae45d76268eb57bf94600a404aa56d9769ca9c
Submitter: Jenkins
Branch: stable/mitaka

commit adae45d76268eb57bf94600a404aa56d9769ca9c
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

In the case of bug 1499669, the exception being missed was
StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669
    (cherry picked from commit e5cec71e52c3fed0ffb4385990758db8ebf367da)

tags:

added: in-stable-mitaka

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-12:

#36

Reviewed: https://review.openstack.org/375472
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=33d2395bfa0ed6b9305e8cc231e66b81e1887ef0
Submitter: Jenkins
Branch: stable/mitaka

commit 33d2395bfa0ed6b9305e8cc231e66b81e1887ef0
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    A functional test is added, but skipped when convergence is enabled.
    This is because cancelling in-progress operations upon delete was not
    supported under convergence in Mitaka (support was added in Newton). The
    bug fix affects only the legacy path anyway.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669
    (cherry picked from commit e56fc689e19d92b5a3d23736d472c9a1fc698537)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-10-17: Fix included in openstack/heat 5.0.3

#37

This issue was fixed in the openstack/heat 5.0.3 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10: Fix included in openstack/heat 7.0.0

#39

This issue was fixed in the openstack/heat 7.0.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-17: Fix included in openstack/heat 8.0.0.0b1

#41

This issue was fixed in the openstack/heat 8.0.0.0b1 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-31: Fix included in openstack/heat 6.1.1

#43

This issue was fixed in the openstack/heat 6.1.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-03-29: Related fix merged to heat (master)

#44

Reviewed: https://review.openstack.org/374444
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=73a886d806254b30067441a321bf816b0f624828
Submitter: Jenkins
Branch: master

commit 73a886d806254b30067441a321bf816b0f624828
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 19:13:02 2016 -0400

RPC Client: don't cast() delete_stack by default

    The delete_stack() RPC call in the client can be sent using either call()
    or cast(), with cast() the default. This is never what you want, because
    the call could raise an exception and you want to hear about that.

    We're now passing cast=False explicitly everywhere. We always were in
    heat-cfn-api and heat-api, but failing to do so in StackResource caused bug
    1499669, which was corrected by I039eb8f6c6a262653c1e9edc8173e5680d81e31b.
    Changing the default to call() will prevent anyone making that same mistake
    again.

Change-Id: Idd6a27988dadbf1cd8376de24b19f2226f6ae5b7
Related-Bug: #1499669

OpenStack Heat

Heat stucks in DELETE_IN_PROGRESS for some input data

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Heat	Fix Released	High	Steve Baker	OpenStack Heat newton-rc2
Liberty	Fix Committed	High	Zane Bitter
Mitaka	Fix Committed	High	Zane Bitter