Heat stucks in DELETE_IN_PROGRESS for some input data

Bug #1499669 reported by Oleksii Chuprykov on 2015-09-25
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Steve Baker
Liberty
Fix Committed
High
Zane Bitter
Mitaka
Fix Committed
High
Zane Bitter

Bug Description

Steps to reproduce:

rg.yaml:
heat_template_version: 2013-05-23
resources:
    rg:
        type: OS::Heat::ResourceGroup
        properties:
            count: 125
            resource_def:
                type: rand_str.yaml

rand_str.yaml
heat_template_version: 2013-05-23
resources:

(yep, without any resource:) )

Run:
heat stack-create abc --template-file rg.yaml
wait about 20-30 sec and run:
heat stack-delete abc
(before stack will be in CREATE_COMPLETE)

Heat stucks in DELETE_IN_PROGRESS

Found this in logs:

2015-09-25 12:59:15.450 ERROR heat.engine.resource [-] DB error Not found

or

2015-09-25 13:04:52.109 ERROR heat.engine.resource [-] DB error This result object does not return rows. It has been closed automatically.
2015-09-25 13:04:52.110 ERROR sqlalchemy.pool.QueuePool [-] Exception during reset or similar
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool Traceback (most recent call last):
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 636, in _finalize_fairy
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool fairy._reset(pool)
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/pool.py", line 776, in _reset
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool pool._dialect.do_rollback(self)
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/dialects/mysql/base.py", line 2519, in do_rollback
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool dbapi_connection.rollback()
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 711, in rollback
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool self._read_ok_packet()
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool File "/usr/local/lib/python2.7/dist-packages/pymysql/connections.py", line 687, in _read_ok_packet
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool raise err.OperationalError(2014, "Command Out of Sync")
2015-09-25 13:04:52.110 TRACE sqlalchemy.pool.QueuePool OperationalError: (2014, 'Command Out of Sync')

or:

 File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/heat/heat/engine/service.py", line 117, in _start_with_trace
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
    return f(*args, **kwargs)
  File "/opt/stack/heat/heat/engine/stack.py", line 1449, in delete
    self.state_set(action, stack_status, reason)
  File "/usr/local/lib/python2.7/dist-packages/osprofiler/profiler.py", line 105, in wrapper
    return f(*args, **kwargs)
  File "/opt/stack/heat/heat/engine/stack.py", line 723, in state_set
    stack = stack_object.Stack.get_by_id(self.context, self.id)
  File "/opt/stack/heat/heat/objects/stack.py", line 90, in get_by_id
    db_stack = db_api.stack_get(context, stack_id, **kwargs)
  File "/opt/stack/heat/heat/db/api.py", line 134, in stack_get
    eager_load=eager_load)
  File "/opt/stack/heat/heat/db/sqlalchemy/api.py", line 344, in stack_get
    result = query.get(stack_id)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 819, in get
    return self._get_impl(ident, loading.load_on_ident)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 852, in _get_impl
    return fallback_fn(self, key)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident
    return q.one()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2473, in one
    ret = list(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2516, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2529, in _execute_and_instances
    close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2520, in _connection_from_session
    **kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 882, in connection
    execution_options=execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 889, in _connection_for_bind
    conn = engine.contextual_connect(**kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2041, in contextual_connect
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 92, in __init__
    self.dispatch.engine_connect(self, self.__branch)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/event/attr.py", line 256, in __call__
    fn(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/oslo_db/sqlalchemy/engines.py", line 72, in _connect_ping_listener
    connection.scalar(select([1]))
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 844, in scalar
    return self.execute(object, *multiparams, **params).scalar()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 1064, in scalar
    row = self.first()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 1038, in first
    return self._non_result(None)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/result.py", line 920, in _non_result
    "This result object does not return rows. "
ResourceClosedError: This result object does not return rows. It has been closed automatically.

All other time heat spamming into log something about:

c89]: {}}) running from (pid=32479) step /opt/stack/heat/heat/engine/scheduler.py:214
2015-09-25 13:13:32.333 DEBUG heat.engine.scheduler [-] Task destroy_resource running from (pid=32479) step /opt/stack/heat/heat/engine/scheduler.py:214
2015-09-25 13:13:32.359 INFO heat.engine.environment [-] Registering file:///home/oleksii/rand_str.yaml -> file:///home/oleksii/rand_str.yaml
2015-09-25 13:13:32.360 DEBUG heat.engine.scheduler [-] Task DependencyTaskGroup((destroy_resource) {ResourceGroup "rg" [6c19089b-9ed9-4d0f-93ca-9c76ba9fa401] Stack "abc" [b5d8388e-68fc-4279-bbc7-9c8ae3637c89]: {}}) sleeping from (pid=32479) _sleep /opt/stack/heat/heat/engine/scheduler.py:160

Not 100% reproducible, but you could experiment with different time intervals 20-30-40-50 secs.

Run heat stack-delete abc again, and stack will be deleted as usual.

description: updated
Changed in heat:
status: New → Triaged
importance: Undecided → Medium
milestone: none → mitaka-1
Zane Bitter (zaneb) wrote :

Looks like we are failing to catch a DB exception correctly. We should get NotFound and recognise that as the stack having been deleted.

The "spamming" in the log is the normal polling of the status of the nested stack.

Changed in heat:
milestone: mitaka-1 → mitaka-2
Changed in heat:
assignee: nobody → zhaozhilong (zhaozhilong)
Changed in heat:
milestone: mitaka-2 → mitaka-3
Changed in heat:
assignee: zhaozhilong (zhaozhilong) → nobody
Changed in heat:
milestone: mitaka-3 → mitaka-rc1
Changed in heat:
milestone: mitaka-rc1 → newton-1
Thomas Herve (therve) on 2016-05-25
Changed in heat:
milestone: newton-1 → ongoing
Zane Bitter (zaneb) wrote :
Download full text (5.9 KiB)

It appears this error results from accessing the same session in multiple threads simultaneously:

http://stackoverflow.com/questions/17317344/celery-and-sqlalchemy-this-result-object-does-not-return-rows-it-has-been-clo#17348307

I'm not sure how that could happen in Heat. In the case of the traceback from therve's comment above, it's failing in the link()ed functions that run after a thread has completed (in this case, it's being cancelled):

2016-06-04 02:51:54.091 3205 INFO heat.engine.stack [req-d3d7848b-ea6e-4557-8759-37626e7f5f63 demo demo - default default] Stopped due to GreenletExit() in create
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 217, in main
    self._resolve_links()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 232, in _resolve_links
    f(self, *ca, **ckw)
  File "/opt/stack/new/heat/heat/engine/service.py", line 187, in release
    stack.persist_state_and_release_lock(lock.engine_id)
  File "/opt/stack/new/heat/heat/engine/stack.py", line 913, in persist_state_and_release_lock
    stack = stack_object.Stack.get_by_id(self.context, self.id)
  File "/opt/stack/new/heat/heat/objects/stack.py", line 85, in get_by_id
    db_stack = db_api.stack_get(context, stack_id, **kwargs)
  File "/opt/stack/new/heat/heat/db/api.py", line 146, in stack_get
    eager_load=eager_load)
  File "/opt/stack/new/heat/heat/db/sqlalchemy/api.py", line 401, in stack_get
    result = query.get(stack_id)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 831, in get
    return self._get_impl(ident, loading.load_on_ident)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 864, in _get_impl
    return fallback_fn(self, key)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/loading.py", line 219, in load_on_ident
    return q.one()
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2718, in one
    ret = list(self)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2761, in __iter__
    return self._execute_and_instances(context)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2774, in _execute_and_instances
    close_with_result=True)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/query.py", line 2765, in _connection_from_session
    **kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 893, in connection
    execution_options=execution_options)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/orm/session.py", line 900, in _connection_for_bind
    conn = engine.contextual_connect(**kw)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 2041, in contextual_connect
    **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 92, in __init__
    self.dispatch.engine_connect(sel...

Read more...

Zane Bitter (zaneb) wrote :

I suspect the solution is to explicitly use transactions in all of the database updates. That way, if an exception occurs due to the thread being killed in the middle of a transaction then the transaction will be rolled back, and there's no risk of attempting to do something else to the DB in the middle of an implicit transaction.

An alternative would be to try to catch exceptions and roll back at a higher level, but that makes it much easier to miss some cases.

Rabi Mishra (rabi) wrote :

We probably already use explicit transactions for all updates including making them subtransactions when required ex.[1].

I tried the reproducer and hit a different issue which moved the stack to DELETE_FAILED.

2016-06-22 07:36:42.966 13243 INFO heat.engine.stack [req-e1cb88b8-6b9d-4d0f-a473-14ea0e35a675 - demo - default default] Stack DELETE FAILED (test_stack-rg-zk273ddlyrxh): Resource DELETE failed: TimeoutError: resources[43]: QueuePool limit of size 5 overflow
 64 reached, connection timed out, timeout 30

[1] https://github.com/openstack/heat/blob/master/heat/db/sqlalchemy/api.py#L272

Evgeny Sikachev (esikachev) wrote :

looks like this affects sahara from master

Luigi Toscano (ltoscano) wrote :

I concur, even the mitaka branch.

Changed in heat:
assignee: nobody → Jason Dunsmore (jasondunsmore)
status: Triaged → In Progress
Zane Bitter (zaneb) wrote :

Just noticed that bug 1546431 is the same issue. Jason, what do you think about closing this as a duplicate and transferring the work to that one?

Thomas Herve (therve) wrote :

I've made a good amount of tests with the given reproducer, and things are looking good in master. We still get the tracebacks, but we seem to be able to recover nicely.

I've traced it down a little bit, and it seems that https://review.openstack.org/#/c/332963/ is the one improving the situation quite a bit.

Zane Bitter (zaneb) wrote :

That's probably just covering up the issue. If we're trying to write to the database it's because there's something important we want to record, and if the write is failing then we're losing some data that we needed to keep. The Logstash query shows that's still happening plenty: http://logstash.openstack.org/#/dashboard/file/logstash.json?query=message:%5C%22Command%20Out%20of%20Sync%5C%22%20tags:%5C%22screen-h-eng%5C%22

It's nice if we don't completely hose the stack in the process though :) We could consider backporting that patch as a mitigation.

Changed in heat:
importance: Medium → High
milestone: ongoing → newton-rc1

Fix proposed to branch: master
Review: https://review.openstack.org/369827

Changed in heat:
assignee: Jason Dunsmore (jasondunsmore) → Steve Baker (steve-stevebaker)
Zane Bitter (zaneb) on 2016-09-14
tags: added: gate-failure
Thomas Herve (therve) on 2016-09-16
Changed in heat:
milestone: newton-rc1 → nexton-rc2

Change abandoned by Jason Dunsmore (<email address hidden>) on branch: master
Review: https://review.openstack.org/291931
Reason: Abandoning in favor of https://review.openstack.org/#/c/369827/

Reviewed: https://review.openstack.org/369827
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3000f904080d8dcd841d913dcd2ae658fb526c1a
Submitter: Jenkins
Branch: master

commit 3000f904080d8dcd841d913dcd2ae658fb526c1a
Author: Steve Baker <email address hidden>
Date: Fri Sep 16 03:29:59 2016 +0000

    Legacy delete attempt thread cancel before stop

    The error messages 'Command Out of Sync' are due to the threads being
    stopped in the middle of the database operations. This happens in the
    legacy action when delete is requested during a stack create.

    We have the thread cancel message but that was not being used in this
    case. Thread cancel should provide a more graceful way of ensuring the
    stack is in a FAILED state before the delete is attempted.

    This changes does the following in the delete_stack service method for
    legace engine:
    - if the stack is still locked, send thread cancel message
    - in a subthread wait for the lock to be released, or until a
      timeout based on the 4 minute cancel grace period
    - if the stack is still locked, do a thread stop as before

    Closes-Bug: #1499669
    Closes-Bug: #1546431
    Closes-Bug: #1536451
    Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0

Changed in heat:
status: In Progress → Fix Released
Zane Bitter (zaneb) on 2016-09-20
no longer affects: heat/newton

Reviewed: https://review.openstack.org/373518
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=2dd44db1b9cf4b789d8a083df6f97ae1fb5e22d5
Submitter: Jenkins
Branch: stable/newton

commit 2dd44db1b9cf4b789d8a083df6f97ae1fb5e22d5
Author: Steve Baker <email address hidden>
Date: Fri Sep 16 03:29:59 2016 +0000

    Legacy delete attempt thread cancel before stop

    The error messages 'Command Out of Sync' are due to the threads being
    stopped in the middle of the database operations. This happens in the
    legacy action when delete is requested during a stack create.

    We have the thread cancel message but that was not being used in this
    case. Thread cancel should provide a more graceful way of ensuring the
    stack is in a FAILED state before the delete is attempted.

    This changes does the following in the delete_stack service method for
    legace engine:
    - if the stack is still locked, send thread cancel message
    - in a subthread wait for the lock to be released, or until a
      timeout based on the 4 minute cancel grace period
    - if the stack is still locked, do a thread stop as before

    Closes-Bug: #1499669
    Closes-Bug: #1546431
    Closes-Bug: #1536451
    Change-Id: I4cd613681f07d295955c4d8a06505d72d83728a0
    (cherry picked from commit 3000f904080d8dcd841d913dcd2ae658fb526c1a)

tags: added: in-stable-newton
Zane Bitter (zaneb) wrote :

The database errors turn out to be a sqlalchemy issue: https://bitbucket.org/zzzeek/sqlalchemy/issues/3803/dbapi-connections-go-invalid-on

Surprisingly enough though, those aren't actually the cause of the problem here. For the most part we deal with an error writing to the DB quite gracefully. The reason the root stack is "hanging" IN_PROGRESS (it's not really hanging; it will eventually time out normally) is that the child stack (the ResourceGroup) doesn't start deleting after we've cancelled its update. And the reason it doesn't start deleting is because we don't wait long enough for the running threads to be stopped before we give up and don't bother starting the delete.

The length of time we wait is configurable as engine_life_check_timeout. The default is 2s - it turns out that it takes at least 4-5s to cancel a stack of this size. A user could work around this problem by increasing the engine_life_check_timeout, however it's probably just inappropriate for us to be using this value (I think it happened in a historical accident).

We're much less likely to encounter this issue now that https://review.openstack.org/369827 has merged, but a fix not only benefit master but be easily backportable to earlier stable branches.

Zane Bitter (zaneb) wrote :

One mystery remains: when the cancellation times out we raise the exception StopActionFailed, and prior to https://review.openstack.org/369827 this should have resulted in the parent stack getting an exception when calling stack_delete (it's now asynchronous, but previously was raised synchronously), and in turn the ResourceGroup resource in the parent stack being marked DELETE_FAILED.

The reason is that we're using cast() instead of call(), which means that we never see the response. This goes back to the original implementation of the RPC client in July 2012 (https://review.openstack.org/#/c/10614/):

    delete_stack seems to be the only method which returns nothing, so it can be
    invoked as cast or call, with cast being the default.

That wasn't true even at the time: while it did not return anything, it very much can raise exceptions, and the caller pretty much always needs to know about those exceptions. Both heat-cfn-api and heat-api have always passed cast=False, but we forgot to do the same here.

Fix proposed to branch: master
Review: https://review.openstack.org/374443

Reviewed: https://review.openstack.org/374442
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e5cec71e52c3fed0ffb4385990758db8ebf367da
Submitter: Jenkins
Branch: master

commit e5cec71e52c3fed0ffb4385990758db8ebf367da
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

    Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

    In the case of bug 1499669, the exception being missed was
    StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669

Reviewed: https://review.openstack.org/374844
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=a96d89b2005dcef2f06b6ae55260fdb0f358abab
Submitter: Jenkins
Branch: stable/newton

commit a96d89b2005dcef2f06b6ae55260fdb0f358abab
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

    Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

    In the case of bug 1499669, the exception being missed was
    StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669
    (cherry picked from commit e5cec71e52c3fed0ffb4385990758db8ebf367da)

Reviewed: https://review.openstack.org/374443
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e56fc689e19d92b5a3d23736d472c9a1fc698537
Submitter: Jenkins
Branch: master

commit e56fc689e19d92b5a3d23736d472c9a1fc698537
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

    Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669

Reviewed: https://review.openstack.org/375469
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ee86435e44065f9d59425023b1b8220826a6e0f2
Submitter: Jenkins
Branch: stable/newton

commit ee86435e44065f9d59425023b1b8220826a6e0f2
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

    Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669
    (cherry picked from commit e56fc689e19d92b5a3d23736d472c9a1fc698537)

This issue was fixed in the openstack/heat 7.0.0.0rc2 release candidate.

Reviewed: https://review.openstack.org/374847
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=9a2c7ecf34d0fdacad11ce484c96ae022c7b2062
Submitter: Jenkins
Branch: stable/liberty

commit 9a2c7ecf34d0fdacad11ce484c96ae022c7b2062
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

    Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

    In the case of bug 1499669, the exception being missed was
    StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669
    (cherry picked from commit e5cec71e52c3fed0ffb4385990758db8ebf367da)

tags: added: in-stable-liberty

Reviewed: https://review.openstack.org/375473
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=cd4ab44929447f40f63cd75cdb760dcc5b9ae92e
Submitter: Jenkins
Branch: stable/liberty

commit cd4ab44929447f40f63cd75cdb760dcc5b9ae92e
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

    Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669
    (cherry picked from commit e56fc689e19d92b5a3d23736d472c9a1fc698537)

Reviewed: https://review.openstack.org/374846
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=adae45d76268eb57bf94600a404aa56d9769ca9c
Submitter: Jenkins
Branch: stable/mitaka

commit adae45d76268eb57bf94600a404aa56d9769ca9c
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 18:37:04 2016 -0400

    Don't use cast() to do StackResource delete

    If an exception was raised in delete_stack when deleting a nested stack,
    the parent stack would never hear about it because we were accidentally
    using cast() instead of call() to do the stack delete. This meant the
    parent resource would remain DELETE_IN_PROGRESS until timeout when the
    nested stack had already failed and raised an exception.

    In the case of bug 1499669, the exception being missed was
    StopActionFailed.

    Change-Id: I039eb8f6c6a262653c1e9edc8173e5680d81e31b
    Partial-Bug: #1499669
    (cherry picked from commit e5cec71e52c3fed0ffb4385990758db8ebf367da)

tags: added: in-stable-mitaka

Reviewed: https://review.openstack.org/375472
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=33d2395bfa0ed6b9305e8cc231e66b81e1887ef0
Submitter: Jenkins
Branch: stable/mitaka

commit 33d2395bfa0ed6b9305e8cc231e66b81e1887ef0
Author: Zane Bitter <email address hidden>
Date: Thu Sep 22 09:44:56 2016 -0400

    Increase the timeout for the stop_stack message

    Previously, the stop_stack message accidentally used the
    engine_life_check_timeout (by default, 2s). But unlike other messages sent
    using that timeout, stop_stack needs to synchronously kill all running
    threads operating on the stack. For a very large stack, this can easily
    take much longer than a couple of seconds. This patch increases the timeout
    to give a better chance of being able to start the delete.

    A functional test is added, but skipped when convergence is enabled.
    This is because cancelling in-progress operations upon delete was not
    supported under convergence in Mitaka (support was added in Newton). The
    bug fix affects only the legacy path anyway.

    Change-Id: I4b36ed7f1025b6439aeab63d71041bb2000363a0
    Closes-Bug: #1499669
    (cherry picked from commit e56fc689e19d92b5a3d23736d472c9a1fc698537)

This issue was fixed in the openstack/heat 5.0.3 release.

This issue was fixed in the openstack/heat 7.0.0 release.

This issue was fixed in the openstack/heat 8.0.0.0b1 development milestone.

This issue was fixed in the openstack/heat 6.1.1 release.

Reviewed: https://review.openstack.org/374444
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=73a886d806254b30067441a321bf816b0f624828
Submitter: Jenkins
Branch: master

commit 73a886d806254b30067441a321bf816b0f624828
Author: Zane Bitter <email address hidden>
Date: Wed Sep 21 19:13:02 2016 -0400

    RPC Client: don't cast() delete_stack by default

    The delete_stack() RPC call in the client can be sent using either call()
    or cast(), with cast() the default. This is never what you want, because
    the call could raise an exception and you want to hear about that.

    We're now passing cast=False explicitly everywhere. We always were in
    heat-cfn-api and heat-api, but failing to do so in StackResource caused bug
    1499669, which was corrected by I039eb8f6c6a262653c1e9edc8173e5680d81e31b.
    Changing the default to call() will prevent anyone making that same mistake
    again.

    Change-Id: Idd6a27988dadbf1cd8376de24b19f2226f6ae5b7
    Related-Bug: #1499669

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers