Node delete is broken

Bug #1749426 reported by Steven Hardy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter
tripleo
Fix Released
High
Brad P. Crochet

Bug Description

I deployed a containerized overcloud on recent master and tried to do a node delete, it doesn't seem to work, the stack is not updated and the workflow eventually times out waiting for stack status:

(undercloud) [stack@undercloud ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| 578d57de-e603-4202-8994-b59b3ca5a824 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.6 |
| 8352f170-ebff-458a-a756-7d0c1b330281 | overcloud-novacompute-0 | ACTIVE | - | Running | ctlplane=192.168.24.13 |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud ~]$ openstack overcloud node delete 8352f170-ebff-458a-a756-7d0c1b330281
Deleting the following nodes from stack overcloud:
- 8352f170-ebff-458a-a756-7d0c1b330281
Started Mistral Workflow tripleo.scale.v1.delete_node. Execution ID: 2dc7b5be-5dc1-4cd4-ae36-241a9a81f385
Waiting for messages on queue 'tripleo' with no timeout.

{u'result': u'Failure caused by error in tasks: wait_for_stack_status\n\n wait_for_stack_status [task_ex_id=3f6728bc-dcd1-4502-9d78-b0af651e17b5] -> Task timed out [timeout(s)=600].\n'}
(undercloud) [stack@undercloud ~]$
(undercloud) [stack@undercloud ~]$ nova list
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
| 578d57de-e603-4202-8994-b59b3ca5a824 | overcloud-controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.6 |
| 8352f170-ebff-458a-a756-7d0c1b330281 | overcloud-novacompute-0 | ACTIVE | - | Running | ctlplane=192.168.24.13 |
+--------------------------------------+-------------------------+--------+------------+-------------+------------------------+
(undercloud) [stack@undercloud ~]$ openstack stack list
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| ID | Stack Name | Project | Stack Status | Creation Time | Updated Time |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+
| 61b7c1b5-6c26-45bf-9a77-e08f7de4fb9e | overcloud | 23fc8df51d39499c85bf47aa71317761 | CREATE_COMPLETE | 2018-02-14T10:38:38Z | None |
+--------------------------------------+------------+----------------------------------+-----------------+----------------------+--------------+

Steven Hardy (shardy)
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
milestone: none → queens-rc1
tags: added: workflows
Revision history for this message
Brad P. Crochet (brad-9) wrote :

Seeing this error:

2018-02-15 13:32:42.221 28439 ERROR oslo_messaging.rpc.server DBError: (pymysql.err.IntegrityError) (1048, u"Column 'resource_id' cannot be null") [SQL: u'INSERT INTO resource_data (created_at, updated_at, `key`, value, redact, decrypt_method, resource_id) VALUES (%(created_at)s, %(updated_at)s, %(key)s, %(value)s, %(redact)s, %(decrypt_method)s, %(resource_id)s)'] [parameters: {'resource_id': None, 'created_at': datetime.datetime(2018, 2, 15, 13, 32, 42, 219223), 'updated_at': None, 'value': u'1', 'redact': 0, 'decrypt_method': '', 'key': 'name_blacklist'}] (Background on this error at: http://sqlalche.me/e/gkpj)

Changed in tripleo:
assignee: nobody → Brad P. Crochet (brad-9)
Revision history for this message
Brad P. Crochet (brad-9) wrote :

And this from the mistral/executor.log:

2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters [req-eaa77ed8-52d1-4b5f-9ccd-0b58dbb4ad09 470b7746a5e744bdac6331ca5a025ef8 85641a8dc0784a7d82cfd9967221f813 - default default] Error validating environment for plan overcloud: ERROR: Internal Error: HTTPInternalServerError: ERROR: Internal Error
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters Traceback (most recent call last):
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters File "/usr/lib/python2.7/site-packages/tripleo_common/actions/parameters.py", line 180, in run
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters 'heat_resource_tree': heat.stacks.validate(**fields),
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters File "/usr/lib/python2.7/site-packages/heatclient/v1/stacks.py", line 332, in validate
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters resp = self.client.post(url, **args)
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 289, in post
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters return self.client_request("POST", url, **kwargs)
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 279, in client_request
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters resp, body = self.json_request(method, url, **kwargs)
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 268, in json_request
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters resp = self._http_request(url, method, **kwargs)
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters File "/usr/lib/python2.7/site-packages/heatclient/common/http.py", line 231, in _http_request
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters raise exc.from_response(resp)
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters HTTPInternalServerError: ERROR: Internal Error
2018-02-15 14:08:05.413 32149 ERROR tripleo_common.actions.parameters

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/544999

Changed in tripleo:
status: Triaged → In Progress
Thomas Herve (therve)
Changed in heat:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Thomas Herve (therve)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/545056

Revision history for this message
Thomas Herve (therve) wrote :

https://review.openstack.org/545056 shoud fix the issue in Heat.

Revision history for this message
Rabi Mishra (rabi) wrote :

I think node delete is broken by this tripleo-common change[1] where we've started doing stack validation with a 'show_nested=True', that does a resource preview[2].

But yeah, balcklist for RG has all along been updated for stack/update preview in heat, which is incorrect and is not related to the change[3] to add 'removal_policies_mode' for RG.

[1] https://github.com/openstack/tripleo-common/commit/a0e84ef7a78d9ff95c345253035967386b2e2b97#diff-af3979e2051542273417a78b796164c4R175

[2] https://github.com/openstack/heat/blob/master/heat/engine/service.py#L1240-L1244

[3] https://github.com/openstack/heat/commit/7a42ec8657f2524265dc78f011f28236309975eb#diff-60be04c9cc608aa5906c801287e70d68

Thomas Herve (therve)
Changed in heat:
milestone: none → queens-rc2
Revision history for this message
Thomas Herve (therve) wrote :

Should be fixed with https://review.openstack.org/#/c/531931/ merged.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Thomas Herve (<email address hidden>) on branch: master
Review: https://review.openstack.org/545056

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/545360

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/queens)

Reviewed: https://review.openstack.org/545360
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=cbcc15b4046d456032441105941a4a4ea93b48e6
Submitter: Zuul
Branch: stable/queens

commit cbcc15b4046d456032441105941a4a4ea93b48e6
Author: Zane Bitter <email address hidden>
Date: Mon Jan 8 17:23:12 2018 -0500

    Don't load nested stack to get ResourceGroup blacklist

    In ResourceGroup, we (unwisely in retrospect) allow blacklisting of
    resources by reference ID (i.e. physical resource ID in most cases). In
    order to both determine whether a removal policy entry was an existing
    resource name and (if not) to find the resource by physical ID, we would
    always load the nested stack into memory.

    To avoid this, unconditionally create an output in the nested stack that
    returns a mapping of all the resource reference IDs and use it for finding
    a resource with a matching reference ID. Only if the output is not
    available (because the nested stack hasn't been updated since prior to the
    output being defined), fall back to the old way of doing it.

    Avoid looking up IDs for names that already appear in the blacklist
    (because they were blacklisted in a previous update and have been removed
    from the stack.)

    Since this is more expensive in time (though hopefully cheaper in space),
    update the blacklist only once on resource creation and whenever it
    actually changes during an update. This is accomplished by moving the
    update to a separate function that is called explicitly, rather than making
    it a side-effect of getting the current blacklist (which occurs up to 4
    times in each create/update).

    This also avoids updating the blacklist during a preview, which the
    previous code erroneously did.

    Change-Id: Ic32d7d99bad06f92b3d2919d58cd1f9128bcfe83
    Partial-Bug: #1731349
    Closes-Bug: #1749426
    (cherry picked from commit 32ec5141de1a8017ccfddc5dccf114deff8271ba)

tags: added: in-stable-queens
Zane Bitter (zaneb)
Changed in heat:
assignee: Thomas Herve (therve) → Zane Bitter (zaneb)
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 10.0.0.0rc2

This issue was fixed in the openstack/heat 10.0.0.0rc2 release candidate.

Changed in tripleo:
milestone: queens-rc1 → rocky-1
Revision history for this message
Brad P. Crochet (brad-9) wrote :

Marking as Fix Released for tripleo

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/544999
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=71e30eebe06ce3b5733e62ff1397e9230c293d35
Submitter: Zuul
Branch: master

commit 71e30eebe06ce3b5733e62ff1397e9230c293d35
Author: Brad P. Crochet <email address hidden>
Date: Thu Feb 15 09:59:25 2018 -0500

    Return any errors from called actions in node delete

    The node delete action (ScaleDownAction) wasn't passing errors back
    from called actions. This results in the workflow continuing to look
    for a stack to go to INPROGRESS that never will.

    Change-Id: Ia43e0a06215ad19d4b49f1721026db7519220799
    Partial-Bug: #1749426

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.