Overcloud update is broken: UPDATE_FAILED NotFound_Remote: resources[0]: Software config with id not found

Bug #1616550 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter

Bug Description

When testing an overcloud update, Heat fails with this error:

2016-08-24 14:40:38.806928 | 2016-08-24 14:40:34 [0]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:40:38.806984 | 2016-08-24 14:40:34 [0]: UPDATE_FAILED NotFound_Remote: resources[0]: Software config with id 91df76f0-ec0d-4d6c-8ad9-d27ff6a17634 not found
2016-08-24 14:40:38.807010 | Traceback (most recent call last):
2016-08-24 14:40:38.807025 |
2016-08-24 14:40:38.807064 | File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 424, in wrapped
2016-08-24 14:40:38.807087 | return func(self, ctx, *ar
2016-08-24 14:40:38.807157 | 2016-08-24 14:40:34 [overcloud-ControllerAllNodesDeployment-4dljqiv2j6cj]: UPDATE_FAILED NotFound_Remote: resources[0]: Software config with id 91df76f0-ec0d-4d6c-8ad9-d27ff6a17634 not found
2016-08-24 14:40:38.807202 | Traceback (most recent call last):
2016-08-24 14:40:38.807219 |
2016-08-24 14:40:38.807257 | File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 424, in wrapped
2016-08-24 14:40:38.807280 | return func(self, ctx, *ar

Full trace:
http://logs.openstack.org/30/351330/10/experimental/gate-tripleo-ci-centos-7-nonha-multinode-updates-nv/e5bc760/console.html#_2016-08-24_14_40_38_806984

Tags: update-bugs
Changed in tripleo:
importance: Undecided → High
status: New → Confirmed
milestone: none → newton-3
tags: added: update-bugs
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/360015

Revision history for this message
Zane Bitter (zaneb) wrote :

It's possible that it could have been caused by this change in Heat, which happened in around the same timeframe:

https://review.openstack.org/#/c/352605/

as this loads the *previous* config as well as the new one during an update of the deployment.

That said, if the config changes then it should be handled using the UpdateReplace workflow, so anything that relies on the old one should have it still available during its own update (this is the whole point of the UpdateReplace workflow). So I'm not at all sure that this is the cause.

Revision history for this message
Zane Bitter (zaneb) wrote :

Oh, I found the problem. The software config in actually buried inside a nested stack, which prevents the UpdateReplace workflow from operating correctly. (Well, it works, but inside the nested stack so the delete phase completes before the update of the parent resource, and therefore by the time anything depending on it in the parent stack updates it's gone.)

This is a known limitation of Heat that we've discussed fixing, most recently at https://etherpad.openstack.org/p/mitaka-heat-break-stack-barrier but never had time to make a start on.

TripleO makes extensive use of nested stacks as an alternative to template generation for customising deployments, rather than just to create logical groupings of resources, which leaves it extremely vulnerable to this kind of problem. The Heat workflow is shattered into a thousand tiny islands so there is no global view of dependencies, and an otherwise legitimate change in any plugin could violate assumptions that TripleO makes about Heat's model that aren't actually guaranteed by Heat.

Since TripleO is unlikely to mend its ways, I don't see a choice here other than to try to reimplement https://review.openstack.org/#/c/352605/ in a less efficient way that is still usable in this situation.

Zane Bitter (zaneb)
Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
importance: Undecided → High
status: New → Triaged
status: Triaged → In Progress
milestone: none → newton-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/360122

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on heat (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.openstack.org/360015

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/360122
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=bbb1dfb06e074dd6da285723393cea7cf419f9b7
Submitter: Jenkins
Branch: master

commit bbb1dfb06e074dd6da285723393cea7cf419f9b7
Author: Zane Bitter <email address hidden>
Date: Wed Aug 24 18:48:11 2016 -0400

    Fix SoftwareDeployment when dealing with deleted configs

    When a SoftwareConfig has been updated, we can't rely on the previous
    one not having been deleted because TripleO has a habit of putting all
    their SoftwareConfigs inside a nested stack, breaking the UpdateReplace
    workflow by splitting the dependency graph.

    Instead, load the previous derived config from the Software Deployment
    itself.

    Change-Id: I9a399a676836be3106268c3640c5edb0c6d8472c
    Closes-Bug: #1616550
    Related-Bug: #1595040

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
Jiří Stránský (jistr) wrote :
Download full text (4.3 KiB)

I pulled in latest Heat, it seems to fail on the same resources but differently:

2016-08-25 15:01:48 [0]: UPDATE_FAILED TypeError: resources[0]: type object got multiple values for keyword argument 'value'
2016-08-25 15:01:48 [overcloud-ControllerAllNodesDeployment-ikqpgd5u6a3b]: UPDATE_FAILED TypeError: resources[0]: type object got multiple values for keyword argument 'value'
2016-08-25 15:01:48 [overcloud-ComputeAllNodesDeployment-ohgsrrpch2x3]: UPDATE_IN_PROGRESS Stack UPDATE started
2016-08-25 15:01:48 [BlockStorageAllNodesDeployment]: UPDATE_IN_PROGRESS state changed
2016-08-25 15:01:49 [0]: UPDATE_IN_PROGRESS state changed
2016-08-25 15:01:49 [CephStorageAllNodesDeployment]: UPDATE_IN_PROGRESS state changed
2016-08-25 15:01:49 [0]: UPDATE_FAILED TypeError: resources[0]: type object got multiple values for keyword argument 'value'
2016-08-25 15:01:49 [overcloud-ComputeAllNodesDeployment-ohgsrrpch2x3]: UPDATE_FAILED TypeError: resources[0]: type object got multiple values for keyword argument 'value'
2016-08-25 15:01:50 [ComputeAllNodesDeployment]: UPDATE_FAILED resources.ComputeAllNodesDeployment: TypeError: resources[0]: type object got multiple values for keyword argument 'value'
2016-08-25 15:01:50 [BlockStorageAllNodesDeployment]: UPDATE_FAILED UPDATE aborted
2016-08-25 15:01:50 [CephStorageAllNodesDeployment]: UPDATE_FAILED UPDATE aborted
2016-08-25 15:01:50 [ObjectStorageAllNodesDeployment]: UPDATE_FAILED UPDATE aborted
2016-08-25 15:01:50 [ControllerAllNodesDeployment]: UPDATE_FAILED UPDATE aborted
2016-08-25 15:01:50 [overcloud]: UPDATE_FAILED resources.ComputeAllNodesDeployment: TypeError: resources[0]: type object got multiple values for keyword argument 'value'
Stack overcloud UPDATE_FAILED
Heat Stack update failed.

Stack trace from heat-engine log:

2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource Traceback (most recent call last):
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 743, in _action_recorder
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource yield
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 1293, in update
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource prop_diff])
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource step = next(subtask)
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 792, in action_handler_task
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource done = check(handler_data)
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 396, in check_update_complete
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource if not checker.step():
2016-08-25 15:01:50.378 16453 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/360831

Revision history for this message
Zane Bitter (zaneb) wrote :

Thanks, our unit tests did not provide comprehensive enough mocking to cover that case so I missed it. The second patch I just posted should fix it.

Revision history for this message
Steven Hardy (shardy) wrote :

Removing tripleo from this bug as it seems to be confirmed as a heat issue, I just approved the second patch from zaneb, thanks!

Changed in tripleo:
milestone: newton-3 → none
no longer affects: tripleo
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/360831
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=8fcebfae3c2a9e86bffb8a66f8bc84fbf4237d22
Submitter: Jenkins
Branch: master

commit 8fcebfae3c2a9e86bffb8a66f8bc84fbf4237d22
Author: Zane Bitter <email address hidden>
Date: Thu Aug 25 20:08:10 2016 -0400

    Fix building derived inputs from a derived config

    The previous patch for this bug, bbb1dfb06e074dd6da285723393cea7cf419f9b7,
    changed from using the previous software config to calculate the
    previous inputs to using the previous derived config. Since the derived
    config includes values, this caused an error when trying to initialise
    the InputConfig objects in _build_derived_inputs(). That'd be easily
    fixed, but it's probably overkill anyway now that we have the entire
    derived config, so just use the stored values.

    Change-Id: I7424df9a564d63eb197da93b16409149a4f37fdb
    Closes-Bug: #1616550

Revision history for this message
Jiří Stránský (jistr) wrote :

Thanks Zane! I can confirm that after adding the 2nd patch i no longer hit this problem.

(Heh i hit an OOM on overcloud instead, but anyway i got further than before so this seems to be fixed.)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 7.0.0.0b3

This issue was fixed in the openstack/heat 7.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.