Convergence: resource failure can result in missing dependencies

Bug #1663388 reported by Zane Bitter
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Zane Bitter

Bug Description

Say you have two resources:

resources:
  A:
    type: MyResourceType
  B:
    type: MyResourceType
    properties:
      a: {get_resource: A}

and during a stack update resource A is replaced and resource B fails for whatever reason. At that point resource B depends on either the old A or the new A, but there's no way of knowing which. If we get it wrong then a subsequent update may try to remove a version of A that B still depends on.

The convergence prototype handled this with a two-step update - when resource B goes to UPDATE_IN_PROGRESS, any new requirements are added to the list but the old ones are not removed. Only when it goes UPDATE_COMPLETE are any out of date requirements cleared out:

https://github.com/zaneb/heat-convergence-prototype/commit/a7d59810ff11233e987610ba72d8c416b023ea9a

However, as implemented, the requirements are only updated once, at the end of the resource update, unconditionally (i.e. regardless of whether the update succeeded or failed):

https://github.com/openstack/heat/blob/stable/ocata/heat/engine/resource.py#L1185-L1192

This means that in the example above, if B had actually started using the second A and e.g. in a subsequent update A is replaced again, Heat will try to remove the second A concurrently with replacing B, before the original B (which still depends on the second A) is deleted.

Prior to the fix for bug 1495363, the dependencies were updated only on success. That suggests that simply implementing the design is likely to cause a recurrence of the hang in the test_stack_update_from_failed fails functional test. It appears that somehow the forward node dependencies were expecting all of the required nodes in the SyncPoint. The requirements list should probably only be used to generate the dependencies for the *reverse* (cleanup) nodes in the graph. The forward (update) nodes should get their dependencies exclusively from the new template. I haven't yet verified whether this was a fault in the original design - that's possible, since there is no way to fail a resource in the prototype.

Zane Bitter (zaneb)
Changed in heat:
milestone: none → pike-1
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-1 → pike-2
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-2 → pike-3
Revision history for this message
Zane Bitter (zaneb) wrote :

OK, so I believe the reason that the design wasn't carried through is because of an issue with UpdateReplace.

If the update succeeds then we want it to have the new requires, and if it fails we want it to have the union of the new and old requires (and in both cases we want to have the new current_template_id). The obvious way to implement this is to set the new template ID and set requires to the union of old and new when the resource goes UPDATE_IN_PROGRESS, and to the new set when it goes UPDATE_COMPLETE. This means we'll always have the right value, even if we don't e.g. correctly handle exceptions.

However, we also have the UpdateReplace flow, and in *that* case, we *need* to have the *old* current_template_id and we'd probably prefer to have the old requires.
The problem is that UpdateReplace can be raised after the resource goes UPDATE_IN_PROGRESS, even though it's usually raised earlier (by _needs_update()) - a subtlety not captured in the prototype.

At the moment, we achieve this by writing nothing when the resource goes UPDATE_IN_PROGRESS, writing the new current_template_id and new requires when the resource goes UPDATE_COMPLETE, and doing a separate DB transaction to write the same data if an exception other than UpdateReplace occurs.

I don't think it'd be unreasonable to say that if _needs_update() raises UpdateReplace then we stick with the old requires, but if it gets raised after the resource is UPDATE_IN_PROGRESS then you get both the old and new requires.

It would be nice to find a better solution for the current_template_id too, but it seems like it's always going to rely on error handling after the fact in one case or the other (replace or failure).

Revision history for this message
Zane Bitter (zaneb) wrote :

It's also worth noting that if no update is required we want to end up in the same state as we would from a successful update.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/486267

Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
status: New → In Progress
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-3 → pike-rc1
Rabi Mishra (rabi)
Changed in heat:
milestone: pike-rc1 → pike-rc2
Rico Lin (rico-lin)
Changed in heat:
milestone: pike-rc2 → queens-1
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-1 → queens-2
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-2 → queens-3
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-3 → queens-rc1
Rico Lin (rico-lin)
Changed in heat:
milestone: queens-rc1 → rocky-1
Rico Lin (rico-lin)
Changed in heat:
milestone: rocky-1 → rocky-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/486267
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=e649574d4751ffd8578f63787621c3a26d383a34
Submitter: Zuul
Branch: master

commit e649574d4751ffd8578f63787621c3a26d383a34
Author: Zane Bitter <email address hidden>
Date: Fri Jul 21 21:32:45 2017 -0400

    Merge before/after 'requires' list on update failure

    If an update of a resource fails, its 'requires' should be set to a union
    of the previous and new requires lists. This is because if the resource
    depends on a resource that has been replaced in this stack update, we can't
    know if the current resource now depends on the new or old version of the
    replaced resource if the current resource failed.

    This is achieved by splitting up the setting of 'requires' and
    'current_template_id', and changing them directly in the update() method
    instead of via a callback.

    When the resource state is changed to UPDATE_IN_PROGRESS, the new
    requirements are added to the old ones. When the state is changed to
    UPDATE_COMPLETE, the new requirements replace the old ones altogether. If
    the update fails or handle_update() raises UpdateReplace, the union of the
    requirements is kept. If _needs_update() raises UpdateReplace, the old
    requirements are kept.

    The current_template_id is updated when the state is changed to either
    UPDATE_COMPLETE or UPDATE_FAILED, or when no update is required
    (_needs_update() returns False).

    This also saves an extra database write when the update fails.

    Change-Id: If70d457fba5c64611173e3f9a0ae6b155ec69e06
    Closes-Bug: #1663388

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 11.0.0.0b2

This issue was fixed in the openstack/heat 11.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/740940

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/queens)

Reviewed: https://review.opendev.org/740940
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=da991316e4ac4813797b96c129ba35f6ddd0a043
Submitter: Zuul
Branch: stable/queens

commit da991316e4ac4813797b96c129ba35f6ddd0a043
Author: Zane Bitter <email address hidden>
Date: Fri Jul 21 21:32:45 2017 -0400

    Merge before/after 'requires' list on update failure

    If an update of a resource fails, its 'requires' should be set to a union
    of the previous and new requires lists. This is because if the resource
    depends on a resource that has been replaced in this stack update, we can't
    know if the current resource now depends on the new or old version of the
    replaced resource if the current resource failed.

    This is achieved by splitting up the setting of 'requires' and
    'current_template_id', and changing them directly in the update() method
    instead of via a callback.

    When the resource state is changed to UPDATE_IN_PROGRESS, the new
    requirements are added to the old ones. When the state is changed to
    UPDATE_COMPLETE, the new requirements replace the old ones altogether. If
    the update fails or handle_update() raises UpdateReplace, the union of the
    requirements is kept. If _needs_update() raises UpdateReplace, the old
    requirements are kept.

    The current_template_id is updated when the state is changed to either
    UPDATE_COMPLETE or UPDATE_FAILED, or when no update is required
    (_needs_update() returns False).

    This also saves an extra database write when the update fails.

    Change-Id: If70d457fba5c64611173e3f9a0ae6b155ec69e06
    Closes-Bug: #1663388
    (cherry picked from commit e649574d4751ffd8578f63787621c3a26d383a34)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat queens-eol

This issue was fixed in the openstack/heat queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.