Say you have two resources:
resources:
A:
type: MyResourceType
B:
type: MyResourceType
properties:
a: {get_resource: A}
and during a stack update resource A is replaced and resource B fails for whatever reason. At that point resource B depends on either the old A or the new A, but there's no way of knowing which. If we get it wrong then a subsequent update may try to remove a version of A that B still depends on.
The convergence prototype handled this with a two-step update - when resource B goes to UPDATE_IN_PROGRESS, any new requirements are added to the list but the old ones are not removed. Only when it goes UPDATE_COMPLETE are any out of date requirements cleared out:
https://github.com/zaneb/heat-convergence-prototype/commit/a7d59810ff11233e987610ba72d8c416b023ea9a
However, as implemented, the requirements are only updated once, at the end of the resource update, unconditionally (i.e. regardless of whether the update succeeded or failed):
https://github.com/openstack/heat/blob/stable/ocata/heat/engine/resource.py#L1185-L1192
This means that in the example above, if B had actually started using the second A and e.g. in a subsequent update A is replaced again, Heat will try to remove the second A concurrently with replacing B, before the original B (which still depends on the second A) is deleted.
Prior to the fix for bug 1495363, the dependencies were updated only on success. That suggests that simply implementing the design is likely to cause a recurrence of the hang in the test_stack_update_from_failed fails functional test. It appears that somehow the forward node dependencies were expecting all of the required nodes in the SyncPoint. The requirements list should probably only be used to generate the dependencies for the *reverse* (cleanup) nodes in the graph. The forward (update) nodes should get their dependencies exclusively from the new template. I haven't yet verified whether this was a fault in the original design - that's possible, since there is no way to fail a resource in the prototype.
OK, so I believe the reason that the design wasn't carried through is because of an issue with UpdateReplace.
If the update succeeds then we want it to have the new requires, and if it fails we want it to have the union of the new and old requires (and in both cases we want to have the new current_ template_ id). The obvious way to implement this is to set the new template ID and set requires to the union of old and new when the resource goes UPDATE_IN_PROGRESS, and to the new set when it goes UPDATE_COMPLETE. This means we'll always have the right value, even if we don't e.g. correctly handle exceptions.
However, we also have the UpdateReplace flow, and in *that* case, we *need* to have the *old* current_template_id and we'd probably prefer to have the old requires.
The problem is that UpdateReplace can be raised after the resource goes UPDATE_IN_PROGRESS, even though it's usually raised earlier (by _needs_update()) - a subtlety not captured in the prototype.
At the moment, we achieve this by writing nothing when the resource goes UPDATE_IN_PROGRESS, writing the new current_template_id and new requires when the resource goes UPDATE_COMPLETE, and doing a separate DB transaction to write the same data if an exception other than UpdateReplace occurs.
I don't think it'd be unreasonable to say that if _needs_update() raises UpdateReplace then we stick with the old requires, but if it gets raised after the resource is UPDATE_IN_PROGRESS then you get both the old and new requires.
It would be nice to find a better solution for the current_template_id too, but it seems like it's always going to rely on error handling after the fact in one case or the other (replace or failure).