Resource failure causes nested stacks to be rolled back
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Heat |
Fix Released
|
High
|
Rico Lin | ||
Kilo |
Won't Fix
|
High
|
Unassigned |
Bug Description
The fix for bug 1446252 was to issue an update_cancel RPC call for a nested stacks whenever an update operation was cancelled in the parent stack. Unfortunately this always triggers a rollback. Previously (in Juno), if an update of a nested stack was cancelled then the disable_rollback flag was respected - which meant never rolling back since the disable_rollback flag is always True for nested stacks (since the parent stack will do an update with the previous template if it wants to roll back a change). The main downside to this is that it leaves the stack in a half-finished state, but that is after all what the user requested.
Since the patch, perhaps fortunately, was accidentally left out of Kilo, in Kilo the update is neither cancelled nor rolled back (that is, bug 1446252 still exists). This sucks because once a resource fails, we have to wait for all stacks in the tree to finish on their own or time out before we can issue another top-level stack update.
In Liberty the patch has landed, so the nested stack will always be rolled back unless we fix it. This could be a big problem for e.g. TripleO, where it is not uncommon for an individual resource to fail and we really don't want to roll back any sibling stacks as a result.
The ideal solution, as usual, is phase 1 of convergence, since in that case there is no need to do anything to nested stacks except when the user requests a rollback - if the user issues a subsequent update it will be accepted with no danger of a locking error.
In the meantime, I suspect the best thing is a change to cancel any in-progress update but not roll back.
Changed in heat: | |
assignee: | nobody → Rico Lin (rico-lin) |
Changed in heat: | |
status: | New → Triaged |
tags: | added: kilo-backport-potential |
Changed in heat: | |
status: | Fix Committed → Fix Released |
tags: | removed: kilo-backport-potential |
Changed in heat: | |
milestone: | liberty-3 → 5.0.0 |
+1 I agree the desired behaviour is just to cancel any in-progress update (like when a stack timeout occurs) and let the operator retry, with bonus points to enable the top-level rollback flag to be respected (but not rollback by default).