Failed update attempts might not preserve in heat the actual status of a resource

Bug #1521944 reported by Giulio Fidente on 2015-12-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
New
Undecided
Unassigned

Bug Description

After an initial (and legit) update failure, a further update attempt (supposed to pass) might instead fail due to what seems to be a misrepresentation of the resources status in heat.

Here is a sample workflow:

1. create a stack (neutron net), with a nested stack (neutron port)

2. try to update the stack with an invalid and different implementation of a resource which would trigger a deletion

2bis. deletion is supposed to fail because the nested stack (neutron port) will prevent in neutron the deletion of the parent (neutron net)

3. try to update the stack again with a valid implementation of the same resource

3bis. update will continue to fail as heat will try to re-create the parent stack (neutron network) which hasn't ever been deleted

Giulio Fidente (gfidente) wrote :

The attachment contains the templates and environments to be used to reproduce the issue as follows:

heat stack-create -e registry.yaml --template-file test.yaml test
heat stack-update -e registry-nooped.yaml --template-file test.yaml test
heat stack-update -e registry.yaml --template-file test.yaml test

summary: - Failed update attempts might fail to preserve the heat representation of
- a resource status
+ Failed update attempts might fail to preserve in heat the actual status
+ of a resource
summary: - Failed update attempts might fail to preserve in heat the actual status
- of a resource
+ Failed update attempts might not preserve in heat the actual status of a
+ resource
Steven Hardy (shardy) wrote :

Ok, here's my analysis of what is happening here:

1. We have a template which creates a Net/Subnet and a Port in two different nested stacks:

resources:
  network:
    type: MY::Net

  port:
    type: MY::Port
    depends_on: network

This works fine on create, as we create the network->subnet->port

2. We update the stack, switching MY::Net and MY::Port to a noop implementation

This is a stealth delete, applied as a update, so we traverse the graph in the normal (forward) way, e.g

Update My::Net causes delete of subnet, then port - which fails, because MY::Port is still the previous state, containing a port.

The example templates use depends_on instead of wiring the actual subnet ID from one nested stack to the other, but I think the same issue exists regardless. This is a kind of odd corner case where we're asking for an update, but really deleting things.

A solution would be to ensure the network, subnet and port resources were always in the same heat template, as then we'd be able to properly manage the dependencies.

In terms of workarounds and recovery, I'm not sure - the main thing would be a user-initiated way to force reverting to the backup stack, but we'd need a way to not do that in the normal rollback method (because that's just a stack update, which will have this same problem).

Giulio Fidente (gfidente) wrote :

Unfortunately using --rollback on the first update will trigger the same issue when actually rolling back so does not look like viable option.

Zane Bitter (zaneb) wrote :

The reason Heat is not restoring the Subnet resource is because it's in the DELETE_FAILED state, and Heat has no way of knowing that it is actually unchanged and so must assume that it needs replacing.

Zane Bitter (zaneb) wrote :

Note that this is under control of the resource type, since it can override _needs_update():

http://git.openstack.org/cgit/openstack/heat/tree/heat/engine/resource.py?h=stable/kilo#n732

So while this is a challenge to fix in general, in this specific case of undeleted Subnet resources we could absolutely fix it to allow us to reuse the existing physical resource without attempting to replace it.

Steven Hardy (shardy) wrote :

Ok, so to follow up on this, from ML and discussion above, it's clear we can't recover from this via rollback, and that we might consider special-casing for neutron resources, since we know it's possible to get stuck in a bad state with FAILED but otherwise working resources in this case.

Giulio Fidente (gfidente) wrote :

So I wonder if this is a valid bug where we want to change the behaviour of needs_update for certain resources or rather close it given the current behaviour isn't really wrong. Ideas?

Al Bailey (albailey1974) wrote :

I think I am seeing a similar issue.
The key points are that a failed stack-update where the "environment" was changed will lead to a un-recoverable scenario and an un-deletable stack. It becomes un-deletable because the resource-registry entry no longer exists. I believe that is the main difference between my env and the original stacks provided for this bug.

(I will attach my sample files as well.)

Step 1) heat stack-create -f Step1.yaml -e Step1.env STAK
-This creates a stack with a network and a subnet. The subnet is part of the custom resource.

Step 2) neutron port-create --name JUNK <uuid of network>
-This implicitly creates a port on the subnet, which will prevent us from being able to delete the subnet

Step 3) heat stack-update -f Step2.yaml STAK
- The update tries to delete the subnet (Env:SubNet custom resource) because Step2.yaml no longer references it. The update will fail because of Step 2, which leads to the real problem.

Step 4) neutron port-delete <UUID of port created in Step 2>
- We should now be able to delete the subnet, and our stack

Step 5) heat stack-delete STAK
ERROR: The Resource Type (ENV::SubNet) could not be found.
- I have not been able to determine a way to recover or remove this stack without going into the DB.

Al Bailey (albailey1974) wrote :

Attaching an example where a failed stack-update which alters the resource_registry leads to an un-deletable stack.

Rico Lin (rico-lin) on 2018-05-07
Changed in heat:
milestone: none → no-priority-tag-bugs
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers