Comment 7 for bug 1389178

Revision history for this message
Steven Hardy (shardy) wrote : Re: heat stack-update failure when scaling resource group

So, Ladislav provided some good analysis via IRC, here's my understanding of it:

There's a global SoftwareDeployments resource which references servers in the ResourceGroup:

https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud-without-mergepy.yaml#L739

Because changing the "servers" list causes the resource to be replaced, we end up with a race on update where two *AllNodesDeployment get created (one UPDATE_IN_PROGRESS and a new replacement CREATE_IN_PROGRESS), and signals don't necessarily hit the right resource.

I'm not quite sure if all the members should signal on update, or just those being added, but either way, it seems that we shouldn't replace the Deployments resources on update, if all that's changed is the servers list - we should compare the before/after list and wait for the appropriate number of signals (how many is tbc after testing and understanding the agent behavior better..)