heat stack-update failure when scaling resource group due to deficient signal handling in o-r-c scripts

Bug #1389178 reported by Ladislav Smola on 2014-11-04
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Steven Hardy
Juno
Fix Released
High
Steven Hardy
tripleo
High
Steven Hardy

Bug Description

Update from 1 to 2 computes was successful in this case, but failed for me in 80% of the cases.

Log of unsuccessful update from 2 to 3 computes in attachment heat_debug_error.txt
heat stack-update time-outed in 60 minutes

CLI paste
http://paste.openstack.org/show/129053/

This was an Instack heat deployment, using this https://github.com/agroup/instack-undercloud/blob/master/README-source.md

Ladislav Smola (lsmola) wrote :
Ladislav Smola (lsmola) wrote :

This is a log of successful update from 1 to 2 computes

Ladislav Smola (lsmola) wrote :

When I run stack-update with the same parameters on the failed stack, it succeeds. So there is probably some signaling race, that prevents the resource from completing in some situations.

Changed in heat:
status: New → Triaged
importance: Undecided → High
Steve Baker (steve-stevebaker) wrote :

When I get the chance I'll see if this can be replicated with a simple server/deployment inside a resource group

Changed in heat:
assignee: nobody → Steve Baker (steve-stevebaker)
Steven Hardy (shardy) wrote :

> When I get the chance I'll see if this can be replicated with a simple server/deployment inside a resource group

FWIW I've been doing a bit of testing with server+deployment inside ResourceGroup (using the cirros hook mostly), and I've not seen any such problem.

That said, the tripleo scenario is so much more complex it's quite possible there's something specific to those templates causing this.

Steven Hardy (shardy) on 2014-11-11
tags: added: tripleo
Ladislav Smola (lsmola) wrote :

OK so I think I found the reason

there is a race condition in signaling/resource creation

when we run heat stack-update, there are two ComputeAllNodesDeployment events in quick sequence
| ComputeAllNodesDeployment | 2607455f-2f34-4275-95dd-205e5ba3cd1b | state changed | CREATE_IN_PROGRESS | 2014-11-11T09:27:46Z |
| ComputeAllNodesDeployment | aea956f1-aa3a-4ce3-8ae4-7a4c841333f2 | state changed | UPDATE_IN_PROGRESS | 2014-11-11T09:27:45Z |

But the CREATE_IN_PROGRESS reference to new ComputeAllNodesDeployment resource.

Now it can happen if config starts too quickly, that the signal will be sent to old resource by this https://github.com/openstack/tripleo-image-elements/blob/master/elements/os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#L5 (the signal is referencing the old ComputeAllNodesDeployment)

Then the new ComputeAllNodesDeployment waits also for the signal, but it never gets it and timeouts.

Steven Hardy (shardy) wrote :

So, Ladislav provided some good analysis via IRC, here's my understanding of it:

There's a global SoftwareDeployments resource which references servers in the ResourceGroup:

https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud-without-mergepy.yaml#L739

Because changing the "servers" list causes the resource to be replaced, we end up with a race on update where two *AllNodesDeployment get created (one UPDATE_IN_PROGRESS and a new replacement CREATE_IN_PROGRESS), and signals don't necessarily hit the right resource.

I'm not quite sure if all the members should signal on update, or just those being added, but either way, it seems that we shouldn't replace the Deployments resources on update, if all that's changed is the servers list - we should compare the before/after list and wait for the appropriate number of signals (how many is tbc after testing and understanding the agent behavior better..)

Steve Baker (steve-stevebaker) wrote :

Yes, all properties of SoftwareDeployments should be update_allowed=True, especially 'servers'

> I'm not quite sure if all the members should signal on update, or just those being added

I think all members should be signalling on update since the whole point of SoftwareDeployments is to propagate the cluster list to the cluster.

If restarting services is expensive then the orc scripts probably need to get better at only restarting a service if they really need to (such as if the service config has changed)

Steven Hardy (shardy) wrote :

Ok then, so modifying the properties schema to allow updates may be all that's required then, given that SoftwareDeployments is really just a ResourceGroup itself, thus should already support updating the nested stack?

Sounds like a nice simple fix if so :)

Excerpts from Steve Baker's message of 2014-11-11 20:40:35 UTC:
> Yes, all properties of SoftwareDeployments should be
> update_allowed=True, especially 'servers'
>
> > I'm not quite sure if all the members should signal on update, or just
> those being added
>
> I think all members should be signalling on update since the whole point
> of SoftwareDeployments is to propagate the cluster list to the cluster.
>
> If restarting services is expensive then the orc scripts probably need
> to get better at only restarting a service if they really need to (such
> as if the service config has changed)
>

Yes we started talking about that at the summit and some of us have a
few ideas on how to do it with a minimal amount of work. That is a
separate thing from signaling, which also needs to work better.

> --
> You received this bug notification because you are subscribed to heat.
> https://bugs.launchpad.net/bugs/1389178
>
> Title:
> heat stack-update failure when scaling resource group
>
> Status in Orchestration API (Heat):
> Triaged
>
> Bug description:
> Update from 1 to 2 computes was successful in this case, but failed
> for me in 80% of the cases.
>
> Log of unsuccessful update from 2 to 3 computes in attachment heat_debug_error.txt
> heat stack-update time-outed in 60 minutes
>
> CLI paste
> http://paste.openstack.org/show/129053/
>
>
> This was an Instack heat deployment, using this https://github.com/agroup/instack-undercloud/blob/master/README-source.md
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/heat/+bug/1389178/+subscriptions

Steve Baker (steve-stevebaker) wrote :

The attached templates should be enough to replicate this by changing
server_count

So trying this http://paste.fedoraproject.org/150029/80221614/

New ComputeAllNodesDeployment is not created, so that is good.

But signaling is broken:
1. Stack doesn't wait for signal from newly created node and finish once nova instances deploy. So it doesn't wait for os-collect-config
2. Only new nodes run os-collect-config (not sure if it's right or wrong)
3. And finally the os-collect-config is not able to obtain signal data from the metadata server in the end, here is log from journal
 http://paste.openstack.org/show/132466/

event list shows events still point to the same resource
http://paste.openstack.org/show/132467/

I will paste heat-engine.log tomorrow and do some more debugging

Steve Baker (steve-stevebaker) wrote :

I'm going to write a simple functional test to replicate this

Steve Baker (steve-stevebaker) wrote :

I can't replicate this using a simple template and the heat-templates software-config hook and current heat master - it appears to do the right thing when scaling up and down.

I think I see what the issue might be. Currently templates have a pattern like this:

  allNodesConfig:
    type: OS::Heat::StructuredConfig
    properties:
      config:
        completion-signal: {get_input: deploy_signal_id}

and os-refresh-config/post-configure.d/99-refresh-completed will signal the *one* URL set by completion-signal. What
99-refresh-completed should be doing is iterating all deployments and signalling every deploy_signal_id it finds. This will allow multiple deployment resources to be signalled instead of just one random one.

Once 99-refresh-completed is modified to do this, the templates no longer need to specify completion-signal: {get_input: deploy_signal_id} (although 99-refresh-completed could continue to signal an optional completion-signal)

Steve Baker (steve-stevebaker) wrote :

If my analysis is correct I think the bug title should become something like

  99-refresh-completed only signals one deployment resource instead of all

Steve Baker (steve-stevebaker) wrote :

Also, 99-refresh-completed should only signal deployments which belong to group 'os-apply-config' or 'Heat::Ungrouped'

Clint Byrum (clint-fewbar) wrote :

Steve I agree. I was surprised that we didn't already have a bug to fix completion signals to allow for many instead of just one.

Changed in tripleo:
status: New → Triaged
importance: Undecided → High
summary: - heat stack-update failure when scaling resource group
+ heat stack-update failure when scaling resource group due to deficient
+ signal handling in o-r-c scripts
Changed in heat:
status: Triaged → Invalid
Steven Hardy (shardy) wrote :

stevebaker: can you clarify, is the Heat part of this really invalid? I thought at least we'd need to make SoftwareDeployments not replace the entire ResourceGroup containing all the SoftwareDeployment resources every update?

Or is that actually expected behavior because you don't want the deployment resources in the nested resource group stack to be updated, but replaced and created every update?

Ladislav Smola (lsmola) wrote :

stevebaker: I don't think I entirely get it

so when I not patch heat, each configured node sends signal like this in 99-refresh-completed

http://192.0.2.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3Abc1620a67879427f9fbda235d2558bc0%3Astacks%2Fovercloud-ComputeAllNodesDeployment-isqr55mpumd4%2F110b50b7-6dd7-44e2-b6d7-4d0861285691%2Fresources%2F0?Timestamp=2014-11-13T08%3A54%3A22Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=05e65a07fd364908a22b5b456c121a5e&SignatureVersion=2&Signature=IX7H8EzzxrYcQMt5A%2BdkFZ0Fs7l6Q4cjUTqpZd1xJDU%3D

which has encoded id of ComputeAllNodesDeployment = 110b50b7-6dd7-44e2-b6d7-4d0861285691
and resource_name=0

So this exactly tracks which node has been configured and completes the correct nested resource of ComputeAllNodesDeployment . That seems like a correct behavior to me.

So this still looks like heat bug to me, that sometimes the id of ComputeAllNodesDeployment is wrong, cause it creates new one in the meantime. Otherwise I would expect it to work like this.

Ladislav Smola (lsmola) wrote :

btw. this is list of steps for reproducing it with Instack

http://paste.openstack.org/show/132692/

Ladislav Smola (lsmola) wrote :

So using this patch http://paste.fedoraproject.org/150029/80221614/

It creates new nova server, but it doesn't create new deployment. So that is why it doesn't wait for the signal
and it can't retrieve the signal url using os-apply-config (cause the resource doesn't exists)

http://paste.openstack.org/show/132725/
http://paste.openstack.org/show/132726/

Ladislav Smola (lsmola) wrote :

Might be that the bug is already fixed in heat. I am updating Instack to latest juno-stable

Steve Baker (steve-stevebaker) wrote :

update_allowed=True is probably the right thing to do, but my point is that the signalling logic in heat-templates 55-heat-config works just fine with the current behaviour, and I can't see how 99-refresh-completed can do the right thing in its current state no matter what update_allowed is set to.

Steven Hardy (shardy) wrote :

@stevebaker: Ok, thanks for the clarification. If it helps, I can revive the heat part of this and post a patch moving the DeploymentsResource to update_allowed=True, and you can continue the investigation of the heat-config hook?

It looks like the symptoms Ladislav descibes in comment #23 were due to running a pre-juno version of heat which didn't contain the fix for bug #1377681, but waiting on confirmation that's the case after re-testing with a newer heat.

Ladislav Smola (lsmola) wrote :

@stevebaker: hm probably catching you for a chat will be the quickest way of me understanding :-)

so this looks promising, worked on shardy test templates
http://paste.fedoraproject.org/150029/80221614/
and
https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1389499,n,z

But I am getting a failure
| stack_status_reason | Error: Cannot update swift-storage-1-servers, stack not created
on a resource group that was created with count 0. We should allow that, this happens e.g. when I am not deploying storage.

Steven Hardy (shardy) on 2014-11-14
Changed in heat:
status: Invalid → Triaged
Steven Hardy (shardy) wrote :

Fix proposed via bug #1392796 for the issue mentioned in comment #27

Ladislav Smola (lsmola) wrote :

so these 3 links below (4 patches) are fixing the update:

http://paste.fedoraproject.org/150029/80221614/
https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1389499,n,z
bug #1392796

The heat stack-update completes, all the nodes are signaling back (newly created and the updated ones).

I have one last issue and that is that I can't ssh to the newly created node. I guess that is for another bug. I need to investigate.
But this bug should be closed by ^ patches.

Fix proposed to branch: master
Review: https://review.openstack.org/135689

Changed in heat:
assignee: Steve Baker (steve-stevebaker) → Steven Hardy (shardy)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/135689
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=218e1038f2828258eaf1e15d85cf074697f851a9
Submitter: Jenkins
Branch: master

commit 218e1038f2828258eaf1e15d85cf074697f851a9
Author: Steven Hardy <email address hidden>
Date: Wed Nov 19 18:22:57 2014 +0000

    SoftwareDeployments make servers property updateable

    Allow the servers list to be updated, which will cause an update
    of the resource group instead of replacing it.

    This looks inconvenient to test in the unit tests due to needing
    to create a coupled nested stack (which will further impede the
    decouple-nested work), so I'll post a followup patch with a
    functional test instead, if that's viewed as acceptable.

    Change-Id: I1714d2cef2d0edbeb82c16e5a83acc458009bd47
    Closes-Bug: #1389178

Changed in heat:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/136729
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6f5dfaa964987c569506ee6224a38dbbd2ee2831
Submitter: Jenkins
Branch: stable/juno

commit 6f5dfaa964987c569506ee6224a38dbbd2ee2831
Author: Steven Hardy <email address hidden>
Date: Wed Nov 19 18:22:57 2014 +0000

    SoftwareDeployments make servers property updateable

    Allow the servers list to be updated, which will cause an update
    of the resource group instead of replacing it.

    This looks inconvenient to test in the unit tests due to needing
    to create a coupled nested stack (which will further impede the
    decouple-nested work), so I'll post a followup patch with a
    functional test instead, if that's viewed as acceptable.

    Change-Id: I1714d2cef2d0edbeb82c16e5a83acc458009bd47
    Closes-Bug: #1389178
    (cherry picked from commit 218e1038f2828258eaf1e15d85cf074697f851a9)

Changed in heat:
milestone: none → kilo-1
Steven Hardy (shardy) on 2014-12-10
Changed in tripleo:
assignee: nobody → Steven Hardy (shardy)
Thierry Carrez (ttx) on 2014-12-18
Changed in heat:
status: Fix Committed → Fix Released

Fix proposed to branch: master
Review: https://review.openstack.org/145764

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
assignee: Steven Hardy (shardy) → Dan Prince (dan-prince)
Changed in tripleo:
assignee: Dan Prince (dan-prince) → Steven Hardy (shardy)

Reviewed: https://review.openstack.org/145764
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=93e05f569e29639e759ea1314af89340fee95227
Submitter: Jenkins
Branch: master

commit 93e05f569e29639e759ea1314af89340fee95227
Author: Steven Hardy <email address hidden>
Date: Thu Jan 8 12:05:15 2015 +0000

    Signal all o-a-c deployments in 99-refresh-completed

    Currently we have a TripleO specific template pattern, where all
    deployment resources are configured NO_SIGNAL, regardless of what
    DefaultSignalTransport is set to, and only one signal for all
    deployments is sent via the *AllNodesDeployment resources, by adding
    the heat-generated deploy_signal_id to the structured config data
    consumed by os-apply-config (as "completion-signal").

    This is inconsistent with all signalling done via heat-config (e.g
    everything other than o-a-c, which is triggered via a hook via
    55-heat-config, this transparently consumes the heat-generated
    deploy_signal_id and signals heat after each deployment hook is
    run.

    To allow per-deployment signalling for os-apply-config configs,
    this adds logic which looks in the deployment data processed by
    os-apply-config and signals all deployments deploying a config with
    group "os-apply-config" (everything else should be handled by 55-heat-config)

    Note that if the deployment is configured NO_SIGNAL, no deploy_signal_id
    will be set, thus this will do nothing, and currently this won't work with
    HEAT_SIGNAL, only CFN_SIGNAL (which is the default for deployments with no
    signal_transport specified). In future it would be good to add support for
    HEAT_SIGNAL to break the dependency on heat-api-cfn.

    This is backwards compatible, but after it's merged we can remove all the
    NO_SIGNAL's in the templates, and the completion-signal key from the
    allNodesConfig, which should in future allow more flexible control of
    the ordering of metadata update for configs applied via o-a-c (as well
    as better visibility of progres during deployment).

    Co-Authored-By: Dan Prince <email address hidden>

    Change-Id: I72ea524effd07deeb432fb38ee7da5f3dc7990a7
    Closes-Bug: #1389178

Changed in tripleo:
status: In Progress → Fix Committed
Jay Dobies (jdob) on 2015-04-10
Changed in tripleo:
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-04-30
Changed in heat:
milestone: kilo-1 → 2015.1.0

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/134089

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers