Bug #1389178 “heat stack-update failure when scaling resource gr...” : Bugs : tripleo

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-04:

#1

heat_debug_error.txt Edit (12.2 MiB, text/plain)

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-04:

#2

heat_debug.txt Edit (1.5 MiB, text/plain)

This is a log of successful update from 1 to 2 computes

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-04:

#3

When I run stack-update with the same parameters on the failed stack, it succeeds. So there is probably some signaling race, that prevents the resource from completing in some situations.

Steve Baker (steve-stevebaker) on 2014-11-05

Changed in heat:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-05:

#4

When I get the chance I'll see if this can be replicated with a simple server/deployment inside a resource group

Changed in heat:
assignee:	nobody → Steve Baker (steve-stevebaker)

Revision history for this message

Steven Hardy (shardy) wrote on 2014-11-06:

#5

> When I get the chance I'll see if this can be replicated with a simple server/deployment inside a resource group

FWIW I've been doing a bit of testing with server+deployment inside ResourceGroup (using the cirros hook mostly), and I've not seen any such problem.

That said, the tripleo scenario is so much more complex it's quite possible there's something specific to those templates causing this.

Steven Hardy (shardy) on 2014-11-11

tags:

added: tripleo

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-11:

#6

OK so I think I found the reason

there is a race condition in signaling/resource creation

But the CREATE_IN_PROGRESS reference to new ComputeAllNodesDeployment resource.

Now it can happen if config starts too quickly, that the signal will be sent to old resource by this https://github.com/openstack/tripleo-image-elements/blob/master/elements/os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#L5 (the signal is referencing the old ComputeAllNodesDeployment)

Then the new ComputeAllNodesDeployment waits also for the signal, but it never gets it and timeouts.

Revision history for this message

Steven Hardy (shardy) wrote on 2014-11-11:

#7

So, Ladislav provided some good analysis via IRC, here's my understanding of it:

There's a global SoftwareDeployments resource which references servers in the ResourceGroup:

https://github.com/openstack/tripleo-heat-templates/blob/master/overcloud-without-mergepy.yaml#L739

Because changing the "servers" list causes the resource to be replaced, we end up with a race on update where two *AllNodesDeployment get created (one UPDATE_IN_PROGRESS and a new replacement CREATE_IN_PROGRESS), and signals don't necessarily hit the right resource.

I'm not quite sure if all the members should signal on update, or just those being added, but either way, it seems that we shouldn't replace the Deployments resources on update, if all that's changed is the servers list - we should compare the before/after list and wait for the appropriate number of signals (how many is tbc after testing and understanding the agent behavior better..)

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-11:

#8

Yes, all properties of SoftwareDeployments should be update_allowed=True, especially 'servers'

> I'm not quite sure if all the members should signal on update, or just those being added

I think all members should be signalling on update since the whole point of SoftwareDeployments is to propagate the cluster list to the cluster.

If restarting services is expensive then the orc scripts probably need to get better at only restarting a service if they really need to (such as if the service config has changed)

Revision history for this message

Steven Hardy (shardy) wrote on 2014-11-11:

#9

Ok then, so modifying the properties schema to allow updates may be all that's required then, given that SoftwareDeployments is really just a ResourceGroup itself, thus should already support updating the nested stack?

Sounds like a nice simple fix if so :)

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2014-11-11: Re: [Bug 1389178] Re: heat stack-update failure when scaling resource group

#10

Excerpts from Steve Baker's message of 2014-11-11 20:40:35 UTC:
> Yes, all properties of SoftwareDeployments should be
> update_allowed=True, especially 'servers'
>
> > I'm not quite sure if all the members should signal on update, or just
> those being added
>
> I think all members should be signalling on update since the whole point
> of SoftwareDeployments is to propagate the cluster list to the cluster.
>
> If restarting services is expensive then the orc scripts probably need
> to get better at only restarting a service if they really need to (such
> as if the service config has changed)
>

Yes we started talking about that at the summit and some of us have a
few ideas on how to do it with a minimal amount of work. That is a
separate thing from signaling, which also needs to work better.

> --
> You received this bug notification because you are subscribed to heat.
> https://bugs.launchpad.net/bugs/1389178
>
> Title:
> heat stack-update failure when scaling resource group
>
> Status in Orchestration API (Heat):
> Triaged
>
> Bug description:
> Update from 1 to 2 computes was successful in this case, but failed
> for me in 80% of the cases.
>
> Log of unsuccessful update from 2 to 3 computes in attachment heat_debug_error.txt
> heat stack-update time-outed in 60 minutes
>
> CLI paste
> http://paste.openstack.org/show/129053/
>
>
> This was an Instack heat deployment, using this https://github.com/agroup/instack-undercloud/blob/master/README-source.md
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/heat/+bug/1389178/+subscriptions

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-11:

#11

software_config_server.yaml Edit (554 bytes, application/x-yaml; name="software_config_server.yaml")
software_config_servers_deployments.yaml Edit (1.8 KiB, application/x-yaml; name="software_config_servers_deployments.yaml")

The attached templates should be enough to replicate this by changing
server_count

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-12: Re: heat stack-update failure when scaling resource group

#12

So trying this http://paste.fedoraproject.org/150029/80221614/

New ComputeAllNodesDeployment is not created, so that is good.

But signaling is broken:
1. Stack doesn't wait for signal from newly created node and finish once nova instances deploy. So it doesn't wait for os-collect-config
2. Only new nodes run os-collect-config (not sure if it's right or wrong)
3. And finally the os-collect-config is not able to obtain signal data from the metadata server in the end, here is log from journal
http://paste.openstack.org/show/132466/

event list shows events still point to the same resource
http://paste.openstack.org/show/132467/

I will paste heat-engine.log tomorrow and do some more debugging

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-12:

#13

I'm going to write a simple functional test to replicate this

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-13:

#14

I can't replicate this using a simple template and the heat-templates software-config hook and current heat master - it appears to do the right thing when scaling up and down.

I think I see what the issue might be. Currently templates have a pattern like this:

  allNodesConfig:
    type: OS::Heat::StructuredConfig
    properties:
      config:
        completion-signal: {get_input: deploy_signal_id}

and os-refresh-config/post-configure.d/99-refresh-completed will signal the *one* URL set by completion-signal. What
99-refresh-completed should be doing is iterating all deployments and signalling every deploy_signal_id it finds. This will allow multiple deployment resources to be signalled instead of just one random one.

Once 99-refresh-completed is modified to do this, the templates no longer need to specify completion-signal: {get_input: deploy_signal_id} (although 99-refresh-completed could continue to signal an optional completion-signal)

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-13:

#15

If my analysis is correct I think the bug title should become something like

99-refresh-completed only signals one deployment resource instead of all

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-13:

#16

Also, 99-refresh-completed should only signal deployments which belong to group 'os-apply-config' or 'Heat::Ungrouped'

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-13:

#17

Here is the signalling code we use

https://github.com/openstack/heat-templates/blob/master/hot/software-config/elements/heat-config/os-refresh-config/configure.d/55-heat-config#L121

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2014-11-13:

#18

Steve I agree. I was surprised that we didn't already have a bug to fix completion signals to allow for many instead of just one.

Changed in tripleo:
status:	New → Triaged
importance:	Undecided → High
summary:	- heat stack-update failure when scaling resource group + heat stack-update failure when scaling resource group due to deficient + signal handling in o-r-c scripts
Changed in heat:
status:	Triaged → Invalid

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-13: Related fix proposed to heat (master)

#19

Related fix proposed to branch: master
Review: https://review.openstack.org/134089

Revision history for this message

Steven Hardy (shardy) wrote on 2014-11-13:

#20

stevebaker: can you clarify, is the Heat part of this really invalid? I thought at least we'd need to make SoftwareDeployments not replace the entire ResourceGroup containing all the SoftwareDeployment resources every update?

Or is that actually expected behavior because you don't want the deployment resources in the nested resource group stack to be updated, but replaced and created every update?

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-13:

#21

stevebaker: I don't think I entirely get it

so when I not patch heat, each configured node sends signal like this in 99-refresh-completed

http://192.0.2.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3Abc1620a67879427f9fbda235d2558bc0%3Astacks%2Fovercloud-ComputeAllNodesDeployment-isqr55mpumd4%2F110b50b7-6dd7-44e2-b6d7-4d0861285691%2Fresources%2F0?Timestamp=2014-11-13T08%3A54%3A22Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=05e65a07fd364908a22b5b456c121a5e&SignatureVersion=2&Signature=IX7H8EzzxrYcQMt5A%2BdkFZ0Fs7l6Q4cjUTqpZd1xJDU%3D

which has encoded id of ComputeAllNodesDeployment = 110b50b7-6dd7-44e2-b6d7-4d0861285691
and resource_name=0

So this exactly tracks which node has been configured and completes the correct nested resource of ComputeAllNodesDeployment . That seems like a correct behavior to me.

So this still looks like heat bug to me, that sometimes the id of ComputeAllNodesDeployment is wrong, cause it creates new one in the meantime. Otherwise I would expect it to work like this.

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-13:

#22

btw. this is list of steps for reproducing it with Instack

http://paste.openstack.org/show/132692/

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-13:

#23

So using this patch http://paste.fedoraproject.org/150029/80221614/

It creates new nova server, but it doesn't create new deployment. So that is why it doesn't wait for the signal
and it can't retrieve the signal url using os-apply-config (cause the resource doesn't exists)

http://paste.openstack.org/show/132725/
http://paste.openstack.org/show/132726/

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-13:

#24

Might be that the bug is already fixed in heat. I am updating Instack to latest juno-stable

Revision history for this message

Steve Baker (steve-stevebaker) wrote on 2014-11-13:

#25

update_allowed=True is probably the right thing to do, but my point is that the signalling logic in heat-templates 55-heat-config works just fine with the current behaviour, and I can't see how 99-refresh-completed can do the right thing in its current state no matter what update_allowed is set to.

Revision history for this message

Steven Hardy (shardy) wrote on 2014-11-14:

#26

@stevebaker: Ok, thanks for the clarification. If it helps, I can revive the heat part of this and post a patch moving the DeploymentsResource to update_allowed=True, and you can continue the investigation of the heat-config hook?

It looks like the symptoms Ladislav descibes in comment #23 were due to running a pre-juno version of heat which didn't contain the fix for bug #1377681, but waiting on confirmation that's the case after re-testing with a newer heat.

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-14:

#27

@stevebaker: hm probably catching you for a chat will be the quickest way of me understanding :-)

so this looks promising, worked on shardy test templates
http://paste.fedoraproject.org/150029/80221614/
and
https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1389499,n,z

But I am getting a failure
| stack_status_reason | Error: Cannot update swift-storage-1-servers, stack not created
on a resource group that was created with count 0. We should allow that, this happens e.g. when I am not deploying storage.

Steven Hardy (shardy) on 2014-11-14

Changed in heat:
status:	Invalid → Triaged

Revision history for this message

Steven Hardy (shardy) wrote on 2014-11-18:

#28

Fix proposed via bug #1392796 for the issue mentioned in comment #27

Revision history for this message

Ladislav Smola (lsmola) wrote on 2014-11-18:

#29

so these 3 links below (4 patches) are fixing the update:

http://paste.fedoraproject.org/150029/80221614/
https://review.openstack.org/#/q/status:open+project:openstack/heat+branch:master+topic:bug/1389499,n,z
bug #1392796

The heat stack-update completes, all the nodes are signaling back (newly created and the updated ones).

I have one last issue and that is that I can't ssh to the newly created node. I guess that is for another bug. I need to investigate.
But this bug should be closed by ^ patches.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-19: Fix proposed to heat (master)

#30

Fix proposed to branch: master
Review: https://review.openstack.org/135689

Changed in heat:
assignee:	Steve Baker (steve-stevebaker) → Steven Hardy (shardy)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-24: Fix merged to heat (master)

#31

Reviewed: https://review.openstack.org/135689
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=218e1038f2828258eaf1e15d85cf074697f851a9
Submitter: Jenkins
Branch: master

commit 218e1038f2828258eaf1e15d85cf074697f851a9
Author: Steven Hardy <email address hidden>
Date: Wed Nov 19 18:22:57 2014 +0000

SoftwareDeployments make servers property updateable

Allow the servers list to be updated, which will cause an update
of the resource group instead of replacing it.

    This looks inconvenient to test in the unit tests due to needing
    to create a coupled nested stack (which will further impede the
    decouple-nested work), so I'll post a followup patch with a
    functional test instead, if that's viewed as acceptable.

Change-Id: I1714d2cef2d0edbeb82c16e5a83acc458009bd47
Closes-Bug: #1389178

Changed in heat:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-24: Fix proposed to heat (stable/juno)

#32

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/136729

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-26: Fix merged to heat (stable/juno)

#33

Reviewed: https://review.openstack.org/136729
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6f5dfaa964987c569506ee6224a38dbbd2ee2831
Submitter: Jenkins
Branch: stable/juno

commit 6f5dfaa964987c569506ee6224a38dbbd2ee2831
Author: Steven Hardy <email address hidden>
Date: Wed Nov 19 18:22:57 2014 +0000

SoftwareDeployments make servers property updateable

Allow the servers list to be updated, which will cause an update
of the resource group instead of replacing it.

    This looks inconvenient to test in the unit tests due to needing
    to create a coupled nested stack (which will further impede the
    decouple-nested work), so I'll post a followup patch with a
    functional test instead, if that's viewed as acceptable.

    Change-Id: I1714d2cef2d0edbeb82c16e5a83acc458009bd47
    Closes-Bug: #1389178
    (cherry picked from commit 218e1038f2828258eaf1e15d85cf074697f851a9)

Steve Baker (steve-stevebaker) on 2014-12-02

Changed in heat:
milestone:	none → kilo-1

Steven Hardy (shardy) on 2014-12-10

Changed in tripleo:
assignee:	nobody → Steven Hardy (shardy)

Thierry Carrez (ttx) on 2014-12-18

Changed in heat:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-08: Fix proposed to tripleo-image-elements (master)

#34

Fix proposed to branch: master
Review: https://review.openstack.org/145764

Changed in tripleo:
status:	Triaged → In Progress

OpenStack Infra (hudson-openstack) on 2015-01-13

Changed in tripleo:
assignee:	Steven Hardy (shardy) → Dan Prince (dan-prince)

OpenStack Infra (hudson-openstack) on 2015-01-13

Changed in tripleo:
assignee:	Dan Prince (dan-prince) → Steven Hardy (shardy)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-28: Fix merged to tripleo-image-elements (master)

#35

Reviewed: https://review.openstack.org/145764
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=93e05f569e29639e759ea1314af89340fee95227
Submitter: Jenkins
Branch: master

commit 93e05f569e29639e759ea1314af89340fee95227
Author: Steven Hardy <email address hidden>
Date: Thu Jan 8 12:05:15 2015 +0000

Signal all o-a-c deployments in 99-refresh-completed

    Currently we have a TripleO specific template pattern, where all
    deployment resources are configured NO_SIGNAL, regardless of what
    DefaultSignalTransport is set to, and only one signal for all
    deployments is sent via the *AllNodesDeployment resources, by adding
    the heat-generated deploy_signal_id to the structured config data
    consumed by os-apply-config (as "completion-signal").

    This is inconsistent with all signalling done via heat-config (e.g
    everything other than o-a-c, which is triggered via a hook via
    55-heat-config, this transparently consumes the heat-generated
    deploy_signal_id and signals heat after each deployment hook is
    run.

    To allow per-deployment signalling for os-apply-config configs,
    this adds logic which looks in the deployment data processed by
    os-apply-config and signals all deployments deploying a config with
    group "os-apply-config" (everything else should be handled by 55-heat-config)

    Note that if the deployment is configured NO_SIGNAL, no deploy_signal_id
    will be set, thus this will do nothing, and currently this won't work with
    HEAT_SIGNAL, only CFN_SIGNAL (which is the default for deployments with no
    signal_transport specified). In future it would be good to add support for
    HEAT_SIGNAL to break the dependency on heat-api-cfn.

    This is backwards compatible, but after it's merged we can remove all the
    NO_SIGNAL's in the templates, and the completion-signal key from the
    allNodesConfig, which should in future allow more flexible control of
    the ordering of metadata update for configs applied via o-a-c (as well
    as better visibility of progres during deployment).

Co-Authored-By: Dan Prince <email address hidden>

Change-Id: I72ea524effd07deeb432fb38ee7da5f3dc7990a7
Closes-Bug: #1389178

Reviewed:  https://review.openstack.org/145764
Committed: https://git.openstack.org/cgit/openstack/tripleo-image-elements/commit/?id=93e05f569e29639e759ea1314af89340fee95227
Submitter: Jenkins
Branch:    master

commit 93e05f569e29639e759ea1314af89340fee95227
Author: Steven Hardy <shardy@redhat.com>
Date:   Thu Jan 8 12:05:15 2015 +0000

Signal all o-a-c deployments in 99-refresh-completed
    
    Currently we have a TripleO specific template pattern, where all
    deployment resources are configured NO_SIGNAL, regardless of what
    DefaultSignalTransport is set to, and only one signal for all
    deployments is sent via the *AllNodesDeployment resources, by adding
    the heat-generated deploy_signal_id to the structured config data
    consumed by os-apply-config (as "completion-signal").
    
    This is inconsistent with all signalling done via heat-config (e.g
    everything other than o-a-c, which is triggered via a hook via
    55-heat-config, this transparently consumes the heat-generated
    deploy_signal_id and signals heat after each deployment hook is
    run.
    
    To allow per-deployment signalling for os-apply-config configs,
    this adds logic which looks in the deployment data processed by
    os-apply-config and signals all deployments deploying a config with
    group "os-apply-config" (everything else should be handled by 55-heat-config)
    
    Note that if the deployment is configured NO_SIGNAL, no deploy_signal_id
    will be set, thus this will do nothing, and currently this won't work with
    HEAT_SIGNAL, only CFN_SIGNAL (which is the default for deployments with no
    signal_transport specified).  In future it would be good to add support for
    HEAT_SIGNAL to break the dependency on heat-api-cfn.
    
    This is backwards compatible, but after it's merged we can remove all the
    NO_SIGNAL's in the templates, and the completion-signal key from the
    allNodesConfig, which should in future allow more flexible control of
    the ordering of metadata update for configs applied via o-a-c (as well
    as better visibility of progres during deployment).
    
    Co-Authored-By: Dan Prince <dprince@redhat.com>
    
    Change-Id: I72ea524effd07deeb432fb38ee7da5f3dc7990a7
    Closes-Bug: #1389178

Changed in tripleo:
status:	In Progress → Fix Committed

Jay Dobies (jdob) on 2015-04-10

Changed in tripleo:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in heat:
milestone:	kilo-1 → 2015.1.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-19: Change abandoned on heat (master)

#36

Change abandoned by Steve Baker (<email address hidden>) on branch: master
Review: https://review.openstack.org/134089

tripleo

heat stack-update failure when scaling resource group due to deficient signal handling in o-r-c scripts

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Heat	Fix Released	High	Steven Hardy	OpenStack Heat 2015.1.0 "kilo"
Juno	Fix Released	High	Steven Hardy	OpenStack Heat 2014.2.1
tripleo	Fix Released	High	Steven Hardy