new resources are not available for signalling during an update -> heat stack update timeout

Bug #1291905 reported by Ladislav Smola
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
Medium
Zane Bitter
tripleo
Fix Released
High
Ladislav Smola

Bug Description

After increasing e.g. number of compute nodes and running heat stack-update

The nova instance is successfully created but the CompletionCondition is not finished, seems like auth problem with undercloud keystone

from new node I try CURL with url from NovaCompute1CompletionCondition:

curl -X POST "http://192.0.2.3:8000/v1/waitcondition/arn%3Aopenstack%3Aheat%3A%3Ace7d690c041e46dc982846f4e4d0fa5e%3Astacks%2Fovercloud%2F917434b0-3592-43a2-ae24-a6a0904f5a15%2Fresources%2FNovaCompute1CompletionHandle?Timestamp=2014-03-13T09%3A48%3A55Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=3ab500ceebfb43c9a8ce41b1f90a52df&SignatureVersion=2&Signature=JeMY902pIZz1lqwhM9J2Stdkfo5FwSGyKIlEte66M6U%3D"

On undercloud node I got
I got:

Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.599 4183 INFO heat.api.aws.ec2token [-] Checking AWS credentials..
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.599 4183 INFO heat.api.aws.ec2token [-] AWS credentials found, checking against keystone.
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.599 4183 INFO heat.api.aws.ec2token [-] Authenticating with http://127.0.0.1:5000/v2.0/ec2tokens
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.602 4183 INFO requests.packages.urllib3.connectionpool [-] Starting new HTTP connection (1): 127.0.0.1
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy keystone-all[16507]: 2014-03-13 10:08:15.610 16507 WARNING keystone.common.wsgi [-] Authorization failed. The request you have made requires authentication. from 127.0.0.1
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.612 4183 DEBUG requests.packages.urllib3.connectionpool [-] "POST /v2.0/ec2tokens HTTP/1.1" 401 114 _make_request /opt/stack/venvs/heat/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:344
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.613 4183 INFO heat.api.aws.ec2token [-] AWS authentication failure.
Mar 13 10:08:15 undercloud-undercloud-ojsfmisefovy heat-api-cfn[4183]: 2014-03-13 10:08:15.613 4183 DEBUG root [-] XML response : <ErrorResponse><Error><Message>User is not authorized to pe

I am trying to compare stack-create and stack-update now. Because create is working correctly.

Ladislav Smola (lsmola)
Changed in tripleo:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ladislav Smola (lsmola) wrote :

ok so it is not auth issue

the issue is that is doesn't creating CompletionHandleResource
http://paste.openstack.org/show/73382/

so event when nova instance is active, Compute1 resources are not there
http://paste.openstack.org/show/73404/

and they are created after the timeout
http://paste.openstack.org/show/73407/

which is weird.

I need to investigate the updating process more closely.

Revision history for this message
Ladislav Smola (lsmola) wrote :

the parameters are used as template default in both calls

heat template for create
http://paste.openstack.org/show/73427/

heat template for update
http://paste.openstack.org/show/73423/

Revision history for this message
Zane Bitter (zaneb) wrote :

The problem is that NovaCompute1 does not depend on NovaCompute1CompletionHandle, which is only referenced from NovaCompute1Config. NovaCompute1 relies on the metadata of NovaCompute1Config, but there is no dependency relationship expressed in the template. I believe if you add "DependsOn": "NovaCompute1Config" to NovaCompute1, the update will work correctly.

Changed in tripleo:
status: Triaged → Invalid
Revision history for this message
Ladislav Smola (lsmola) wrote :

This particular dependency results in:

ERROR: Remote error: CircularDependencyException Circular Dependency Found: {LaunchConfiguration "NovaCompute0Config": {Server "NovaCompute1"}, Server "NovaCompute1": {LaunchConfiguration "NovaCompute1Config"}, LaunchConfiguration "NovaCompute1Config": {Server "NovaCompute1"}}

I will experiment with the dependencies more, seems this could be a right track.

Btw. the bug needs to stay, even when it is just dependency, we need to fix tripleo-heat-templates https://github.com/openstack/tripleo-heat-templates

Changed in tripleo:
status: Invalid → Triaged
Revision history for this message
Ladislav Smola (lsmola) wrote :

Seems like the dependencies doesn't help. The resources are there, but seems they are in another stack that we are looking on when we finish the CompletionCondition

When it fails, the updated stack becomes the current stack and resources are there, but that is too late.

Ladislav Smola (lsmola)
Changed in heat:
assignee: nobody → Ladislav Smola (lsmola)
Zane Bitter (zaneb)
Changed in heat:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Zane Bitter (zaneb) wrote :

Hmm, it seems a bit wrong that NovaCompute1Config gets attributes of NovaCompute1, which is the thing is is configuring. But this whole LaunchConfiguration business is a hack anyway.

You _will_ need to get around it by having NovaCompute1 depend on the NovaCompute1CompletionHandle directly.

That said, based on IRC discussions, there is still a Heat bug to be addressed here, in that during an update the newly-created WaitCondition is not writable until after the update is complete, which is obviously too late.

(Apologies for setting the wrong status on this; I just assumed it was in the Heat component without looking.)

Changed in heat:
milestone: none → icehouse-rc1
Revision history for this message
Ladislav Smola (lsmola) wrote :

Zane:
Np. Would be nice to put the feedback of the template to some tripleo bug. I didn't know it is doing something hacky. :-)

About the bug:
I tried to investigate how to do a quick workaround for Icehouse, but I don't see any trivial solution.

so I tried:

load resource directly from db using resource_get_by_name_and_stack(cnxt, resource_name, s.id) here: https://github.com/openstack/heat/blob/fe8f7f1b80953a671e52dc0e07d61269a6818f53/heat/engine/service.py#L960

so the existence check is easy, but then I need to initialize the resource.Resource so I can call metadata_update here:
https://github.com/openstack/heat/blob/fe8f7f1b80953a671e52dc0e07d61269a6818f53/heat/engine/service.py#L965

I didn't find a way how to do it, cause seems like Resource init needs the template data, which I don't have (example here:
https://github.com/openstack/heat/blob/fe8f7f1b80953a671e52dc0e07d61269a6818f53/heat/engine/parser.py#L117)

So I see 2 possible ways how to solve this:
1. resource will store enough info in the db, so you can initialize it just using the db record
2. current stack will save also updated_template, so it is accessible in this method

Both ^ are non-trivial architecture changes, so I am not sure if it is the right way. Did you have some easier hack in mind? Since the db record doesn't have even the resource_type, I can't see any easy hack.

Please, if you have some quick solution in mind, it would be nice if you could do it. Or if you point me, I can try to fix this next week (we have big planning this week so I can't probably do anything :-( )

Changed in heat:
assignee: Ladislav Smola (lsmola) → nobody
Changed in tripleo:
assignee: nobody → Zane Bitter (zaneb)
Revision history for this message
Ladislav Smola (lsmola) wrote :

Zane:
Seems like I don't have rights to assign you the Heat part, so I am assigning the tripleo part. :-)

Revision history for this message
Zane Bitter (zaneb) wrote :

Thanks, I grabbed the heat part :)

Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
Changed in tripleo:
assignee: Zane Bitter (zaneb) → nobody
Revision history for this message
Zane Bitter (zaneb) wrote :

It's difficult to see this getting fixed in Icehouse, for all of the reasons you've outlined above. (The plan is to fix this in Juno.)

I believe the workaround is to use the OS::Heat::UpdateWaitConditionHandle resource in place of AWS::CloudFormation::WaitConditionHandle. This resource gets replaced on every stack update, so you don't need to rename it to trigger a replacement. That should ensure you can always access it, since its always in the template (with the same name).

Perhaps Clint can confirm that this is the standard practice for TripleO at the moment?

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Yes Zane that is how TripleO handles wait conditions currently and it works quite well (except when something doesn't work inside the instance, and then we get UPDATE_FAILED ...)

Changed in heat:
milestone: icehouse-rc1 → next
Revision history for this message
Zane Bitter (zaneb) wrote :

I see from IRC scrollback that I was not especially clear. I've done a more thorough analysis of the templates - which BTW are incredibly difficult to read because there's a notCompute0 and a NovaCompute0 (why??? name things after what they are, not what they aren't), and because stuff moves around unnecessarily in the file (It looks like NovaCompute1 replaces NovaCompute0, but in fact NovaCompute0 has just moved elsewhere in the file). It appears the change you're trying to make is to add a new server (NovaCompute1) with a new WaitCondition and WaitConditionHandle, while not modifying the previous server.

What you want to do instead is for NovaCompute0CompletionHandle to have been an OS::Heat::UpdateWaitConditionHandle instead of an AWS::CloudFormation::WaitConditionHandle for the new server (I think it's OK if you create a new WaitCondition; it's probably not important either way). Unlike a WaitConditionHandle, an UpdateWaitConditionHandle (which stores the data for the WaitCondition) will get replaced (i.e. the data cleared) during an update. The Count on the WaitCondition should be the number of servers you are adding and/or replacing. So if NovaCompute0 is not being replaced, you only need/want a Count of 1 (for NovaCompute1, which is new).

You'll need to rearrange the dependencies too, obviously.

The other workaround would be to do a two-stage update. Add the WaitConditionHandle in the first update, then add the server and the WaitCondition in a subsequent update.

Ladislav Smola (lsmola)
summary: - heat stack-update fails on timeout
+ new resources are not available for signalling during an update -> heat
+ stack update timeout
Changed in tripleo:
assignee: nobody → Ladislav Smola (lsmola)
status: Triaged → In Progress
Revision history for this message
Ladislav Smola (lsmola) wrote :

OK, so workaround might not be possible to do the right way.

The right way of signaling back to CompletionCondition is this
https://github.com/openstack/tripleo-image-elements/blob/master/elements/os-refresh-config/os-refresh-config/post-configure.d/99-refresh-completed#L10

That means that the configuration was finished, but this first calls os-collect-config, which requires appropriate Config resource to exist . So it fails.
I am actually getting authorization error with os-collect-config when the resource e.g. NovaCompute2Config is not there, and it works when the resource is there.

Zane Bitter (zaneb)
Changed in heat:
status: Triaged → In Progress
milestone: next → juno-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/100047

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/100047
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=40693910beb336518bf06d7e0621b7b14414fc84
Submitter: Jenkins
Branch: master

commit 40693910beb336518bf06d7e0621b7b14414fc84
Author: Zane Bitter <email address hidden>
Date: Mon Jun 16 11:05:18 2014 -0400

    Update: persist current template on change

    After each resource is updated, persist the modified template to the
    database so that any new resources will be accessible to API calls -
    including signals to new WaitConditionHandles. This also ensures that if
    the Heat engine dies completely during an update, the template in the
    database will still be in a consistent state.

    Change-Id: Ie6f234302cf72213d4b0e1f5b963cd8def422498
    Closes-Bug: #1291905
    Closes-Bug: #1206702
    Closes-Bug: #1328342
    Implements: partial-blueprint update-failure-recovery

Changed in heat:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/101288

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/101288
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=3ff80e61285b763a9d625692f1d790b9a36dd380
Submitter: Jenkins
Branch: master

commit 3ff80e61285b763a9d625692f1d790b9a36dd380
Author: Zane Bitter <email address hidden>
Date: Fri Jun 27 16:22:45 2014 -0400

    Pass the context when updating raw_templates

    Without a context, the raw_template was being stored using a different DB
    session, with the result that changes did not end up in the database.

    Also, don't use the same raw_template DB entry for the backup stack, since
    that causes competing modifications to the DB that break rollback.

    Change-Id: I6c71c19acac0b87943f36c57c2300cf2b0478aa3
    Closes-Bug: #1331872
    Related-Bug: #1291905
    Related-Bug: #1206702
    Related-Bug: #1328342

Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: juno-2 → 2014.2
Revision history for this message
Ladislav Smola (lsmola) wrote :

Fixed by Heat. We have another issues with ResourceGroup now. Will create another bug for it.

Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.