6 upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment, SoftwareDeployment resources signals get response code 400

Bug #1699463 reported by Jiří Stránský
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Zane Bitter
tripleo
Fix Released
Medium
James Slagle

Bug Description

All our upgrade jobs (both containerized and non-containerized) get stuck on UpgradeInitDeployment. This is not intermittent, the failure rate is 100%.

tags: added: alert
Revision history for this message
Jiří Stránský (jistr) wrote :
Download full text (5.4 KiB)

From containerized upgrade job:

http://logs.openstack.org/52/475952/1/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/32c9192/logs/subnode-2/var/log/messages.txt.gz#_Jun_21_00_39_25

Jun 21 00:39:25 centos-7-2-node-osic-cloud1-s3500-9417693-648176 os-collect-config: [2017-06-21 00:39:25,589] (heat-config) [INFO]
Jun 21 00:39:25 centos-7-2-node-osic-cloud1-s3500-9417693-648176 os-collect-config: [2017-06-21 00:39:25,589] (heat-config) [DEBUG] [2017-06-21 00:39:25,068] (heat-config-notify) [DEBUG] Signaling to http://192.168.24.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A868043c70bc64d8bb6cb3b189b169583%3Astacks%2Fovercloud-Controller-3qt54xb32c7d-0-yifjre2binay%2Fcc9edd2a-990d-4620-b030-f0e3a897cc7c%2Fresources%2FControllerUpgradeInitDeployment?Timestamp=2017-06-20T23%3A58%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=24e7fe4e5297497ca11100b27736f3f3&SignatureVersion=2&Signature=%2FtPvTWy9711dBODRB2RbLkLCgMT8NWzpDj4W4%2B1tB1E%3D via POST
Jun 21 00:39:25 centos-7-2-node-osic-cloud1-s3500-9417693-648176 os-collect-config: [2017-06-21 00:39:25,556] (heat-config-notify) [DEBUG] Response <Response [400]>

==========================================

http://logs.openstack.org/52/475952/1/check/gate-tripleo-ci-centos-7-containers-multinode-upgrades-nv/32c9192/logs/undercloud/home/jenkins/overcloud_upgrade_console.log.txt.gz#_2017-06-21_01_55_41

2017-06-21 00:39:20 | 2017-06-21 00:39:14Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.StorageMgmtPort]: UPDATE_COMPLETE state changed
2017-06-21 00:39:20 | 2017-06-21 00:39:14Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.TenantPort]: UPDATE_COMPLETE state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:14Z [overcloud-Controller-3qt54xb32c7d-0-Heat Stack update failed.
2017-06-21 01:55:41 | Heat Stack update failed.
2017-06-21 01:55:41 | yifjre2binay.InternalApiPort]: UPDATE_COMPLETE state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:14Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.NetIpMap]: UPDATE_IN_PROGRESS state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:15Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.NetworkConfig]: UPDATE_IN_PROGRESS state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:17Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.NetworkConfig]: UPDATE_COMPLETE state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:17Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.NetIpMap]: UPDATE_COMPLETE state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:17Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.UpdateDeployment]: UPDATE_IN_PROGRESS state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:17Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.ControllerUpgradeInitDeployment]: UPDATE_IN_PROGRESS state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:17Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.ControllerConfig]: UPDATE_IN_PROGRESS state changed
2017-06-21 01:55:41 | 2017-06-21 00:39:17Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.ControllerConfig]: UPDATE_COMPLETE The Resource ControllerConfig requires replacement.
2017-06-21 01:55:41 | 2017-06-21 00:39:18Z [overcloud-Controller-3qt54xb32c7d-0-yifjre2binay.ControllerConfig]: CR...

Read more...

Revision history for this message
Jiří Stránský (jistr) wrote :
Revision history for this message
Jiří Stránský (jistr) wrote :

I've added Heat to the bug because it looks like it might be some Heat bug. From what i'm able to see, the time when we started hitting this bug seems to coincide with recent RDO repos promotion (meaning TripleO CI started using newer Heat packages).

Also, in the logs it seems like resources which aren't SoftwareDeployment are able to report progress to heat-cfn-api just fine. Only SoftwareDeployment seems to be getting stuck.

This is very reminiscent of bug 1328342.

summary: - Upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment
+ Six upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment
Revision history for this message
Jiří Stránský (jistr) wrote : Re: Six upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment

This also affects the 4 scenario upgrade jobs we have.

summary: - Six upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment
+ 6 upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment
summary: - 6 upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment
+ 6 upgrade jobs (cont. and non-cont.) stuck on UpgradeInitDeployment,
+ SoftwareDeployment resources signals get response code 400
Revision history for this message
Thomas Herve (therve) wrote :

So yeah the CFN signal to ControllerUpgradeInitDeployment and UpdateDeployment returns a 400, but there is no detail at all in Heat logs :/. Some of the signals are passing earlier on, so it's not a systematic issue, but for now I'm not able to find anything revealing in the logs.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-heat-templates (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/476519

Revision history for this message
Zane Bitter (zaneb) wrote :
Download full text (3.2 KiB)

Summarising the findings of Thomas's investigation: when convergence is disabled in Heat, during an update we copy resources from the new template to the existing template one at a time as we update them. Only at the end of a successful update will the entire template be copied over. So if a conditional is added in a template update, it will not be accessible by any resources that reference it in "if" functions. (Also, if the definition of an _existing_ conditional changes, it'll continue to use the old one.) If the stack update succeeds then there's no long-term problem, but if it fails or if you're trying to load the stack in the meantime (e.g. to signal a resource, as in this case) then we'll run into problems:

 **************************************** BODY
 <ErrorResponse><Error><Message>A bad or out-of-range value was supplied:resources.DeploymentActions.properties.value.if: Invalid condition "server_not_blacklisted"
 Traceback (most recent call last):

   File "/usr/lib/python2.7/site-packages/heat/common/context.py", line 399, in wrapped
     return func(self, ctx, *args, **kwargs)

   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1853, in resource_signal
     self._verify_stack_resource(stack, resource_name)

   File "/usr/lib/python2.7/site-packages/heat/engine/service.py", line 1794, in _verify_stack_resource
     resource = stack.resource_get(resource_name)

   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 390, in resource_get
     res = self.resources.get(name)

   File "/usr/lib/python2.7/site-packages/heat/engine/stack.py", line 308, in resources
     res_defns = self.t.resource_definitions(self)

   File "/usr/lib/python2.7/site-packages/heat/engine/hot/template.py", line 261, in resource_definitions
     return dict(defns())

   File "/usr/lib/python2.7/site-packages/heat/engine/hot/template.py", line 240, in defns
     snippet))

   File "/usr/lib/python2.7/site-packages/heat/engine/hot/template.py", line 495, in _rsrc_defn_args
     data):

   File "/usr/lib/python2.7/site-packages/heat/engine/template_common.py", line 87, in _rsrc_defn_args
     name, data, parse))

   File "/usr/lib/python2.7/site-packages/heat/engine/template_common.py", line 56, in _parse_resource_field
     key]))

   File "/usr/lib/python2.7/site-packages/heat/engine/template.py", line 252, in parse
     return parse(self.functions, stack, snippet, path, self)

   File "/usr/lib/python2.7/site-packages/heat/engine/template.py", line 349, in parse
     for k, v in six.iteritems(snippet))

   File "/usr/lib/python2.7/site-packages/heat/engine/template.py", line 349, in &lt;genexpr&gt;
     for k, v in six.iteritems(snippet))

   File "/usr/lib/python2.7/site-packages/heat/engine/template.py", line 346, in parse
     message=six.text_type(e))

 StackValidationFailed: resources.DeploymentActions.properties.value.if: Invalid condition "server_not_blacklisted"
 </Message><Code>InvalidParameterValue</Code><Type>Sender</Type></Error></ErrorResponse>

This problem is as old as conditionals in Heat, but since the TripleO upgrades job is non-gating it was possible to merge in new conditionals and resour...

Read more...

Zane Bitter (zaneb)
Changed in heat:
importance: Undecided → High
status: New → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to heat (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/476697

Changed in heat:
assignee: nobody → Zane Bitter (zaneb)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/476519
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=69936229f4def703cd44ab164d8d1989c9fa37cb
Submitter: Jenkins
Branch: master

commit 69936229f4def703cd44ab164d8d1989c9fa37cb
Author: Alex Schultz <email address hidden>
Date: Thu Jun 22 13:35:19 2017 +0000

    Revert "Blacklist support for ExtraConfig"

    This reverts commit d6c0979eb3de79b8c3a79ea5798498f0241eb32d.

    This seems to be causing issues in Heat in upgrades.

    Change-Id: I379fb2133358ba9c3c989c98a2dd399ad064f706
    Related-Bug: #1699463

Thomas Herve (therve)
Changed in heat:
milestone: none → pike-3
Changed in tripleo:
importance: Critical → Medium
milestone: none → pike-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/477542

Changed in tripleo:
assignee: nobody → James Slagle (james-slagle)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/476622
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ba0b570054f89c72d261e2e2d0aedd55c4fe2344
Submitter: Jenkins
Branch: master

commit ba0b570054f89c72d261e2e2d0aedd55c4fe2344
Author: Zane Bitter <email address hidden>
Date: Thu Jun 22 12:18:03 2017 -0400

    Resolve Macros when copying templates

    During stack updates in the legacy path, we copy resource definitions
    back and forth between templates. In the case where the definitions
    contain macros (which in practice means the If macro for conditionals),
    they may rely on external state (in practice, the conditional
    definitions) that is not available in the template they're being copied
    into (e.g. in the case of an If macro referencing a new condition).

    This change means that when we copy a template, the macros get resolved
    so that only the chosen path of the If macro is represented.

    This resolves the issue where trying to signal a resource during an
    update fails when one of the already-updated resources in the template
    contains an If macro that refers to a condition definition that is
    newly-added in the new template.

    Change-Id: I6d08507f43b0fcc4c0b5e848e97fa26033d839b2
    Closes-Bug: #1699463

Changed in heat:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to heat (master)

Reviewed: https://review.openstack.org/476697
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=b6692d6152db0cec7c9b36f9a2f7a727a960d2d7
Submitter: Jenkins
Branch: master

commit b6692d6152db0cec7c9b36f9a2f7a727a960d2d7
Author: Thomas Herve <email address hidden>
Date: Thu Jun 22 22:40:50 2017 +0200

    Add functional test for conditions during updates

    Change-Id: I64dab0e6ec6f5758ccba936b007f3453fb847f8f
    Depends-On: I6d08507f43b0fcc4c0b5e848e97fa26033d839b2
    Related-Bug: #1699463

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/480625

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/477542
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=11b3cb25a9884d8eaed0544a5ef2862e4a046652
Submitter: Jenkins
Branch: master

commit 11b3cb25a9884d8eaed0544a5ef2862e4a046652
Author: James Slagle <email address hidden>
Date: Mon Jun 26 09:48:34 2017 -0400

    Revert "Revert "Blacklist support for ExtraConfig""

    There is a Heat patch posted (via Depends-On) that resolves the issue
    that caused this to be reverted. This reverts the revert and we need to
    make sure all the upgrades jobs pass before we merge this patch.

    This reverts commit 69936229f4def703cd44ab164d8d1989c9fa37cb.
    Closes-Bug: #1699463
    implements blueprint disable-deployments

    Change-Id: Iedf680fddfbfc020d301bec8837a0cb98d481eb5

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 7.0.0.0b3

This issue was fixed in the openstack/tripleo-heat-templates 7.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 9.0.0.0b3

This issue was fixed in the openstack/heat 9.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (stable/ocata)

Reviewed: https://review.openstack.org/480625
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=6e8534bef01cc068f02adc061c1f8a478402d522
Submitter: Jenkins
Branch: stable/ocata

commit 6e8534bef01cc068f02adc061c1f8a478402d522
Author: Zane Bitter <email address hidden>
Date: Thu Jun 22 12:18:03 2017 -0400

    Resolve Macros when copying templates

    During stack updates in the legacy path, we copy resource definitions
    back and forth between templates. In the case where the definitions
    contain macros (which in practice means the If macro for conditionals),
    they may rely on external state (in practice, the conditional
    definitions) that is not available in the template they're being copied
    into (e.g. in the case of an If macro referencing a new condition).

    This change means that when we copy a template, the macros get resolved
    so that only the chosen path of the If macro is represented.

    This resolves the issue where trying to signal a resource during an
    update fails when one of the already-updated resources in the template
    contains an If macro that refers to a condition definition that is
    newly-added in the new template.

    Change-Id: I6d08507f43b0fcc4c0b5e848e97fa26033d839b2
    Closes-Bug: #1699463
    (cherry picked from commit ba0b570054f89c72d261e2e2d0aedd55c4fe2344)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/heat 8.0.4

This issue was fixed in the openstack/heat 8.0.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.