Heat: cfn API not returning Metadata. tripleo-ci: Failing to connect to overcloud controller

Bug #1337772 reported by Derek Higgins
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Steven Hardy
tripleo
Invalid
Critical
Unassigned

Bug Description

As of 2014-07-03 16:00 GMT (approx), ci overcloud job is failing

e.g. http://logs.openstack.org/19/104619/1/check-tripleo/check-tripleo-overcloud-f20/cb88d8d/console.html
2014-07-03 18:51:32.322 | + init-keystone -o 192.0.2.3 -t 7dfcb6eab99831cf32e7a7b4ab93ae9b734ddc0d -e admin.example.com -p 1b2b5e82c77b8704d35f7e8491927859d8f9eb27 -u heat-admin
2014-07-03 19:04:36.783 | keystoneclient.openstack.common.apiclient.exceptions.ConnectionRefused: Unable to establish connection to http://192.0.2.3:35357/v2.0/OS-KSADM/roles

get_state_from_hosts is failing to get logs from overcloud-controller0, so we can't ssh to the node (but it does got to an active state and the wait for the finsished completes)

Tags: ci
Revision history for this message
Derek Higgins (derekh) wrote :

Reproduced what looks like the same problem locally

o-r-c complete successfully and sends the SUCCESS signal to heat and then

Jul 04 08:38:41 undercloud-undercloud-il6jftz7thss os-collect-config[711]: INFO:os-refresh-config:Completed phase migration
Jul 04 08:38:41 undercloud-undercloud-il6jftz7thss sudo[4910]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl dump-flows br-int table=23
Jul 04 08:38:42 undercloud-undercloud-il6jftz7thss os-collect-config[711]: 2014-07-04 08:38:42.719 711 CRITICAL os-collect-config [-] expected string or buffer
Jul 04 08:38:42 undercloud-undercloud-il6jftz7thss systemd[1]: os-collect-config.service: main process exited, code=exited, status=1/FAILURE
Jul 04 08:38:42 undercloud-undercloud-il6jftz7thss dhclient[3210]: Received signal 15, initiating shutdown.
Jul 04 08:38:42 undercloud-undercloud-il6jftz7thss dhclient[3210]: DHCPRELEASE on br-ctlplane to 192.0.2.2 port 67 (xid=0x2a508744)
Jul 04 08:38:43 undercloud-undercloud-il6jftz7thss sudo[4937]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl dump-flows br-int table=23
Jul 04 08:38:44 undercloud-undercloud-il6jftz7thss ntpd[3615]: Deleting interface #3 br-ctlplane, 192.0.2.3#123, interface stats: received=0, sent=0, dropped=0, active_time=57 secs
Jul 04 08:38:45 undercloud-undercloud-il6jftz7thss sudo[4940]: neutron : TTY=unknown ; PWD=/ ; USER=root ; COMMAND=/usr/bin/neutron-rootwrap /etc/neutron/rootwrap.conf ovs-ofctl dump-flows br-int table=23
Jul 04 08:38:46 undercloud-undercloud-il6jftz7thss systemd[1]: Unit os-collect-config.service entered failed state.

Revision history for this message
Derek Higgins (derekh) wrote :

Ok, I think IP getting lost is just a symptom of os-collect-config crashing(systemd kills dhclient),

The error I think we should be looking at is
CRITICAL os-collect-config [-] expected string or buffer

which is coming from here

http://git.openstack.org/cgit/openstack/os-collect-config/tree/os_collect_config/cfn.py?id=1dc89292dc998b3a1883b0e5b405caa4bb718a04#n121
    value = json.loads(sub_element.text)

sub_element.text is None

the first time o-r-c runs the XML stack description contains
<Metadata>{.json metadata.}</Metadata>

then after o-r-c successfully runs later the XML contains nothing and o-c-c crashes
<Metadata/>

Revision history for this message
Derek Higgins (derekh) wrote :

I've reproduced locally, the problem goes away when I set heat back a few commits
DIB_REPOREF_heat=706f7289fa859fa4561b3fbb163f60b63efbd746

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Looks like signal handling was changed and signal handling is wiping out metadata.

Note this may also be a bug in os-collect-config as it should not be running with that CRITICAL.

summary: - ci: Failing to connect to overcloud controller
+ Heat: cfn API not returning Metadata. tripleo-ci: Failing to connect to
+ overcloud controller
Changed in heat:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/104938

Changed in heat:
assignee: nobody → Clint Byrum (clint-fewbar)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/104939

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/104940

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/104938
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=26ce899edf7ddfd43a8259373944f5671e3f6994
Submitter: Jenkins
Branch: master

commit 26ce899edf7ddfd43a8259373944f5671e3f6994
Author: Clint Byrum <email address hidden>
Date: Fri Jul 4 10:16:23 2014 -0700

    Revert "Refactor waitcondition resources to allow easier subclassing"

    This reverts commit 850ac0c1b552d2bae91d5718df56c3e3a739b2f5.

    One of the prior two commits prior are causing Metadata to come back empty
    after signals are processed. This must be reverted because it conflicts
    with those commits.

    Change-Id: I37c11e3b7e4c7528ebd21736ed4a32288ff4e77e
    Partial-Bug: #1337772

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/104939
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=f16a0f64a3454145d22ce794054879edfd53a8ef
Submitter: Jenkins
Branch: master

commit f16a0f64a3454145d22ce794054879edfd53a8ef
Author: Clint Byrum <email address hidden>
Date: Fri Jul 4 10:18:51 2014 -0700

    Revert "Update waitcondition API to use signal RPC interface"

    This reverts commit f2f2697c9d8f926869f93f34c0a6df0df7cba20a.

    This and/or the previous commit are causing Metadata to return blank
    after signals are processed.

    Change-Id: I1c8b3f18e63d9c74b00b27fc32c5d5746faa8235
    Partial-Bug: #1337772

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/104940
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=896fe97c80ff629c8cd202ce5e8bba3dbef61740
Submitter: Jenkins
Branch: master

commit 896fe97c80ff629c8cd202ce5e8bba3dbef61740
Author: Clint Byrum <email address hidden>
Date: Fri Jul 4 10:19:39 2014 -0700

    Revert "Convert WaitConditionHandle to use handle_signal"

    This reverts commit a60f2722f8e084c4d68d2e439c6cc7c1d049b696.

    This and/or the next commit (f2f2697c9d8f926869f93f34c0a6df0df7cba20a)
    are causing Metadata to return blank after signals are processed.

    Change-Id: Ic4e068fd1e26e51e826055cebed6a392a5595bf9
    Partial-Bug: #1337772

Revision history for this message
Steven Hardy (shardy) wrote :

Hmm, sorry for the breakage, my local tests did not catch this issue unfortunately (for reference, would check experimental have caught this?)

Is the problem just that we don't do the stack-wide metadata refresh in resource_signal?

https://github.com/openstack/heat/blob/master/heat/engine/service.py#L1023

If you can confirm that check-experimental will catch this, I'll add that update logic to resource_signal and re-post the reverted patches, then we can check if the regression you encountered is resolved.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

NP Steven, happens to the best of us. check experimental did expose this issue. I'll be building a test on top of the tempest test that tests os-collect-config as well.

Let's be careful about the metadata churn in signals. We expect to have a lot of signals coming in during stack creation and updates, so it would be better if we don't have to do a stack-wide operation just because a single resource received a signal.

Changed in heat:
status: In Progress → Triaged
assignee: Clint Byrum (clint-fewbar) → Steven Hardy (shardy)
Changed in tripleo:
status: Triaged → Invalid
Revision history for this message
Steven Hardy (shardy) wrote :

Ok, I'm working on a fix which should allow us to reinstate the reverted patches, but since the troublesome patches have now been reverted, I'll drop this to high.

Basically I think as mentioned above the resource_signal interface needs to trigger metadata refresh, like metadata_update does. I propose we copy the existing code initially, then work on an optimisation where only the dependent resources are updated, to mitigate the performance concerns raised by Clint.

Probably things only work by luck at the moment, because in most cases signals don't directly affect resource metadata, whereas after my patches they do.

Changed in heat:
importance: Critical → High
status: Triaged → In Progress
milestone: none → juno-2
Revision history for this message
Steven Hardy (shardy) wrote :

Hmm, looking more closely, my analysis above is incorrect, as I already copied the metadata refresh logic in the original patch:

https://review.openstack.org/#/c/101351/4/heat/engine/service.py

Currently trying to figure out how to reproduce this locally..

Revision history for this message
Steven Hardy (shardy) wrote :

Oh! Is os-collect-config relying on this interface returning the metadata after it's set?

https://github.com/openstack/heat/blob/master/heat/engine/service.py#L1034

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to heat (master)

Fix proposed to branch: master
Review: https://review.openstack.org/105470

Revision history for this message
Steven Hardy (shardy) wrote :

Turns out the issue was OS::Nova::Server overwrites the deployments metadata (which is set in the resource not the template), and I couldn't reproduce because I wasn't testing with SoftwareDeployment resources.

Patch proposed above which I believe fixes the problem, test feedback appreciated! :)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to heat (master)

Reviewed: https://review.openstack.org/105470
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=1bade6da7cbacd91d15b450a8d92983341cf5e03
Submitter: Jenkins
Branch: master

commit 1bade6da7cbacd91d15b450a8d92983341cf5e03
Author: Steven Hardy <email address hidden>
Date: Tue Jul 8 15:29:36 2014 +0100

    Don't overwrite deployments metadata in Server resource

    Currently, when metadata_update is called and local metadata has been
    added for software deployments (e.g stuff not defined in the template)
    we will silently discard the non-template-defined stuff when calling
    metadata_update.

    Arguably we are abusing the metadata section by pushing arbitrary additional
    data into it but since that issue has already happened, as part of the
    SoftwareDeployments implementation, the simplest fix seems to be to merge
    the current and template re-resolved state of the metadata.

    Change-Id: Id68d11b584734fbb828aabaca6cc75aab0f4ee4c
    Closes-Bug: #1337772

Changed in heat:
status: In Progress → Fix Committed
Changed in heat:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in heat:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.