M/N upgrades UPDATE_FAILED .enabled_services.list_join: Incorrect arguments to "list_join" should be: "list_join" : [ " ", [ "str1", "str2"]]

Bug #1620696 reported by Michele Baldessari on 2016-09-06
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Heat
Fix Released
High
Thomas Herve
tripleo
High
mathieu bultel

Bug Description

Just got the following error during an M/N upgrade:
+ openstack overcloud deploy --templates --libvirt-type qemu --control-flavor oooq_control --compute-flavor oooq_compute --ceph-storage-flavor oooq_ceph --timeout 75 --ntp-server clock.redha
t.com --control-scale 3 --neutron-network-type vxlan --neutron-tunnel-types vxlan -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/opensta
ck-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e /home/stack/network-environment.ya
ml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-init.yaml -e /home/st
ack/overcloud-repo.yaml --ntp-server clock.redhat.com -e /tmp/deploy_env.yaml
WARNING: openstackclient.common.utils is deprecated and will be removed after Jun 2017. Please use osc_lib.utils
WARNING: openstackclient.common.exceptions is deprecated and will be removed after Jun 2017. Please use osc_lib.exceptions
Deploying templates in the directory /usr/share/openstack-tripleo-heat-templates
2016-09-06 08:25:36 [overcloud]: UPDATE_IN_PROGRESS Stack UPDATE started
2016-09-06 08:25:48 [ServiceNetMap]: CREATE_IN_PROGRESS state changed

...snip..

2016-09-06 08:29:55 [overcloud-allNodesConfig-swx4hduwudr3]: UPDATE_COMPLETE Stack UPDATE completed successfully
2016-09-06 08:29:55 [1]: SIGNAL_COMPLETE Unknown
2016-09-06 08:29:57 [ControllerDeployment]: SIGNAL_COMPLETE Unknown
2016-09-06 08:29:59 [allNodesConfig]: UPDATE_COMPLETE state changed
2016-09-06 08:30:00 [ControllerSwiftDeployment]: UPDATE_FAILED UPDATE aborted
2016-09-06 08:30:00 [1]: SIGNAL_COMPLETE Unknown
2016-09-06 08:30:00 [UpdateWorkflow]: UPDATE_FAILED UPDATE aborted
2016-09-06 08:30:00 [overcloud]: UPDATE_FAILED .enabled_services.list_join: Incorrect arguments to "list_join" should be: "list_join" : [ " ", [ "str1", "str2"]]
2016-09-06 08:30:00 [0]: CREATE_IN_PROGRESS state changed
2016-09-06 08:30:01 [1]: UPDATE_FAILED UPDATE aborted
2016-09-06 08:30:01 [0]: SIGNAL_IN_PROGRESS Signal: deployment 669d746d-932c-4bd2-9099-cd781127d05a succeeded
2016-09-06 08:30:01 [ComputeDeliverUpgradeScriptDeployment]: CREATE_FAILED CREATE aborted
2016-09-06 08:30:01 [UpgradeInitComputeDeployment]: CREATE_FAILED CREATE aborted
2016-09-06 08:30:01 [0]: UPDATE_FAILED UPDATE aborted
2016-09-06 08:30:01 [UpgradeInitControllerDeployment]: CREATE_FAILED CREATE aborted
2016-09-06 08:30:01 [2]: UPDATE_FAILED UPDATE aborted
2016-09-06 08:30:01 [overcloud-UpdateWorkflow-xrcqsywmwkrr]: UPDATE_FAILED Operation cancelled
2016-09-06 08:30:01 [overcloud-ControllerSwiftDeployment-3zkecnhcjnc4]: UPDATE_FAILED Operation cancelled
2016-09-06 08:30:02 [2]: CREATE_IN_PROGRESS state changed
2016-09-06 08:30:06 [2]: SIGNAL_COMPLETE Unknown
Stack overcloud UPDATE_FAILED
Heat Stack update failed.

This is during the major-upgrade-pacemaker-init.yaml step. The repos I used to upgrade:
# undercloud
export CURRENT_VERSION=mitaka
export NEW_VERSION=newton
mkdir /home/stack/REPOBACKUP
sudo mv /etc/yum.repos.d/delorean* /home/stack/REPOBACKUP/
sudo curl -o /etc/yum.repos.d/delorean-$NEW_VERSION.repo http://trunk.rdoproject.org/centos7-$NEW_VERSION/current/delorean.repo
sudo curl -o /etc/yum.repos.d/delorean-deps-$NEW_VERSION.repo http://trunk.rdoproject.org/centos7-$NEW_VERSION/delorean-deps.repo
sudo yum clean all
sudo yum repolist

# overcloud
cat > ~/overcloud-repo.yaml <<EOF
parameter_defaults:
  UpgradeInitCommand: |
    set -e
    curl -L -o /etc/yum.repos.d/delorean-deps.repo http://trunk.rdoproject.org/centos7-newton/delorean-deps.repo
    curl -L -o /etc/yum.repos.d/delorean.repo http://trunk.rdoproject.org/centos7-newton/current-passed-ci/delorean.repo
    yum clean all
EOF

The heat-engine logs are a bit unhelpful given that "debug = True" in the conf:
heat-engine.log-20160906:2016-09-06 08:30:00.543 539 INFO heat.engine.stack [req-062cb50d-439b-4fdc-90cb-4c73b5155041 b829ee63ca3d4597abf9327135e30510 d26412647be54c849cb1d30a4e4b3998 - - -] Stack UPDATE FAILED (overcloud): .enabled_services.list_join: Incorrect arguments to "list_join" should be: "list_join" : [ " ", [ "str1", "str2"]]

Michele Baldessari (michele) wrote :

rpm versions:
[root@undercloud heat]# rpm -qa |grep -E "heat|tripleo"
puppet-heat-9.2.0-0.20160901072004.4d7b5be.el7.centos.noarch
openstack-heat-templates-0.0.1-0.20160905224105.ac2db55.el7.centos.noarch
openstack-tripleo-puppet-elements-5.0.0-0.20160902162220.01fb147.el7.centos.noarch
python2-heatclient-1.4.0-0.20160831084943.fb7802e.el7.centos.noarch
python-heat-tests-7.0.0-0.20160906023509.8a2f4dd.el7.centos.noarch
openstack-heat-common-7.0.0-0.20160906023509.8a2f4dd.el7.centos.noarch
openstack-tripleo-image-elements-5.0.0-0.20160811131857.98b9c6a.el7.centos.noarch
openstack-heat-api-7.0.0-0.20160906023509.8a2f4dd.el7.centos.noarch
openstack-tripleo-common-5.0.1-0.20160905143814.6c39473.el7.centos.noarch
openstack-tripleo-0.0.1-0.20160831005944.15f0afe.el7.centos.noarch
openstack-heat-engine-7.0.0-0.20160906023509.8a2f4dd.el7.centos.noarch
python-tripleoclient-5.0.0-0.20160905171224.b0d7ce7.el7.centos.noarch
puppet-tripleo-5.0.0-0.20160905160927.8b0e161.el7.centos.noarch
openstack-tripleo-heat-templates-5.0.0-0.20160906091634.4488b0f.el7.centos.noarch
openstack-heat-api-cfn-7.0.0-0.20160906023509.8a2f4dd.el7.centos.noarch

Michele Baldessari (michele) wrote :

So if I checkout tht from git and go back in time to at least commit:
commit 67d3a774e55a6c27aa19d1f00de9cec4a02e7866
Author: karthik s <email address hidden>
Date: Tue Jun 14 17:44:38 2016 +0530

 Configure the pci_passthrough_whitelist via THT

The command ends successfully. And once it finishes successfully I can go back to the latest master
and it succeeds again. It's as if some lateish patch to THT has issues when the stack gets updated the first time.

Michele Baldessari (michele) wrote :

Setting to confirmed since Mathieu is hitting it as well

Changed in tripleo:
status: New → Confirmed
Michele Baldessari (michele) wrote :

So it seems the patch breaking this is:
commit 753131d6b5520552f73c9489274f2bd3c25b9e50
Author: Steven Hardy <email address hidden>
Date: Thu Aug 25 17:39:54 2016 +0100

    Create hiera service_enabled keys for enabled services

I still need to test with a heat version on the undercloud that includes the following bugfix:
https://bugs.launchpad.net/heat/+bug/1617019

Michele Baldessari (michele) wrote :

So I confirm that the following commit breaks the upgrades:
commit 753131d6b5520552f73c9489274f2bd3c25b9e50
Author: Steven Hardy <email address hidden>
Date: Thu Aug 25 17:39:54 2016 +0100

    Create hiera service_enabled keys for enabled services

I also made sure I have a heat version on the undercloud which has the patch for bug
https://bugs.launchpad.net/heat/+bug/1617019, but the issue still persists.

Michele Baldessari (michele) wrote :

So if I dump the arguments in heat/engine/cfn/functions.py where it fails:
class Join(function.Function):
...
         try:
             self._delim, self._strings = self.args
         except ValueError:
+ with open("/tmp/test", "w") as f:
+ for i in args:
+ f.write("Arg: %s\n" % i)
+
             raise ValueError(_('Incorrect arguments to "%(fn_name)s" '
                                'should be: %(example)s') % fmt_data)

I see the following:
Arg: ,
Arg: <heat.engine.hot.functions.GetAtt {get_attr: [u'ControllerServiceChain', u'role_data', u'service_names']} -> None>
Arg: <heat.engine.hot.functions.GetAtt {get_attr: [u'ComputeServiceChain', u'role_data', u'service_names']} -> None>
Arg: <heat.engine.hot.functions.GetAtt {get_attr: [u'BlockStorageServiceChain', u'role_data', u'service_names']} -> None>
Arg: <heat.engine.hot.functions.GetAtt {get_attr: [u'ObjectStorageServiceChain', u'role_data', u'service_names']} -> None>
Arg: <heat.engine.hot.functions.GetAtt {get_attr: [u'CephStorageServiceChain', u'role_data', u'service_names']} -> None>

So it seems those functions return None and this confuses the function?

mathieu bultel (mat-bultel) wrote :

I think i have a fix for this issue,
I'm testing it atm in both upgrade & native install scenario.
I'll paste the review here when it's done.

It's a fix in heat itself.

Changed in tripleo:
assignee: nobody → mbu (mat-bultel)
Changed in tripleo:
importance: Undecided → Critical
tags: added: upgrade-bugs
Steven Hardy (shardy) wrote :

We've probably run out of time to fix this for RC1 but it sounds like a release blocker so I've targetted it to RC2

Changed in tripleo:
milestone: none → newton-rc2
Changed in tripleo:
status: Confirmed → In Progress
mathieu bultel (mat-bultel) wrote :

The real fix is quite easier and evident than what I was fixing in heat.
this should be enough:
https://review.openstack.org/370069

Maybe heat needs to raise some clearer exception, but isn't critical or blocker anymore now .

mathieu bultel (mat-bultel) wrote :

This issue is not reproduced anymore, even without applying the review above.
I'll make a latest test, to be sure that it's solved and i'll close it.

Change abandoned by mathieu bultel (<email address hidden>) on branch: master
Review: https://review.openstack.org/370069
Reason: I abandon the review, since I didn't reproduce in all my tests recently (Michele also)

mathieu bultel (mat-bultel) wrote :

I have abandoned the review, could you please close the issue ?
Thx

Changed in tripleo:
status: In Progress → Won't Fix
Changed in tripleo:
milestone: newton-rc2 → none

Hi,

I'm in the process of doing an upgrade from RDO Mitaka to RDO Newton and I am hitting this error. It definitely still needs to be fixed

resources.allNodesConfig.properties.enabled_services.list_join: Incorrect arguments to "list_join" should be: "list_join" : [ " ", [ "str1", "str2"]]

I did not hit it the first time I tried a stack update (with major-upgrade-pacemaker-init.yaml) which failed for another issue. I fixed that issue and I am now hitting this one and can't continue further.

Can we please re-open this issue.

Regards,

Graeme

Applying the fix from

https://review.openstack.org/#/c/370069/

does not fix the issue

mathieu bultel (mat-bultel) wrote :

Hi,

Can you please add more information about the version of rdo packages that you are using, the version of tripleo-heat-templates and tripleoclient.

Thank you by advance.

I am using the latest packages from RDO Newton CBS at

http://mirror.centos.org/centos/7/cloud/$basearch/openstack-newton/

openstack-tripleo-heat-templates-5.0.0-0.3.0rc2.el7.noarch
python-tripleoclient-5.2.1-0.2.2ee369fgit.el7.centos.noarch

Note I am using my own rebuilt version of python-tripleoclient which is running off commit ee369f from stable/newton to work around issue

https://bugs.launchpad.net/tripleo/+bug/1632568

I've also tried using the latest heat from delorean on stable/newton

https://trunk.rdoproject.org/centos7-newton/4e/16/4e16041f1125c3109788a48436c9eb8e62065ae1_efe46336/

openstack-heat-api-7.0.1-0.20161012234819.4e16041.el7.centos.noarch.rpm
openstack-heat-api-cfn-7.0.1-0.20161012234819.4e16041.el7.centos.noarch.rpm
openstack-heat-common-7.0.1-0.20161012234819.4e16041.el7.centos.noarch.rpm
openstack-heat-engine-7.0.1-0.20161012234819.4e16041.el7.centos.noarch.rpm

and the problem persists

mathieu bultel (mat-bultel) wrote :

Hey Gillies,

You are not on latest tht packages I think:
I used those:
http://buildlogs.centos.org/centos/7/cloud/x86_64/rdo-trunk-master-tested/

Maybe i would be better to try with the this repo and see what happen

Hi,

Even using the latest tht version

openstack-tripleo-heat-templates-5.0.0-0.4.0rc3.el7.noarch

which was tagged today, I can still reproduce the issue.

Can we please have this bug re-opened

Dougal Matthews (d0ugal) on 2016-10-28
Changed in tripleo:
status: Won't Fix → Confirmed
Michele Baldessari (michele) wrote :

Graeme, are you still hitting this? I have never ever seen it again, so was wondering

I haven't been attempting M/N upgrades at all recently so haven't had a chance to try and reproduce. If noone else besides me has seen it, I'm happy to close it, and re-open when/if I see it again

Regards,

Graeme

Changed in tripleo:
importance: Critical → High

We're also hitting this issue.

Also we're hitting the same issue. I don't know if it's related, but i've also a nested stack failed.

The failed nested stack is reporting a missing user on the overcloud hosts. This was true in the first time, but then i installed all the required packages for gnocchi in order to have the user gnocchi. If i run manually the manifest on the hosts it does return 0 (not 6 as overcloud deploy says).

The command i'm running is this:

openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/scheduler-hints.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/puppet-ceph-external.yaml --neutron-bridge-mappings datacentre:br-ex,storage-pub:br-stg-pub -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --ntp-server timeserver.company --control-scale 3 --compute-scale 3 --ceph-storage-scale 0 --control-flavor control --compute-flavor compute --neutron-network-type vxlan --neutron-tunnel-types vxlan --verbose --debug --log-file ~stack/log/overcloud_`date +"%F-%T"`.log -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-ceilometer-wsgi-mitaka-newton.yaml

Attached outputs reporting the issue.

Also we're hitting the same issue. I don't know if it's related, but i've also a nested stack failed.

The failed nested stack is reporting a missing user on the overcloud hosts. This was true in the first time, but then i installed all the required packages for gnocchi in order to have the user gnocchi. If i run manually the manifest on the hosts it does return 0 (not 6 as overcloud deploy says).

The command i'm running is this:

openstack overcloud deploy --templates -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /home/stack/scheduler-hints.yaml -e /home/stack/templates/network-environment.yaml -e /home/stack/templates/puppet-ceph-external.yaml --neutron-bridge-mappings datacentre:br-ex,storage-pub:br-stg-pub -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --ntp-server timeserver.company --control-scale 3 --compute-scale 3 --ceph-storage-scale 0 --control-flavor control --compute-flavor compute --neutron-network-type vxlan --neutron-tunnel-types vxlan --verbose --debug --log-file ~stack/log/overcloud_`date +"%F-%T"`.log -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-ceilometer-wsgi-mitaka-newton.yaml

Attached outputs reporting the issue.

I'm going on with the debug and i found out that the problem is related to this piece of heat template:

      enabled_services:
        list_join:
          - ','

          - {get_attr: [ControllerServiceChain, role_data, service_names]}

          - {get_attr: [ComputeServiceChain, role_data, service_names]}

          - {get_attr: [BlockStorageServiceChain, role_data, service_names]}

          - {get_attr: [ObjectStorageServiceChain, role_data, service_names]}

          - {get_attr: [CephStorageServiceChain, role_data, service_names]}

As you can see is a list_join with multiple parameters: one is the char for the join, the others are the list of elements to be joined.

The error i'm getting is an exception raised from heat/engine/hot/functions.py:531 (https://github.com/openstack/heat/blob/stable/newton/heat/engine/hot/functions.py#L531). But as you can see is inside a function that is designed to join only elements from a single list, and not from multiple lists!

Who's deciding which function to call?

I see that since heat templates version 2015-10-15 JoinMultiple is called, but in the overcloud bucket in swift there are several (and not only in user-files) heat templates with version older than that date. See the attached list.

Does this can affect the function resolution?

I did a check to my overcloud template and i found out this:

[stack@opstrio1101 ~]$ openstack stack template show overcloud | grep heat_template_version
heat_template_version: '2015-04-30'

That version of heat_template_version translates list_join to Join (https://github.com/openstack/heat/blob/stable/newton/heat/engine/hot/functions.py#L502) and not JoinMultiple (https://github.com/openstack/heat/blob/stable/newton/heat/engine/hot/functions.py#L560)

@David Hill are you still experiencing the issue? i need some confirmations about my findings.

2015-04-30 comes from the overcloud.yaml used when deploying mitaka

After debugging a lot inside heat i didn't understand what is the cause that was making heat use 2015-04-30 version instead of 2016-10-14 for that specific list_join.
While debugging i've seen other calls to list_join that were resolved correctly (maybe from other files, maybe new files included)

The only way i found to move on from this blocking problem has been changing this row:

https://github.com/openstack/heat/blob/stable/newton/heat/engine/hot/template.py#L300

from

'list_join': hot_funcs.Join,

to:

'list_join': hot_funcs.JoinMultiple,

Is a bad workaround, but i haven't seen other ways to continue up to now.

Thomas Herve (therve) wrote :

This is a Heat issue, it looks similar to bug 1508096

Thomas Herve (therve) on 2017-04-20
Changed in heat:
status: New → Confirmed
importance: Undecided → High
milestone: none → pike-2
assignee: nobody → Thomas Herve (therve)

Fix proposed to branch: master
Review: https://review.openstack.org/458497

Changed in heat:
status: Confirmed → In Progress
Changed in tripleo:
status: Confirmed → Triaged
milestone: none → pike-2

Reviewed: https://review.openstack.org/458497
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=45fde101979d9d3f446cd937e5cc4c47539cc5bc
Submitter: Jenkins
Branch: master

commit 45fde101979d9d3f446cd937e5cc4c47539cc5bc
Author: Thomas Herve <email address hidden>
Date: Thu Apr 20 15:11:08 2017 +0200

    Copy template version when update fails

    When an update fails, we may have copy some chunk of resources or
    parameters to the new template. If the version was updated and the new
    resources require the version, this can lead to a state where the stack
    is in an usable state. This synchronizes the version when a failure
    happens.

    Change-Id: I2faf8f3541fc800ea61c417e5575f4a56a83665b
    Closes-Bug: #1620696

Changed in heat:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/464585
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=d0309937b328340eb3ce1da0ff8860e998e77c69
Submitter: Jenkins
Branch: stable/ocata

commit d0309937b328340eb3ce1da0ff8860e998e77c69
Author: Thomas Herve <email address hidden>
Date: Thu Apr 20 15:11:08 2017 +0200

    Copy template version when update fails

    When an update fails, we may have copy some chunk of resources or
    parameters to the new template. If the version was updated and the new
    resources require the version, this can lead to a state where the stack
    is in an usable state. This synchronizes the version when a failure
    happens.

    Change-Id: I2faf8f3541fc800ea61c417e5575f4a56a83665b
    Closes-Bug: #1620696
    (cherry picked from commit 45fde101979d9d3f446cd937e5cc4c47539cc5bc)

tags: added: in-stable-ocata

This issue was fixed in the openstack/heat 9.0.0.0b2 development milestone.

Changed in tripleo:
milestone: pike-2 → pike-3
mathieu bultel (mat-bultel) wrote :

Hi, I think we can close this one, since all the fixes has been landed in the respective branches.

This issue was fixed in the openstack/heat 8.0.2 release.

Changed in tripleo:
status: Triaged → In Progress
Changed in tripleo:
milestone: pike-3 → pike-rc1

Reviewed: https://review.openstack.org/464584
Committed: https://git.openstack.org/cgit/openstack/heat/commit/?id=ff47d9d6f327f92cedd2a30094bf5d50a83b68f4
Submitter: Jenkins
Branch: stable/newton

commit ff47d9d6f327f92cedd2a30094bf5d50a83b68f4
Author: Thomas Herve <email address hidden>
Date: Thu Apr 20 15:11:08 2017 +0200

    Copy template version when update fails

    When an update fails, we may have copy some chunk of resources or
    parameters to the new template. If the version was updated and the new
    resources require the version, this can lead to a state where the stack
    is in an usable state. This synchronizes the version when a failure
    happens.

    Change-Id: I2faf8f3541fc800ea61c417e5575f4a56a83665b
    Closes-Bug: #1620696
    (cherry picked from commit 45fde101979d9d3f446cd937e5cc4c47539cc5bc)

tags: added: in-stable-newton
Ben Nemec (bnemec) wrote :

Sounds like this was fixed in Heat. Closing the tripleo bug.

Changed in tripleo:
status: In Progress → Fix Released

This issue was fixed in the openstack/heat 7.0.6 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers