fs037 updates: Failed to update nodes - Controller

Bug #1783866 reported by Rafael Folco
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fix Released
Jiri Stransky

Bug Description

s037 updates: Failed to update nodes - Controller

While fixing
https://launchpad.net/bugs/1783399 --> TAGS regression which stopped updates to run in ci jobs
https://launchpad.net/bugs/1783857 --> false positive that reports SUCCESS for failed jobs/playbooks

Found this issue:

fs037 updates job is consistently failing at the same point with the same error.

***Not sure this was introduced when gate was reporting false positives.

However, for some reason this is a successful update from patchset3:

and it is also getting success here in this other patch:

2018-07-26 16:37:22 | u'TASK [Set docker_startup_configs_with_default fact] ****************************',
2018-07-26 16:37:22 | u'Thursday 26 July 2018 16:37:20 +0000 (0:00:00.220) 0:02:55.340 ********* ',
2018-07-26 16:37:22 | u'An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ^',
2018-07-26 16:37:22 | u'fatal: [centos-7-rax-ord-0000988567]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}',


2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack Traceback (most recent call last):
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/cliff/app.py", line 402, in run_subcommand
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack result = cmd.run(parsed_args)
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/tripleoclient/command.py", line 25, in run
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack super(Command, self).run(parsed_args)
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/osc_lib/command/command.py", line 41, in run
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack return super(Command, self).run(parsed_args)
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/cliff/command.py", line 184, in run
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack return_code = self.take_action(parsed_args) or 0
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/tripleoclient/v1/overcloud_update.py", line 190, in take_action
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack parsed_args.ssh_user)
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/tripleoclient/utils.py", line 959, in run_update_ansible_action
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack node_user=ssh_user, skip_tags=skip_tags)
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack File "/usr/lib/python2.7/site-packages/tripleoclient/workflows/package_update.py", line 98, in update_ansible
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack raise RuntimeError('Update failed with: {}'.format(payload))
2018-07-26 16:37:22 | 2018-07-26 16:37:22.981 8878 ERROR openstack RuntimeError: Update failed with: {u'status': u'FAILED', u'message': u'Failed to update nodes - Controller, please see the logs.', u'execution': {u'name': u'tripleo.package_update.v1.update_nodes', u'created_at': u'2018-07-26 16:34:18', u'updated_at': u'2018-07-26 16:37:21', u'id': u'fc5288d5-f461-40d0-af2f-902074d85561', u'params': {u'namespace': u'', u'env': {}}, u'input': {u'inventory_file': u'Undercloud:\n hosts:\n undercloud: {}\n vars:\n ansible_connection: local\n ansible_host: localhost\n ansible_remote_tmp: /tmp/ansible-${USER}\n auth_url:\n cacert: null\n os_auth_token: gAAAAABbWfgEt5Gy1D4jmAOx9nRndzad7UBzCGebdpNnzjZTq8LqJJvY9nCr0cuIFaR12Fakg74uU7FJ3C2uDc3PoSR70U9CKaZ-tVU9jRRayMbhLOMPWLSXLcGliXlMDoULswbQnovB35mtuITvRMrYwHHgl-cQKZdqkmynG5_tHXAPpFB3-S0\n overcloud_admin_password: xmWu48EB1GLLPHUheidfbFtN8\n overcloud_horizon_url:\n overcloud_keystone_url:\n plan: overcloud\n project_name: admin\n undercloud_service_list: [openstack-nova-compute, openstack-heat-engine, openstack-ironic-conductor,\n openstack-swift-container, openstack-swift-object, openstack-mistral-engine]\n undercloud_swift_url:\n username: admin\nController:\n hosts:\n centos-7-rax-ord-0000988567:\n ansible_host:\n ctlplane_ip:\n deploy_server_id: 448f84fc-62a9-4cbb-815c-835b5607989c\n enabled_networks: [management, storage, ctlplane, external, internal_api, storage_mgmt,\n tenant]\n external_ip:\n internal_api_ip:\n management_ip:\n storage_ip:\n storage_mgmt_ip:\n tenant_ip:\n vars: {ansible_ssh_user: tripleo-admin, bootstrap_server_id: 448f84fc-62a9-4cbb-815c-835b5607989c,\n tripleo_role_name: Controller}\novercloud:\n children:\n Controller: {}\n vars: {ctlplane_vip:, external_vip:, internal_api_vip:,\n redis_vip:, storage_mgmt_vip:, storage_vip:}\nhaproxy:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nkernel:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nsshd:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\ntripleo_firewall:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\noslo_messaging_notify:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nmysql_client:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nntp:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nclustercheck:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nsnmp:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nkeystone:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\ntripleo_packages:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\npacemaker:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nmysql:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nca_certs:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\ntimezone:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\ndocker:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\nmemcached:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\noslo_messaging_rpc:\n children:\n Controller: {}\n vars: {ansible_ssh_user: tripleo-admin}\n', u'work_dir': u'/var/lib/mistral', u'verbosity': 1, u'skip_tags': u'', u'playbook': u'update_steps_playbook.yaml', u'ansible_extra_env_variables': {u'ANSIBLE_HOST_KEY_CHECKING': u'False', u'ANSIBLE_LOG_PATH': u'/var/log/mistral/package_update.log'}, u'module_path': u'/usr/share/ansible-modules', u'nodes': u'Controller', u'node_user': u'tripleo-admin', u'ansible_queue_name': u'update'}, u'spec': {u'tasks': {u'get_private_key': {u'name': u'get_private_key', u'on-success': u'node_update', u'publish': {u'private_key': u'<% task().result %>'}, u'version': u'2.0', u'action': u'tripleo.validations.get_privkey', u'type': u'direct'}, u'node_update_failed': {u'version': u'2.0', u'type': u'direct', u'name': u'node_update_failed', u'publish': {u'status': u'FAILED', u'message': u'Failed to update nodes - <% $.nodes %>, please see the logs.'}, u'on-success': u'send_message'}, u'node_update_passed': {u'version': u'2.0', u'type': u'direct', u'name': u'node_update_passed', u'publish': {u'status': u'SUCCESS', u'message': u'Updated nodes - <% $.nodes %>'}, u'on-success': u'send_message'}, u'send_message': {u'input': {u'status': u"<% $.get('status', 'SUCCESS') %>", u'message': u"<% $.get('message', '') %>", u'queue_name': u'<% $.ansible_queue_name %>', u'type': u'<% execution().name %>', u'execution': u'<% execution() %>'}, u'version': u'2.0', u'type': u'direct', u'name': u'send_message', u'workflow': u'tripleo.messaging.v1.send'}, u'node_update': {u'name': u'node_update', u'on-error': u'node_update_failed', u'on-success': [{u'node_update_passed': u'<% task().result.returncode = 0 %>'}, {u'node_update_failed': u'<% task().result.returncode != 0 %>'}], u'publish': {u'output': u'<% task().result %>'}, u'version': u'2.0', u'action': u'tripleo.ansible-playbook', u'input': {u'remote_user': u'<% $.node_user %>', u'limit_hosts': u'<% $.nodes %>', u'become_user': u'root', u'verbosity': u'<% $.verbosity %>', u'queue_name': u'<% $.ansible_queue_name %>', u'extra_env_variables': u'<% $.ansible_extra_env_variables %>', u'skip_tags': u'<% $.skip_tags %>', u'inventory': u'<% $.inventory_file %>', u'become': True, u'module_path': u'<% $.module_path %>', u'playbook': u'<% $.work_dir %>/<% execution().id %>/<% $.playbook %>', u'trash_output': True, u'execution_id': u'<% execution().id %>', u'ssh_private_key': u'<% $.private_key %>'}, u'type': u'direct'}, u'download_config': {u'name': u'download_config', u'on-error': u'node_update_failed', u'on-success': u'get_private_key', u'version': u'2.0', u'action': u'tripleo.config.download_config', u'input': {u'work_dir': u'<% $.work_dir %>/<% execution().id %>'}, u'type': u'direct'}}, u'description': u'Take a container and perform an update nodes by nodes', u'tags': [u'tripleo-common-managed'], u'version': u'2.0', u'input': [{u'node_user': u'tripleo-admin'}, u'nodes', u'playbook', u'inventory_file', {u'ansible_queue_name': u'tripleo'}, {u'module_path': u'/usr/share/ansible-modules'}, {u'ansible_extra_env_variables': {u'ANSIBLE_HOST_KEY_CHECKING': u'False', u'ANSIBLE_LOG_PATH': u'/var/log/mistral/package_update.log'}}, {u'verbosity': 1}, {u'work_dir': u'/var/lib/mistral'}, {u'skip_tags': u''}], u'name': u'update_nodes'}}, u'plan_name': None, u'execution_id': u'fc5288d5-f461-40d0-af2f-902074d85561', u'deployment_status': None}

Revision history for this message
Rafael Folco (rafaelfolco) wrote :
Changed in tripleo:
milestone: none → rocky-rc1
Revision history for this message
Quique Llorente (quiquell) wrote :

RDO update job is failing at the same point, and there it uses old workflow run.yaml, so looks like a defect has pass the gates


Revision history for this message
Quique Llorente (quiquell) wrote :

Review with a Depends-On to get the logs from the failing task

Revision history for this message
Quique Llorente (quiquell) wrote :

Checkin Ronelle's findings, also TRIPLEO_DEPLOY_IDENTIFIER= is empty


Revision history for this message
Quique Llorente (quiquell) wrote :

Ronelle Landy (rlandy) wrote 30 minutes ago: #5
I have a reproducer set up with fs037.
Manually removing that mislaced/extra start_order and the root line allow the update to continue

Ronelle Landy (rlandy) wrote 22 minutes ago: #6
Each time /home/zuul/overcloud_update_run-Controller.sh is run, a new dir is created in /var/lib/mistral and Set docker_startup_configs_with_default fact fails on https://github.com/openstack/tripleo-heat-templates/blob/master/common/deploy-steps-tasks.yaml#L80.

u'An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ^',
 u'fatal: [subnode-1]: FAILED! => {"msg": "Unexpected failure during module execution.", "stdout": ""}',

With no_log removed the step before sprints out to:

 u'TASK [Set docker_config_default fact] ******************************************',
 u'Thursday 26 July 2018 21:20:22 +0000 (0:00:00.548) 0:02:04.493 ********* ',
 u'ok: [subnode-1] => (item=1) => {"ansible_facts": {"docker_config_default": {"step_1": {}}}, "changed": false, "item": "1"}',
 u'ok: [subnode-1] => (item=2) => {"ansible_facts": {"docker_config_default": {"step_1": {}, "step_2": {}}}, "changed": false, "item": "2"}',
 u'ok: [subnode-1] => (item=3) => {"ansible_facts": {"docker_config_default": {"step_1": {}, "step_2": {}, "step_3": {}}}, "changed": false, "item": "3"}',
 u'ok: [subnode-1] => (item=4) => {"ansible_facts": {"docker_config_default": {"step_1": {}, "step_2": {}, "step_3": {}, "step_4": {}}}, "changed": false, "item": "4"}',
 u'ok: [subnode-1] => (item=5) => {"ansible_facts": {"docker_config_default": {"step_1": {}, "step_2": {}, "step_3": {}, "step_4": {}, "step_5": {}}}, "changed": false, "item": "5"}',
 u'ok: [subnode-1] => (item=6) => {"ansible_facts": {"docker_config_default": {"step_1": {}, "step_2": {}, "step_3": {}, "step_4": {}, "step_5": {}, "step_6": {}}}, "changed": false, "item": "6"}',

^^ which is fine. The error is on /var/lib/mistral/xxxx/Controller/docker_config.yaml.
(never on /var/lib/mistral/overcloud/Controller/docker_config.yaml - just the update)

Ronelle Landy (rlandy) wrote 6 minutes ago: #7
Possibly something falls apart in https://github.com/openstack/tripleo-heat-templates/blob/master/common/deploy-steps.j2#L342

Revision history for this message
Quique Llorente (quiquell) wrote :
Revision history for this message
Quique Llorente (quiquell) wrote :
Download full text (31.9 KiB)

Checked again, with a Depends-On to activate logs and update passes:

Installed packages: http://logs.openstack.org/44/586444/1/check/tripleo-ci-centos-7-scenario000-multinode-oooq-container-updates/a08dd57/logs/undercloud/var/log/yum.log.txt.gz

Jul 26 05:52:35 Installed: centos-release-virt-common-1-1.el7.centos.noarch
Jul 26 05:52:35 Installed: centos-release-qemu-ev-1.0-3.el7.centos.noarch
Jul 26 05:52:35 Installed: centos-release-storage-common-1-2.el7.centos.noarch
Jul 26 05:52:35 Installed: centos-release-ceph-luminous-1.0-1.el7.centos.noarch
Jul 26 05:52:35 Installed: centos-release-openstack-queens-1-1.el7.centos.x86_64
Jul 26 05:52:42 Installed: numactl-libs-2.0.9-7.el7.x86_64
Jul 26 05:52:43 Installed: 1:openvswitch-2.9.0-3.el7.x86_64
Jul 26 05:52:46 Erased: centos-release-openstack-queens-1-1.el7.centos.x86_64
Jul 26 05:55:30 Installed: python-six-1.9.0-2.el7.noarch
Jul 26 05:55:30 Installed: python-urllib3-1.10.2-5.el7.noarch
Jul 26 05:55:30 Installed: python-requests-2.6.0-1.el7_1.noarch
Jul 26 05:57:11 Installed: keyutils-libs-devel-1.5.8-3.el7.x86_64
Jul 26 05:57:11 Installed: libcom_err-devel-1.42.9-12.el7_5.x86_64
Jul 26 05:57:11 Installed: libkadm5-1.15.1-19.el7.x86_64
Jul 26 05:57:11 Installed: libsepol-devel-2.5-8.1.el7.x86_64
Jul 26 05:57:11 Installed: pcre-devel-8.32-17.el7.x86_64
Jul 26 05:57:11 Installed: libselinux-devel-2.5-12.el7.x86_64
Jul 26 05:57:11 Installed: libverto-devel-0.2.5-4.el7.x86_64
Jul 26 05:57:11 Installed: krb5-devel-1.15.1-19.el7.x86_64
Jul 26 05:57:11 Installed: zlib-devel-1.2.7-17.el7.x86_64
Jul 26 05:57:12 Installed: 1:openssl-devel-1.0.2k-12.el7.x86_64
Jul 26 05:57:12 Installed: libffi-devel-3.0.13-18.el7.x86_64
Jul 26 05:57:12 Installed: libyaml-0.1.4-11.el7_0.x86_64
Jul 26 06:00:22 Installed: yum-plugin-priorities-1.1.31-45.el7.noarch
Jul 26 06:00:25 Erased: centos-release-qemu-ev-1.0-3.el7.centos.noarch
Jul 26 06:00:25 Erased: centos-release-ceph-luminous-1.0-1.el7.centos.noarch
Jul 26 06:00:31 Installed: python2-idna-2.5-1.el7.noarch
Jul 26 06:00:31 Installed: python2-six-1.11.0-4.el7.noarch
Jul 26 06:00:31 Updated: python-ipaddress-1.0.16-3.el7.noarch
Jul 26 06:00:31 Installed: python-ply-3.4-11.el7.noarch
Jul 26 06:00:31 Installed: python-pycparser-2.14-1.el7.noarch
Jul 26 06:00:31 Installed: python2-cffi-1.11.2-1.el7.x86_64
Jul 26 06:00:32 Installed: python2-asn1crypto-0.23.0-2.el7.noarch
Jul 26 06:00:32 Installed: python2-pysocks-1.5.6-3.el7.noarch
Jul 26 06:00:32 Installed: 3:mariadb-config-10.1.20-2.el7.x86_64
Jul 26 06:00:32 Installed: 3:mariadb-common-10.1.20-2.el7.x86_64
Jul 26 06:00:32 Installed: python-enum34-1.0.4-1.el7.noarch
Jul 26 06:00:32 Installed: python2-cryptography-2.1.4-2.el7.x86_64
Jul 26 06:00:32 Installed: python2-pyOpenSSL-17.3.0-3.el7.noarch
Jul 26 06:00:32 Installed: python2-urllib3-1.21.1-1.el7.noarch
Jul 26 06:00:32 Installed: python2-requests-2.14.2-1.el7.noarch
Jul 26 06:00:32 Updated: 3:mariadb-libs-10.1.20-2.el7.x86_64
Jul 26 06:00:33 Installed: python2-setuptools-22.0.5-1.el7.noarch
Jul 26 06:00:33 Erased: python-requests-2.6.0-1.el7_1.noarch
Jul 26 06:00:33 Erased: python-urllib3-1.10.2-5.el7.noarch
Jul 26 06:00:33 Erased: python-six-1.9.0-2.el7.noarch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/586499

Changed in tripleo:
assignee: Gabriele Cerami (gcerami) → Jiří Stránský (jistr)
status: Triaged → In Progress
Revision history for this message
Quique Llorente (quiquell) wrote :

Testing patch, it fixes the gates https://review.openstack.org/#/c/585528/

Revision history for this message
Rafael Folco (rafaelfolco) wrote :

Just for the record...https://review.openstack.org/#/c/586333/ --> job failed the same way running legacy workflow, which confirms that the updates failure is a separate one.

Changed in tripleo:
assignee: Jiří Stránský (jistr) → Quique Llorente (quiquell)
Changed in tripleo:
assignee: Quique Llorente (quiquell) → Jiri Stransky (jistran)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/586499
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=f6767eb21c184bb4a8f1c56ae4c6f3f8320e6242
Submitter: Zuul
Branch: master

commit f6767eb21c184bb4a8f1c56ae4c6f3f8320e6242
Author: Jiri Stransky <email address hidden>
Date: Fri Jul 27 12:10:37 2018 +0200

    Fix overwriting downloaded config files

    When we wrote downloaded config files over previously existing ones,
    we sometimes got garbled files with remainders of the previous state,

        action: exec
        - keystone
        - pkill
        - --signal
        - USR1
        - httpd
        start_order: 1
        user: root
    start_order: 1
        user: root

    This was because we opened the file for writing but didn't truncate
    its size, so this commit adds O_TRUNC to the flags for opening
    files. It's very similar to bug 1434187 which we had a long time ago.

    Revert "Fix deploy health checks"
    Depends-On: Ia2c12d7455564b6297c5f0934812b10fabbdc914

    Change-Id: Ib1d3c68ec3c4048ffc7277daf84834288ea50e48
    Closes-Bug: #1783866

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/589527

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/589527
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=3eedf929e82ea0570219d280e19524fd14e39cb0
Submitter: Zuul
Branch: stable/queens

commit 3eedf929e82ea0570219d280e19524fd14e39cb0
Author: Jiri Stransky <email address hidden>
Date: Fri Jul 27 12:10:37 2018 +0200

    Fix overwriting downloaded config files

    When we wrote downloaded config files over previously existing ones,
    we sometimes got garbled files with remainders of the previous state,

        action: exec
        - keystone
        - pkill
        - --signal
        - USR1
        - httpd
        start_order: 1
        user: root
    start_order: 1
        user: root

    This was because we opened the file for writing but didn't truncate
    its size, so this commit adds O_TRUNC to the flags for opening
    files. It's very similar to bug 1434187 which we had a long time ago.

    Revert "Fix deploy health checks"
    Depends-On: Ia2c12d7455564b6297c5f0934812b10fabbdc914

    Change-Id: Ib1d3c68ec3c4048ffc7277daf84834288ea50e48
    Closes-Bug: #1783866
    (cherry picked from commit f6767eb21c184bb4a8f1c56ae4c6f3f8320e6242)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 9.3.0

This issue was fixed in the openstack/tripleo-common 9.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.6.5

This issue was fixed in the openstack/tripleo-common 8.6.5 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.