[tripleo] rocky baremetal deployment fails with jq: error: Could not open file /var/lib/heat-config/deployed/<id>.notify.json

Bug #1792343 reported by Raoul Scarazzini on 2018-09-13
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
James Slagle

Bug Description

Rocky's RDOPhase2 job is failing because it cannot find the notify json file it is looking for:

2018-09-12 17:14:17 | TASK [Run deployment NetworkDeployment] ****************************************
2018-09-12 17:14:17 | Wednesday 12 September 2018 17:13:58 +0000 (0:00:00.254) 0:00:27.104 ***
2018-09-12 17:14:17 | fatal: [overcloud-controller-0]: FAILED! => {"changed": true, "cmd": "/usr/libexec/os-refresh-config/configure.d/55-heat-config\n exit $(jq .deploy_status_code /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json)", "delta": "0:00:00.047339", "end": "2018-09-12 17:14:16.472866", "msg": "non-zero return code", "rc": 2, "start": "2018-09-12 17:14:16.425527", "stderr": "[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed\n[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json\njq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory", "stderr_lines": ["[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed", "[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json", "jq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory"], "stdout": "", "stdout_lines": []}

All the logs concerning the failed job are available here [1] and there's already a proposed review [2] that might address the issue.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-rocky-rdo_trunk-bmu-ha-lab-cygnus-float_nic_with_vlans-3
[2] https://review.openstack.org/#/c/602270/

Raoul Scarazzini (rasca) on 2018-09-13
Changed in tripleo:
milestone: none → rocky-rc2
Changed in tripleo:
assignee: nobody → Quique Llorente (quiquell)
Quique Llorente (quiquell) wrote :

We are doing a lot of includes
http://git.openstack.org/cgit/openstack/tripleo-heat-templates/tree/common/deploy-steps.j2#n495
2018-09-14 03:09:48 | TASK [include_tasks] ***********************************************************
2018-09-14 03:09:48 | Friday 14 September 2018 03:09:18 +0000 (0:00:01.519) 0:00:32.351 ******
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0, overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0

Quique Llorente (quiquell) wrote :

Let's try reverting parallelization https://review.openstack.org/602594

Changed in tripleo:
importance: High → Critical

Change abandoned by Quique Llorente (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/602572
Reason: This was just a testing patch, it's fixing anything

wes hayutin (weshayutin) on 2018-09-18
Changed in tripleo:
assignee: Quique Llorente (quiquell) → nobody
wes hayutin (weshayutin) wrote :

no need to use alert for this bug, just a promotion blocker.

tags: removed: alert
Changed in tripleo:
milestone: rocky-rc2 → stein-1
James Slagle (james-slagle) wrote :

i debugged this issue in the reproducer environment and found that when os-net-config was configuring the network configuration on the overcloud node, this was causing ssh to drop the connection:

Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task since it was failed by a ssh connection error.

However, the first task was actually still running and it eventually succeeds.

The second task that was kicked off by ansible as a retry, sees that the deployment is already applied, but the notification file (*.notify.json) does not yet exist since the first task is still in progress. This causes the second task to fail with the error reported in the bug and the whole ansible-playbook run to then fail.

James Slagle (james-slagle) wrote :

Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix the issue as ssh doesn't drop the first connection when these are configured.

Changed in tripleo:
assignee: nobody → James Slagle (james-slagle)
status: Triaged → In Progress
Changed in tripleo:
assignee: James Slagle (james-slagle) → Quique Llorente (quiquell)

Reviewed: https://review.openstack.org/604171
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296
Submitter: Zuul
Branch: master

commit c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296
Author: James Slagle <email address hidden>
Date: Thu Sep 20 13:36:03 2018 -0400

    Set SSH server keep alive options

    When os-net-config configures the network configuration on the overcloud nodes
    ssh connections can be dropped.

    Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
    since it was failed by a ssh connection error.

    However, the first task was actually still running and it eventually succeeds.

    The second task that was kicked off by ansible as a retry, sees that the
    deployment is already applied, but the notification file (*.notify.json) does
    not yet exist since the first task is still in progress. This causes the second
    task to fail with the error reported in the bug and the whole ansible-playbook
    run to then fail.

    Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
    the issue as ssh doesn't drop the first connection when these are configured.

    Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
    Closes-Bug: #1792343

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/604455
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=b17791ae2ce2bd6dd5e888bdbb5318e53af4cea6
Submitter: Zuul
Branch: stable/rocky

commit b17791ae2ce2bd6dd5e888bdbb5318e53af4cea6
Author: James Slagle <email address hidden>
Date: Thu Sep 20 13:36:03 2018 -0400

    Set SSH server keep alive options

    When os-net-config configures the network configuration on the overcloud nodes
    ssh connections can be dropped.

    Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
    since it was failed by a ssh connection error.

    However, the first task was actually still running and it eventually succeeds.

    The second task that was kicked off by ansible as a retry, sees that the
    deployment is already applied, but the notification file (*.notify.json) does
    not yet exist since the first task is still in progress. This causes the second
    task to fail with the error reported in the bug and the whole ansible-playbook
    run to then fail.

    Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
    the issue as ssh doesn't drop the first connection when these are configured.

    Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
    Closes-Bug: #1792343
    (cherry picked from commit c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296)

tags: added: in-stable-rocky
Ben Nemec (bnemec) wrote :

I'm still seeing this on master. I've attached the deployment output and I can see that I do have the keepalive fix on the undercloud:

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5

This is doing a fairly basic deployment using the OVB multinic templates: https://github.com/cybertron/openstack-virtual-baremetal/tree/master/overcloud-templates/network-templates-v2

Deploy command is this: openstack overcloud deploy --templates --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-isolation-absolute.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-environment.yaml

Changed in tripleo:
status: Fix Released → Triaged

This issue was fixed in the openstack/tripleo-common 10.0.0 release.

Noam Angel (noama) wrote :

have the same issue on latest rocky

rpm -qa | grep tripleo
openstack-tripleo-validations-9.3.1-0.20181008110747.4064fb7.el7.noarch
openstack-tripleo-image-elements-9.0.1-0.20181007200834.2dc678a.el7.noarch
openstack-tripleo-common-containers-9.4.1-0.20181014200404.52fe2f3.el7.noarch
openstack-tripleo-puppet-elements-9.0.0-0.20181007201103.daf9069.el7.noarch
python2-tripleo-repos-0.0.1-0.20181007202255.dca903e.el7.noarch
openstack-tripleo-common-9.4.1-0.20181014200404.52fe2f3.el7.noarch
python-tripleoclient-10.6.1-0.20181014192805.b5c17c1.el7.noarch
puppet-tripleo-9.3.1-0.20181014201134.e44b525.el7.noarch
ansible-tripleo-ipsec-9.0.1-0.20181012162415.8b37e93.el7.noarch
openstack-tripleo-heat-templates-9.0.1-0.20181015031903.9967382.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20181014201444.36f4481.el7.noarch
python2-tripleo-common-9.4.1-0.20181014200404.52fe2f3.el7.noarch
python-tripleoclient-heat-installer-10.6.1-0.20181014192805.b5c17c1.el7.noarch

Noam Angel (noama) wrote :
Noam Angel (noama) wrote :

i was able to workaround with "-o ServerAliveInterval=5 -o ServerAliveCountMax=800"

https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/ansible_config_download.html#deployment-log

Ben Nemec (bnemec) wrote :

I should note that I was able to reproduce this consistently for a couple of days, and since then I haven't seen it again.

Fix proposed to branch: master
Review: https://review.openstack.org/611712

Changed in tripleo:
assignee: Quique Llorente (quiquell) → James Slagle (james-slagle)
status: Triaged → In Progress

Reviewed: https://review.openstack.org/611712
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ac4ac838e1a3eda32bbf8e9c0a61a89142688194
Submitter: Zuul
Branch: master

commit ac4ac838e1a3eda32bbf8e9c0a61a89142688194
Author: James Slagle <email address hidden>
Date: Thu Oct 18 16:11:32 2018 -0400

    Run NetworkDeployment as async task

    This commit adds special handling of the NetworkDeployment such that it
    will be run as an async task with ansible. This should prevent any
    issues where the network configuration causes the ssh connection to drop
    and the ansible task to either be unnecessarily retried or failed.

    Also added are three variables that can be used to control the async
    behavior:

    async_deployment: boolean which will toggle running all deployments in
                      async mode.
    async_timeout: timeout in seconds to wait for async tasks
    async_poll: interval in seconds to check async task status

    These variables can only be set if running the config-download process
    manually, however a future patch could wire them up to Heat parameters.

    Change-Id: If1f35980a98a9015ca65f2c6a3e4db04725f1c10
    Closes-Bug: #1792343

Changed in tripleo:
status: In Progress → Fix Released

Change abandoned by Quique Llorente (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/602595
Reason: This was and old not working stuff to try to fix the bug.

Change abandoned by Quique Llorente (<email address hidden>) on branch: master
Review: https://review.openstack.org/602594
Reason: This was and old not working stuff to try to fix the bug.

Reviewed: https://review.openstack.org/612387
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=d213435a61072b0a0d559eecb30dcdf1167293ee
Submitter: Zuul
Branch: stable/rocky

commit d213435a61072b0a0d559eecb30dcdf1167293ee
Author: James Slagle <email address hidden>
Date: Thu Oct 18 16:11:32 2018 -0400

    Run NetworkDeployment as async task

    This commit adds special handling of the NetworkDeployment such that it
    will be run as an async task with ansible. This should prevent any
    issues where the network configuration causes the ssh connection to drop
    and the ansible task to either be unnecessarily retried or failed.

    Also added are three variables that can be used to control the async
    behavior:

    async_deployment: boolean which will toggle running all deployments in
                      async mode.
    async_timeout: timeout in seconds to wait for async tasks
    async_poll: interval in seconds to check async task status

    These variables can only be set if running the config-download process
    manually, however a future patch could wire them up to Heat parameters.

    Change-Id: If1f35980a98a9015ca65f2c6a3e4db04725f1c10
    Closes-Bug: #1792343
    (cherry picked from commit ac4ac838e1a3eda32bbf8e9c0a61a89142688194)

Reviewed: https://review.openstack.org/618118
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=56bf1d6db54e943cda5a5936e596eff3ceece024
Submitter: Zuul
Branch: stable/queens

commit 56bf1d6db54e943cda5a5936e596eff3ceece024
Author: James Slagle <email address hidden>
Date: Thu Sep 20 13:36:03 2018 -0400

    Set SSH server keep alive options

    When os-net-config configures the network configuration on the overcloud nodes
    ssh connections can be dropped.

    Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
    since it was failed by a ssh connection error.

    However, the first task was actually still running and it eventually succeeds.

    The second task that was kicked off by ansible as a retry, sees that the
    deployment is already applied, but the notification file (*.notify.json) does
    not yet exist since the first task is still in progress. This causes the second
    task to fail with the error reported in the bug and the whole ansible-playbook
    run to then fail.

    Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
    the issue as ssh doesn't drop the first connection when these are configured.

    Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
    Closes-Bug: #1792343
    (cherry picked from commit c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296)

tags: added: in-stable-queens

This issue was fixed in the openstack/tripleo-common 10.1.0 release.

This issue was fixed in the openstack/tripleo-common 8.6.7 release.

kobig (kobi.ginon) wrote :

to my experience with latest rocky version using this suggested fix above the issue is still there
See my suggested fix in the similar thread
https://bugs.launchpad.net/tripleo/+bug/1769622

Reviewed: https://review.openstack.org/636978
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=a2550158e1aff6c0d6f23ec1bb7e19ab4c043b5e
Submitter: Zuul
Branch: stable/queens

commit a2550158e1aff6c0d6f23ec1bb7e19ab4c043b5e
Author: ekultails <email address hidden>
Date: Wed Feb 27 15:32:28 2019 -0500

    Run NetworkDeployment as async task

    This commit adds special handling of the NetworkDeployment such that it
    will be run as an async task with ansible. This should prevent any
    issues where the network configuration causes the ssh connection to drop
    and the ansible task to either be unnecessarily retried or failed.

    Also added are three variables that can be used to control the async
    behavior:

    async_deployment: boolean which will toggle running all deployments in
                      async mode.
    async_timeout: timeout in seconds to wait for async tasks
    async_poll: interval in seconds to check async task status

    These variables can only be set if running the config-download process
    manually, however a future patch could wire them up to Heat parameters.

    Change-Id: If1f35980a98a9015ca65f2c6a3e4db04725f1c10
    Closes-Bug: #1792343
    (cherry picked from commit ac4ac838e1a3eda32bbf8e9c0a61a89142688194)

This issue was fixed in the openstack/tripleo-common 9.5.0 release.

This issue was fixed in the openstack/tripleo-common 8.7.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers