[tripleo] rocky baremetal deployment fails with jq: error: Could not open file /var/lib/heat-config/deployed/<id>.notify.json

Bug #1792343 reported by Raoul Scarazzini
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
James Slagle

Bug Description

Rocky's RDOPhase2 job is failing because it cannot find the notify json file it is looking for:

2018-09-12 17:14:17 | TASK [Run deployment NetworkDeployment] ****************************************
2018-09-12 17:14:17 | Wednesday 12 September 2018 17:13:58 +0000 (0:00:00.254) 0:00:27.104 ***
2018-09-12 17:14:17 | fatal: [overcloud-controller-0]: FAILED! => {"changed": true, "cmd": "/usr/libexec/os-refresh-config/configure.d/55-heat-config\n exit $(jq .deploy_status_code /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json)", "delta": "0:00:00.047339", "end": "2018-09-12 17:14:16.472866", "msg": "non-zero return code", "rc": 2, "start": "2018-09-12 17:14:16.425527", "stderr": "[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed\n[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json\njq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory", "stderr_lines": ["[2018-09-12 17:14:16,444] (heat-config) [WARNING] Skipping config a557a476-c788-4b44-8e8c-744b0d7120d0, already deployed", "[2018-09-12 17:14:16,445] (heat-config) [WARNING] To force-deploy, rm /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.json", "jq: error: Could not open file /var/lib/heat-config/deployed/a557a476-c788-4b44-8e8c-744b0d7120d0.notify.json: No such file or directory"], "stdout": "", "stdout_lines": []}

All the logs concerning the failed job are available here [1] and there's already a proposed review [2] that might address the issue.

[1] https://thirdparty.logs.rdoproject.org/jenkins-oooq-rocky-rdo_trunk-bmu-ha-lab-cygnus-float_nic_with_vlans-3
[2] https://review.openstack.org/#/c/602270/

Revision history for this message
Quique Llorente (quiquell) wrote :
Raoul Scarazzini (rasca)
Changed in tripleo:
milestone: none → rocky-rc2
Revision history for this message
Quique Llorente (quiquell) wrote :
Changed in tripleo:
assignee: nobody → Quique Llorente (quiquell)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/602572

Revision history for this message
Quique Llorente (quiquell) wrote :
tags: added: alert
Revision history for this message
Quique Llorente (quiquell) wrote :

We are doing a lot of includes
http://git.openstack.org/cgit/openstack/tripleo-heat-templates/tree/common/deploy-steps.j2#n495
2018-09-14 03:09:48 | TASK [include_tasks] ***********************************************************
2018-09-14 03:09:48 | Friday 14 September 2018 03:09:18 +0000 (0:00:01.519) 0:00:32.351 ******
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0, overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-controller-2, overcloud-controller-1, overcloud-controller-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0
2018-09-14 03:09:48 | included: /var/lib/mistral/overcloud/deployments.yaml for overcloud-novacompute-0

Revision history for this message
Quique Llorente (quiquell) wrote :
Revision history for this message
Quique Llorente (quiquell) wrote :

Let's try reverting parallelization https://review.openstack.org/602594

Changed in tripleo:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (stable/rocky)

Change abandoned by Quique Llorente (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/602572
Reason: This was just a testing patch, it's fixing anything

wes hayutin (weshayutin)
Changed in tripleo:
assignee: Quique Llorente (quiquell) → nobody
Revision history for this message
wes hayutin (weshayutin) wrote :

no need to use alert for this bug, just a promotion blocker.

tags: removed: alert
Changed in tripleo:
milestone: rocky-rc2 → stein-1
Revision history for this message
James Slagle (james-slagle) wrote :

i debugged this issue in the reproducer environment and found that when os-net-config was configuring the network configuration on the overcloud node, this was causing ssh to drop the connection:

Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task since it was failed by a ssh connection error.

However, the first task was actually still running and it eventually succeeds.

The second task that was kicked off by ansible as a retry, sees that the deployment is already applied, but the notification file (*.notify.json) does not yet exist since the first task is still in progress. This causes the second task to fail with the error reported in the bug and the whole ansible-playbook run to then fail.

Revision history for this message
James Slagle (james-slagle) wrote :

Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix the issue as ssh doesn't drop the first connection when these are configured.

Changed in tripleo:
assignee: nobody → James Slagle (james-slagle)
status: Triaged → In Progress
Revision history for this message
James Slagle (james-slagle) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/604171

Changed in tripleo:
assignee: James Slagle (james-slagle) → Quique Llorente (quiquell)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/604171
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296
Submitter: Zuul
Branch: master

commit c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296
Author: James Slagle <email address hidden>
Date: Thu Sep 20 13:36:03 2018 -0400

    Set SSH server keep alive options

    When os-net-config configures the network configuration on the overcloud nodes
    ssh connections can be dropped.

    Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
    since it was failed by a ssh connection error.

    However, the first task was actually still running and it eventually succeeds.

    The second task that was kicked off by ansible as a retry, sees that the
    deployment is already applied, but the notification file (*.notify.json) does
    not yet exist since the first task is still in progress. This causes the second
    task to fail with the error reported in the bug and the whole ansible-playbook
    run to then fail.

    Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
    the issue as ssh doesn't drop the first connection when these are configured.

    Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
    Closes-Bug: #1792343

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/604455

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.openstack.org/604455
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=b17791ae2ce2bd6dd5e888bdbb5318e53af4cea6
Submitter: Zuul
Branch: stable/rocky

commit b17791ae2ce2bd6dd5e888bdbb5318e53af4cea6
Author: James Slagle <email address hidden>
Date: Thu Sep 20 13:36:03 2018 -0400

    Set SSH server keep alive options

    When os-net-config configures the network configuration on the overcloud nodes
    ssh connections can be dropped.

    Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
    since it was failed by a ssh connection error.

    However, the first task was actually still running and it eventually succeeds.

    The second task that was kicked off by ansible as a retry, sees that the
    deployment is already applied, but the notification file (*.notify.json) does
    not yet exist since the first task is still in progress. This causes the second
    task to fail with the error reported in the bug and the whole ansible-playbook
    run to then fail.

    Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
    the issue as ssh doesn't drop the first connection when these are configured.

    Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
    Closes-Bug: #1792343
    (cherry picked from commit c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296)

tags: added: in-stable-rocky
Revision history for this message
Ben Nemec (bnemec) wrote :

I'm still seeing this on master. I've attached the deployment output and I can see that I do have the keepalive fix on the undercloud:

[ssh_connection]
ssh_args = -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -o ControlMaster=auto -o ControlPersist=30m -o ServerAliveInterval=5 -o ServerAliveCountMax=5

This is doing a fairly basic deployment using the OVB multinic templates: https://github.com/cybertron/openstack-virtual-baremetal/tree/master/overcloud-templates/network-templates-v2

Deploy command is this: openstack overcloud deploy --templates --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/disable-telemetry.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-isolation-absolute.yaml -e /home/centos/openstack-virtual-baremetal/overcloud-templates/network-templates-v2/network-environment.yaml

Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.0.0

This issue was fixed in the openstack/tripleo-common 10.0.0 release.

Revision history for this message
Noam Angel (noama) wrote :

have the same issue on latest rocky

rpm -qa | grep tripleo
openstack-tripleo-validations-9.3.1-0.20181008110747.4064fb7.el7.noarch
openstack-tripleo-image-elements-9.0.1-0.20181007200834.2dc678a.el7.noarch
openstack-tripleo-common-containers-9.4.1-0.20181014200404.52fe2f3.el7.noarch
openstack-tripleo-puppet-elements-9.0.0-0.20181007201103.daf9069.el7.noarch
python2-tripleo-repos-0.0.1-0.20181007202255.dca903e.el7.noarch
openstack-tripleo-common-9.4.1-0.20181014200404.52fe2f3.el7.noarch
python-tripleoclient-10.6.1-0.20181014192805.b5c17c1.el7.noarch
puppet-tripleo-9.3.1-0.20181014201134.e44b525.el7.noarch
ansible-tripleo-ipsec-9.0.1-0.20181012162415.8b37e93.el7.noarch
openstack-tripleo-heat-templates-9.0.1-0.20181015031903.9967382.el7.noarch
ansible-role-tripleo-modify-image-1.0.1-0.20181014201444.36f4481.el7.noarch
python2-tripleo-common-9.4.1-0.20181014200404.52fe2f3.el7.noarch
python-tripleoclient-heat-installer-10.6.1-0.20181014192805.b5c17c1.el7.noarch

Revision history for this message
Noam Angel (noama) wrote :
Revision history for this message
Noam Angel (noama) wrote :

i was able to workaround with "-o ServerAliveInterval=5 -o ServerAliveCountMax=800"

https://docs.openstack.org/tripleo-docs/latest/install/advanced_deployment/ansible_config_download.html#deployment-log

Revision history for this message
Ben Nemec (bnemec) wrote :

I should note that I was able to reproduce this consistently for a couple of days, and since then I haven't seen it again.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.openstack.org/611712

Changed in tripleo:
assignee: Quique Llorente (quiquell) → James Slagle (james-slagle)
status: Triaged → In Progress
Revision history for this message
James Slagle (james-slagle) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.openstack.org/611712
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=ac4ac838e1a3eda32bbf8e9c0a61a89142688194
Submitter: Zuul
Branch: master

commit ac4ac838e1a3eda32bbf8e9c0a61a89142688194
Author: James Slagle <email address hidden>
Date: Thu Oct 18 16:11:32 2018 -0400

    Run NetworkDeployment as async task

    This commit adds special handling of the NetworkDeployment such that it
    will be run as an async task with ansible. This should prevent any
    issues where the network configuration causes the ssh connection to drop
    and the ansible task to either be unnecessarily retried or failed.

    Also added are three variables that can be used to control the async
    behavior:

    async_deployment: boolean which will toggle running all deployments in
                      async mode.
    async_timeout: timeout in seconds to wait for async tasks
    async_poll: interval in seconds to check async task status

    These variables can only be set if running the config-download process
    manually, however a future patch could wire them up to Heat parameters.

    Change-Id: If1f35980a98a9015ca65f2c6a3e4db04725f1c10
    Closes-Bug: #1792343

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (stable/rocky)

Change abandoned by Quique Llorente (<email address hidden>) on branch: stable/rocky
Review: https://review.openstack.org/602595
Reason: This was and old not working stuff to try to fix the bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-heat-templates (master)

Change abandoned by Quique Llorente (<email address hidden>) on branch: master
Review: https://review.openstack.org/602594
Reason: This was and old not working stuff to try to fix the bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/612387

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/rocky)

Reviewed: https://review.openstack.org/612387
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=d213435a61072b0a0d559eecb30dcdf1167293ee
Submitter: Zuul
Branch: stable/rocky

commit d213435a61072b0a0d559eecb30dcdf1167293ee
Author: James Slagle <email address hidden>
Date: Thu Oct 18 16:11:32 2018 -0400

    Run NetworkDeployment as async task

    This commit adds special handling of the NetworkDeployment such that it
    will be run as an async task with ansible. This should prevent any
    issues where the network configuration causes the ssh connection to drop
    and the ansible task to either be unnecessarily retried or failed.

    Also added are three variables that can be used to control the async
    behavior:

    async_deployment: boolean which will toggle running all deployments in
                      async mode.
    async_timeout: timeout in seconds to wait for async tasks
    async_poll: interval in seconds to check async task status

    These variables can only be set if running the config-download process
    manually, however a future patch could wire them up to Heat parameters.

    Change-Id: If1f35980a98a9015ca65f2c6a3e4db04725f1c10
    Closes-Bug: #1792343
    (cherry picked from commit ac4ac838e1a3eda32bbf8e9c0a61a89142688194)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/618118

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/618118
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=56bf1d6db54e943cda5a5936e596eff3ceece024
Submitter: Zuul
Branch: stable/queens

commit 56bf1d6db54e943cda5a5936e596eff3ceece024
Author: James Slagle <email address hidden>
Date: Thu Sep 20 13:36:03 2018 -0400

    Set SSH server keep alive options

    When os-net-config configures the network configuration on the overcloud nodes
    ssh connections can be dropped.

    Since we have ssh retries set to 8 in ansible.cfg, ansible would retry the task
    since it was failed by a ssh connection error.

    However, the first task was actually still running and it eventually succeeds.

    The second task that was kicked off by ansible as a retry, sees that the
    deployment is already applied, but the notification file (*.notify.json) does
    not yet exist since the first task is still in progress. This causes the second
    task to fail with the error reported in the bug and the whole ansible-playbook
    run to then fail.

    Setting ServerAliveInterval and ServerAliveCountMax ssh options seems to fix
    the issue as ssh doesn't drop the first connection when these are configured.

    Change-Id: I08781fe2aa6472d3fae5c5f5d0babd1f7a3b9b2d
    Closes-Bug: #1792343
    (cherry picked from commit c0f41cae9f672c21f05fa7b0cfbfeb66d1cfe296)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 10.1.0

This issue was fixed in the openstack/tripleo-common 10.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.6.7

This issue was fixed in the openstack/tripleo-common 8.6.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/636978

Revision history for this message
kobig (kobi.ginon) wrote :

to my experience with latest rocky version using this suggested fix above the issue is still there
See my suggested fix in the similar thread
https://bugs.launchpad.net/tripleo/+bug/1769622

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/queens)

Reviewed: https://review.openstack.org/636978
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=a2550158e1aff6c0d6f23ec1bb7e19ab4c043b5e
Submitter: Zuul
Branch: stable/queens

commit a2550158e1aff6c0d6f23ec1bb7e19ab4c043b5e
Author: ekultails <email address hidden>
Date: Wed Feb 27 15:32:28 2019 -0500

    Run NetworkDeployment as async task

    This commit adds special handling of the NetworkDeployment such that it
    will be run as an async task with ansible. This should prevent any
    issues where the network configuration causes the ssh connection to drop
    and the ansible task to either be unnecessarily retried or failed.

    Also added are three variables that can be used to control the async
    behavior:

    async_deployment: boolean which will toggle running all deployments in
                      async mode.
    async_timeout: timeout in seconds to wait for async tasks
    async_poll: interval in seconds to check async task status

    These variables can only be set if running the config-download process
    manually, however a future patch could wire them up to Heat parameters.

    Change-Id: If1f35980a98a9015ca65f2c6a3e4db04725f1c10
    Closes-Bug: #1792343
    (cherry picked from commit ac4ac838e1a3eda32bbf8e9c0a61a89142688194)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 9.5.0

This issue was fixed in the openstack/tripleo-common 9.5.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 8.7.0

This issue was fixed in the openstack/tripleo-common 8.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/743626

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/743626
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=43aaaaa1190d7f5e808c906e82abcf116e0508c9
Submitter: Zuul
Branch: master

commit 43aaaaa1190d7f5e808c906e82abcf116e0508c9
Author: Alex Schultz <email address hidden>
Date: Tue Jul 28 14:02:53 2020 -0600

    Switch 55-heat-config to async

    If while a task is executing the ssh connection is severed, ansible will
    automagically rerun the command under the covers. This causes problems
    for long running 55-heat-config tasks as first process may have written
    out the deployed json but not the notify.json that we use use to
    determine if it was successful or not. This can lead to a failure
    because the process either never runs to completion. This change
    switches the execution to always be run async to ensure that ssh
    interruptions won't cause inconsistent failures.

    We previously saw a similar issue when invoking the NetworkDeployments
    using this process. We've moved the network configurations to the
    NetworkConfig task in THT/common/deploy-steps.j2 but this code is still
    used to invoked with OS::Heat::SoftwareDeploymentGroup

    Change-Id: Ic911bb6d999caf2dc4afd4cff3d44047c03dc8e4
    Related-Bug: #1792343
    Closes-Bug: #1887846

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/ussuri)

Related fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/743780

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to tripleo-common (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/743781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/ussuri)

Reviewed: https://review.opendev.org/743780
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=f38f6bdfa7632d06cfec5e407de9759656045071
Submitter: Zuul
Branch: stable/ussuri

commit f38f6bdfa7632d06cfec5e407de9759656045071
Author: Alex Schultz <email address hidden>
Date: Tue Jul 28 14:02:53 2020 -0600

    Switch 55-heat-config to async

    If while a task is executing the ssh connection is severed, ansible will
    automagically rerun the command under the covers. This causes problems
    for long running 55-heat-config tasks as first process may have written
    out the deployed json but not the notify.json that we use use to
    determine if it was successful or not. This can lead to a failure
    because the process either never runs to completion. This change
    switches the execution to always be run async to ensure that ssh
    interruptions won't cause inconsistent failures.

    We previously saw a similar issue when invoking the NetworkDeployments
    using this process. We've moved the network configurations to the
    NetworkConfig task in THT/common/deploy-steps.j2 but this code is still
    used to invoked with OS::Heat::SoftwareDeploymentGroup

    Change-Id: Ic911bb6d999caf2dc4afd4cff3d44047c03dc8e4
    Related-Bug: #1792343
    Closes-Bug: #1887846
    (cherry picked from commit 43aaaaa1190d7f5e808c906e82abcf116e0508c9)

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/743781
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2e50794163f115d4bb83e73a4be96a5058f6e346
Submitter: Zuul
Branch: stable/train

commit 2e50794163f115d4bb83e73a4be96a5058f6e346
Author: Alex Schultz <email address hidden>
Date: Tue Jul 28 14:02:53 2020 -0600

    Switch 55-heat-config to async

    If while a task is executing the ssh connection is severed, ansible will
    automagically rerun the command under the covers. This causes problems
    for long running 55-heat-config tasks as first process may have written
    out the deployed json but not the notify.json that we use use to
    determine if it was successful or not. This can lead to a failure
    because the process either never runs to completion. This change
    switches the execution to always be run async to ensure that ssh
    interruptions won't cause inconsistent failures.

    We previously saw a similar issue when invoking the NetworkDeployments
    using this process. We've moved the network configurations to the
    NetworkConfig task in THT/common/deploy-steps.j2 but this code is still
    used to invoked with OS::Heat::SoftwareDeploymentGroup

    Change-Id: Ic911bb6d999caf2dc4afd4cff3d44047c03dc8e4
    Related-Bug: #1792343
    Closes-Bug: #1887846
    (cherry picked from commit 43aaaaa1190d7f5e808c906e82abcf116e0508c9)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.