jq: error: Could not open file /var/lib/heat-config/deployed/XXXXXX.notify.json: No such file or directory

Bug #1887846 reported by Emilien Macchi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Medium
Alex Schultz

Bug Description

It's possible that a deployment hits this error:

"jq: error: Could not open file /var/lib/heat-config/deployed/4aeb1e17-6d45-4bc5-ad30-d0d7134a5bd5.notify.json: No such file or directory"

It can happen if 55-heat-config created the notify.json file but it's not yet seen on the filesystem, fast enough so it can be directly used by jq to get the rc code.

To address this race condition, we should retry fetching the file with a timeout value.

Changed in tripleo:
milestone: none → victoria-1
assignee: nobody → Emilien Macchi (emilienm)
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/741520

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/741526

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on tripleo-common (master)

Change abandoned by Emilien Macchi (<email address hidden>) on branch: master
Review: https://review.opendev.org/741526
Reason: this is too complex, we'll need to refactor the whole playbook at some point . I'll go with https://review.opendev.org/#/c/741520

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/741520
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=e6634751e5bb127e4513bb7fc6ce51ca4770a7e9
Submitter: Zuul
Branch: master

commit e6634751e5bb127e4513bb7fc6ce51ca4770a7e9
Author: Emilien Macchi <email address hidden>
Date: Thu Jul 16 14:34:40 2020 -0400

    Retry fetching {{ deployment_uuid }}.notify.json file

    During a deployment, there is a small chance that right after
    55-heat-config run, the {{ deployment_uuid }}.notify.json file doesn't
    exist yet on the filesystem.

    This patch will retry to fetch the file during 20s and return an error
    message if it can't be found.

    Change-Id: Ia29944a07900ff36d5d4a5826cdd076d11864962
    Closes-Bug: #1887846

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/742184

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/742185

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/ussuri)

Reviewed: https://review.opendev.org/742184
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=20f32171e4da9d029f40b54e7a0c61074fcda1ba
Submitter: Zuul
Branch: stable/ussuri

commit 20f32171e4da9d029f40b54e7a0c61074fcda1ba
Author: Emilien Macchi <email address hidden>
Date: Thu Jul 16 14:34:40 2020 -0400

    Retry fetching {{ deployment_uuid }}.notify.json file

    During a deployment, there is a small chance that right after
    55-heat-config run, the {{ deployment_uuid }}.notify.json file doesn't
    exist yet on the filesystem.

    This patch will retry to fetch the file during 20s and return an error
    message if it can't be found.

    Change-Id: Ia29944a07900ff36d5d4a5826cdd076d11864962
    Closes-Bug: #1887846
    (cherry picked from commit e6634751e5bb127e4513bb7fc6ce51ca4770a7e9)

tags: added: in-stable-ussuri
tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/742185
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=258a4af14698b09be588d5eba0f9449fce994dde
Submitter: Zuul
Branch: stable/train

commit 258a4af14698b09be588d5eba0f9449fce994dde
Author: Emilien Macchi <email address hidden>
Date: Thu Jul 16 14:34:40 2020 -0400

    Retry fetching {{ deployment_uuid }}.notify.json file

    During a deployment, there is a small chance that right after
    55-heat-config run, the {{ deployment_uuid }}.notify.json file doesn't
    exist yet on the filesystem.

    This patch will retry to fetch the file during 20s and return an error
    message if it can't be found.

    Change-Id: Ia29944a07900ff36d5d4a5826cdd076d11864962
    Closes-Bug: #1887846
    (cherry picked from commit e6634751e5bb127e4513bb7fc6ce51ca4770a7e9)

Revision history for this message
Alex Schultz (alex-schultz) wrote :

It is still occurring. It looks like an issue with ansible execution where it's being rerun on the same node. The 55-heat-config script has a race condition such that if the first execution is running while the 2nd comes in our task will fail because the notify file hasn't been created yet

Changed in tripleo:
status: Fix Released → Triaged
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (master)

Fix proposed to branch: master
Review: https://review.opendev.org/743626

Changed in tripleo:
assignee: Emilien Macchi (emilienm) → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (master)

Reviewed: https://review.opendev.org/743626
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=43aaaaa1190d7f5e808c906e82abcf116e0508c9
Submitter: Zuul
Branch: master

commit 43aaaaa1190d7f5e808c906e82abcf116e0508c9
Author: Alex Schultz <email address hidden>
Date: Tue Jul 28 14:02:53 2020 -0600

    Switch 55-heat-config to async

    If while a task is executing the ssh connection is severed, ansible will
    automagically rerun the command under the covers. This causes problems
    for long running 55-heat-config tasks as first process may have written
    out the deployed json but not the notify.json that we use use to
    determine if it was successful or not. This can lead to a failure
    because the process either never runs to completion. This change
    switches the execution to always be run async to ensure that ssh
    interruptions won't cause inconsistent failures.

    We previously saw a similar issue when invoking the NetworkDeployments
    using this process. We've moved the network configurations to the
    NetworkConfig task in THT/common/deploy-steps.j2 but this code is still
    used to invoked with OS::Heat::SoftwareDeploymentGroup

    Change-Id: Ic911bb6d999caf2dc4afd4cff3d44047c03dc8e4
    Related-Bug: #1792343
    Closes-Bug: #1887846

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/743780

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-common (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/743781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/ussuri)

Reviewed: https://review.opendev.org/743780
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=f38f6bdfa7632d06cfec5e407de9759656045071
Submitter: Zuul
Branch: stable/ussuri

commit f38f6bdfa7632d06cfec5e407de9759656045071
Author: Alex Schultz <email address hidden>
Date: Tue Jul 28 14:02:53 2020 -0600

    Switch 55-heat-config to async

    If while a task is executing the ssh connection is severed, ansible will
    automagically rerun the command under the covers. This causes problems
    for long running 55-heat-config tasks as first process may have written
    out the deployed json but not the notify.json that we use use to
    determine if it was successful or not. This can lead to a failure
    because the process either never runs to completion. This change
    switches the execution to always be run async to ensure that ssh
    interruptions won't cause inconsistent failures.

    We previously saw a similar issue when invoking the NetworkDeployments
    using this process. We've moved the network configurations to the
    NetworkConfig task in THT/common/deploy-steps.j2 but this code is still
    used to invoked with OS::Heat::SoftwareDeploymentGroup

    Change-Id: Ic911bb6d999caf2dc4afd4cff3d44047c03dc8e4
    Related-Bug: #1792343
    Closes-Bug: #1887846
    (cherry picked from commit 43aaaaa1190d7f5e808c906e82abcf116e0508c9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-common (stable/train)

Reviewed: https://review.opendev.org/743781
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=2e50794163f115d4bb83e73a4be96a5058f6e346
Submitter: Zuul
Branch: stable/train

commit 2e50794163f115d4bb83e73a4be96a5058f6e346
Author: Alex Schultz <email address hidden>
Date: Tue Jul 28 14:02:53 2020 -0600

    Switch 55-heat-config to async

    If while a task is executing the ssh connection is severed, ansible will
    automagically rerun the command under the covers. This causes problems
    for long running 55-heat-config tasks as first process may have written
    out the deployed json but not the notify.json that we use use to
    determine if it was successful or not. This can lead to a failure
    because the process either never runs to completion. This change
    switches the execution to always be run async to ensure that ssh
    interruptions won't cause inconsistent failures.

    We previously saw a similar issue when invoking the NetworkDeployments
    using this process. We've moved the network configurations to the
    NetworkConfig task in THT/common/deploy-steps.j2 but this code is still
    used to invoked with OS::Heat::SoftwareDeploymentGroup

    Change-Id: Ic911bb6d999caf2dc4afd4cff3d44047c03dc8e4
    Related-Bug: #1792343
    Closes-Bug: #1887846
    (cherry picked from commit 43aaaaa1190d7f5e808c906e82abcf116e0508c9)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-common 11.5.0

This issue was fixed in the openstack/tripleo-common 11.5.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.