StarlingX

Prestage orchestration can hang indefinitely if one subcloud prestage hangs

Bug #1971994 reported by Kyle MacLeod on 2022-05-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Kyle MacLeod

Bug Description

Brief Description
This issue was observed during the test of a large-scale subcloud prestage orchestration. In one of many rounds of test, ansible hung in the middle of prestage of a subcloud causing the whole strategy to hang for many hours. The process had to be manually killed as strategy abort did not work in this case.

Severity
Major

Steps to Reproduce
Repeat large-scale subcloud prestage orchestration a number of times

Expected Behavior
Prestage orchestration either fails or completes. It should never hang

Actual Behavior
Prestage orchestration hung

Reproducibility
Very rare, first time this issue is reported.

System Configuration
Distributed Cloud

Load info
StarlingX master

Last Pass
Many times before

Alarms
N/A

Test Activity
Developer Testing

Workaround
Manually kill dcmanager-orchestrator
Kill hung ansible process

Tags:

Kyle MacLeod (kmacleod) on 2022-05-06

Changed in starlingx:
assignee:	nobody → Kyle MacLeod (kmacleod)

OpenStack Infra (hudson-openstack) on 2022-05-06

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-05-09: Fix merged to distcloud (master)

Download full text (3.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/839948
Committed: https://opendev.org/starlingx/distcloud/commit/886697755b21e09bca4f7640b637efb1675c2db5
Submitter: "Zuul (22348)"
Branch: master

commit 886697755b21e09bca4f7640b637efb1675c2db5
Author: Kyle MacLeod <email address hidden>
Date: Wed May 4 16:20:11 2022 -0400

Add timeout for prestage ansible playbooks

    Fix an issue observed during the testing of a large-scale
    subcloud prestage operation. In one of many rounds of test,
    ansible hung in the middle of prestage of a subcloud causing
    the whole strategy to hang for many hours. The process had
    to be manually killed as strategy abort did not work in
    this case.

    The issue is addressed by invoking the 'ansible-playbook' call
    via '/usr/bin/timeout'. The timeout command will kill the
    ansible-playbook tree if the given timeout value is hit.

    For now, only the prestaging operations are using the
    new timeout. The original 'run_playbook' method is
    preserved in order to reduce any risk in this new
    method of invoking a subprocess.

When a timeout occurs, the ansible log is updated before
the process is killed. Example:

2022-04-28-17:28:44 TIMEOUT (1800s) - playbook is terminated

    Default timeout:
    - We use a global timeout (default: 3600s / 1hr)
    - The default can be modified from the [DEFAULTS] section
      in /etc/dcmanager/dcmanager.conf. To change it, add the
      'playbook_timeout' as shown below, then restart the
      dcmanager-manager service.

playbook_timeout=3600

    Future considerations (not part of this commit):
    - In python3, this code can be simplified to
      use the new subprocess.run(timeout=val) method
      or Popen with p.wait(timeout=val)
    - Beginning with ansible 2.10, we can introduce
      the ANSIBLE_TASK_TIMEOUT value to set a
      task-level timeout. This is not available
      in our current version of ansible (2.7.5)

Test Plan:

    PASS:
    Add unit tests covering:
      - no timeout given (maintain current functionality)
      - timeout given but not hit
      - timeout given; process is killed
      - timeout given; hung process (ignoring SIGTERM) is killed

Run prestage operations as normal
- no regression

    Modify default timeout to 5s, run prestage operations
      - verify that timeout occurs
      - verify that ansible-playbook is terminated
      - verify that ansible log file shows TIMEOUT log

    Modify default timeout to 5s for a single sublcoud, then
    run prestage operations
      - verify that only the single subcloud operation is killed

    Modify prestage prestage-sw-packages/tasks/main.yml to use
    '--bwlimit=128' in the rsync from registry.central. This slows down
    the package prestaging, and the playbook timeout is reached.

Add a 'pause' task in the prestage-sw-packages ansible for a
single subcloud. Ensure just the one task times out.

    Exercise non-prestaging ansible playbook (to ensure subprocess
    Popen change does not impact other playbooks
      - provisioned a...

Reviewed:  https://review.opendev.org/c/starlingx/distcloud/+/839948
Committed: https://opendev.org/starlingx/distcloud/commit/886697755b21e09bca4f7640b637efb1675c2db5
Submitter: "Zuul (22348)"
Branch:    master

commit 886697755b21e09bca4f7640b637efb1675c2db5
Author: Kyle MacLeod <kyle.macleod@windriver.com>
Date:   Wed May 4 16:20:11 2022 -0400

Add timeout for prestage ansible playbooks
    
    Fix an issue observed during the testing of a large-scale
    subcloud prestage operation. In one of many rounds of test,
    ansible hung in the middle of prestage of a subcloud causing
    the whole strategy to hang for many hours. The process had
    to be manually killed as strategy abort did not work in
    this case.
    
    The issue is addressed by invoking the 'ansible-playbook' call
    via '/usr/bin/timeout'. The timeout command will kill the
    ansible-playbook tree if the given timeout value is hit.
    
    For now, only the prestaging operations are using the
    new timeout. The original 'run_playbook' method is
    preserved in order to reduce any risk in this new
    method of invoking a subprocess.
    
    When a timeout occurs, the ansible log is updated before
    the process is killed. Example:
    
        2022-04-28-17:28:44 TIMEOUT (1800s) - playbook is terminated
    
    Default timeout:
    - We use a global timeout (default: 3600s / 1hr)
    - The default can be modified from the [DEFAULTS] section
      in /etc/dcmanager/dcmanager.conf. To change it, add the
      'playbook_timeout' as shown below, then restart the
      dcmanager-manager service.
    
          playbook_timeout=3600
    
    Future considerations (not part of this commit):
    - In python3, this code can be simplified to
      use the new subprocess.run(timeout=val) method
      or Popen with p.wait(timeout=val)
    - Beginning with ansible 2.10, we can introduce
      the ANSIBLE_TASK_TIMEOUT value to set a
      task-level timeout. This is not available
      in our current version of ansible (2.7.5)
    
    Test Plan:
    
    PASS:
    Add unit tests covering:
      - no timeout given (maintain current functionality)
      - timeout given but not hit
      - timeout given; process is killed
      - timeout given; hung process (ignoring SIGTERM) is killed
    
    Run prestage operations as normal
      - no regression
    
    Modify default timeout to 5s, run prestage operations
      - verify that timeout occurs
      - verify that ansible-playbook is terminated
      - verify that ansible log file shows TIMEOUT log
    
    Modify default timeout to 5s for a single sublcoud, then
    run prestage operations
      - verify that only the single subcloud operation is killed
    
    Modify prestage prestage-sw-packages/tasks/main.yml to use
    '--bwlimit=128' in the rsync from registry.central. This slows down
    the package prestaging, and the playbook timeout is reached.
    
    Add a 'pause' task in the prestage-sw-packages ansible for a
    single subcloud. Ensure just the one task times out.
    
    Exercise non-prestaging ansible playbook (to ensure subprocess
    Popen change does not impact other playbooks
      - provisioned a new subcloud
    
    Closes-Bug: 1971994
    Change-Id: Iaf1bee786afc505594c6671c959cc2650202ee6c
    Signed-off-by: Kyle MacLeod <kyle.macleod@windriver.com>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2022-05-20

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.7.0 stx.distcloud

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.