Prestage orchestration can hang indefinitely if one subcloud prestage hangs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Kyle MacLeod |
Bug Description
Brief Description
This issue was observed during the test of a large-scale subcloud prestage orchestration. In one of many rounds of test, ansible hung in the middle of prestage of a subcloud causing the whole strategy to hang for many hours. The process had to be manually killed as strategy abort did not work in this case.
Severity
Major
Steps to Reproduce
Repeat large-scale subcloud prestage orchestration a number of times
Expected Behavior
Prestage orchestration either fails or completes. It should never hang
Actual Behavior
Prestage orchestration hung
Reproducibility
Very rare, first time this issue is reported.
System Configuration
Distributed Cloud
Load info
StarlingX master
Last Pass
Many times before
Alarms
N/A
Test Activity
Developer Testing
Workaround
Manually kill dcmanager-
Kill hung ansible process
Changed in starlingx: | |
assignee: | nobody → Kyle MacLeod (kmacleod) |
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.7.0 stx.distcloud |
Reviewed: https:/ /review. opendev. org/c/starlingx /distcloud/ +/839948 /opendev. org/starlingx/ distcloud/ commit/ 886697755b21e09 bca4f7640b637ef b1675c2db5
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 886697755b21e09 bca4f7640b637ef b1675c2db5
Author: Kyle MacLeod <email address hidden>
Date: Wed May 4 16:20:11 2022 -0400
Add timeout for prestage ansible playbooks
Fix an issue observed during the testing of a large-scale
subcloud prestage operation. In one of many rounds of test,
ansible hung in the middle of prestage of a subcloud causing
the whole strategy to hang for many hours. The process had
to be manually killed as strategy abort did not work in
this case.
The issue is addressed by invoking the 'ansible-playbook' call playbook tree if the given timeout value is hit.
via '/usr/bin/timeout'. The timeout command will kill the
ansible-
For now, only the prestaging operations are using the
new timeout. The original 'run_playbook' method is
preserved in order to reduce any risk in this new
method of invoking a subprocess.
When a timeout occurs, the ansible log is updated before
the process is killed. Example:
Default timeout: dcmanager. conf. To change it, add the playbook_ timeout' as shown below, then restart the -manager service.
- We use a global timeout (default: 3600s / 1hr)
- The default can be modified from the [DEFAULTS] section
in /etc/dcmanager/
'
dcmanager
Future considerations (not part of this commit): run(timeout= val) method TASK_TIMEOUT value to set a
- In python3, this code can be simplified to
use the new subprocess.
or Popen with p.wait(timeout=val)
- Beginning with ansible 2.10, we can introduce
the ANSIBLE_
task-level timeout. This is not available
in our current version of ansible (2.7.5)
Test Plan:
PASS:
Add unit tests covering:
- no timeout given (maintain current functionality)
- timeout given but not hit
- timeout given; process is killed
- timeout given; hung process (ignoring SIGTERM) is killed
Run prestage operations as normal
- no regression
Modify default timeout to 5s, run prestage operations
- verify that timeout occurs
- verify that ansible-playbook is terminated
- verify that ansible log file shows TIMEOUT log
Modify default timeout to 5s for a single sublcoud, then
run prestage operations
- verify that only the single subcloud operation is killed
Modify prestage prestage- sw-packages/ tasks/main. yml to use
'--bwlimit=128' in the rsync from registry.central. This slows down
the package prestaging, and the playbook timeout is reached.
Add a 'pause' task in the prestage- sw-packages ansible for a
single subcloud. Ensure just the one task times out.
Exercise non-prestaging ansible playbook (to ensure subprocess
Popen change does not impact other playbooks
- provisioned a...