Patch Orchestration hanging when failed to obtain health report
Bug #1976109 reported by
Bo Yuan Chang
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Bo Yuan Chang |
Bug Description
Brief Description
The patch orchestration hang when the system failed to get health report from the subclouds.
Severity
Critical: Unable to proceed with patch update orchestration
Steps to Reproduce
1. Apply patches in system controller
2. Upgrade all subclouds with patch
3. Delete patch from system controller
4. Patch orchestrate all subclouds
Expected Behavior
Patch orchestration can be aborted then deleted
Actual Behavior
Patch orchestration is stucked in updating state and cannot be aborted
Reproducibility
Yes
System Configuration
any DC setup
Branch/Pull Time/Commit
master
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
assignee: | nobody → Bo Yuan Chang (cby19961020) |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.7.0 stx.distcloud |
To post a comment you must log in.
Reviewed: https:/ /review. opendev. org/c/starlingx /distcloud/ +/840216 /opendev. org/starlingx/ distcloud/ commit/ 116c119541b2b7a c85b475d5b0e1c7 c4292ea2fa
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 116c119541b2b7a c85b475d5b0e1c7 c4292ea2fa
Author: BoYuan Chang <email address hidden>
Date: Mon May 2 11:18:41 2022 -0500
Ensure one patching worker thread per subcloud
Remove the worker thread creation in STRATEGY_ STATE_UPDATING_ PATCHES STATE_INITIAL state. This parallel_ subclouds size. The patch orch thread loops through the
state since it's already done in STRATEGY_
code flaw was revealed from testing patch orchestration of large
max_
list of subclouds of the current step and process each one of them
every 10s. When the batch size is large, the subcloud state
retrieved at the beginning of the loop and stored in memory may be
stale for a subcloud by the time it processes that subcloud.
Try/Catch statement is also added to pre_check to prevent state stuck
indefinitely when failed to obtain the health report from sysinv.
Test Plan:
1. Ensure all subclouds are free of mgmt affecting alarms. Execute patch
orchestration of large max_parallel_ subclouds size and verify that it
works without any failure(s) due to "Patch in-progress" alarm.
2. Check when the get system health report timed out the subcloud will
fail rather than hanging indefinitely.
Closes-Bug: 1971172 1d175c09a462b95 077b0a8218b
Closes-Bug: 1976109
Signed-off-by: BoYuan Chang <email address hidden>
Change-Id: Ib70f00228ebd18