Dcmanager subcloud audit is not aware of upgrade context
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Al Bailey |
Bug Description
Brief Description
-----------------
During subcloud upgrade, failure can occur during remote install or data migration step for reasons such as misconfigurations, temporary network glitch. If this occurs, the subcloud deploy_status is set to install-
Severity
--------
Major
Steps to Reproduce
------------------
With the system controller running N+1 load and at least one subcloud running N load
- Create a subcloud upgrade strategy using the command "dcmanager upgrade-strategy create <subcloud-name>"
- Apply the upgrade strategy using the command "dcmanager upgrade-strategy apply"
- Induce a failure during upgrade simplex step by temporarily removing route to the subcloud bootstrap IP
- View the subcloud detail using the command "dcmanager subcloud show <subcloud-name>"
Expected Behavior
------------------
The affected subcloud should be listed as "offline".
Actual Behavior
----------------
The affected subcloud would alaywas be listed as "online" as dcmanager subcloud audit skips auditing any subclouds with deploy_status not equal to 'deploy-failed', 'deploying' or 'complete' status.
Reproducibility
---------------
Reproducible
System Configuration
-------
Distributed Cloud
Branch/Pull Time/Commit
-------
Jun 30th master build
Last Pass
---------
I don't think there an existing test case for this specific scenario.
Timestamp/Logs
--------------
N/A, there are no error logs. This is a design oversight.
Test Activity
-------------
Developer Testing
Workaround
----------
Change the deploy_status of the affected subcloud to 'complete' in the database, wait up to 20s for the subcloud audit to resume auditing the affected subcloud.
description: | updated |
tags: | added: stx.distcloud stx.update |
Changed in starlingx: | |
assignee: | nobody → Al Bailey (albailey1974) |
Bart, this is the block of code that causes it to skip the audit
if (subcloud. deploy_ status not in
[ consts. DEPLOY_ STATE_DONE,
consts. DEPLOY_ STATE_DEPLOYING ,
consts. DEPLOY_ STATE_DEPLOY_ FAILED] ):
LOG.debug( "Skip subcloud %s audit, deploy_status: %s" %
(subcloud. name, subcloud. deploy_ status) )
continue
Should failed be the only one removed from the list. I am assuming 'done' and 'deploying' will never get stuck.