Orchestrated upgrade API failures may cause strategy to fail with undescriptive error message
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Kyle MacLeod |
Bug Description
Brief Description
During some steps of the orchestrated upgrade, some calls to platform API (e.g.: sysinv) may fail, throwing an 'AttributeError' exception, causing the orchestration to fail with an undescriptive error message in the 'dcmanager strategy-step list'.
Severity
Major
Steps to Reproduce
Install both system controller and duplex subcloud using N load
Upgrade system controller to N+1 load
Attempt to upgrade duplex subcloud to N+1 load, verify strategy steps list and logs
Expected Behavior
If upgrade fails, a descriptive failure message is shown to the user on 'dcmanager strategy-step list', explaining where to look for further information.
Actual Behavior
An undescriptive 'AttributeError: status_code' is shown in the reason field for the failed upgrade.
Reproducibility
Some scenarios cause it to be 100% reproducible (e.g.: forcing a management alarm on the subcloud will cause it to fail pre-check, and return the described message).
System Configuration
Distributed cloud
Load info (eg: 2022-03-
N load: 2022-01-10_02-00-35
N+1 load: 2022-05-28_20-00-06
Last Pass
Not sure if this scenario used to work in the past.
Timestamp/Logs
One example of such failure can be seen in the orchestrator logs:
2022-06-11 20:53:38.662 939613 INFO dcmanager.
2022-06-11 21:04:59.882 939613 INFO dcmanager.
2022-06-11 21:05:09.386 939613 INFO dcmanager.
2022-06-11 21:05:13.869 939613 WARNING cgtsclient.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.
2022-06-11 21:05:19.396 939613 INFO dcmanager.
Alarms
N/A
Test Activity
Developer testing
Workaround
Looking for the logs related to the specific failed call from the exception stack trace might give some better information on the error.
For instance, in the orchestrator logs provided above, it can be seen that the failed call was a swact request for the subcloud sysinv API. On the subcloud's /var/log/sysinv.log the following warning can be seen, which further explains the issue:
sysinv 2022-06-11 21:05:13.868 113534 INFO sysinv.
ler-1 is not ready to take service, service is syncing data', u'action': u'swact-pre-check', u'error_code': u'-1001'}
sysinv 2022-06-11 21:05:13.868 113534 WARNING wsme.api [-] Client-side error: controller-services on controller-1 is not ready to take service, service is syncing data: ClientSideError: controller-services on controller-1 is not ready to take service, service is syncing data
Changed in starlingx: | |
assignee: | nobody → Kyle MacLeod (kmacleod) |
importance: | Undecided → Medium |
tags: | added: stx.7.0 stx.update |
Fix proposed to branch: master /review. opendev. org/c/starlingx /config/ +/845790
Review: https:/