Orchestrated upgrade API failures may cause strategy to fail with undescriptive error message

Bug #1978499 reported by Kyle MacLeod
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Kyle MacLeod

Bug Description

Brief Description

During some steps of the orchestrated upgrade, some calls to platform API (e.g.: sysinv) may fail, throwing an 'AttributeError' exception, causing the orchestration to fail with an undescriptive error message in the 'dcmanager strategy-step list'.

Severity

Major

Steps to Reproduce

    Install both system controller and duplex subcloud using N load
    Upgrade system controller to N+1 load
    Attempt to upgrade duplex subcloud to N+1 load, verify strategy steps list and logs

Expected Behavior

If upgrade fails, a descriptive failure message is shown to the user on 'dcmanager strategy-step list', explaining where to look for further information.

Actual Behavior

An undescriptive 'AttributeError: status_code' is shown in the reason field for the failed upgrade.

Reproducibility

Some scenarios cause it to be 100% reproducible (e.g.: forcing a management alarm on the subcloud will cause it to fail pre-check, and return the described message).

System Configuration

Distributed cloud

Load info (eg: 2022-03-10_20-00-07)

N load: 2022-01-10_02-00-35
N+1 load: 2022-05-28_20-00-06

Last Pass

Not sure if this scenario used to work in the past.

Timestamp/Logs

One example of such failure can be seen in the orchestrator logs:

2022-06-11 20:53:38.662 939613 INFO dcmanager.orchestrator.orch_thread [-] (upgrade) Stage: 1, State: unlocking controller-1, Subcloud: subcloud4
2022-06-11 21:04:59.882 939613 INFO dcmanager.orchestrator.states.base [req-acc33fd8-dfcc-413d-9170-ce8d8711601d - - - - -] Stage: 1, State: unlocking controller-1, Subcloud: subcloud4, Details: Host: controller-1 is now: unlocked enabled available
2022-06-11 21:05:09.386 939613 INFO dcmanager.orchestrator.orch_thread [-] (upgrade) Stage: 1, State: swacting to controller-1, Subcloud: subcloud4
2022-06-11 21:05:13.869 939613 WARNING cgtsclient.common.http [req-358011e6-c512-4765-a8b0-40874da8d47b - - - - -] Request returned failure status.
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread [req-358011e6-c512-4765-a8b0-40874da8d47b - - - - -] (upgrade) Failed! Stage: 1, State: swacting to controller-1, Subcloud: subcloud4: AttributeError: status_code
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread Traceback (most recent call last):
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/orch_thread.py", line 539, in perform_state_action
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread next_state = state_operator.perform_state_action(strategy_step)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/states/swact_host.py", line 59, in perform_state_action
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread response = self.get_sysinv_client(region).swact_host(standby_host.id)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dccommon/drivers/openstack/sysinv_v1.py", line 189, in swact_host
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread return self._do_host_action(host_id, action_value)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dccommon/drivers/openstack/sysinv_v1.py", line 165, in _do_host_action
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread return self.sysinv_client.ihost.update(host_id, patch)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib64/python2.7/site-packages/cgtsclient/v1/ihost.py", line 115, in update
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread return self._update(self._path(ihost_id), patch)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib64/python2.7/site-packages/cgtsclient/common/base.py", line 89, in _update
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread _, body = self.api.json_request(http_method, url, body=body)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py", line 436, in json_request
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread method, **kwargs)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib64/python2.7/site-packages/cgtsclient/common/http.py", line 412, in _cs_request
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread error_json.get('debuginfo'), *args)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib64/python2.7/site-packages/cgtsclient/exc.py", line 157, in from_response
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread cls = _code_map.get(response.status_code, HTTPException)
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/httplib2/__init__.py", line 1708, in __getattr__
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread raise AttributeError, name
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread AttributeError: status_code
2022-06-11 21:05:13.870 939613 ERROR dcmanager.orchestrator.orch_thread
2022-06-11 21:05:19.396 939613 INFO dcmanager.orchestrator.orch_thread [-] (upgrade) Strategy application has failed.

Alarms

N/A

Test Activity

Developer testing

Workaround

Looking for the logs related to the specific failed call from the exception stack trace might give some better information on the error.

For instance, in the orchestrator logs provided above, it can be seen that the failed call was a swact request for the subcloud sysinv API. On the subcloud's /var/log/sysinv.log the following warning can be seen, which further explains the issue:

sysinv 2022-06-11 21:05:13.868 113534 INFO sysinv.common.rest_api [-] Response={u'origin': u'sm', u'oper': u'unknown', u'admin': u'unknown', u'hostname': u'controller-0', u'avail': u'', u'error_details': u'controller-services on control
ler-1 is not ready to take service, service is syncing data', u'action': u'swact-pre-check', u'error_code': u'-1001'}
sysinv 2022-06-11 21:05:13.868 113534 WARNING wsme.api [-] Client-side error: controller-services on controller-1 is not ready to take service, service is syncing data: ClientSideError: controller-services on controller-1 is not ready to take service, service is syncing data

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/845790

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/845790
Committed: https://opendev.org/starlingx/config/commit/edd548d8c165ac13d148c2c5d560bd098e345cc3
Submitter: "Zuul (22348)"
Branch: master

commit edd548d8c165ac13d148c2c5d560bd098e345cc3
Author: Kyle MacLeod <email address hidden>
Date: Mon Jun 13 15:35:02 2022 -0400

    Handle status_code / status error in HTTP response

    Checks for the existence of either 'status_code'
    'status_int', or 'status' in the HTTP response object,
    instantiating the HTTPException instance with the found
    code, or raises a generic Exception if not found.

    Test Plan:

    PASS:
    Tested code on system exhibiting the errors
    referenced in the bug. The HTTP response is now
    correctly caught and the correct HTTPException is
    raised.

    Closes-Bug: 1978499

    Signed-off-by: Kyle MacLeod <email address hidden>
    Change-Id: I53d1dd6583368de4f6713a838fe3304d362f1756

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Kyle MacLeod (kmacleod)
importance: Undecided → Medium
tags: added: stx.7.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.