Orchestrated subcloud upgrade failed due to connection error at patching_finish step

Bug #1968321 reported by Gabriel Silva Trevisan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Gabriel Silva Trevisan

Bug Description

Brief Description
-----------------
While testing a parallel subcloud upgrade, it was noticed that a few subclouds failed during the finishing_patch step.
The issue seems to have been caused by a connection error while trying to retrieve the committed patches from the RegionOne patching API.

Severity
--------
Major

Steps to Reproduce
------------------
Prepare a DC system for orchestrated subcloud upgrade
Create and apply an upgrade strategy for a large number of simplex subclouds in parallel

Expected Behavior
------------------
The upgrade strategy completes successfully for all subclouds

Actual Behavior
----------------
Upgrade failed for some of the subclouds at the finishing_patch step, while trying to retrieve a list of committed patches from the system controller. All failures have occurred within the same second, which indicates a possible intermittency for this batch of subclouds.

Reproducibility
---------------
Intermittent. Reproduced while trying to upgrade a large number of subclouds in parallel. Occurred once in the two executed trials, with 16% of the subclouds failing due to this error.

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
2022-01-19 18:42:01

Last Pass
---------
Not sure if a similar upgrade test with a large number of subclouds was previously performed for the stx/6.0 load.

Timestamp/Logs
--------------
Logs from /var/log/dcmanager/orchestrator.log, from the moment one of the failures occurred:

2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread [req-254e92cf-1567-40ed-bef5-39b7753b05e0 - - - - -] (upgrade) Failed! Stage: 1, State: finishing patch strategy, Subcloud: subcloud12: ConnectionError: ('Connection aborted.', BadStatusLine("''",))
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread Traceback (most recent call last):
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/orch_thread.py", line 516, in perform_state_action
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread next_state = state_operator.perform_state_action(strategy_step)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dcmanager/orchestrator/states/upgrade/finishing_patch_strategy.py", line 43, in perform_state_action
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread state=patching_v1.PATCH_STATE_COMMITTED)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/dccommon/drivers/openstack/patching_v1.py", line 64, in query
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread response = requests.get(url, headers=headers, timeout=timeout)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/requests/api.py", line 75, in get
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread return request('get', url, params=params, **kwargs)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/requests/api.py", line 60, in request
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread return session.request(method=method, url=url, **kwargs)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread resp = self.send(prep, **send_kwargs)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread r = adapter.send(request, **kwargs)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 498, in send
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread raise ConnectionError(err, request=request)
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread ConnectionError: ('Connection aborted.', BadStatusLine("''",))
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.orchestrator.orch_thread

No ansible-playbook logs were recorded for the failing subclouds, since the failure occurred during the initial patching step.

Test Activity
-------------
Developer Testing. Seen while verifying a fix for https://bugs.launchpad.net/starlingx/+bug/1965170.

Workaround
----------
Delete the failed strategy, create a new one and reapply. Worked in the first attempt for the evidenced failure.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/837120

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Gabriel Silva Trevisan (g-trevisan)
importance: Undecided → Medium
tags: added: stx.7.0 stx.distcloud
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/837120
Committed: https://opendev.org/starlingx/distcloud/commit/6473bb885b630a09aca9a1c84c4b75188d061dbc
Submitter: "Zuul (22348)"
Branch: master

commit 6473bb885b630a09aca9a1c84c4b75188d061dbc
Author: Gabriel Silva Trevisan <email address hidden>
Date: Fri Apr 8 08:55:28 2022 -0300

    Cache RegionOne patches on orchestrated upgrade

    Due to a potentially large number of parallel requests during the
    initial stages of orchestrated upgrade, the RegionOne patching API may
    become overloaded, causing the upgrade to fail for some subclouds.

    Add a patching cache to the upgrade strategy, so that RegionOne patches
    can be retrieved by workers without needing to make repeated requests to
    the client API, and potentially overloading the server.

    Test Plan:

    PASS:
    - Apply upgrade orchestration to a large number of simplex subclouds in parallel

    Closes-bug: 1968321
    Signed-off-by: Gabriel Silva Trevisan <email address hidden>
    Change-Id: I45201904fe0015179e31e98f419e6d82fc6806e5

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.