Orchestrated subcloud upgrade failed due to connection error at patching_finish step
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Gabriel Silva Trevisan |
Bug Description
Brief Description
-----------------
While testing a parallel subcloud upgrade, it was noticed that a few subclouds failed during the finishing_patch step.
The issue seems to have been caused by a connection error while trying to retrieve the committed patches from the RegionOne patching API.
Severity
--------
Major
Steps to Reproduce
------------------
Prepare a DC system for orchestrated subcloud upgrade
Create and apply an upgrade strategy for a large number of simplex subclouds in parallel
Expected Behavior
------------------
The upgrade strategy completes successfully for all subclouds
Actual Behavior
----------------
Upgrade failed for some of the subclouds at the finishing_patch step, while trying to retrieve a list of committed patches from the system controller. All failures have occurred within the same second, which indicates a possible intermittency for this batch of subclouds.
Reproducibility
---------------
Intermittent. Reproduced while trying to upgrade a large number of subclouds in parallel. Occurred once in the two executed trials, with 16% of the subclouds failing due to this error.
System Configuration
-------
Distributed Cloud
Branch/Pull Time/Commit
-------
2022-01-19 18:42:01
Last Pass
---------
Not sure if a similar upgrade test with a large number of subclouds was previously performed for the stx/6.0 load.
Timestamp/Logs
--------------
Logs from /var/log/
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
2022-03-18 16:37:18.252 3441053 ERROR dcmanager.
No ansible-playbook logs were recorded for the failing subclouds, since the failure occurred during the initial patching step.
Test Activity
-------------
Developer Testing. Seen while verifying a fix for https:/
Workaround
----------
Delete the failed strategy, create a new one and reapply. Worked in the first attempt for the evidenced failure.
Changed in starlingx: | |
assignee: | nobody → Gabriel Silva Trevisan (g-trevisan) |
importance: | Undecided → Medium |
tags: | added: stx.7.0 stx.distcloud |
Fix proposed to branch: master /review. opendev. org/c/starlingx /distcloud/ +/837120
Review: https:/