DC orchestration sometimes re-enters the state that has completed/failed
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Jessica Castelino |
Bug Description
Brief Description
-----------------
Sometimes DC orchestration can re-enter the same state after it has either failed or completed. This issue can only be observed when a large number of subcloud orchestration is performed. As the problematic code appears to be in orch_thread.py, it affects all types of DC orchestration.
Severity
--------
Major
Steps to Reproduce
------------------
Perform batch load import
Expected Behavior
------------------
Each orchestration state in the pipeline is visited at most once.
Actual Behavior
----------------
A few subclouds timed out on importing load state and threw an exception. However, new worker threads were created to repeat the importing load state again for these subclouds. Similarly, a subcloud completed the importing load state but a new worker thread was create to repeat the importing load for that subcloud which ended up failing at the end.
From inspecting the code, there are 2 issues:
1) The orch thread that wakes up periodically to apply/check/move the state along for each subcloud can get a strategy_steps snapshot
strategy_steps = db_api.
that is not the most up-to-date by the time the following logic is executed
for strategy_step in strategy_steps:
if strategy_step.stage == current_stage:
if self.stopped():
if strategy_step.state == {color}
So even though the subcloud worker thread has thrown an exception, invoked strategy_
def process_
"""manage the green thread for calling perform_
if region in self.subcloud_
# A worker already exists. Let it finish whatever it was doing.
if log_error:
else:
else:
# Create a greenthread to start processing the update for the
# subcloud and invoke the perform_
2) The subcloud_workers dictionary is not accessed in a thread-safe fashion which can lead to a race condition.
Reproducibility
---------------
100% reproducible
System Configuration
-------
Distributed Cloud
Branch/Pull Time/Commit
-------
BUILD_DATE=
Last Pass
---------
N/A
Timestamp/Logs
--------------
N/A
Test Activity
-------------
Developer Testing
Workaround
----------
N/A
Changed in starlingx: | |
assignee: | nobody → Jessica Castelino (jcasteli) |
Fix proposed to branch: master /review. opendev. org/c/starlingx /distcloud/ +/820903
Review: https:/