StarlingX

DC orchestration sometimes re-enters the state that has completed/failed

Bug #1953519 reported by Jessica Castelino on 2021-12-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Jessica Castelino

Bug Description

Brief Description
-----------------
Sometimes DC orchestration can re-enter the same state after it has either failed or completed. This issue can only be observed when a large number of subcloud orchestration is performed. As the problematic code appears to be in orch_thread.py, it affects all types of DC orchestration.

Severity
--------
Major

Steps to Reproduce
------------------
Perform batch load import

Expected Behavior
------------------
Each orchestration state in the pipeline is visited at most once.

Actual Behavior
----------------
A few subclouds timed out on importing load state and threw an exception. However, new worker threads were created to repeat the importing load state again for these subclouds. Similarly, a subcloud completed the importing load state but a new worker thread was create to repeat the importing load for that subcloud which ended up failing at the end.

From inspecting the code, there are 2 issues:

1) The orch thread that wakes up periodically to apply/check/move the state along for each subcloud can get a strategy_steps snapshot

strategy_steps = db_api.strategy_step_get_all(self.context)

that is not the most up-to-date by the time the following logic is executed

for strategy_step in strategy_steps:

if strategy_step.stage == current_stage:

region = self.get_region_name(strategy_step)

if self.stopped():

LOG.info("(%s) Exiting because task is stopped"

% self.update_type)

return

if strategy_step.state == {color}

consts.STRATEGY_STATE_FAILED:

LOG.info("(%s) Intermediate step is failed for %s"

% (self.update_type, strategy_step.subcloud.name))

continue

elif strategy_step.state == {color}

consts.STRATEGY_STATE_COMPLETE:

LOG.info("(%s) Intermediate step is complete for %s"

% (self.update_type, strategy_step.subcloud.name))

continue

elif strategy_step.state == \

consts.STRATEGY_STATE_INITIAL:

# Don't start upgrading this subcloud if it has been

# unmanaged by the user. If orchestration was already

# started, it will be allowed to complete.

if strategy_step.subcloud_id is not None and \

strategy_step.subcloud.management_state == \

consts.MANAGEMENT_UNMANAGED:

message = ("Subcloud %s is unmanaged." %

strategy_step.subcloud.name)

LOG.warn(message)

self.strategy_step_update(

strategy_step.subcloud_id,

state=consts.STRATEGY_STATE_FAILED,

details=message)

continue

# We are just getting started, enter the first state

# Use the updated value for calling process_update_step

strategy_step = self.strategy_step_update(

strategy_step.subcloud_id,

state=self.starting_state)

# Starting state should log an error if greenthread exists

self.process_update_step(region,

strategy_step,

log_error=True)

else:

self.process_update_step(region,

strategy_step,

log_error=False)

So even though the subcloud worker thread has thrown an exception, invoked strategy_step_update() and finally is removed from subcloud_workers dictionary; the orchestration thread still called process_update_step() again for the subcloud. Since the worker thread no longer exists in the dictionary, a new one is created.

def process_update_step(self, region, strategy_step, log_error=False):

"""manage the green thread for calling perform_state_action"""

if region in self.subcloud_workers:

# A worker already exists. Let it finish whatever it was doing.

if log_error:

LOG.error("(%s) Worker should not exist for %s."

% (self.update_type, region))

else:

LOG.debug("(%s) Update worker exists for %s."

% (self.update_type, region))

else:

# Create a greenthread to start processing the update for the

# subcloud and invoke the perform_state_action method

self.subcloud_workers[region] = {color}

self.thread_group_manager.start(self.perform_state_action,

strategy_step)

2) The subcloud_workers dictionary is not accessed in a thread-safe fashion which can lead to a race condition.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2021-11-26 00:08:34 -0500"

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
N/A

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-07: Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/820903

Changed in starlingx:
status:	New → In Progress

Jessica Castelino (jcasteli) on 2021-12-07

Changed in starlingx:
assignee:	nobody → Jessica Castelino (jcasteli)

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2021-12-07 (last edit on 2021-12-07):

screening: stx.7.0 / medium - appears to be a DC scalability issue. Therefore, won't hold up stx.6.0 at this stage.

tags:	added: stx.distcloud
Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.7.0

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-22: Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/820903
Committed: https://opendev.org/starlingx/distcloud/commit/6e204a8e055e52a583152f8c70f35519468c395a
Submitter: "Zuul (22348)"
Branch: master

commit 6e204a8e055e52a583152f8c70f35519468c395a
Author: Jessica Castelino <email address hidden>
Date: Tue Dec 7 05:23:32 2021 -0500

Fix: DC orch re-enters completed/failed state sometimes

    Sometimes DC orchestration can re-enter the same state
    after it has either failed or completed. This issue can
    only be observed when a large number of subcloud
    orchestration is performed. As the problematic code
    appears to be in orch_thread.py, it affects all types
    of DC orchestration.

This commit fixes the issue described above.

    Test:
    Successfuly completed load import (as a part of duplex
    subcloud orchestration) with large number of subclouds
    without re-entering the same state.

    Change-Id: I57802a07009ff50d300146869efa3ceb4c9a2749
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1953519

Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.