DC orchestration sometimes re-enters the state that has completed/failed

Bug #1953519 reported by Jessica Castelino
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jessica Castelino

Bug Description

Brief Description
-----------------
Sometimes DC orchestration can re-enter the same state after it has either failed or completed. This issue can only be observed when a large number of subcloud orchestration is performed. As the problematic code appears to be in orch_thread.py, it affects all types of DC orchestration.

Severity
--------
Major

Steps to Reproduce
------------------
Perform batch load import

Expected Behavior
------------------
Each orchestration state in the pipeline is visited at most once.

Actual Behavior
----------------
A few subclouds timed out on importing load state and threw an exception. However, new worker threads were created to repeat the importing load state again for these subclouds. Similarly, a subcloud completed the importing load state but a new worker thread was create to repeat the importing load for that subcloud which ended up failing at the end.

From inspecting the code, there are 2 issues:

1) The orch thread that wakes up periodically to apply/check/move the state along for each subcloud can get a strategy_steps snapshot

strategy_steps = db_api.strategy_step_get_all(self.context)

that is not the most up-to-date by the time the following logic is executed

        for strategy_step in strategy_steps:

            if strategy_step.stage == current_stage:

                region = self.get_region_name(strategy_step)

                if self.stopped():

                    LOG.info("(%s) Exiting because task is stopped"

                             % self.update_type)

                    return

                if strategy_step.state == {color}

                        consts.STRATEGY_STATE_FAILED:

                    LOG.info("(%s) Intermediate step is failed for %s"

                              % (self.update_type, strategy_step.subcloud.name))

                    continue

                elif strategy_step.state == {color}

                        consts.STRATEGY_STATE_COMPLETE:

                    LOG.info("(%s) Intermediate step is complete for %s"

                              % (self.update_type, strategy_step.subcloud.name))

                    continue

                elif strategy_step.state == \

                        consts.STRATEGY_STATE_INITIAL:

                    # Don't start upgrading this subcloud if it has been

                    # unmanaged by the user. If orchestration was already

                    # started, it will be allowed to complete.

                    if strategy_step.subcloud_id is not None and \

                            strategy_step.subcloud.management_state == \

                            consts.MANAGEMENT_UNMANAGED:

                        message = ("Subcloud %s is unmanaged." %

                                   strategy_step.subcloud.name)

                        LOG.warn(message)

                        self.strategy_step_update(

                            strategy_step.subcloud_id,

                            state=consts.STRATEGY_STATE_FAILED,

                            details=message)

                        continue

                    # We are just getting started, enter the first state

                    # Use the updated value for calling process_update_step

                    strategy_step = self.strategy_step_update(

                        strategy_step.subcloud_id,

                        state=self.starting_state)

                    # Starting state should log an error if greenthread exists

                    self.process_update_step(region,

                                             strategy_step,

                                             log_error=True)

                else:

                    self.process_update_step(region,

                                             strategy_step,

                                             log_error=False)

So even though the subcloud worker thread has thrown an exception, invoked strategy_step_update() and finally is removed from subcloud_workers dictionary; the orchestration thread still called process_update_step() again for the subcloud. Since the worker thread no longer exists in the dictionary, a new one is created.

    def process_update_step(self, region, strategy_step, log_error=False):

        """manage the green thread for calling perform_state_action"""

        if region in self.subcloud_workers:

            # A worker already exists. Let it finish whatever it was doing.

            if log_error:

                LOG.error("(%s) Worker should not exist for %s."

                          % (self.update_type, region))

            else:

                LOG.debug("(%s) Update worker exists for %s."

                          % (self.update_type, region))

        else:

            # Create a greenthread to start processing the update for the

            # subcloud and invoke the perform_state_action method

            self.subcloud_workers[region] = {color}

                self.thread_group_manager.start(self.perform_state_action,

                                                strategy_step)

2) The subcloud_workers dictionary is not accessed in a thread-safe fashion which can lead to a race condition.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2021-11-26 00:08:34 -0500"

Last Pass
---------
N/A

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Developer Testing

Workaround
----------
N/A

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/820903

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Jessica Castelino (jcasteli)
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

screening: stx.7.0 / medium - appears to be a DC scalability issue. Therefore, won't hold up stx.6.0 at this stage.

tags: added: stx.distcloud
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/820903
Committed: https://opendev.org/starlingx/distcloud/commit/6e204a8e055e52a583152f8c70f35519468c395a
Submitter: "Zuul (22348)"
Branch: master

commit 6e204a8e055e52a583152f8c70f35519468c395a
Author: Jessica Castelino <email address hidden>
Date: Tue Dec 7 05:23:32 2021 -0500

    Fix: DC orch re-enters completed/failed state sometimes

    Sometimes DC orchestration can re-enter the same state
    after it has either failed or completed. This issue can
    only be observed when a large number of subcloud
    orchestration is performed. As the problematic code
    appears to be in orch_thread.py, it affects all types
    of DC orchestration.

    This commit fixes the issue described above.

    Test:
    Successfuly completed load import (as a part of duplex
    subcloud orchestration) with large number of subclouds
    without re-entering the same state.

    Change-Id: I57802a07009ff50d300146869efa3ceb4c9a2749
    Signed-off-by: Jessica Castelino <email address hidden>
    Closes-Bug: 1953519

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.