Patch Orchestration failure due to mishandling of ready-to-apply state

Bug #1979237 reported by Yuxing
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tee Ngo

Bug Description

Brief Description
-----------------
NR Patch orchestration failed because of an unexpected state ready-to-apply.

[sysadmin@controller-0 ~(keystone_admin)]$ dcmanager strategy-step list | grep failed
-----------------
subcloudxxxx 2 failed Strategy apply failed for subcloudxxxx - unexpected state ready-to-apply 2022-06-14 03:12:45.342907 2022-06-14 03:17:28.563280

Severity
-----------------
Major: Unable to patch many subclouds

Steps to Reproduce
-----------------
Create a patch strategy for a large number of subclouds in parallel

Expected Behavior
-----------------
Patch strategy complete

Actual Behavior
-----------------
Patch strategy failed

Reproducibility
-----------------
Reproducible

System Configuration
-----------------
DC

Load info
-----------------
Jun 7th stx 7.0 load

Last Pass
-----------------
Unknown

Timestamp/Logs
-----------------
an

Alarms
-----------------
na

Test Activity
-----------------
System Testing

Workaround
-----------------
na

Tee Ngo (teewrs)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/846798

Changed in starlingx:
status: New → In Progress
Yuxing (yuxing)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/846798
Committed: https://opendev.org/starlingx/distcloud/commit/b393a8be9c41bbbdd4251f764eaf35c674155192
Submitter: "Zuul (22348)"
Branch: master

commit b393a8be9c41bbbdd4251f764eaf35c674155192
Author: Tee Ngo <email address hidden>
Date: Mon Jun 20 15:00:33 2022 -0400

    Fix race conditions in DC patch orchestration

    In the current design, the orch thread which retrieves a snapshot of
    the strategy steps from the database, processes them serially and
    creates a worker thread to perform the operation based on the
    strategy step state. Each time a worker thread is created, it is
    added to the subcloud_workers dictionary. The worker thread, running
    independently, removes itself from the subcloud_workers dictionary
    after it has completed or failed the operation. The orch thread
    sleeps for 10 seconds between its periodic scans.

    It is possible that a subcloud worker thread can complete its state
    operation, sets the next state in the database and removes itself
    from the dictionary while the orch thread has not got to process
    the strategy step of that subcloud in the current scan. When the
    orch thread gets to process the strategy step for this subcloud,
    as the subcloud worker has been removed from the dictionary by the
    worker thread, it creates a new one to perform the operation for
    the state that had been done.

    The race condition is likely to occur when
    the max_parallel_subclouds number is large resulting in orchestration
    failures. The solution is to leave the dictionary updates to the orch
    thread. The worker thread is only responsible for performing state
    operation and updating the strategy step state based on the operation
    result.

    Test Plan:
      - Verify successful patch orchestration of an in-service
        patch.
      - Verify successful patch orchestration of a
        reboot-required patch.

    Closes-Bug: 1979237
    Change-Id: I18e682b6877b5c2e39aaadd0f05399b4c5fa99c9
    Signed-off-by: Tee Ngo <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Tee Ngo (teewrs)
importance: Undecided → Medium
tags: added: stx.7.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.