Patch orchestration stuck after service restart

Bug #1979097 reported by Yuxing
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Yuxing

Bug Description

Brief Description
-----------------
    During a patch orchestration, after a restart of the
    dcmanager-orchestrator(unintentionally caused by a swact or a reboot
    etc.), the application of the on-going patch strategy on subclouds
    will be unhealthy( e.g. progress stuck due to the missing of the green
    threads, missing the cache of the RegionOne patches).

Severity
--------
Minor

Steps to Reproduce
------------------
1 Create and apply a patch orchestration
2 Manually restart the dcmanager-orchestrator service

Expected Behavior
------------------
The patch strategy should be in a stable status

Actual Behavior
----------------
The patch strategy stuck in applying, cannot delete

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DC

Branch/Pull Time/Commit
-----------------------
Jun 16th load

Last Pass
---------
na

Timestamp/Logs
--------------
na

Test Activity
-------------
Developer Testing

Workaround
----------
Editing the DB to fail the patch strategy.

Yuxing (yuxing)
Changed in starlingx:
assignee: nobody → Yuxing (yuxing)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/846447

Changed in starlingx:
status: New → In Progress
Yuxing (yuxing)
summary: - Patch orchestration stuck after serivce restart
+ Patch orchestration stuck after service restart
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/847059

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on distcloud (master)

Change abandoned by "Yuxing Jiang <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/846447
Reason: Combine this change with https://review.opendev.org/c/starlingx/distcloud/+/847059

Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/847059
Committed: https://opendev.org/starlingx/distcloud/commit/29fe24acb313cf9ff3c434dd535e314800503e4f
Submitter: "Zuul (22348)"
Branch: master

commit 29fe24acb313cf9ff3c434dd535e314800503e4f
Author: Yuxing Jiang <email address hidden>
Date: Tue Jun 21 11:57:02 2022 -0400

    Enhance handling on-going patch strategy

    This commit includes:
    1. Rebuild the region one patch cache during the service reload.
    2. Ignore the patch in progress alarm as this could be a patch
    orchestration retry.
    3. Re-create workers if they are cleard after a service restart.
    4. Properly handle reboot-required patching in case both system
    controller and subcloud patch orchestration is done in one single
    strategy.
    With these improvement, the patch strategy can continue after a
    service reload.

    Test plan(passed):
    1. Verify successful patch orchestration of an RR patch when both
    system controller and subclouds are patched in the same strategy.
    2. Induce a 300.005 alarm (mgmt-affecting) in a subcloud, verify
    that orchestrated patching fails for that subcloud.
    3. Induce a 900.001 alarm by partially apply a patch in a subcloud
    beforehand, verify that orchestrated patching completes for that
    subcloud.
    4. Induce process restart in the middle of a subcloud patch
    orchestration, verify that transitional strategy steps are set to
    failed and the subclouds still in "initial" state can continue.
    5. Induce process restart in the middle of a system controller patch
    orchestration, verify that system controller patching can resume and
    complete.

    Closes-Bug: 1979097
    Signed-off-by: Yuxing Jiang<email address hidden>
    Change-Id: I1b70d14b77c3e1be6301f011baff297502b9108b

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.7.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.