Patch Orchestration hanging when failed to obtain health report

Bug #1976109 reported by Bo Yuan Chang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bo Yuan Chang

Bug Description

Brief Description

The patch orchestration hang when the system failed to get health report from the subclouds.

Severity

Critical: Unable to proceed with patch update orchestration

Steps to Reproduce

1. Apply patches in system controller
2. Upgrade all subclouds with patch
3. Delete patch from system controller
4. Patch orchestrate all subclouds

Expected Behavior

Patch orchestration can be aborted then deleted

Actual Behavior

Patch orchestration is stucked in updating state and cannot be aborted

Reproducibility

Yes

System Configuration
any DC setup

Branch/Pull Time/Commit
master

Changed in starlingx:
status: New → In Progress
Changed in starlingx:
assignee: nobody → Bo Yuan Chang (cby19961020)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/840216
Committed: https://opendev.org/starlingx/distcloud/commit/116c119541b2b7ac85b475d5b0e1c7c4292ea2fa
Submitter: "Zuul (22348)"
Branch: master

commit 116c119541b2b7ac85b475d5b0e1c7c4292ea2fa
Author: BoYuan Chang <email address hidden>
Date: Mon May 2 11:18:41 2022 -0500

    Ensure one patching worker thread per subcloud

    Remove the worker thread creation in STRATEGY_STATE_UPDATING_PATCHES
    state since it's already done in STRATEGY_STATE_INITIAL state. This
    code flaw was revealed from testing patch orchestration of large
    max_parallel_subclouds size. The patch orch thread loops through the
    list of subclouds of the current step and process each one of them
    every 10s. When the batch size is large, the subcloud state
    retrieved at the beginning of the loop and stored in memory may be
    stale for a subcloud by the time it processes that subcloud.
    Try/Catch statement is also added to pre_check to prevent state stuck
    indefinitely when failed to obtain the health report from sysinv.

    Test Plan:

    1. Ensure all subclouds are free of mgmt affecting alarms. Execute patch
       orchestration of large max_parallel_subclouds size and verify that it
       works without any failure(s) due to "Patch in-progress" alarm.

    2. Check when the get system health report timed out the subcloud will
       fail rather than hanging indefinitely.

    Closes-Bug: 1971172
    Closes-Bug: 1976109
    Signed-off-by: BoYuan Chang <email address hidden>
    Change-Id: Ib70f00228ebd181d175c09a462b95077b0a8218b

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/845306

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/845306
Committed: https://opendev.org/starlingx/distcloud/commit/4f443a869ba73fa9c150b63c7e2f54fbe1a1914f
Submitter: "Zuul (22348)"
Branch: master

commit 4f443a869ba73fa9c150b63c7e2f54fbe1a1914f
Author: BoYuan Chang <email address hidden>
Date: Thu Jun 9 18:51:19 2022 -0400

    Fix log format flaw in previous commit

    Added the missing bracket inside the log call in
    https://review.opendev.org/c/starlingx/distcloud/+/840216

    Closes-Bug: 1971172
    Closes-Bug: 1976109
    Signed-off-by: BoYuan Chang <email address hidden>
    Change-Id: I2ec0db346a61a49a49fb85bdc6390b3bfd69a81f

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.distcloud
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.