Patch orchestration failed due to mishandling of management affecting alarm check

Bug #1971172 reported by Bo Yuan Chang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Bo Yuan Chang

Bug Description

Brief Description
Patch orchestrator detects the management effecting alarms AFTER applying the patches. The alarm is legit – it’s the patching operation in progress alarm.

Severity
Major

Brief Description
Patch orchestrator detects the management effecting alarms AFTER applying the patches. The alarm is legit – it’s the patching operation in progress alarm.

Severity
Major

Steps to Reproduce
Apply patch strategy on many subclouds

Expected Behavior
Patch orchestration completes successfully

Actual Behavior
many subcloud failed to apply the patch for the same reason

WARNING dcmanager.orchestrator.patch_orch_thread [-] Subcloud subcloudxxx contains one or more management affecting alarm(s). It will not be patched. Please resolve the alarm condition(s) and try again.

Reproducibility
Intermittent

System Configuration
22.02

Alarms
N/A

Test Activity
Developer Testing

Workaround
Manually finish patch apply for 30 failed subclouds as orchestration cannot be retried for these subclouds in this state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/840216

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.distcloud
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/840216
Committed: https://opendev.org/starlingx/distcloud/commit/116c119541b2b7ac85b475d5b0e1c7c4292ea2fa
Submitter: "Zuul (22348)"
Branch: master

commit 116c119541b2b7ac85b475d5b0e1c7c4292ea2fa
Author: BoYuan Chang <email address hidden>
Date: Mon May 2 11:18:41 2022 -0500

    Ensure one patching worker thread per subcloud

    Remove the worker thread creation in STRATEGY_STATE_UPDATING_PATCHES
    state since it's already done in STRATEGY_STATE_INITIAL state. This
    code flaw was revealed from testing patch orchestration of large
    max_parallel_subclouds size. The patch orch thread loops through the
    list of subclouds of the current step and process each one of them
    every 10s. When the batch size is large, the subcloud state
    retrieved at the beginning of the loop and stored in memory may be
    stale for a subcloud by the time it processes that subcloud.
    Try/Catch statement is also added to pre_check to prevent state stuck
    indefinitely when failed to obtain the health report from sysinv.

    Test Plan:

    1. Ensure all subclouds are free of mgmt affecting alarms. Execute patch
       orchestration of large max_parallel_subclouds size and verify that it
       works without any failure(s) due to "Patch in-progress" alarm.

    2. Check when the get system health report timed out the subcloud will
       fail rather than hanging indefinitely.

    Closes-Bug: 1971172
    Closes-Bug: 1976109
    Signed-off-by: BoYuan Chang <email address hidden>
    Change-Id: Ib70f00228ebd181d175c09a462b95077b0a8218b

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/845306

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/845306
Committed: https://opendev.org/starlingx/distcloud/commit/4f443a869ba73fa9c150b63c7e2f54fbe1a1914f
Submitter: "Zuul (22348)"
Branch: master

commit 4f443a869ba73fa9c150b63c7e2f54fbe1a1914f
Author: BoYuan Chang <email address hidden>
Date: Thu Jun 9 18:51:19 2022 -0400

    Fix log format flaw in previous commit

    Added the missing bracket inside the log call in
    https://review.opendev.org/c/starlingx/distcloud/+/840216

    Closes-Bug: 1971172
    Closes-Bug: 1976109
    Signed-off-by: BoYuan Chang <email address hidden>
    Change-Id: I2ec0db346a61a49a49fb85bdc6390b3bfd69a81f

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0
Changed in starlingx:
assignee: nobody → Bo Yuan Chang (cby19961020)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.