Patching orchestration fails due to alarm not cleared

Bug #1999447 reported by Jessica Castelino
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jessica Castelino

Bug Description

Brief Description
-----------------
On a AIO-DX+ system, patching orchestration successfully works for patching controller-0, but when patching controller-1, it gets stuck waiting for the host to be "patch current" and the orchestration does not proceed for other hosts.

Even manually running the "sudo sw-patch host-install controller-1" and getting a success response did not change the controller-1 state which keeps "Patch Current=NO".

Severity
-----------------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
-----------------
Install a AIO-DX+ system( AIO+ 10 worker nodes)

Configure patching orchestration strategy via command line to patch the entire system using a RR patch.

sw-manager patch-strategy create --controller-apply-type serial --storage-apply-type parallel --worker-apply-type parallel --alarm-restrictions relaxed --max-parallel-worker-hosts 10
Apply the strategy

sw-manager patch-strategy apply
Monitor the patching orchestration procedure via /var/log/patching.log and sudo sw-patch query-hosts{}

Expected Behavior
-----------------
All nodes are patched successfully

Actual Behavior
-----------------
Controller-1 is patched successfully

Controller-0 is not patched and orchestration is stuck after the sw-patch host-install command is sent to controller-0

Compute nodes are not patched

Eventually the patching strategy times out and fails

Reproducibility
-----------------
Intermittent with high frequency

It happened 2 out of 3 attempts

System Configuration
-----------------
AIO-DX+

Load info (eg: 2022-03-10_20-00-07)
-----------------
[sysadmin@controller-0 ~(keystone_admin)]$ cat /etc/build.info
2022-11-28_18-00-09

Last Pass
-----------------
Not sure, but this scenario passed fine on system test execution for 22.06 release.

Timestamp/Logs
-----------------
NA

Alarms
-----------------
NA

Test Activity
-----------------
System Test

Workaround
-----------------
NA

Changed in starlingx:
assignee: nobody → Jessica Castelino (jcasteli)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to update (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/update/+/867311

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to update (master)

Reviewed: https://review.opendev.org/c/starlingx/update/+/867311
Committed: https://opendev.org/starlingx/update/commit/83fb74ad74f2e6ef2e69df42bedbf7452cd3b9f2
Submitter: "Zuul (22348)"
Branch: master

commit 83fb74ad74f2e6ef2e69df42bedbf7452cd3b9f2
Author: Jessica Castelino <email address hidden>
Date: Mon Dec 12 21:20:45 2022 +0000

    Make /etc readable and writable after re-mount

    When we install an in-service patch, we remount the etc and usr
    directories. Currently, these remounts make /usr and /etc
    readable. We want /etc to be a rw mount and /usr to be a readable
    mount and this commit fixes the issue.

    Test Plan:
    [PASS] Install an in-service patch and verify /etc is rw

    Closes-Bug: 1999447
    Signed-off-by: Jessica Castelino <email address hidden>
    Change-Id: I0325efec1f1a91112c359997800c3deb2af7eb3a

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.