sw-manager patch-strategy failed to install timeout

Bug #2059305 reported by John Kung
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Vanathi Selvaraju

Bug Description

Brief Description
-----------------
sw-manager patch-strategy failed to install due to timeout

Severity
--------

Major

Steps to Reproduce
------------------

sudo sw-patch upload <patch>
sudo sw-patch apply <patch>

sw-manager patch-strategy create
sw-manager patch-strategy apply

Expected Behavior
-----------------

"sw-manager patch-strategy apply" installs patch successfully

Actual Behavior
---------------

"sw-manager patch-strategy apply" fails with the following error:
[sysadmin@controller-1 ~(keystone_admin)]$ sw-manager patch-strategy show

Strategy Patch Strategy:
  strategy-uuid: 2082ab5e-a387-4b6a-be23-50ac23317725
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: serial
  default-instance-action: stop-start
  alarm-restrictions: strict
  current-phase: abort
  current-phase-completion: 100%
  state: aborted
  apply-result: timed-out
  apply-reason:
  abort-result: success
  abort-reason:

+----------+----------------------------------+--------------------+----------+--------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+----------------------------------+--------------------+----------+--------------+
| 900.103 | Software patch auto-apply failed | orchestration=sw- | critical | 2024-03-21T1 |
| | | patch | | 3:45:26. |
| | | | | 933951 |
| | | | | |
| 900.001 | Patching operation in progress | host=controller | minor | 2024-03-21T0 |
| | | | | 2:16:12. |
| | | | | 255206 |
| | | | | |
+----------+----------------------------------+--------------------+----------+--------------+

The following alarm took 1817 seconds to clear:
| 2024-03-26T | clear | 750. | A configuration change requires a reapply of the platform- | k8s_application=platform-integ-apps | warning |
| 03:17:08. | | 006 | integ-apps application. | | |

2024-03-26T | set | 750. | A configuration change requires a reapply of the platform- | k8s_application=platform-integ-apps | warning |
| 02:46:51. | | 006 | integ-apps application. | | |
| 903492 | |

Reproducibility
---------------

Intermittent, reproducible in certain labs.

Load:
----
2022-12-19_02-22-00

System Configuration
--------------------
AIO-DX

Workaround:
----------
Remove strategy and reapply or apply manually via sw-patch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/nfv/+/914559

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nfv (master)

Change abandoned by "Vanathi Selvaraju <email address hidden>" on branch: master
Review: https://review.opendev.org/c/starlingx/nfv/+/914559
Reason: Abandoning due to change ID mismatch. Will be raising a new review.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/914559
Committed: https://opendev.org/starlingx/nfv/commit/eca1a05b8310cfb2878a1921fe65236fae78ec5b
Submitter: "Zuul (22348)"
Branch: master

commit eca1a05b8310cfb2878a1921fe65236fae78ec5b
Author: Vanathi.Selvaraju <email address hidden>
Date: Wed Mar 27 15:53:55 2024 -0400

    sw-manager patch-strategy failed to install due to timeout

    As part of this fix, new parameter ignore_alarm_conditional
    is added, which has the list of stale alarms that needs to
    ignored after 30mins.
    Alarm clear wait step checks for stale alarm 750.006 for
    30mins. If the alarm is still not cleared, patch-strategy
    ignores the alarm.
    Now, since the stale alarms are monitored for 30mins,
    the overall alarm clear timeout is increased to 2400sec.

    In the current case alarm 750.006 is not getting cleared
    and also it is not part of the ignore alarm list
    so the patch-strategy times out.

    Test Plan:
    PASSED: Applying a patch - On DX system(VM),
    Create and apply patch strategy,
    fm alarm-list to have an uncleared alarm(for test purpose
    100.103 - Memory threshold alarm was used). After 30mins
    alarm was ignored and patch strategy successfully applied.
    PASSED: Removing a patch - On DX system(VM),
    Create and apply patch strategy,
    fm alarm-list to have an uncleared alarm(for test purpose
    100.103 - Memory threshold alarm was used). After 1800sec
    alarm was ignored and patch strategy successfully applied.
    PASSED: On DX system(lab), 4 consecutive patch orchestration
    successfully applied. 750.006 - stale alarm tested.
    PASSED: On DX system, create and apply strategy,
    with alarm existing on system(not part of ignore list)
    strategy would wait for 2400sec before timing out.
    PASSED: On DX system, k8s upgrade from v1.21.8 to
    v1.22.5 successfully executed.

    Closes-Bug: 2059305
    Change-Id: I7ebaf5a24fa45a7e45f3af7e5ca588ce3ee06156
    Signed-off-by: Vanathi.Selvaraju <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.10.0 stx.nfv
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Vanathi Selvaraju (vselvara)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.