sw-manager kube-upgrade-strategy orchestration failed due to wait-alarms-clear timeout

Bug #2003260 reported by Igor Pires Soares
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Igor Pires Soares

Bug Description

Brief Description
-----------------
sw-manager kube-upgrade-strategy orchestration failed due to wait-alarms-clear timeout

Note: It's probably timing / intermittent issue, as I did orchestration several rounds/labs, this is the only time to hit such issue.

Severity
-----------------
Major

Steps to Reproduce
-----------------
sw-manager kube-upgrade-strategy create --to-version v1.24.4
sw-manager kube-upgrade-strategy apply

Expected Behavior
-----------------
sw-manager kube-upgrade-strategy orchestration is applied successfully

Actual Behavior
-----------------
sw-manager kube-upgrade-strategy orchestration failed due to wait-alarms-clear timeout

Reproducibility
-----------------
Not sure

System Configuration
-----------------
multi-node

Timestamp/Logs
-----------------

[sysadmin@controller-1 ~(keystone_admin)]$ sw-manager kube-upgrade-strategy show

Strategy Kubernetes Upgrade Strategy:
  strategy-uuid: d38b6d46-410b-4a8c-98f1-33e95a6fb1f0
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: serial
  default-instance-action: stop-start
  alarm-restrictions: strict
  current-phase: abort
  current-phase-completion: 100%
  state: aborted
  apply-result: timed-out
  apply-reason:
  abort-result: success
  abort-reason:

[sysadmin@controller-1 ~(keystone_admin)]$ system kube-host-upgrade-list

+----+--------------+-------------+----------------+-----------------------+-----------------+------------------+
| id | hostname | personality | target_version | control_plane_version | kubelet_version | status |
+----+--------------+-------------+----------------+-----------------------+-----------------+------------------+
| 1 | controller-0 | controller | v1.24.4 | v1.24.4 | v1.24.4 | upgraded-kubelet |
| 2 | compute-0 | worker | v1.23.1 | N/A | v1.23.1 | None |
| 3 | controller-1 | controller | v1.24.4 | v1.24.4 | v1.24.4 | upgraded-kubelet |
+----+--------------+-------------+----------------+-----------------------+-----------------+------------------+

[sysadmin@controller-1 ~(keystone_admin)]$ sw-manager kube-upgrade-strategy show --details

...
            step-id: 3
            step-name: kube-host-upgrade-kubelet
            entity-type: hosts
            entity-names: ['controller-0']
            entity-uuids: ['519aaf06-bba4-4a99-bf60-f6d254272b5b']
            timeout: 900 seconds
            start-date-time: 2022-12-20 21:29:07
            end-date-time: 2022-12-20 21:30:10
            result: success
            reason:

            step-id: 4
            step-name: system-stabilize
            timeout: 15 seconds
            start-date-time: 2022-12-20 21:30:10
            end-date-time: 2022-12-20 21:30:27
            result: success
            reason:

            step-id: 5
            step-name: unlock-hosts
            entity-type: hosts
            entity-names: ['controller-0']
            entity-uuids: ['519aaf06-bba4-4a99-bf60-f6d254272b5b']
            timeout: 1800 seconds
            start-date-time: 2022-12-20 21:30:27
            end-date-time: 2022-12-20 21:35:04
            result: success
            reason:

            step-id: 6
            step-name: wait-alarms-clear
            timeout: 1800 seconds
            start-date-time: 2022-12-20 21:35:04
            end-date-time: 2022-12-20 22:05:05
            result: timed-out
            reason:

        stage-id: 6
        stage-name: kube-upgrade-kubelets-workers
        total-steps: 6
        current-step: 0
        timeout: 3736 seconds
        start-date-time:
        end-date-time:
        result: initial
        reason:
        steps:
...

[sysadmin@controller-1 ~(keystone_admin)]$ fm alarm-list

+----------+---------------------------------------------------------------------------+----------------------+----------+---------------+
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+----------+---------------------------------------------------------------------------+----------------------+----------+---------------+
| 750.006 | A configuration change requires a reapply of the oidc-auth-apps | k8s_application= | warning | 2022-12-20T21 |
| | application. | oidc-auth-apps | | :35:04.047739 |
| | | | | |
| 750.006 | A configuration change requires a reapply of the platform-integ-apps | k8s_application= | warning | 2022-12-20T21 |
| | application. | platform-integ-apps | | :35:02.837708 |
| | | | | |
| 750.006 | A configuration change requires a reapply of the cert-manager application | k8s_application= | warning | 2022-12-20T21 |
| | . | cert-manager | | :35:01.490398 |
| | | | | |
| 100.114 | NTP address 91.207.136.55 is not a valid or a reachable NTP server. | host=controller-1=91 | minor | 2022-12-20T21 |
| | | .207.136.55 | | :34:47.116198 |
| | | | | |
| 900.007 | Kubernetes upgrade in progress. | host=controller | minor | 2022-12-20T20 |
| | | | | :55:00.596354 |
| | | | | |
+----------+---------------------------------------------------------------------------+----------------------+----------+---------------+

Test Activity
-----------------
Feature Testing

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
importance: Low → Medium
tags: added: stx.nfv
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to cert-manager-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to platform-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oidc-auth-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to rook-ceph (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/rook-ceph/+/871771

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/871772
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/3a95ea5d05a0e646e1dddd6e79386044c0f5e011
Submitter: "Zuul (22348)"
Branch: master

commit 3a95ea5d05a0e646e1dddd6e79386044c0f5e011
Author: Igor Soares <email address hidden>
Date: Wed Jan 25 13:07:32 2023 -0500

    Add kube-upgrade-complete trigger

    Add kube-upgrade-complete trigger for reevaluating app reaplies

    The kube-upgrade-complete trigger was introduced to resume the
    app reapply evaluation process after a Kubernetes upgrade is completed.

    Test Plan:
    PASS: Full image build
    PASS: AIO-SX deployment

    Relates-to: https://review.opendev.org/c/starlingx/config/+/870990
    Closes-Bug: 2003260

    Signed-off-by: Igor Soares <email address hidden>
    Change-Id: Ie3a13e4e5b4681f32db0de70bf13bb7fc3810793

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Igor Pires Soares (ipiresso)
tags: added: stx.9.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/870990
Committed: https://opendev.org/starlingx/config/commit/72d93297119a2fdb980e73d34f8bf3a239d2e3ab
Submitter: "Zuul (22348)"
Branch: master

commit 72d93297119a2fdb980e73d34f8bf3a239d2e3ab
Author: Igor Soares <email address hidden>
Date: Tue Jan 17 14:57:20 2023 -0500

    Defer app reapply evaluation during k8s upgrades

    Prevent triggering the app reapply evaluation process during
    Kubernetes upgrades.

    The app reapply evaluation process can raise alarms that would
    prevent Kubernetes upgrades from proceeding. Concurrently,
    reapplying apps would be deferred because of the upgrade process
    itself thus causing a deadlock condition.

    This commit prevents reapply alarms from being raised during
    Kubernetes upgrades and resumes app reapply evaluation when
    upgrades are complete.

    Test Plan:
    PASS: AIO-DX full system upgrade
    PASS: Kubernetes upgrade orchestration from v1.23.1 to v1.24.4

    Closes-Bug: 2003260
    Signed-off-by: Igor Soares <email address hidden>
    Change-Id: I999ca3f1a454954d2759d3a7d347e51d5875b187

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to platform-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/platform-armada-app/+/871749
Committed: https://opendev.org/starlingx/platform-armada-app/commit/770e5f01d7600bda66af99beeeec00b43f5d0861
Submitter: "Zuul (22348)"
Branch: master

commit 770e5f01d7600bda66af99beeeec00b43f5d0861
Author: Igor Soares <email address hidden>
Date: Wed Jan 25 12:53:40 2023 -0500

    Add kube-upgrade-complete trigger

    Add kube-upgrade-complete trigger for reevaluating app reaplies

    The kube-upgrade-complete trigger was introduced to resume the
    app reapply evaluation process after a Kubernetes upgrade is completed.

    Test Plan:
    PASS: Full image build
    PASS: AIO-SX deployment

    Relates-to: https://review.opendev.org/c/starlingx/config/+/870990
    Closes-Bug: 2003260

    Signed-off-by: Igor Soares <email address hidden>
    Change-Id: I85973b076aa1d05ca60cc58d0d26dd93c84d7092

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oidc-auth-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/oidc-auth-armada-app/+/871770
Committed: https://opendev.org/starlingx/oidc-auth-armada-app/commit/5cbb99016fa61b3509cf357123c1a9939d27b094
Submitter: "Zuul (22348)"
Branch: master

commit 5cbb99016fa61b3509cf357123c1a9939d27b094
Author: Igor Soares <email address hidden>
Date: Wed Jan 25 12:57:22 2023 -0500

    Add kube-upgrade-complete trigger

    Add kube-upgrade-complete trigger for reevaluating app reaplies

    The kube-upgrade-complete trigger was introduced to resume the
    app reapply evaluation process after a Kubernetes upgrade is completed.

    Test Plan:
    PASS: Full image build
    PASS: AIO-SX deployment

    Relates-to: https://review.opendev.org/c/starlingx/config/+/870990
    Closes-Bug: 2003260

    Signed-off-by: Igor Soares <email address hidden>
    Change-Id: I57ca315e27f83c100e4efbb8788c3ef193cf7175

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to rook-ceph (master)

Reviewed: https://review.opendev.org/c/starlingx/rook-ceph/+/871771
Committed: https://opendev.org/starlingx/rook-ceph/commit/6b717f4b1d757d8e73e1f367323dfc16bbb10bba
Submitter: "Zuul (22348)"
Branch: master

commit 6b717f4b1d757d8e73e1f367323dfc16bbb10bba
Author: Igor Soares <email address hidden>
Date: Wed Jan 25 13:01:07 2023 -0500

    Add kube-upgrade-complete trigger

    Add kube-upgrade-complete trigger for reevaluating app reaplies

    The kube-upgrade-complete trigger was introduced to resume the
    app reapply evaluation process after a Kubernetes upgrade is completed.

    Test Plan:
    PASS: Full image build
    PASS: AIO-SX deployment

    Relates-to: https://review.opendev.org/c/starlingx/config/+/870990
    Closes-Bug: 2003260

    Signed-off-by: Igor Soares <email address hidden>
    Change-Id: Ic8d280d4b5c8bf4b90d3d95143006b581ea68926

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to cert-manager-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/cert-manager-armada-app/+/871748
Committed: https://opendev.org/starlingx/cert-manager-armada-app/commit/fe0b395d325b991aec3f5d3060971b1e19f5f113
Submitter: "Zuul (22348)"
Branch: master

commit fe0b395d325b991aec3f5d3060971b1e19f5f113
Author: Igor Soares <email address hidden>
Date: Wed Jan 25 12:23:23 2023 -0500

    Add kube-upgrade-complete trigger

    Add kube-upgrade-complete trigger for reevaluating app reapplies

    The kube-upgrade-complete trigger was introduced to resume the
    app reapply evaluation process after a Kubernetes upgrade is completed.

    Test Plan:
    PASS: Full image build
    PASS: AIO-SX deployment
    PASS: Issue kube-upgrade-complete trigger during app apply

    Relates-to: https://review.opendev.org/c/starlingx/config/+/870990
    Closes-Bug: 2003260

    Signed-off-by: Igor Soares <email address hidden>
    Change-Id: I9c40519ec598106d102eeb35a4f396b4d4c7d167

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.