DC - k8s upgrade orchestration to check health-query-upgrade

Bug #2012300 reported by Christopher de Oliveira Souza
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Christopher de Oliveira Souza

Bug Description

Brief Description
--------
Subclouds k8s upgrade orchestration fails as 'system health-query-kube-upgrade' returns the host has out-of-date configurations.

$ dcmanager strategy-step list | grep fail
| subcloud59 | 1 | failed | kube applying vim kube upgrade strategy: (kube-upgrade) Vim strategy apply failed. Unexpected State: aborted. | 2022-12-08 16:40:05.943226 | 2022-12-08 16:43:31.941503 |

Severity
---------
Major

Steps to Reproduce
---------
    Apply Subcloud k8s upgrade orchestration (prerequisites: Install System Controller and subclouds with 1.23 K8s then upgrade K8s on the system contoller first).
    $ dcmanager kube-upgrade-strategy create --max-parallel-subclouds 250 --subcloud-apply-type parallel --to-version v1.24.4
    $ dcmanager kube-upgrade-strategy apply

Expected Behavior
----------
Subclouds kube upgraded.

Actual behavior
-----------
Subcloud kube not upgraded due to a failure during $ system health-query-kube-upgrade

Reproducibility
-----------
1 out of 1000 subclouds.

System Configuration
-----------
DC

Load info
-----------
22.12

Last Pass
-----------
NA

Timestamp/Logs
-----------
NA

Alarms
------------
None

Test Activity
------------
Scalability Testing

Workaround
------------
Lock/Unlock the subcloud controller-0 then re-apply the k8s strategy

Changed in starlingx:
assignee: nobody → Christopher de Oliveira Souza (cdeolive)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/877994

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/877994
Committed: https://opendev.org/starlingx/distcloud/commit/46c0c59a9c2ddf7a4fd306ab958450dc90c5f10b
Submitter: "Zuul (22348)"
Branch: master

commit 46c0c59a9c2ddf7a4fd306ab958450dc90c5f10b
Author: Christopher Souza <email address hidden>
Date: Mon Mar 20 15:00:58 2023 -0300

    Add health check to kube upgrade pre-check

    In this commit, the DC k8s upgrade orchestrator was updated
    to include health-query-kube-upgrade check.

    Test Plan:
    PASS: Create and apply k8s upgrade strategy having a subcloud
    with mgmt affecting alarms.
    Then verify that the pre-check stage failed for that subcloud due
    health check failure.
    PASS: Create and apply k8s upgrade strategy having only
    healthy subclouds with no alarms.
    Then verify that the subclouds passed the pre-check stage.
    PASS: Create and apply k8s upgrade strategy having one subcloud
    with one non-mgmt affecting alarm. E.g: 100.103.
    Then verify that this subcloud passed the pre-check stage.

    Closes-Bug: 2012300

    Signed-off-by: Christopher Souza <email address hidden>
    Change-Id: Ib2b8f7a54c69f96234c0fdd9ad4d423d98a9d9ea

Changed in starlingx:
status: In Progress → Fix Released
tags: added: stx.9
tags: added: stx.9.0
removed: stx.9
tags: added: stx.distcloud
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to distcloud (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/distcloud/+/882331

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/882331
Committed: https://opendev.org/starlingx/distcloud/commit/f8b3b48d853926536f99e769fa6a6f021b388e7e
Submitter: "Zuul (22348)"
Branch: master

commit f8b3b48d853926536f99e769fa6a6f021b388e7e
Author: Christopher Souza <email address hidden>
Date: Thu May 4 12:34:26 2023 -0300

    Ignore certain alarms during kube upgrade pre-check

    In this commit, the kubernetes upgrade orchestrator was updated to
    not fail in the pre-check stage when certain alarms are set. Before
    this change it was not possible to retry the kubernetes upgrade for
    a subcloud that failed because of the 900.007 alarm.

    Test Plan:
    PASS: Run kubernetes upgrade orchestration and re-run
    the orchestration for a subcloud that failed and verify that
    the orchestration finished succesfully
    PASS: Set a management-affecting alarm that is not in the following
    list: ['100.003', '200.001', '700.004', '750.006','900.007', '900.401']
    on the subcloud and verify that the kubernetes upgrade orchestration
    failed for that subcloud.

    Closes-Bug: 2012300

    Signed-off-by: Christopher Souza <email address hidden>
    Change-Id: I25b97a98441db4bdbd1d3e546eda0b5a1588dcb1

Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.