robustness: error in k8s upgrade ends up stuck in upgrading-second-master

Bug #1999405 reported by sachin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
sachin

Bug Description

While testing a bug in k8s upgrades which led to a late-stage failure upgrading the second controller node, the "sudo kubeadm upgrade node" command failed right near the end after the new /var/lib/kubelet/config.yaml had been written. This resulted in us getting stuck in a state where "system kube-upgrade-show" shows the state as "upgrading-second-master" (as per the "state" field in the "kube_upgrade" table in the sysinv DB) rather than transitioning to "upgrading-second-master-failed". Similarly, the "status" field in the kube_host_upgrade table in the sysinv DB has a "status" field that remains set, and this needs to be cleared before we can retry.

Each of the above issues on their own would mean that we can't retry the "system kube-host-upgrade controller-X control-plane" command as it fails the pre-checks.

Severity

Major

Steps to Reproduce

Force a late-stage failure of the "kubeadm upgrade node" command when doing a kubernetes upgrade on a control plane node.

Expected Behavior

System transitions to state of "upgrading-second-master-failed".

Actual Behavior

System stays at "upgrading-second-master" and a retry fails the pre-checks.

Reproducibility

Reproducible (with a late-state failure of the puppet manifest).

System Configuration

Multi-node standard lab. Might also show up in AIO-SX if there is a failure in the k8s control plane upgrade puppet manifest.

sachin (skrishn5)
Changed in starlingx:
assignee: nobody → sachin (skrishn5)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/867255

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.containers stx.update
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/867255
Committed: https://opendev.org/starlingx/config/commit/029e3eecf50d5000142c8f3ef6b203437e313c07
Submitter: "Zuul (22348)"
Branch: master

commit 029e3eecf50d5000142c8f3ef6b203437e313c07
Author: Sachin Gopala Krishna <email address hidden>
Date: Mon Dec 12 10:55:09 2022 -0500

    Add system command and periodic audit to transition state

    system kube-upgrade-* commands can get stuck in upgrading-* state with
    no way to continue to upgrade. The 'system kube-upgrade-failed' command
    created to manually set state to *-failed.

    Created kube-upgrade-failed command to manually set status to *-failed.
    Created 30 minute periodic task _audit_kube_upgrade_states to
    automatically change the kube_upograde state to *-failed if the specific
    state is stuck 'upgrading-*' for more than 1 hour.
    Updarted kube_upgrade_controller to support state transition to *-failed
    state.

    Test Plan:
    PASS: Manually edit kube_upgrade state to upgrading-* and execute
    'system kube-upgrade-failed' and verify the state transition to *-failed
    PASS: Manually edit kube_upgrade state to upgrading-* after kube_upgrade
    completion and wait for one hour and verify state transition to *-failed
    based on updated_at time stamp
    PASS: Verify the functionality of _audit_kube_upgrade_states and
    kube-upgrade-failed by building ISO

    Closes-Bug: 1999405

    Signed-off-by: Sachin Gopala Krishna <email address hidden>
    Change-Id: I499fb2909f11dc2b240dbf2e03ccfd95f1fd2e62

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.