during K8s upgrade it's possible for needed images to be garbage collected

Bug #2044493 reported by Chris Friesen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

Brief Description

SX subcloud K8s upgrade timeout in control plane 1.22.5

Retry worked fine and this issue is only found in one subcloud in the same DC lab another subcloud passed
Severity

Major

Steps to Reproduce

dcmanager kube-upgrade-strategy create --to-version v1.24.4

dcmanager kube-upgrade-strategy apply

Expected Behavior

K8s upgrade should be successful

Actual Behavior

[sysadmin@controller-0 scratch(keystone_admin)]$ dcmanager strategy-step list | grep subcloud2
subcloud2 3 failed kube applying vim kube upgrade strategy: (kube-upgrade) Vim strategy apply failed. Unexpected State: abort-failed. 2023-11-16 19:19:32.947260 2023-11-16 19:40:32.497926

Reproducibility
Intermittent

System Configuration
Distributed Cloud system controller

Bob Church analyzed the logs and found that some of the control plane images used by the static pods had been garbage-collected by kubelet just before they were actually needed. Because static pods can't use Secrets, we were unable to set the imagePullSecrets field on these pods.

The solution is to disable garbage-collecting of images prior to pulling the new images, and re-enabling it after the upgrade is complete.

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/901816

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/901778
Committed: https://opendev.org/starlingx/stx-puppet/commit/4636be19a498eca43f028c79cb2b652ad552d26f
Submitter: "Zuul (22348)"
Branch: master

commit 4636be19a498eca43f028c79cb2b652ad552d26f
Author: Chris Friesen <email address hidden>
Date: Thu Nov 23 15:04:29 2023 -0600

    disable image gc when doing k8s upgrade

    Static pods cannot use image pull secrets, so it's important that the
    control plane images are not garbage-collected while we're doing a
    Kubernetes upgrade otherwise the upgrade can fail.

    Accordingly we want to disable garbage-collecting the images, then
    pre-pull the new images, then do the actual K8s upgrade, then re-enable
    image garbage collection.

    Also included are a couple of fixes for places where we were using
    subtly incorrect versions when retrieving the image list as part of
    a multi-version upgrade.

    TEST-PLAN:
    PASS: Perform multi-version K8s upgrade on AIO-SX, ensure upgrade
          passes and image garbage collection is disabled during the
          upgrade and re-enabled when kubelet gets upgraded to the
          final version.

    PASS: Perform single-verison K8s upgrade on AIO-SX, ensure upgrade
          passes and image garbage collection is disabled during the
          upgrade and re-enabled when kubelet gets upgraded.

    PASS: Perform single-version K8s upgrade on Standard lab, ensure
          upgrade passes and image garbage collection is disabled on
          each node during the upgrade and re-enabled when kubelet is
          upgraded.

    Closes-Bug: 2044492
    Partial-Bug: 2044493

    Change-Id: I358ae922e5c2c5c047806a1e6773b1d23a74cbd0
    Signed-off-by: Chris Friesen <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/901816
Committed: https://opendev.org/starlingx/config/commit/c645ce21d6272e92b4546ded808205088874f4bd
Submitter: "Zuul (22348)"
Branch: master

commit c645ce21d6272e92b4546ded808205088874f4bd
Author: Chris Friesen <email address hidden>
Date: Fri Nov 24 02:00:39 2023 -0600

    disable image gc when doing k8s upgrade

    Static pods cannot use image pull secrets, so it's important that the
    control plane images are not garbage-collected while we're doing a
    Kubernetes upgrade otherwise the upgrade can fail.

    Accordingly we want to disable garbage-collecting the images, then
    pre-pull the new images, then do the actual K8s upgrade, then re-enable
    image garbage collection.

    For duplex systems we can disable garbage collection from the puppet
    manifest, but for simplex puppet isn't involved so we have to do it
    from sysinv.

    The re-enabling of the image garbage collection happens when we
    upgrade kubelet to the final desired version. It's done in the
    puppet commit linked below.

    Depends-On: https://review.opendev.org/c/starlingx/stx-puppet/+/901778

    TEST-PLAN:

    PASS: Perform single-verison K8s upgrade on AIO-SX, ensure upgrade
          passes and image garbage collection is disabled when we
          download images and re-enabled when kubelet gets upgraded.

    Closes-Bug: 2044493

    Change-Id: Ide258768c3b05a01c4e903e52380a348c2fcae65
    Signed-off-by: Chris Friesen <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.containers
Changed in starlingx:
assignee: nobody → Chris Friesen (cbf123)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.