Restore Operation failed with Backup done after K8s upgrade to 1.24

Bug #1999095 reported by Chris Friesen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Chris Friesen

Bug Description

System Restore failed on AIO-DX (ip_31-32_k8s) using Backup after upgrading K8s (1.23.1 to 1.24.4).

According to Chris F.,

/var/log/containers/sysadmin@controller-0:/var/log/containers$ cat kube-apiserver-controller-0_kube-system_kube-apiserver-9477bdffb6a844754f3d2ce9fcf86888318be26b74bf9879a35b257924b22469.log has the following:

2022-11-29T22:35:22.543496541Z stderr F Error: invalid argument "RemoveSelfLink=false" for "--feature-gates" flag: cannot set feature gate RemoveSelfLink to false, feature is locked to true

Severity

Major

Steps to Reproduce

    Verify the System Controller is healthy and running kubelet version 1.23.1.
    Controller-0 was the Active Controller initially
    Create and apply the kube upgrade strategy
    sw-manager kube-upgrade-strategy create --to-version v1.24.4
    sw-manager kube-upgrade-strategy apply
    Watch progress - "sw-manager kube-upgrade-strategy show"
    After K8s Upgrade completed, perform a Backup.
    Re-Initialize Debian with the same load, and attempt to Restore from the Platform_Backup

Expected Behavior

System is Restored with K8s Upgraded, and All Pods are running

Actual Behavior

    [wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s

    [kubelet-check] Initial timeout of 40s passed.

    Unfortunately, an error has occurred:
            timed out waiting for the condition (likely kubeapi-server exited)

Chris Friesen (cbf123)
Changed in starlingx:
assignee: nobody → Chris Friesen (cbf123)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/866804
Committed: https://opendev.org/starlingx/integ/commit/15db2d6990a717f50cb7611b1e4ee76f3c626af7
Submitter: "Zuul (22348)"
Branch: master

commit 15db2d6990a717f50cb7611b1e4ee76f3c626af7
Author: Chris Friesen <email address hidden>
Date: Tue Dec 6 14:33:08 2022 -0600

    clean up feature gates on k8s upgrade

    During a K8s feature upgrade from 1.23 to 1.24 we need to remove
    the "RemoveSelfLink=false" feature gate from kube-apiserver.

    We had previously handled updating the kubeadm configmap, which
    was sufficient to handle the running system. However, in order
    to properly handle backup and restore after the K8s upgrade to
    1.24 (and just for general tidiness) we need to also remove the
    feature gate from the saved service parameters and from the
    last_kube_extra_config_bootstrap.yaml file.

    It's possible that there are other kube-apiserver feature gates
    specified by the end user, this adds a bit of complexity to the
    code.

    Test Plan:
    PASS: Test python script and bash script in isolation.
    PASS: End-to-end test with k8s upgrade and backup/restore with
          manual modification of service parameters and yaml file.
          Tested with AIO-DX, AIO-SX unoptimised restore, and
          AIO-SX optimised restore.
    PASS: K8s upgrade using the new code, ensure service parameter
          and last_kube_extra_config_bootstrap.yaml have been
          updated with "RemoveSelfLink=false" feature gate removed.

    Closes-Bug: 1999095
    Signed-off-by: Chris Friesen <email address hidden>
    Change-Id: I82ecd821d4e1745ab0f480f9f9c0178757521038

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.containers stx.update
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/867587

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/867743

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/867587
Committed: https://opendev.org/starlingx/config/commit/804a31613820d8064292495637928e5b6a66ee00
Submitter: "Zuul (22348)"
Branch: master

commit 804a31613820d8064292495637928e5b6a66ee00
Author: Chris Friesen <email address hidden>
Date: Tue Dec 13 20:39:02 2022 -0600

    clean up feature gates on k8s upgrade

    During a K8s feature upgrade from 1.23 to 1.24 we need to remove
    the "RemoveSelfLink=false" feature gate from kube-apiserver.

    We had previously handled updating the kubeadm configmap, which
    was sufficient to handle the running system. However, in order
    to properly handle backup and restore after the K8s upgrade to
    1.24 (and just for general tidiness) we need to also remove the
    feature gate from the service parameters and from the
    last_kube_extra_config_bootstrap.yaml file. These changes must
    be done on the active controller, so we can't easily do them
    from puppet where the existing feature-gate changes were done.

    Note that there may be other kube-apiserver feature gates
    specified by the end user, this adds a bit of complexity to the
    code.

    https://review.opendev.org/c/starlingx/integ/+/866804 was intended
    to deal with this problem, but it didn't work if the K8s control
    plane was upgraded on the inactive controller first.

    Test Plan:
    PASS: Test K8s control-plane upgrade on AIO-SX and ensure that
          problematic feature gate is removed from service parameters
          and yaml file.
    PASS: Test K8s control-plane upgrade on AIO-DX upgrading control
          plane on standby controller first. Ensure that the
          problematic feature gate is removed from service parameters
          and yaml file.

    Change-Id: Ief45be6e1dbae9eee68bb4d2be9535cd4b09f322
    Closes-Bug: 1999095
    Signed-off-by: Chris Friesen <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/867743
Committed: https://opendev.org/starlingx/integ/commit/5dbfd02db07f3b5f59ea774d9aedd229901f2bea
Submitter: "Zuul (22348)"
Branch: master

commit 5dbfd02db07f3b5f59ea774d9aedd229901f2bea
Author: Chris Friesen <email address hidden>
Date: Wed Dec 14 14:55:00 2022 -0600

    Revert "clean up feature gates on k8s upgrade"

    This reverts commit 15db2d6990a717f50cb7611b1e4ee76f3c626af7.

    While this works fine if we trigger the control plane upgrade on
    the active controller first, it fails miserably if we upgrade the
    inactive controller first.

    The fix is to revert this and instead do it in sysinv-conductor, as
    covered in https://bugs.launchpad.net/starlingx/+bug/1999095

    Partial-Bug: 1999095
    Signed-off-by: Chris Friesen <email address hidden>
    Change-Id: I8f9119ad0fa57bc337883a9263671048f5818c2f

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.