Optimized restore fails if kubeadm config is missing during backup

Bug #2047845 reported by Joshua Kraitberg
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Joshua Kraitberg

Bug Description

Brief Description
-----------------
DC subcloud BnR: Subcloud restore, post k8s upgrade, failed to intialize kubernetes master

Failure:

 TASK [optimized-restore/restore-data : Initializing Kubernetes master] *********
    Thursday 14 December 2023 18:49:49 +0000 (0:00:01.831) 0:05:59.424 *****
    fatal: [localhost]: FAILED! => changed=true
      cmd:
      - kubeadm
      - init
      - --ignore-preflight-errors=DirAvailable--var-lib-etc
      - --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-apiserver.yaml
      - --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-controller-manager.yaml
      - --ignore-preflight-errors=FileAvailable--etc-kubernetes-manifests-kube-scheduler.yaml
      - --config=/etc/kubernetes/kubeadm.yaml
      delta: '0:00:00.029620'
      end: '2023-12-14 18:49:49.419911'
      msg: non-zero return code
      rc: 1
      start: '2023-12-14 18:49:49.390291'
      stderr: |-
        W1214 18:49:49.412646 26252 common.go:84] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta2". Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.
        W1214 18:49:49.413923 26252 common.go:84] your configuration file uses a deprecated API spec: "kubeadm.k8s.io/v1beta2". Please use 'kubeadm config migrate --old-config old.yaml --new-config new.yaml', which will write the new, similar spec using a newer API version.
        W1214 18:49:49.414319 26252 initconfiguration.go:120] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/var/run/containerd/containerd.sock". Please update your configuration!
        this version of kubeadm only supports deploying clusters with the control plane version >= 1.23.0. Current version: v1.21.8
        To see the stack trace of this error execute with --v=5 or higher
      stderr_lines: <omitted>
      stdout: ''
      stdout_lines: <omitted>
subcloud state:

kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.4", GitCommit:"95ee5ab382d64cfe6c28967f36b53970b8374491", GitTreeState:"archive", BuildDate:"2023-11-25T04:11:10Z", GoVersion:"go1.18.5", Compiler:"gc", Platform:"linux/amd64"}

## check nodes control-plane version
kubectl get nodes -n deployment
Error from server (ServiceUnavailable): the server is currently unable to handle the request

Severity
-----------------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
-----------------
Run BnR post subcloud platform upgrade and k8s upgrade

steps:

deploy systemcontroller and subclouds with 21.12P10
upgrade systemcontroller and subclouds
upgrade k8s on systemconrtoller and subclouds
backup subcloud
restore subcloud
Expected Behavior

the subcloud should be restored successfully

Actual Behavior
-----------------
The subcloud restore failed

Reproducibility
-----------------
100%

System Configuration
-----------------

DC / subcloud

Load info (eg: 2022-03-10_20-00-07)

cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
SRC_BUILD_ID="38"JOB="wrcp-22.12-debian"
BUILD_BY="jenkins"
BUILD_NUMBER="50"
BUILD_HOST="yow-wrcp-lx.wrs.com"
BUILD_DATE="2022-12-19 07:22:00 +0000"
[sysadmin@controller-0 ~(keystone_admin)]$ sw-patch query
      Patch ID RR Release Patch State
===================== == ======= ===========
WRCP_21.12_PATCH_0001 Y 21.12 Committed
WRCP_21.12_PATCH_0002 Y 21.12 Committed
WRCP_21.12_PATCH_0003 Y 21.12 Committed
WRCP_21.12_PATCH_0004 Y 21.12 Committed
WRCP_21.12_PATCH_0005 Y 21.12 Committed
WRCP_21.12_PATCH_0006 Y 21.12 Committed
WRCP_21.12_PATCH_0007 Y 21.12 Committed
WRCP_21.12_PATCH_0008 Y 21.12 Committed
WRCP_21.12_PATCH_0009 Y 21.12 Committed
WRCP_21.12_PATCH_0010 N 21.12 Committed
WRCP_22.12_PATCH_0001 Y 22.12 Committed
WRCP_22.12_PATCH_0002 Y 22.12 Committed
WRCP_22.12_PATCH_0003 Y 22.12 Committed
WRCP_22.12_PATCH_0004 Y 22.12 Committed
Last Pass

Timestamp/Logs
-----------------

Alarms
-----------------
no alarms

Test Activity
-----------------
Manual regression

Workaround
-----------------
Change kubeadm.k8s.io/v1beta1 to kubeadm.k8s.io/v1beta3 in /usr/share/ansible/stx-ansible/playbooks/roles/common/files/kubeadm.yaml.j2 before creating the backup

Changed in starlingx:
assignee: nobody → Joshua Kraitberg (jkraitbe-wr)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/903720
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/c9e16717e76ec375ad598f063c44018e9f24a33b
Submitter: "Zuul (22348)"
Branch: master

commit c9e16717e76ec375ad598f063c44018e9f24a33b
Author: Joshua Kraitberg <email address hidden>
Date: Thu Dec 14 17:17:54 2023 -0500

    Do not use kubeadm during optimized restore

    During backup, the kubeadm config file is not guaranteed to be present
    on the system. This caused an issue during optimized restore because
    that file was used to recreate the cluster.

    A similar issue can also occur during restore after upgrade because
    the kubeadm config will contain deprecated fields.

    Rather than using "kubeadm init" to initialize a new cluster,
    the K8s certificates, control-plane static pod manifests,
    kubelet configuration, and etcd snapshot will be leveraged
    to bring up the previous cluster by simply starting kubelet.

    TEST PLAN
    PASS: Optimized restore on AIO-SX
    * stx8
    * stx9
    PASS: Optimized restore after upgrade, stx9

    Closes-Bug: 2047845
    Change-Id: Ia0a0f83cf6111e854776cc8967e6cba99d186b66
    Signed-off-by: Joshua Kraitberg <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.