vim traps on null kube_upgrade_obj leading to swact loop
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Vanathi Selvaraju |
Bug Description
Brief Description
-----------------
A k8s upgrade to v1.22.5 was performed successfully. However, System Controller is now in swact loop after a k8s upgrade to v1.23.1.
Severity
--------
Major: System/Feature is usable but degraded
Steps to Reproduce
------------------
- Install System Controller and use kubernetes_version: 1.21.8 in the localhost.yml
- Deploy multiple AIO-SX subclouds, also using v1.21.8
- Create a strategy to upgrade k8s on the System Controller to v1.22.5
$ fm alarm-list --mgmt_affecting
$ system health-
$ sw-manager kube-upgrade-
$ sw-manager kube-upgrade-
- Check k8s upgrade is complete and create another strategy to v1.23.1 now
$ fm alarm-list --mgmt_affecting
$ system health-
$ sw-manager kube-upgrade-
$ sw-manager kube-upgrade-
Expected Behavior
------------------
DC System Controller operational and running v.1.23.1
Actual Behavior
----------------
Nodes/Control plane running v.1.23.1; however, controllers are in swact loop.
Reproducibility
---------------
Seen once.
System Configuration
-------
Distributed Cloud
Branch/Pull Time/Commit
-------
2023-11-14_22-20-52
Timestamp/Logs
--------------
// kube-upgrade is stuck for hours at 94%. It seems there is no TIMEOUT? The 600s may not being respected.
$ date
Fri Nov 17 15:09:14 UTC 2023
$ sw-manager kube-upgrade-
Strategy Kubernetes Upgrade Strategy:
strategy-uuid: 86695601-
controller-
storage-
worker-
max-parallel-
default-
alarm-
current-phase: apply
current-
state: applying
apply-phase:
total-stages: 12
current-stage: 10
stop-at-stage: 12
timeout: 1203 seconds
completion-
start-
inprogress: true
stages:
stage-id: 10
stage-name: kube-upgrade-
timeout: 601 seconds
inprogress: true
steps:
result: wait
reason:
// Services disabled
$ sudo sm-dump-
oam-services standby standby
controller-services standby standby
cloud-services standby standby
patching-services standby standby
directory-services active active
web-services active active
storage-services active active
storage-
vim-services disabled disabled failed
distributed-
-------
oam-ip enabled-standby disabled
management-ip enabled-standby disabled
drbd-pg enabled-standby enabled-standby
drbd-rabbit enabled-standby enabled-standby
drbd-platform enabled-standby enabled-standby
pg-fs enabled-standby disabled
rabbit-fs enabled-standby disabled
nfs-mgmt enabled-standby disabled
platform-fs enabled-standby disabled
postgres enabled-standby disabled
rabbit enabled-standby disabled
platform-export-fs enabled-standby disabled
sysinv-inv enabled-standby disabled
sysinv-conductor enabled-standby disabled
mtc-agent enabled-standby disabled
hw-mon enabled-standby disabled
dnsmasq enabled-standby disabled
fm-mgr enabled-standby disabled
keystone enabled-standby disabled
open-ldap enabled-active enabled-active
lighttpd enabled-active enabled-active
horizon enabled-active enabled-active
patch-alarm-manager enabled-standby disabled
mgr-restful-plugin enabled-active enabled-active
ceph-manager enabled-standby disabled
vim disabled disabled
vim-api disabled disabled
vim-webserver disabled disabled
haproxy enabled-standby disabled
pxeboot-ip enabled-standby disabled
drbd-extension enabled-standby enabled-standby
extension-fs enabled-standby disabled
extension-export-fs enabled-standby disabled
dcorch-engine enabled-standby disabled
dcmanager-manager enabled-standby disabled
dcmanager-api enabled-standby disabled
dcmanager-audit enabled-standby disabled
dcorch-
drbd-dc-vault enabled-standby enabled-standby
dc-vault-fs enabled-standby disabled
dcorch-
dcorch-
etcd enabled-standby disabled
drbd-etcd enabled-standby enabled-standby
etcd-fs enabled-standby disabled
barbican-api enabled-standby disabled
barbican-
barbican-worker enabled-standby disabled
cluster-host-ip enabled-standby disabled
dcdbsync-api enabled-standby disabled
dcmanager-
dcmanager-
dcmanager-state enabled-standby disabled
docker-distribution enabled-standby disabled
dockerdistribut
drbd-dockerdist
helmrepository-fs enabled-standby disabled
registry-
dc-iso-fs enabled-standby disabled
cert-mon enabled-standby disabled
device-image-fs enabled-standby disabled
cert-alarm enabled-standby disabled
// On controller-0 pods do not seem happy
[sysadmin@
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
armada armada-
flux-helm helm-controller
flux-helm source-
kube-system kube-apiserver-
platform-
platform-
Alarms
Test Activity
-------------
System Test
Workaround
----------
Code change to stabilize vim:
sudo diff /usr/lib/
4000c4000
< if kube_upgrade_
---
> if kube_upgrade_obj and kube_upgrade_
4016a4017,4020
> if kube_upgrade_obj:
> kube_upgrade_
> else:
> kube_upgrade_
4019c4023
< kube_upgrade_
---
> kube_upgrade_
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.nfv |
Changed in starlingx: | |
assignee: | nobody → Vanathi Selvaraju (vselvara) |
Reviewed: https:/ /review. opendev. org/c/starlingx /nfv/+/ 901339 /opendev. org/starlingx/ nfv/commit/ cdd758efb72d7a1 7d9f69e79475fda b0791fafc2
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit cdd758efb72d7a1 7d9f69e79475fda b0791fafc2
Author: Vanathi.Selvaraju <email address hidden>
Date: Fri Nov 17 17:59:49 2023 -0500
Post K8s upgrade DC system controller swacts in a loop
After K8s upgrade from 1.22 to 1.23 the system
swacts in a loop, this occurs as an atrribute
of the VIM object is not available leading to
multiple restarts of VIM service resulting
in swact.
Test Plan:
PASSED: Induce condition Kube upgrade object null state
causing vim restarts, apply fix, system stabilizes.
Closes-Bug: 2043859
Change-Id: Iccb3106cade130 8aa6e9232013366 c2b9181557b
Signed-off-by: Vanathi.Selvaraju <email address hidden>