StarlingX

vim traps on null kube_upgrade_obj leading to swact loop

Bug #2043859 reported by John Kung on 2023-11-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Vanathi Selvaraju

Bug Description

Brief Description
-----------------
A k8s upgrade to v1.22.5 was performed successfully. However, System Controller is now in swact loop after a k8s upgrade to v1.23.1.

Severity
--------

Major: System/Feature is usable but degraded

Steps to Reproduce
------------------
- Install System Controller and use kubernetes_version: 1.21.8 in the localhost.yml

- Deploy multiple AIO-SX subclouds, also using v1.21.8

- Create a strategy to upgrade k8s on the System Controller to v1.22.5
$ fm alarm-list --mgmt_affecting
$ system health-query-kube-upgrade
$ sw-manager kube-upgrade-strategy create --to-version v1.22.5 --worker-apply-type parallel --max-parallel-worker-hosts 2 --alarm-restriction relaxed
$ sw-manager kube-upgrade-strategy apply

- Check k8s upgrade is complete and create another strategy to v1.23.1 now
$ fm alarm-list --mgmt_affecting
$ system health-query-kube-upgrade
$ sw-manager kube-upgrade-strategy create --to-version v1.23.1 --worker-apply-type parallel --max-parallel-worker-hosts 2 --alarm-restriction relaxed
$ sw-manager kube-upgrade-strategy apply

Expected Behavior
------------------
DC System Controller operational and running v.1.23.1

Actual Behavior
----------------
Nodes/Control plane running v.1.23.1; however, controllers are in swact loop.

Reproducibility
---------------
Seen once.

System Configuration
--------------------
Distributed Cloud

Branch/Pull Time/Commit
-----------------------
2023-11-14_22-20-52

Timestamp/Logs
--------------
// kube-upgrade is stuck for hours at 94%. It seems there is no TIMEOUT? The 600s may not being respected.

$ date
Fri Nov 17 15:09:14 UTC 2023

$ sw-manager kube-upgrade-strategy show --active
Strategy Kubernetes Upgrade Strategy:
  strategy-uuid: 86695601-6ac2-4947-8018-f4f7e2b9771a
  controller-apply-type: serial
  storage-apply-type: serial
  worker-apply-type: parallel
  max-parallel-worker-hosts: 2
  default-instance-action: stop-start
  alarm-restrictions: relaxed
  current-phase: apply
  current-phase-completion: 94%
  state: applying
  apply-phase:
    total-stages: 12
    current-stage: 10
    stop-at-stage: 12
    timeout: 1203 seconds
    completion-percentage: 94%
    start-date-time: 2023-11-17 11:34:52
    inprogress: true
    stages:
        stage-id: 10
        stage-name: kube-upgrade-complete
        total-steps: 1
        current-step: 0
        timeout: 601 seconds
        start-date-time: 2023-11-17 12:44:50
        inprogress: true
        steps:
            step-id: 0
            step-name: kube-upgrade-complete
            timeout: 600 seconds
            start-date-time: 2023-11-17 12:44:50
            result: wait
            reason:
// Services disabled

$ sudo sm-dump-Service_Groups------------------------------------------------------------------------
oam-services standby standby
controller-services standby standby
cloud-services standby standby
patching-services standby standby
directory-services active active
web-services active active
storage-services active active
storage-monitoring-services standby standby
vim-services disabled disabled failed
distributed-cloud-services standby standby
----------------------------------------------------------------------------------------Services------------------------------------------------------------------------------
oam-ip enabled-standby disabled
management-ip enabled-standby disabled
drbd-pg enabled-standby enabled-standby
drbd-rabbit enabled-standby enabled-standby
drbd-platform enabled-standby enabled-standby
pg-fs enabled-standby disabled
rabbit-fs enabled-standby disabled
nfs-mgmt enabled-standby disabled
platform-fs enabled-standby disabled
postgres enabled-standby disabled
rabbit enabled-standby disabled
platform-export-fs enabled-standby disabled
sysinv-inv enabled-standby disabled
sysinv-conductor enabled-standby disabled
mtc-agent enabled-standby disabled
hw-mon enabled-standby disabled
dnsmasq enabled-standby disabled
fm-mgr enabled-standby disabled
keystone enabled-standby disabled
open-ldap enabled-active enabled-active
lighttpd enabled-active enabled-active
horizon enabled-active enabled-active
patch-alarm-manager enabled-standby disabled
mgr-restful-plugin enabled-active enabled-active
ceph-manager enabled-standby disabled
vim disabled disabled
vim-api disabled disabled
vim-webserver disabled disabled
haproxy enabled-standby disabled
pxeboot-ip enabled-standby disabled
drbd-extension enabled-standby enabled-standby
extension-fs enabled-standby disabled
extension-export-fs enabled-standby disabled
dcorch-engine enabled-standby disabled
dcmanager-manager enabled-standby disabled
dcmanager-api enabled-standby disabled
dcmanager-audit enabled-standby disabled
dcorch-sysinv-api-proxy enabled-standby disabled
drbd-dc-vault enabled-standby enabled-standby
dc-vault-fs enabled-standby disabled
dcorch-patch-api-proxy enabled-standby disabled
dcorch-identity-api-proxy enabled-standby disabled
etcd enabled-standby disabled
drbd-etcd enabled-standby enabled-standby
etcd-fs enabled-standby disabled
barbican-api enabled-standby disabled
barbican-keystone-listener enabled-standby disabled
barbican-worker enabled-standby disabled
cluster-host-ip enabled-standby disabled
dcdbsync-api enabled-standby disabled
dcmanager-orchestrator enabled-standby disabled
dcmanager-audit-worker enabled-standby disabled
dcmanager-state enabled-standby disabled
docker-distribution enabled-standby disabled
dockerdistribution-fs enabled-standby disabled
drbd-dockerdistribution enabled-standby enabled-standby
helmrepository-fs enabled-standby disabled
registry-token-server enabled-standby disabled
dc-iso-fs enabled-standby disabled
cert-mon enabled-standby disabled
device-image-fs enabled-standby disabled
cert-alarm enabled-standby disabled

// On controller-0 pods do not seem happy

[sysadmin@controller-0 ~(keystone_admin)]$ kubectl get pod --all-namespaces --field-selector=status.phase=Running -o=wide | grep --color=never -v -E '([0-9])+/\1'
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
armada armada-api-597fb9cbd5-szw29 1/2 CrashLoopBackOff 29 (8m1s ago) 3h7m dead:beef::a4ce:fec1:5423:e32a controller-1 <none> <none>
flux-helm helm-controller-596654dfb4-5kq7f 0/1 CrashLoopBackOff 22 (15m ago) 3h7m dead:beef::a4ce:fec1:5423:e337 controller-1 <none> <none>
flux-helm source-controller-64b74f7db6-fjplx 0/1 Running 25 (9m37s ago) 3h7m dead:beef::a4ce:fec1:5423:e32f controller-1 <none> <none>
kube-system kube-apiserver-controller-0 0/1 Running 26 (7m59s ago) 3h4m 2620:10a:a001:df1::3 controller-0 <none> <none>
platform-deployment-manager dm-monitor-77897b8867-wswpw 0/1 CrashLoopBackOff 30 (7m46s ago) 3h7m dead:beef::a4ce:fec1:5423:e307 controller-1 <none> <none>
platform-deployment-manager platform-deployment-manager-667c65599b-4w2wr 1/2 CrashLoopBackOff 25 (9m56s ago) 3h7m dead:beef::a4ce:fec1:5423:e32b controller-1 <none> <none>
Alarms

Test Activity
-------------
System Test

Workaround
----------

Code change to stabilize vim:
sudo diff /usr/lib/python3/dist-packages/nfv_vim/strategy/_strategy_steps.py.orig /usr/lib/python3/dist-packages/nfv_vim/strategy/_strategy_steps.py
4000c4000
< if kube_upgrade_obj.state == self._success_state:
---
> if kube_upgrade_obj and kube_upgrade_obj.state == self._success_state:
4016a4017,4020
> if kube_upgrade_obj:
> kube_upgrade_obj_state = kube_upgrade_obj.state
> else:
> kube_upgrade_obj_state = None
4019c4023
< kube_upgrade_obj.state,
---
> kube_upgrade_obj_state,

Tags:

OpenStack Infra (hudson-openstack) on 2023-11-17

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-19: Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/901339
Committed: https://opendev.org/starlingx/nfv/commit/cdd758efb72d7a17d9f69e79475fdab0791fafc2
Submitter: "Zuul (22348)"
Branch: master

commit cdd758efb72d7a17d9f69e79475fdab0791fafc2
Author: Vanathi.Selvaraju <email address hidden>
Date: Fri Nov 17 17:59:49 2023 -0500

Post K8s upgrade DC system controller swacts in a loop

    After K8s upgrade from 1.22 to 1.23 the system
    swacts in a loop, this occurs as an atrribute
    of the VIM object is not available leading to
    multiple restarts of VIM service resulting
    in swact.

    Test Plan:
    PASSED: Induce condition Kube upgrade object null state
    causing vim restarts, apply fix, system stabilizes.

Closes-Bug: 2043859

Change-Id: Iccb3106cade1308aa6e9232013366c2b9181557b
Signed-off-by: Vanathi.Selvaraju <email address hidden>

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2023-11-20

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.9.0 stx.nfv
Changed in starlingx:
assignee:	nobody → Vanathi Selvaraju (vselvara)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.