Vim k8s upgrade strategy apply failed with timeout. Unexpected State: aborted

Bug #1973781 reported by Heitor Matsui
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Heitor Matsui

Bug Description

Brief Description
-----------------
Kube version upgrade strategy failed to apply in subclouds because of timeout.

Severity
--------
Major: failed to upgrade kube version in some subclouds

Steps to Reproduce
------------------
1. Install DC system with 1000 subclouds
2. Upgrade Kube version 250 in parallel

Expected Behavior
------------------
Kube version upgrade successful

Actual Behavior
----------------
Kube version upgrade not successful

Reproducibility
---------------
Reproducible

System Configuration
--------------------
DC with 1000 subclouds

Branch/Pull Time/Commit
-----------------------
2022-01-10

Last Pass
---------
N/A

Timestamp/Logs
--------------
/var/log/dcmanager/orchestrator.log
subcloud667
/var/log/nfv-vim-events.log
====================================================================================
log-id = 8
event-id = kube-upgrade-auto-apply-failed
event-type = action-event
event-context = admin
importance = high
entity = orchestration=kube-upgrade
reason_text = Kubernetes upgrade auto-apply failed
additional_text =
timestamp = 2022-05-01 07:13:01.519352
====================================================================================

/var/log/nfv-vim.log

2022-05-01T07:03:00.229 controller-0 VIM_Thread[101353] INFO _strategy_phase.py.344 Phase apply running kube-upgrade-networking stage.
2022-05-01T07:03:00.229 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.235 Stage (kube-upgrade-networking) cleanup called
2022-05-01T07:13:01.508 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.429 Stage (kube-upgrade-networking) step (kube-upgrade-networking) timed out, timeout_in_secs=600.
2022-05-01T07:13:01.509 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.235 Stage (kube-upgrade-networking) cleanup called
2022-05-01T07:13:01.513 controller-0 VIM_Thread[101353] INFO _strategy_phase.py.244 Phase (apply) cleanup called
2022-05-01T07:13:01.518 controller-0 VIM_Thread[101353] INFO _strategy.py.404 Apply Complete Callback, result=timed-out, reason=.
2022-05-01T07:13:01.520 controller-0 VIM_Thread[101353] INFO _strategy_step.py.177 Default strategy step abort for kube-upgrade-networking.
2022-05-01T07:13:01.520 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.268 Stage (kube-upgrade-networking) abort step (kube-upgrade-networking).
2022-05-01T07:13:01.520 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.270 Stage (kube-upgrade-networking) abort.
2022-05-01T07:13:01.520 controller-0 VIM_Thread[101353] INFO _strategy_phase.py.277 Phase (apply) abort stage (kube-upgrade-networking).
2022-05-01T07:13:01.520 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.270 Stage (kube-upgrade-download-images) abort.
2022-05-01T07:13:01.521 controller-0 VIM_Thread[101353] INFO _strategy_phase.py.277 Phase (apply) abort stage (kube-upgrade-download-images).
2022-05-01T07:13:01.522 controller-0 VIM_Thread[101353] INFO _strategy_stage.py.270 Stage (kube-upgrade-start) abort.
2022-05-01T07:13:01.522 controller-0 VIM_Thread[101353] INFO _strategy_phase.py.277 Phase (apply) abort stage (kube-upgrade-start).
2022-05-01T07:13:01.522 controller-0 VIM_Thread[101353] INFO _strategy_phase.py.279 Phase (apply) abort.

Test Activity
-------------
Regression Testing

Workaround
----------
Re-apply strategy

Changed in starlingx:
assignee: nobody → Heitor Matsui (heitormatsui)
Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.nfv
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/841469
Committed: https://opendev.org/starlingx/nfv/commit/a55b65b234329d6a88f06d26697afca8ba1fddf1
Submitter: "Zuul (22348)"
Branch: master

commit a55b65b234329d6a88f06d26697afca8ba1fddf1
Author: Heitor Matsui <email address hidden>
Date: Wed May 11 17:38:11 2022 -0300

    Increase timeout for networking step on k8s upgrade

    Kubernetes upgrade might fail during the Upgrade Networking
    Step with timeout message when upgrading subclouds. The default
    timeout of 600s from the parent class does not seem to be enough
    for some subclouds to download the networking images.

    This commit increases the timeout of the Upgrade Networking Step.

    Test Plan:
    PASS: upgrade k8s version with 90 subclouds in parallel

    Closes-bug: 1973781
    Change-Id: I686b9582daf14f0520fc0c2cb7100816c372107e
    Signed-off-by: Heitor Matsui <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.