Add instrumentation for kube upgrade commands
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Low
|
Jim Gauld |
Bug Description
Brief Description
-----------------
Need more instrumentation to debug kubernetes upgrades at various steps.
Puppet has a deficiency when we hit puppet configured timeout, the command output is lost and we cannot tell how far something has progressed, so we lack information to debug what is wrong. There is also too little verbosity in the output to tell how things are progressing.
This problem has been noticed primarily with "kubeadm upgrade node", "kubeadm upgrade apply", and "kubectl drain". Debugging has always meant reproducing the situation, and manually running the same commands in that state, with "-v6" verbosity to obtain better reasons.
There have been very specific reasons like:
- timeout failure due to hitting pod-disruption-
- timeout failure due upgrade taking too long
- timeout failure due to networking problem
- TBD unknown reasons
Severity
--------
Major: unable to diagnose/debug kubernetes upgrades for failure conditions.
Steps to Reproduce
------------------
Perform orchestrated kubernetes upgrade in a big lab.
This may have to be don MANY repetitions in specific conditions to reproduce.
eg.,
sw-manager --os-interface internal --os-region-name RegionOne \
kube-upgrade-
--worker-apply-type parallel \
--max-parallel-
--alarm-
--instance-action migrate \
--to-version v1.23.1
sw-manager kube-upgrade-
<wait for failure of some kind>
Expected Behavior
------------------
Should have puppet logs and other kubernetes critical logs showing progression of upgrade steps, especially when there is failure, so that we can diagnose what is wrong and create proper solution.
Actual Behavior
----------------
4/42 orchestrated upgrades will fail on AIO-DX at the upgrade_
Reproducibility
---------------
Intermittent: 4/42 on orchestrated upgrades
System Configuration
-------
Generic to all configs. Problems with each: AIO-SX, AIO-DX, STANDARD.
specifically for the 'kubeadm upgrade node' .
Branch/Pull Time/Commit
-------
Current.
Last Pass
---------
NA.
Timestamp/Logs
--------------
NA
Test Activity
-------------
Sanity. Developer Testing.
Workaround
----------
Manually reproduce the failure.
Manually issue the specific command that is failing with increased verbosity.
Changed in starlingx: | |
assignee: | nobody → Jim Gauld (jgauld) |
Changed in starlingx: | |
importance: | Undecided → Low |
tags: | added: stx.9.0 stx.containers stx.update |
Fix proposed to branch: master /review. opendev. org/c/starlingx /stx-puppet/ +/901776
Review: https:/