Add instrumentation for kube upgrade commands

Bug #2044413 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Jim Gauld

Bug Description

Brief Description
-----------------
Need more instrumentation to debug kubernetes upgrades at various steps.
Puppet has a deficiency when we hit puppet configured timeout, the command output is lost and we cannot tell how far something has progressed, so we lack information to debug what is wrong. There is also too little verbosity in the output to tell how things are progressing.

This problem has been noticed primarily with "kubeadm upgrade node", "kubeadm upgrade apply", and "kubectl drain". Debugging has always meant reproducing the situation, and manually running the same commands in that state, with "-v6" verbosity to obtain better reasons.

There have been very specific reasons like:
- timeout failure due to hitting pod-disruption-budget
- timeout failure due upgrade taking too long
- timeout failure due to networking problem
- TBD unknown reasons

Severity
--------
Major: unable to diagnose/debug kubernetes upgrades for failure conditions.

Steps to Reproduce
------------------
Perform orchestrated kubernetes upgrade in a big lab.
This may have to be don MANY repetitions in specific conditions to reproduce.

eg.,
sw-manager --os-interface internal --os-region-name RegionOne \
kube-upgrade-strategy create \
--worker-apply-type parallel \
--max-parallel-worker-hosts 10 \
--alarm-restrictions relaxed \
--instance-action migrate \
--to-version v1.23.1

sw-manager kube-upgrade-strategy apply
<wait for failure of some kind>

Expected Behavior
------------------
Should have puppet logs and other kubernetes critical logs showing progression of upgrade steps, especially when there is failure, so that we can diagnose what is wrong and create proper solution.

Actual Behavior
----------------
4/42 orchestrated upgrades will fail on AIO-DX at the upgrade_control_plane stage which does "kubeadm upgrade node" This will reach defaul puppet timeout of 300 seconds and we don't have any conclusive information what is going on.

Reproducibility
---------------
Intermittent: 4/42 on orchestrated upgrades

System Configuration
--------------------
Generic to all configs. Problems with each: AIO-SX, AIO-DX, STANDARD.
specifically for the 'kubeadm upgrade node' .

Branch/Pull Time/Commit
-----------------------
Current.

Last Pass
---------
NA.

Timestamp/Logs
--------------
NA

Test Activity
-------------
Sanity. Developer Testing.

Workaround
----------
Manually reproduce the failure.
Manually issue the specific command that is failing with increased verbosity.

Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/901776

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/config/+/901808

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/901776
Committed: https://opendev.org/starlingx/stx-puppet/commit/578b64b6ee77d7769af186a33dfba0f0c6184401
Submitter: "Zuul (22348)"
Branch: master

commit 578b64b6ee77d7769af186a33dfba0f0c6184401
Author: Jim Gauld <email address hidden>
Date: Thu Nov 23 15:12:22 2023 -0500

    Add kube_command helper for logging instrumentation

    This adds define platform::kubernetes::kube_command so that we can
    reuse common mechanism to log puppet exec output even in cases when
    puppet exec hits timeout.

    This is now being called in multiple places specifically for
    kubernetes upgrade commands that run generally long and are
    difficult to debug.

    This identical mechanism was used previously for:
    - 'kubeadm upgrade apply'
    - 'kubectl drain'

    This will add instrumentation for:
    - 'kubeadm upgrade node'

    TEST CASES:
    PASS: Run orchestrated kubernetes upgrade: AIO-SX, AIO-DX, STANDARD.
          Verify we get file output logs in /var/log/puppet/<dir>/
          for kube-upgrade-apply.log and kube-upgrade-node.log with
          verbose output.
    PASS: Issue 'system kube-config-kubelet' and verify we get output
          Verify we get file output logs in /var/log/puppet/<dir>/
          for: kubeadm-upgrade-node-phase-kubelet-config.log .
    PASS: Manually modify code to reduce timeout to 1 second,
          demonstrate that doing 'kubeadm upgrade node' will timeout
          and provide log output.

    Partial-Bug: #2044413

    Change-Id: Id898b4bd7e9ee3a1d833439ee71b9355edd7d865
    Signed-off-by: Jim Gauld <email address hidden>

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/c/starlingx/config/+/901808
Committed: https://opendev.org/starlingx/config/commit/80cd44c22ac158fe106daf8461687bbfa41f28f6
Submitter: "Zuul (22348)"
Branch: master

commit 80cd44c22ac158fe106daf8461687bbfa41f28f6
Author: Jim Gauld <email address hidden>
Date: Thu Nov 23 21:03:47 2023 -0500

    Add instrumentation for kube_upgrade_control_plane

    This adds instrumentation for sysinv kube_upgrade_control_plane
    so that we see progress and more error reasons when retrieving
    kubernetes control-plane versions. This adds a few more places
    to generate exceptions so that a retry is performed.

    This enforces we must be able to get the versions from each
    control-plane component (kube-apiservver, kube-controller-manager,
    kube-scheduler) by querying pods that match expected pod name and
    container image.

    TEST CASES:
    PASS: Run orchestrated kubernetes upgrade: AIO-SX, AIO-DX, STANDARD.
          Verify we see new logs during upgrade control plane.
    PASS: Manually modify code to test likely exception paths causes retry.

    Closes-bug: #2044413

    Change-Id: Ic33cdbdf390804c7a0791609a350dd1df6e697e4
    Signed-off-by: Jim Gauld <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.containers stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.