insufficient logs for kubeadm upgrade apply timeout

Bug #1998830 reported by Jim Gauld
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Jim Gauld

Bug Description

Brief Description
-----------------
For kubernetes upgrade, the step 'system kube-host-upgrade controller-0 control-plane' hits 300 second default timeout when there are > 250 subclouds being upgraded in parallel via orchestration. Puppet has a deficiency in that it does not capture any exec command output and display it on timeout, so there is insufficient details in our logs to identify progression of a long operation. The command is 'kubeadm upgrade apply' can easily take 90 seconds and has many API calls and upgrade steps. Most of the API calls themselves have built-in 10s timeouts, and the upgrade of control-plane step has 5m builtin timeout, upgrade of kube-apiserver has 5m builtin timeout, kube-scheduler has 5m builtin timeout, kube-controller-manager has 5m builtin timeout.

Recommended increasing timeout > 300s (value TBD), improve method for getting command output on timeout, and increase verbosity of the progress output to identify what is getting stuck or possibly a more appropriate timeout value.

Severity
--------
Major: Requires repeat of overall upgrade orchestration, the remaining nodes get upgraded.

Steps to Reproduce
------------------
kubernetes version upgrade. Apply strategy on DC system, 250 subcloud doing upgrades in parallel.

Expected Behavior
------------------
Upgrade should run to completion without repeat workaround.

Actual Behavior
----------------
Small subset of nodes will hit 300 second timeout for 'kubeadm upgrade apply' which is to upgrading control-plane, so these upgrade nodes fail.

Reproducibility
---------------
4/4 on large system (250 subclouds in parallel). Otherwise rare intermittent.

System Configuration
--------------------
DC System (DC1000_2 1000 AWS subclouds)

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2022-06-07 20:11:37 -0400"

Last Pass
---------
This is new issue based on size of DC system under test.

Timestamp/Logs
--------------
sysinv.log shows failure traceback of puppet-manifest-apply.
/var/log/puppet/*/puppet.log shows kubeadm upgrade apply timeout and no supporting details.

Test Activity
-------------
DC System test.

Workaround
----------
Repeat upgrade strategy; remaining nodes that previously 'failed' due to timeout will be upgraded.

Manually capture output on a failed node, i.e., running the same command as the puppet exec, just to gather output. This will not reproduce conditions of original timeout though.

kubeadm --kubeconfig=/etc/kubernetes/admin.conf upgrade apply v1.24.4 --allow-experimental-upgrades --allow-release-candidate-upgrades -y

To get more verbose output, also add the option "-v6".

May manually execute the same puppet manifest like the following:
cat <<EOF >> /tmp/upgrade-first-control-plane.yaml
classes:
- platform::kubernetes::upgrade_first_control_plane
EOF

sudo /usr/local/bin/puppet-manifest-apply.sh \
 /opt/platform/puppet/22.12/hieradata 192.168.204.2 controller runtime \
 /tmp/upgrade-first-control-plane.yaml

Note that this script shows minimal outout including options in user.log.

Jim Gauld (jgauld)
Changed in starlingx:
assignee: nobody → Jim Gauld (jgauld)
Revision history for this message
Jim Gauld (jgauld) wrote :

The following is the proposed fix, which gives the output on success, or failure / timeout.

Note that the script /usr/local/bin/puppet-manifest-apply.sh is what actually calls the given class.

stx/stx-puppet$ git diff
diff --git a/puppet-manifests/src/modules/platform/manifests/kubernetes.pp b/puppet-manifests/src/modules/platform/manifests/kubernetes.pp
index 0c92fcf..c06eb65 100644
--- a/puppet-manifests/src/modules/platform/manifests/kubernetes.pp
+++ b/puppet-manifests/src/modules/platform/manifests/kubernetes.pp
@@ -724,9 +724,17 @@ class platform::kubernetes::upgrade_first_control_plane
   }

   # The --allow-*-upgrades options allow us to upgrade to any k8s release if necessary
+ # The -v6 gives verbose debug output includes health, GET response, delay.
+ # Puppet captures no command output on timeout. Workaround:
+ # - use 'stdbuf' to flush line buffer for stdout and stderr
+ # - redirect stderr to stdout
+ # - use 'tee' so we write output to both stdout and file
+ # Since we hit default 300 second timeout under load (i.e., upgrade 250 subclouds
+ # in parallel), specify larger timeout.
   exec { 'upgrade first control plane':
- command => "kubeadm --kubeconfig=/etc/kubernetes/admin.conf upgrade apply ${version} --allow-experimental-upgrades --allow-release-candidate-upgrades -y", # lint:ignore:140chars
+ command => "stdbuf -oL -eL kubeadm -v6 --kubeconfig=/etc/kubernetes/admin.conf upgrade apply ${version} --allow-experimental-upgrades --allow-release-candidate-upgrades -y 2>&1 | tee /var/log/puppet/latest/kubeadm-upgrade-apply.log", # lint:ignore:140chars
     logoutput => true,
+ timeout => 600,
     require => Exec['update kubeadm-config']
   }

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/866619

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Fixed By: https://review.opendev.org/c/starlingx/stx-puppet/+/866619
Unclear by the LP was not automatically updated to Fix Released.

Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.8.0 stx.containers stx.update
Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.