StarlingX

insufficient logs for kubeadm upgrade apply timeout

Bug #1998830 reported by Jim Gauld on 2022-12-05

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Jim Gauld

Bug Description

Brief Description
-----------------
For kubernetes upgrade, the step 'system kube-host-upgrade controller-0 control-plane' hits 300 second default timeout when there are > 250 subclouds being upgraded in parallel via orchestration. Puppet has a deficiency in that it does not capture any exec command output and display it on timeout, so there is insufficient details in our logs to identify progression of a long operation. The command is 'kubeadm upgrade apply' can easily take 90 seconds and has many API calls and upgrade steps. Most of the API calls themselves have built-in 10s timeouts, and the upgrade of control-plane step has 5m builtin timeout, upgrade of kube-apiserver has 5m builtin timeout, kube-scheduler has 5m builtin timeout, kube-controller-manager has 5m builtin timeout.

Recommended increasing timeout > 300s (value TBD), improve method for getting command output on timeout, and increase verbosity of the progress output to identify what is getting stuck or possibly a more appropriate timeout value.

Severity
--------
Major: Requires repeat of overall upgrade orchestration, the remaining nodes get upgraded.

Steps to Reproduce
------------------
kubernetes version upgrade. Apply strategy on DC system, 250 subcloud doing upgrades in parallel.

Expected Behavior
------------------
Upgrade should run to completion without repeat workaround.

Actual Behavior
----------------
Small subset of nodes will hit 300 second timeout for 'kubeadm upgrade apply' which is to upgrading control-plane, so these upgrade nodes fail.

Reproducibility
---------------
4/4 on large system (250 subclouds in parallel). Otherwise rare intermittent.

System Configuration
--------------------
DC System (DC1000_2 1000 AWS subclouds)

Branch/Pull Time/Commit
-----------------------
BUILD_DATE="2022-06-07 20:11:37 -0400"

Last Pass
---------
This is new issue based on size of DC system under test.

Timestamp/Logs
--------------
sysinv.log shows failure traceback of puppet-manifest-apply.
/var/log/puppet/*/puppet.log shows kubeadm upgrade apply timeout and no supporting details.

Test Activity
-------------
DC System test.

Workaround
----------
Repeat upgrade strategy; remaining nodes that previously 'failed' due to timeout will be upgraded.

Manually capture output on a failed node, i.e., running the same command as the puppet exec, just to gather output. This will not reproduce conditions of original timeout though.

kubeadm --kubeconfig=/etc/kubernetes/admin.conf upgrade apply v1.24.4 --allow-experimental-upgrades --allow-release-candidate-upgrades -y

To get more verbose output, also add the option "-v6".

May manually execute the same puppet manifest like the following:
cat <<EOF >> /tmp/upgrade-first-control-plane.yaml
classes:
- platform::kubernetes::upgrade_first_control_plane
EOF

sudo /usr/local/bin/puppet-manifest-apply.sh \
/opt/platform/puppet/22.12/hieradata 192.168.204.2 controller runtime \
/tmp/upgrade-first-control-plane.yaml

Note that this script shows minimal outout including options in user.log.

Tags:

Jim Gauld (jgauld) on 2022-12-05

Changed in starlingx:
assignee:	nobody → Jim Gauld (jgauld)

Revision history for this message

Jim Gauld (jgauld) wrote on 2022-12-05:

The following is the proposed fix, which gives the output on success, or failure / timeout.

Note that the script /usr/local/bin/puppet-manifest-apply.sh is what actually calls the given class.

stx/stx-puppet$ git diff
diff --git a/puppet-manifests/src/modules/platform/manifests/kubernetes.pp b/puppet-manifests/src/modules/platform/manifests/kubernetes.pp
index 0c92fcf..c06eb65 100644
--- a/puppet-manifests/src/modules/platform/manifests/kubernetes.pp
+++ b/puppet-manifests/src/modules/platform/manifests/kubernetes.pp
@@ -724,9 +724,17 @@ class platform::kubernetes::upgrade_first_control_plane
}

   # The --allow-*-upgrades options allow us to upgrade to any k8s release if necessary
+ # The -v6 gives verbose debug output includes health, GET response, delay.
+ # Puppet captures no command output on timeout. Workaround:
+ # - use 'stdbuf' to flush line buffer for stdout and stderr
+ # - redirect stderr to stdout
+ # - use 'tee' so we write output to both stdout and file
+ # Since we hit default 300 second timeout under load (i.e., upgrade 250 subclouds
+ # in parallel), specify larger timeout.
   exec { 'upgrade first control plane':
- command => "kubeadm --kubeconfig=/etc/kubernetes/admin.conf upgrade apply ${version} --allow-experimental-upgrades --allow-release-candidate-upgrades -y", # lint:ignore:140chars
+ command => "stdbuf -oL -eL kubeadm -v6 --kubeconfig=/etc/kubernetes/admin.conf upgrade apply ${version} --allow-experimental-upgrades --allow-release-candidate-upgrades -y 2>&1 | tee /var/log/puppet/latest/kubeadm-upgrade-apply.log", # lint:ignore:140chars
     logoutput => true,
+ timeout => 600,
     require => Exec['update kubeadm-config']
   }

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-12-05: Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/866619

Changed in starlingx:
status:	New → In Progress

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2022-12-09:

Fixed By: https://review.opendev.org/c/starlingx/stx-puppet/+/866619
Unclear by the LP was not automatically updated to Fix Released.

Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.8.0 stx.containers stx.update
Changed in starlingx:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.