Rollout of statefulsets fail during Kubernetes Root CA Certificate Update
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
João Victor Portal |
Bug Description
Brief Description
-----------------
During Kubernetes Root CA Certificate Update procedure, it was observed that the pods restart script (kube-rootca-
Severity
--------
Critical
Steps to Reproduce
------------------
Execute the Kubernetes Root CA Certificate Update procedure in any system that has at least one statefulset.
Expected Behavior
------------------
The rollout of statefulsets happen in states "trustbothcas" and "trustnewca".
Actual Behavior
----------------
The rollout of statefulsets doesn't happen in states "trustbothcas" and "trustnewca".
Reproducibility
---------------
100% reproducible.
System Configuration
-------
Any.
Branch/Pull Time/Commit
-------
N/A.
Last Pass
---------
N/A.
Timestamp/Logs
--------------
N/A.
Test Activity
-------------
Developer Testing
Workaround
----------
N/A.
Changed in starlingx: | |
assignee: | nobody → João Victor Portal (jvictorp) |
Changed in starlingx: | |
status: | New → In Progress |
Changed in starlingx: | |
status: | Fix Released → In Progress |
Changed in starlingx: | |
importance: | Undecided → Medium |
tags: | added: stx.9.0 stx.security |
tags: | added: stx.config |
Reviewed: https:/ /review. opendev. org/c/starlingx /stx-puppet/ +/872563 /opendev. org/starlingx/ stx-puppet/ commit/ 39cc00aab8a6632 043ee9d58202d2e 90db0db73a
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 39cc00aab8a6632 043ee9d58202d2e 90db0db73a
Author: Joao Victor Portal <email address hidden>
Date: Thu Feb 2 12:58:51 2023 -0300
Fix pod rollout script of K8s root CA change
The fixes and improvements of the pod rollout script used in Kubernetes update- pods.erb" .
Root CA update procedure are listed below. This script is used in the
stages "trustbothcas" and "trustnewca" of the procedure. It is written
in the file "kube-rootca-
In file "kube-rootca- update- pods.erb" :
Fix: the variable "statefullsets" had a typo and was changed to
"statefulsets". Because of this typo, in line 29, this variable was
being used without being properly initialized. When the system had at
least one statefulset, the script always got stuck trying to restart a
statefulset inside the non-existing namespace "0", the internal script
timeout was always reached and no existing statefulset was restarted.
Improvement: the internal script timeout was increased from 8 minutes
(32 retries every 15s) to 30 minutes (40 retries every 45s).
Fix: It was noticed that "kubectl rollout status --watch=false" output
does not always contain "successfully" in it when the rollout is
completed, like the example below:
"partitioned roll out complete: 2 new pods have been updated...".
It also always output "0" as command result. Because of this, it was
used "--timeout=100ms" instead of "--watch=false", to make the command
output "1" when the rollout was not completed and "0" when it was.
Improvement: more logs were added to the script.
Fix: The script now returns error when the internal script timeout is
reached and the rollout was not completed. Before, it simply exited and
the procedure would continue not pointing problems.
In file "kubernetes.pp":
Improvement: The hard timeout of Puppet for the pod rollout script was
changed from 10 minutes to 60 minutes. This timeout should not be
reached, as the internal script timeout is around 30 minutes, however it
is not impossible, as the 30 minutes timeout doesn't count the time
spend in the execution of "kubectl rollout status" commands.
Test Plan:
PASS: In a AIO-SX with at least one statefulset, being these
statefulsets with no problems, execute the Kubernetes Root CA update
procedure and check that it is completed successfully without reaching
any timeout in stages "trustbothcas" and "trustnewca".
PASS: In a AIO-SX with at least one statefulset, being these
statefulsets in error state, execute the Kubernetes Root CA update
procedure and check that it transitioned to an error state due to the
pod rollout script reaching the internal script timeout (that is around
30 minutes) and returning error.
Closes-Bug: 2004594
Signed-off-by: Joao Victor Portal <email address hidden>
Ch...