Rollout of statefulsets fail during Kubernetes Root CA Certificate Update

Bug #2004594 reported by João Victor Portal
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
João Victor Portal

Bug Description

Brief Description
-----------------
During Kubernetes Root CA Certificate Update procedure, it was observed that the pods restart script (kube-rootca-update-pods.erb) is stuck trying to restart a statefulset in a non-existing namespace "0" and the existing statefulsets on the system are not restarted. Either the internal script timeout (that is 8 minutes plus the execution time of "kubectl rollout status" commands) or the Puppet timeout (that is 10 minutes) is reached. If the internal script timeout is reached, no error is returned and the update procedure continues as no error has happened. If the Puppet timeout is reached, the update procedure transitions to a failed state.

Severity
--------
Critical

Steps to Reproduce
------------------
Execute the Kubernetes Root CA Certificate Update procedure in any system that has at least one statefulset.

Expected Behavior
------------------
The rollout of statefulsets happen in states "trustbothcas" and "trustnewca".

Actual Behavior
----------------
The rollout of statefulsets doesn't happen in states "trustbothcas" and "trustnewca".

Reproducibility
---------------
100% reproducible.

System Configuration
--------------------
Any.

Branch/Pull Time/Commit
-----------------------
N/A.

Last Pass
---------
N/A.

Timestamp/Logs
--------------
N/A.

Test Activity
-------------
Developer Testing

Workaround
----------
N/A.

Changed in starlingx:
assignee: nobody → João Victor Portal (jvictorp)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)
Download full text (3.2 KiB)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/872563
Committed: https://opendev.org/starlingx/stx-puppet/commit/39cc00aab8a6632043ee9d58202d2e90db0db73a
Submitter: "Zuul (22348)"
Branch: master

commit 39cc00aab8a6632043ee9d58202d2e90db0db73a
Author: Joao Victor Portal <email address hidden>
Date: Thu Feb 2 12:58:51 2023 -0300

    Fix pod rollout script of K8s root CA change

    The fixes and improvements of the pod rollout script used in Kubernetes
    Root CA update procedure are listed below. This script is used in the
    stages "trustbothcas" and "trustnewca" of the procedure. It is written
    in the file "kube-rootca-update-pods.erb".

    In file "kube-rootca-update-pods.erb":

    Fix: the variable "statefullsets" had a typo and was changed to
    "statefulsets". Because of this typo, in line 29, this variable was
    being used without being properly initialized. When the system had at
    least one statefulset, the script always got stuck trying to restart a
    statefulset inside the non-existing namespace "0", the internal script
    timeout was always reached and no existing statefulset was restarted.

    Improvement: the internal script timeout was increased from 8 minutes
    (32 retries every 15s) to 30 minutes (40 retries every 45s).

    Fix: It was noticed that "kubectl rollout status --watch=false" output
    does not always contain "successfully" in it when the rollout is
    completed, like the example below:
    "partitioned roll out complete: 2 new pods have been updated...".
    It also always output "0" as command result. Because of this, it was
    used "--timeout=100ms" instead of "--watch=false", to make the command
    output "1" when the rollout was not completed and "0" when it was.

    Improvement: more logs were added to the script.

    Fix: The script now returns error when the internal script timeout is
    reached and the rollout was not completed. Before, it simply exited and
    the procedure would continue not pointing problems.

    In file "kubernetes.pp":

    Improvement: The hard timeout of Puppet for the pod rollout script was
    changed from 10 minutes to 60 minutes. This timeout should not be
    reached, as the internal script timeout is around 30 minutes, however it
    is not impossible, as the 30 minutes timeout doesn't count the time
    spend in the execution of "kubectl rollout status" commands.

    Test Plan:

    PASS: In a AIO-SX with at least one statefulset, being these
    statefulsets with no problems, execute the Kubernetes Root CA update
    procedure and check that it is completed successfully without reaching
    any timeout in stages "trustbothcas" and "trustnewca".

    PASS: In a AIO-SX with at least one statefulset, being these
    statefulsets in error state, execute the Kubernetes Root CA update
    procedure and check that it transitioned to an error state due to the
    pod rollout script reaching the internal script timeout (that is around
    30 minutes) and returning error.

    Closes-Bug: 2004594
    Signed-off-by: Joao Victor Portal <email address hidden>
    Ch...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nfv (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/nfv/+/874040

Ghada Khalil (gkhalil)
Changed in starlingx:
status: Fix Released → In Progress
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.security
tags: added: stx.config
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nfv (master)

Reviewed: https://review.opendev.org/c/starlingx/nfv/+/874040
Committed: https://opendev.org/starlingx/nfv/commit/244b29636cb6e4c494a4b971c414272b82a58eae
Submitter: "Zuul (22348)"
Branch: master

commit 244b29636cb6e4c494a4b971c414272b82a58eae
Author: Joao Victor Portal <email address hidden>
Date: Wed Feb 15 20:31:11 2023 -0300

    Change pod rollout timeout for K8s root CA change

    The Puppet timeout for pod rollout in stages "trustbothcas" and
    "trustnewca" was recently changed from 600s to 3600s. In this commit,
    the timeouts for these stages in the kubernetes root CA update strategy
    are also updated.

    Test Plan:

    PASS: In a AIO-SX, execute the Kubernetes Root CA update through
    sw-manager strategy and check that it is completed successfully without
    reaching any timeout in stages "trustbothcas" and "trustnewca".

    PASS: In a AIO-SX, artificially make the pod rollout script hang for 15
    minutes and check that the stages "trustbothcas" and "trustnewca" are
    still completed successfully.

    Closes-Bug: 2004594
    Signed-off-by: Joao Victor Portal <email address hidden>
    Change-Id: I96d04de95e424e15bd79f049be644909bb0dcff7

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to distcloud (master)

Reviewed: https://review.opendev.org/c/starlingx/distcloud/+/874772
Committed: https://opendev.org/starlingx/distcloud/commit/78b64129a257bf1c72287392cddde5c7ade4ef8b
Submitter: "Zuul (22348)"
Branch: master

commit 78b64129a257bf1c72287392cddde5c7ade4ef8b
Author: Joao Victor Portal <email address hidden>
Date: Wed Feb 22 11:04:00 2023 -0300

    Increment timeout of DC K8s Root CA strategy

    Recently, the internal pod rollout script used in the stages
    "trust-both-cas" and "trust-new-ca" of the Kubernetes Root CA update
    process was increased from ~8 minutes to ~30 minutes, so the max
    possible timeout for the root CA update increased in ~44 minutes.
    This change increases the timeout for subclouds, used by dcmanager
    kube-rootca-update-strategy, from 60 minutes to 120 minutes.

    Test Plan:

    PASS: In a DC deploy with 1 subcloud, successfully apply a dcmanager
    kube-rootca-update-strategy on the subcloud.

    PASS: Repeat the test above, but artificially make the root CA update
    process in the subcloud take more than 1 hour to complete and check
    that no timeout occurs on the central cloud.

    Closes-Bug: 2004594
    Signed-off-by: Joao Victor Portal <email address hidden>
    Change-Id: I6460846eba6a37d1d0cdc634ea4cf1314b1b6bc4

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.