During root CA update some statefulsets may not complete rollout restart within time limit and cause update to fail

Bug #1954303 reported by Andy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Andy

Bug Description

Brief Description
-----------------
During k8s root CA update, deployments, daemonsets and statefulsets are rollout restarted in order for the pods deployed by them to take the new root CA certificates. It is observed that some statefulsets may take longer than the time limit (10 minutes) to complete the rollout restart, causing puppet manifests apply timeout and fail the update.

Severity
--------
Major

Steps to Reproduce
------------------
- Deploy an application by statefulset with multiple replicas (such as a mysql database).

- Check if the statefulset takes more than 10 min to complete rollout restart. Can be checked by the following commands:

kubectl rollout restart statefulset <the statefulset> -n <namespace>
kubectl rollout status statefulset <the statefulset> -n <namespace>

If the second command (checking status) takes more than 10 mins, it will cause puppet apply timeout and fail the update.

- Run system commands to update k8s root CA certificate update.

Expected Behavior
------------------
The "system kube-rootca-pods-update --phase=trust-both-cas" should succeed.

Actual Behavior
----------------
"system kube-rootca-pods-update --phase=trust-both-cas" failed.

Reproducibility
---------------
Reproducible if the statefulset takes more than 10 mins to complete rollout restart.

System Configuration
--------------------
Any

Branch/Pull Time/Commit
-----------------------
Latest from STX master

Last Pass
---------
N/A

Timestamp/Logs
--------------
puppet.log:
1401 2021-12-02T17:00:11.286 # Trigger rollout restart for all deployments and daemonsets so that they
1402 2021-12-02T17:00:11.292 # restart in parallel.
1403 2021-12-02T17:00:11.294 for namespace in $(kubectl get namespace -o jsonpath='{.items[*].metadata.name}'); do
1404 2021-12-02T17:00:11.298 for name in $(kubectl get deployments -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
1405 2021-12-02T17:00:11.301 kubectl rollout restart deployment ${name} -n ${namespace}
1406 2021-12-02T17:00:11.303 done
1407 2021-12-02T17:00:11.308 for name in $(kubectl get daemonsets -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
1408 2021-12-02T17:00:11.310 kubectl rollout restart daemonsets ${name} -n ${namespace}
1409 2021-12-02T17:00:11.313 done
1410 2021-12-02T17:00:11.325 for name in $(kubectl get statefulsets -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
1411 2021-12-02T17:00:11.336 kubectl rollout restart statefulsets ${name} -n ${namespace}
1412 2021-12-02T17:00:11.338 done
1413 2021-12-02T17:00:11.345 done
1414 2021-12-02T17:00:11.359
1415 2021-12-02T17:00:11.363 # Check the rollout status.
1416 2021-12-02T17:00:11.365 for namespace in $(kubectl get namespace -o jsonpath='{.items[*].metadata.name}'); do
1417 2021-12-02T17:00:11.367 for name in $(kubectl get deployments -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
1418 2021-12-02T17:00:11.369 kubectl rollout status deployment ${name} -n ${namespace}
1419 2021-12-02T17:00:11.371 done
1420 2021-12-02T17:00:11.374 for name in $(kubectl get daemonsets -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
1421 2021-12-02T17:00:11.382 kubectl rollout status daemonsets ${name} -n ${namespace}
1422 2021-12-02T17:00:11.398 done
1423 2021-12-02T17:00:11.402 for name in $(kubectl get statefulsets -n $namespace -o jsonpath='{.items[*].metadata.name}'); do
1424 2021-12-02T17:00:11.405 kubectl rollout status statefulsets ${name} -n ${namespace}
1425 2021-12-02T17:00:11.411 done
1426 2021-12-02T17:00:11.416 done
1427 2021-12-02T17:00:11.423 '^[[0m
1428 2021-12-02T17:10:11.278 ^[[1;31mError: 2021-12-02 17:10:11 +0000 Command exceeded timeout

Test Activity
-------------
Developer Testing

Workaround
----------
N/A

Andy (andy.wrs)
Changed in starlingx:
assignee: nobody → Andy (andy.wrs)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/stx-puppet/+/821274

Changed in starlingx:
status: New → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening / medium - dependent on end-user deployment and only specific to particular operations, so it's sufficient to fix in stx master for now. will not hold up stx.6.0

tags: added: stx.7.0 stx.security
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/821274
Committed: https://opendev.org/starlingx/stx-puppet/commit/e25d16065a0983b8b9f1a35465c24b167892864a
Submitter: "Zuul (22348)"
Branch: master

commit e25d16065a0983b8b9f1a35465c24b167892864a
Author: Andy Ning <email address hidden>
Date: Thu Dec 9 13:11:23 2021 -0500

    Don't fail if pods restart timeout during root CA update

    During k8s root CA update, deployments, daemonsets and statefulsets
    are rollout restarted in order for the pods deployed by them to take
    the new root CA certificate. It is observed that some statefulsets
    may take longer than the time limit (10 minutes) to complete the
    rollout restart, causing puppet manifests apply timeout and fail
    the update.

    This change updated the puppet restart code so that it check the
    rollout restart status periodically for a limited time (8 mins),
    generate a "ATTENTION" log in puppet.log for any of the deloyments,
    daemonsets or statefulsets that don't complete restart in the time
    limit. After the time limit, the puppet apply returns successfully
    so that the root CA update continues.

    This solution is a balance between "let the root CA update continue
    and finish" and "minimize service impact by restarting applications"

    Test Plan:
    PASS: Successful root CA update with all sets complete restart in
          allocated timeout.
    PASS: Successful root CA update with some sets don't complete restart
          in allocated timeout. Logs generated in puppet.log.

    Closes-Bug: 1954303
    Signed-off-by: Andy Ning <email address hidden>
    Change-Id: Ie2589701a9ba234928e06d659e58db5412486303

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.