StarlingX

Bug #1954303
Comment #3

Comment 3 for bug 1954303

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-17: Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/821274
Committed: https://opendev.org/starlingx/stx-puppet/commit/e25d16065a0983b8b9f1a35465c24b167892864a
Submitter: "Zuul (22348)"
Branch: master

commit e25d16065a0983b8b9f1a35465c24b167892864a
Author: Andy Ning <email address hidden>
Date: Thu Dec 9 13:11:23 2021 -0500

Don't fail if pods restart timeout during root CA update

    During k8s root CA update, deployments, daemonsets and statefulsets
    are rollout restarted in order for the pods deployed by them to take
    the new root CA certificate. It is observed that some statefulsets
    may take longer than the time limit (10 minutes) to complete the
    rollout restart, causing puppet manifests apply timeout and fail
    the update.

    This change updated the puppet restart code so that it check the
    rollout restart status periodically for a limited time (8 mins),
    generate a "ATTENTION" log in puppet.log for any of the deloyments,
    daemonsets or statefulsets that don't complete restart in the time
    limit. After the time limit, the puppet apply returns successfully
    so that the root CA update continues.

This solution is a balance between "let the root CA update continue
and finish" and "minimize service impact by restarting applications"

    Test Plan:
    PASS: Successful root CA update with all sets complete restart in
          allocated timeout.
    PASS: Successful root CA update with some sets don't complete restart
          in allocated timeout. Logs generated in puppet.log.

    Closes-Bug: 1954303
    Signed-off-by: Andy Ning <email address hidden>
    Change-Id: Ie2589701a9ba234928e06d659e58db5412486303