Subcloud rehome playbook failed by time out waiting pods to restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
Medium
|
Reinildes Oliveira |
Bug Description
Brief Description
-------
Brief Description
Subcloud rehome playbook failed by time out waiting pods to restart
Scenario:
Subcloud rehoming
Subcloud type: AIO-SX
Additional k8s apps applied
*Note: The issue to have pods in bad state was only observed on custom app applied
Error condition:
There is a task on rehome playbook to restart all pods and wait for them to reach ready state
TASK [common/
Wednesday 28 February 2035 16:06:02 +0000 (0:00:00.735) 0:00:27.451 ****
changed: [subcloud1]
TASK [common/
cmd: kubectl get po -l '!job-name' -A --no-headers -o 'custom-
delta: '0:00:00.266306'
end: '2035-02-28 17:12:44.926166'
rc: 0
NAMESPACE NAME READY STATUS RESTARTS AGE
armada armada-
flux-helm helm-controller
flux-helm source-
kube-system calico-
kube-system calico-node-h2x59 1/1 Running 0 131m
kube-system ceph-pools-
kube-system ceph-pools-
kube-system ceph-pools-
kube-system cephfs-
kube-system cephfs-
kube-system cephfs-
kube-system coredns-
kube-system ic-nginx-
kube-system kube-apiserver-
kube-system kube-controller
kube-system kube-multus-
kube-system kube-proxy-78rvg 1/1 Running 0 131m
kube-system kube-scheduler-
kube-system kube-sriov-
kube-system kube-sriov-
kube-system rbd-nodeplugin-
kube-system rbd-provisioner
kube-system rbd-storage-
kube-system volume-
monitor mon-elastic-
monitor mon-filebeat-wmwlh 0/1 Running 0 130m
monitor mon-kube-
monitor mon-logstash-0 0/1 Init:3/4 150 (7m34s ago) 21h
monitor mon-metricbeat-
monitor mon-metricbeat-
monitor mon-metricbeat-
fm alarm-list
| Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
| 250.001 | controller-0 Configuration is out-of-date. (applied: 3ae3f09f-
| 500.210 | Certificate 'system certificate-show 5d502c2f-
| | | 5d502c2f-
| | | | | |
| 200.006 | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller
cd /var/log/
grep -r ERROR ./ | head -n 10
on sysinv I see the following errors:
sysinv 2035-02-28 15:58:40.811 3040284 ERROR sysinv.
Severity
-------
<Critical: System/Feature is not usable after the defect>
Steps to Reproduce
-------
Deploy subcloud
update subcloud clock to the future (11 years ahead), to have either certificates and license expired
trigger rehoming for this subcloud on a target SystemController which also has the clock set to the future
Expected Behavior
-------
rehome playbook should renew all certificates, with the exception of docker cert.
the playbook should fail with an error msg requesting the user to renew docker cert.
Actual Behavior
-------
Subcloud rehome failed due to logstash pod in bad state
Reproducibility
-------
100% reproducible
System Configuration
-------
DC
[sysadmin@
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID=
BUILD_BY="jenkins"
BUILD_NUMBER="50"
BUILD_DATE=
Test Activity
-------
Regression Testing
Workaround
-------
no workarounds
Changed in starlingx: | |
status: | New → In Progress |
tags: | added: stx.10.0 stx.config stx.security |
Changed in starlingx: | |
importance: | Undecided → Medium |
assignee: | nobody → Reinildes Oliveira (rjosemat) |
Reviewed: https:/ /review. opendev. org/c/starlingx /ansible- playbooks/ +/913829 /opendev. org/starlingx/ ansible- playbooks/ commit/ 5a304af6e1f424f d5c5e2bba907428 bc0d402cfd
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 5a304af6e1f424f d5c5e2bba907428 bc0d402cfd
Author: Rei Oliveira <email address hidden>
Date: Fri Mar 15 11:40:26 2024 -0300
Only wait for essential pods in cert recovery
The certificate recovery role will trigger a restart of every pod
in the k8s cluster so that they can be updated with the latest
certificate information.
After pods restart the procedure waits every pod to recover and become
READY. This change modifies that behaviour to only wait for essential
pods to recover, being those in the core namespaces armada,
cert-manager, flux-helm and kube-system.
Test case:
PASS: Run certificate recovery with crashing pods in a custom namespace
Closes-Bug: 2058751
Signed-off-by: Rei Oliveira <email address hidden> bb5f2c1f56d6ce1 c8bd80fabee
Change-Id: I3ea403a3e324ec