StarlingX

Subcloud rehome playbook failed by time out waiting pods to restart

Bug #2058751 reported by Reinildes Oliveira on 2024-03-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Reinildes Oliveira

Bug Description

Brief Description
---------------------------------------------

Brief Description

Subcloud rehome playbook failed by time out waiting pods to restart
Scenario:

    Subcloud rehoming
    Subcloud type: AIO-SX
    Additional k8s apps applied
    *Note: The issue to have pods in bad state was only observed on custom app applied

Error condition:

    There is a task on rehome playbook to restart all pods and wait for them to reach ready state
        TASK [common/recover-subcloud-certificates : Trigger restart of networking pods first to avoid pod scheduling issues] ***
        Wednesday 28 February 2035 16:06:02 +0000 (0:00:00.735) 0:00:27.451 ****
        changed: [subcloud1]

        TASK [common/recover-subcloud-certificates : Wait pods to restart (become READY) on controller] ***
        cmd: kubectl get po -l '!job-name' -A --no-headers -o 'custom-columns=NAME:.metadata.name, READY:.status.containerStatuses[*].ready,NODE:.spec.nodeName' | grep -v calico-node | grep $(hostname) | grep -cv true
          delta: '0:00:00.266306'
          end: '2035-02-28 17:12:44.926166'
          failed_when_result: true
          rc: 0

sysadmin@ cert-manager cert-manager cert-manager flux-helm flux-helm kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system kube-system metrics- platform- platform-controller-0:~$ kubectl get pods -A
NAME READY STATUS RESTARTS AGE
armada-api-7c4cdff774-bkr8m 2/2 Running 0 131m
cm-cert-manager-6c95587448-9hhtr 1/1 Running 0 131m
cm-cert-manager-cainjector-5cbdd6849-74mpj 1/1 Running 0 131m
cm-cert-manager-webhook-c9bddf7f4-9k82z 1/1 Running 0 131m
helm-controller-5c98bb9658-vhd67 1/1 Running 0 131m
source-controller-774d8dc5-49h4r 1/1 Running 0 131m
calico-kube-controllers-5657f865d8-jhhht 1/1 Running 0 131m
calico-node-h2x59 1/1 Running 0 131m
ceph-pools-audit-34271645-hb5b7 0/1 Completed 0 12m
ceph-pools-audit-34271650-rnzbm 0/1 Completed 0 7m57s
ceph-pools-audit-34271655-zcdl6 0/1 Completed 0 2m57s
cephfs-nodeplugin-7glwz 2/2 Running 0 131m
cephfs-provisioner-7d6847777f-7h7fh 4/4 Running 0 131m
cephfs-storage-init-7f8q2 0/1 Completed 0 11y
coredns-5768b6dd65-c5kv7 1/1 Running 0 131m
ic-nginx-ingress-ingress-nginx-controller-j4b9t 1/1 Running 0 131m
kube-apiserver-controller-0 1/1 Running 219 (144m ago) 11y
kube-controller-manager-controller-0 1/1 Running 6 (18h ago) 11y
kube-multus-ds-amd64-wsjz7 1/1 Running 0 131m
kube-proxy-78rvg 1/1 Running 0 131m
kube-scheduler-controller-0 1/1 Running 6 (18h ago) 11y
kube-sriov-cni-ds-amd64-kbthx 1/1 Running 0 131m
kube-sriov-device-plugin-amd64-4wl59 1/1 Running 0 131m
rbd-nodeplugin-vtbp5 2/2 Running 0 131m
rbd-provisioner-5d994c59cd-dh2f2 6/6 Running 0 131m
rbd-storage-init-crm9d 0/1 Completed 0 11y
volume-snapshot-controller-0 1/1 Running 0 131m
/>server ms-metrics-server-7466cbd75d-88qhz 1/1 Running 0 131m
mon-elastic-services-95875f958-hhlcx 2/2 Running 0 131m
mon-filebeat-wmwlh 0/1 Running 0 130m
mon-kube-state-metrics-c65697c8d-6wpkj 1/1 Running 0 131m
mon-logstash-0 0/1 Init:3/4 150 (7m34s ago) 21h
mon-metricbeat-metrics-769f99d8c5-l86t8 0/1 Running 0 131m
mon-metricbeat-metrics-7f6898b4c9-vmgkj 0/1 Running 0 21h
mon-metricbeat-z6g25 0/1 Running 0 131m
/>deployment-manager dm-monitor-7976d74cdc-vkhvz 1/1 Running 0 131m
/>deployment-manager platform-deployment-manager-7954d9cdd4-mxpvr 2/2 Running 0 131m

        sysadmin@controller-0:/var/log$ source /etc/platform/openrc
        fm alarm-list
        [sysadmin@controller-0 log(keystone_admin)]$ fm alarm-list
        +----------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------+----------------------------+
        | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
        +----------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------+----------------------------+
        | 250.001 | controller-0 Configuration is out-of-date. (applied: 3ae3f09f-53dd-4bbe-a5e6-a3a162685b80 target: 0f9f78f0-fe61-42db-9080-4bf6f2ea544b) | host=controller-0 | major | 2035-02-28T18:29:33.117080 |
        | 500.210 | Certificate 'system certificate-show 5d502c2f-e79d-42b0-89f4-92d1e92f56d5' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-02-27T20:39:17.208325 |
        | | | 5d502c2f-e79d-42b0-89f4-92d1e92f56d5 | | |
        | | | | | |
        | 200.006 | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | 2035-02-27T20:35:57.822232 |
        +----------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------+----------------------------+

----------------------
cd /var/log/pods/monitor_mon-logstash-0_068b025d-5398-4b39-a539-8d60b5076051

grep -r ERROR ./ | head -n 10

./d-logstash-setup/150.log:2035-02-28T18:11:55.780926851Z stdout F 2035-02-28 18:11:55.780 ERROR elastic-services /tmp/files/logstash_setup.py:193 Could not connect to elasticsearch cluster: <Elasticsearch([{'host': 'mon-elasticsearch.central', 'port': 31001, 'use_ssl': True}])>

----------------------
on sysinv I see the following errors:

sysinv 2035-02-28 15:58:40.811 3040284 ERROR sysinv.conductor.manager [-] Unexpected error during hook for app platform-integ-apps, error: Cannot load 'relative_timing' in the base class: NotImplementedError: Cannot load 'relative_timing' in the base class

Severity
---------------------------------------------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
---------------------------------------------

    Deploy subcloud
    update subcloud clock to the future (11 years ahead), to have either certificates and license expired
    trigger rehoming for this subcloud on a target SystemController which also has the clock set to the future

Expected Behavior
---------------------------------------------

rehome playbook should renew all certificates, with the exception of docker cert.
the playbook should fail with an error msg requesting the user to renew docker cert.

Actual Behavior
---------------------------------------------

Subcloud rehome failed due to logstash pod in bad state

Reproducibility
---------------------------------------------

100% reproducible

System Configuration
---------------------------------------------

[sysadmin@controller-1 ~(keystone_admin)]$ cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
BUILD_BY="jenkins"
BUILD_NUMBER="50"
BUILD_DATE="2022-12-19 07:22:00 +0000"
Test Activity
---------------------------------------------

Regression Testing

Workaround
---------------------------------------------

no workarounds

Tags:

OpenStack Infra (hudson-openstack) on 2024-03-22

Changed in starlingx:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-03-25: Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/913829
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/5a304af6e1f424fd5c5e2bba907428bc0d402cfd
Submitter: "Zuul (22348)"
Branch: master

commit 5a304af6e1f424fd5c5e2bba907428bc0d402cfd
Author: Rei Oliveira <email address hidden>
Date: Fri Mar 15 11:40:26 2024 -0300

Only wait for essential pods in cert recovery

    The certificate recovery role will trigger a restart of every pod
    in the k8s cluster so that they can be updated with the latest
    certificate information.

    After pods restart the procedure waits every pod to recover and become
    READY. This change modifies that behaviour to only wait for essential
    pods to recover, being those in the core namespaces armada,
    cert-manager, flux-helm and kube-system.

Test case:

PASS: Run certificate recovery with crashing pods in a custom namespace

Closes-Bug: 2058751

Signed-off-by: Rei Oliveira <email address hidden>
Change-Id: I3ea403a3e324ecbb5f2c1f56d6ce1c8bd80fabee

Changed in starlingx:
status:	In Progress → Fix Released

Ghada Khalil (gkhalil) on 2024-03-26

tags:	added: stx.10.0 stx.config stx.security
Changed in starlingx:
importance:	Undecided → Medium
assignee:	nobody → Reinildes Oliveira (rjosemat)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.