Subcloud rehome playbook failed by time out waiting pods to restart

Bug #2058751 reported by Reinildes Oliveira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Reinildes Oliveira

Bug Description

Brief Description
---------------------------------------------

Brief Description

Subcloud rehome playbook failed by time out waiting pods to restart
Scenario:

    Subcloud rehoming
    Subcloud type: AIO-SX
    Additional k8s apps applied
    *Note: The issue to have pods in bad state was only observed on custom app applied

Error condition:

    There is a task on rehome playbook to restart all pods and wait for them to reach ready state
        TASK [common/recover-subcloud-certificates : Trigger restart of networking pods first to avoid pod scheduling issues] ***
        Wednesday 28 February 2035 16:06:02 +0000 (0:00:00.735) 0:00:27.451 ****
        changed: [subcloud1]

        TASK [common/recover-subcloud-certificates : Wait pods to restart (become READY) on controller] ***
        cmd: kubectl get po -l '!job-name' -A --no-headers -o 'custom-columns=NAME:.metadata.name, READY:.status.containerStatuses[*].ready,NODE:.spec.nodeName' | grep -v calico-node | grep $(hostname) | grep -cv true
          delta: '0:00:00.266306'
          end: '2035-02-28 17:12:44.926166'
          failed_when_result: true
          rc: 0

        sysadmin@controller-0:~$ kubectl get pods -A
        NAMESPACE NAME READY STATUS RESTARTS AGE
        armada armada-api-7c4cdff774-bkr8m 2/2 Running 0 131m
        cert-manager cm-cert-manager-6c95587448-9hhtr 1/1 Running 0 131m
        cert-manager cm-cert-manager-cainjector-5cbdd6849-74mpj 1/1 Running 0 131m
        cert-manager cm-cert-manager-webhook-c9bddf7f4-9k82z 1/1 Running 0 131m
        flux-helm helm-controller-5c98bb9658-vhd67 1/1 Running 0 131m
        flux-helm source-controller-774d8dc5-49h4r 1/1 Running 0 131m
        kube-system calico-kube-controllers-5657f865d8-jhhht 1/1 Running 0 131m
        kube-system calico-node-h2x59 1/1 Running 0 131m
        kube-system ceph-pools-audit-34271645-hb5b7 0/1 Completed 0 12m
        kube-system ceph-pools-audit-34271650-rnzbm 0/1 Completed 0 7m57s
        kube-system ceph-pools-audit-34271655-zcdl6 0/1 Completed 0 2m57s
        kube-system cephfs-nodeplugin-7glwz 2/2 Running 0 131m
        kube-system cephfs-provisioner-7d6847777f-7h7fh 4/4 Running 0 131m
        kube-system cephfs-storage-init-7f8q2 0/1 Completed 0 11y
        kube-system coredns-5768b6dd65-c5kv7 1/1 Running 0 131m
        kube-system ic-nginx-ingress-ingress-nginx-controller-j4b9t 1/1 Running 0 131m
        kube-system kube-apiserver-controller-0 1/1 Running 219 (144m ago) 11y
        kube-system kube-controller-manager-controller-0 1/1 Running 6 (18h ago) 11y
        kube-system kube-multus-ds-amd64-wsjz7 1/1 Running 0 131m
        kube-system kube-proxy-78rvg 1/1 Running 0 131m
        kube-system kube-scheduler-controller-0 1/1 Running 6 (18h ago) 11y
        kube-system kube-sriov-cni-ds-amd64-kbthx 1/1 Running 0 131m
        kube-system kube-sriov-device-plugin-amd64-4wl59 1/1 Running 0 131m
        kube-system rbd-nodeplugin-vtbp5 2/2 Running 0 131m
        kube-system rbd-provisioner-5d994c59cd-dh2f2 6/6 Running 0 131m
        kube-system rbd-storage-init-crm9d 0/1 Completed 0 11y
        kube-system volume-snapshot-controller-0 1/1 Running 0 131m
        metrics-server ms-metrics-server-7466cbd75d-88qhz 1/1 Running 0 131m
        monitor mon-elastic-services-95875f958-hhlcx 2/2 Running 0 131m
        monitor mon-filebeat-wmwlh 0/1 Running 0 130m
        monitor mon-kube-state-metrics-c65697c8d-6wpkj 1/1 Running 0 131m
        monitor mon-logstash-0 0/1 Init:3/4 150 (7m34s ago) 21h
        monitor mon-metricbeat-metrics-769f99d8c5-l86t8 0/1 Running 0 131m
        monitor mon-metricbeat-metrics-7f6898b4c9-vmgkj 0/1 Running 0 21h
        monitor mon-metricbeat-z6g25 0/1 Running 0 131m
        platform-deployment-manager dm-monitor-7976d74cdc-vkhvz 1/1 Running 0 131m
        platform-deployment-manager platform-deployment-manager-7954d9cdd4-mxpvr 2/2 Running 0 131m

        sysadmin@controller-0:/var/log$ source /etc/platform/openrc
        fm alarm-list
        [sysadmin@controller-0 log(keystone_admin)]$ fm alarm-list
        +----------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------+----------------------------+
        | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
        +----------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------+----------------------------+
        | 250.001 | controller-0 Configuration is out-of-date. (applied: 3ae3f09f-53dd-4bbe-a5e6-a3a162685b80 target: 0f9f78f0-fe61-42db-9080-4bf6f2ea544b) | host=controller-0 | major | 2035-02-28T18:29:33.117080 |
        | 500.210 | Certificate 'system certificate-show 5d502c2f-e79d-42b0-89f4-92d1e92f56d5' (mode=ssl_ca) expired. | system.certificate.mode=ssl_ca.uuid= | critical | 2035-02-27T20:39:17.208325 |
        | | | 5d502c2f-e79d-42b0-89f4-92d1e92f56d5 | | |
        | | | | | |
        | 200.006 | controller-0 'ntp' process has failed. Manual recovery is required. | host=controller-0.process=ntp | minor | 2035-02-27T20:35:57.822232 |
        +----------+-----------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------+----------+----------------------------+

        ----------------------
        cd /var/log/pods/monitor_mon-logstash-0_068b025d-5398-4b39-a539-8d60b5076051

        grep -r ERROR ./ | head -n 10

        ./d-logstash-setup/150.log:2035-02-28T18:11:55.780926851Z stdout F 2035-02-28 18:11:55.780 ERROR elastic-services /tmp/files/logstash_setup.py:193 Could not connect to elasticsearch cluster: <Elasticsearch([{'host': 'mon-elasticsearch.central', 'port': 31001, 'use_ssl': True}])>

        ----------------------
        on sysinv I see the following errors:

        sysinv 2035-02-28 15:58:40.811 3040284 ERROR sysinv.conductor.manager [-] Unexpected error during hook for app platform-integ-apps, error: Cannot load 'relative_timing' in the base class: NotImplementedError: Cannot load 'relative_timing' in the base class

Severity
---------------------------------------------
<Critical: System/Feature is not usable after the defect>

Steps to Reproduce
---------------------------------------------

    Deploy subcloud
    update subcloud clock to the future (11 years ahead), to have either certificates and license expired
    trigger rehoming for this subcloud on a target SystemController which also has the clock set to the future

Expected Behavior
---------------------------------------------

    rehome playbook should renew all certificates, with the exception of docker cert.
    the playbook should fail with an error msg requesting the user to renew docker cert.

Actual Behavior
---------------------------------------------

Subcloud rehome failed due to logstash pod in bad state

Reproducibility
---------------------------------------------

100% reproducible

System Configuration
---------------------------------------------

DC

[sysadmin@controller-1 ~(keystone_admin)]$ cat /etc/build.info
SW_VERSION="22.12"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="2022-12-19_02-22-00"
BUILD_BY="jenkins"
BUILD_NUMBER="50"
BUILD_DATE="2022-12-19 07:22:00 +0000"
Test Activity
---------------------------------------------

Regression Testing

Workaround
---------------------------------------------

no workarounds

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/913829
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/5a304af6e1f424fd5c5e2bba907428bc0d402cfd
Submitter: "Zuul (22348)"
Branch: master

commit 5a304af6e1f424fd5c5e2bba907428bc0d402cfd
Author: Rei Oliveira <email address hidden>
Date: Fri Mar 15 11:40:26 2024 -0300

    Only wait for essential pods in cert recovery

    The certificate recovery role will trigger a restart of every pod
    in the k8s cluster so that they can be updated with the latest
    certificate information.

    After pods restart the procedure waits every pod to recover and become
    READY. This change modifies that behaviour to only wait for essential
    pods to recover, being those in the core namespaces armada,
    cert-manager, flux-helm and kube-system.

    Test case:

    PASS: Run certificate recovery with crashing pods in a custom namespace

    Closes-Bug: 2058751

    Signed-off-by: Rei Oliveira <email address hidden>
    Change-Id: I3ea403a3e324ecbb5f2c1f56d6ce1c8bd80fabee

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.10.0 stx.config stx.security
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Reinildes Oliveira (rjosemat)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.