Simplex: ansible restore failure on timeout on kube-sriov-cni-ds-amd64-bcl75

Bug #1974051 reported by Luis Eduardo Angelini Marquitti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Virginia Martins Perozim

Bug Description

Brief Description
-----------------
Simplex standalone restore failed during the ansible execution. Ansible logs say timeout on waiting for pods/kube-sriov-cni-ds-amd64-bcl75.

Severity
--------
Major

Steps to Reproduce
------------------
Run a backup and try to restore.

Expected Behavior
------------------
Complete the restore.

Actual Behavior
----------------
During the restore, Ansible logs say timeout on waiting for pods/kube-sriov-cni-ds-amd64-bcl75.

Reproducibility
---------------
100% Reproducible

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
Starlingx master

Last Pass
---------
-

Timestamp/Logs
--------------
-

Test Activity
-------------
-

Workaround
----------
-

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.7.0 stx.update
Changed in starlingx:
importance: Undecided → Medium
assignee: nobody → Virginia Martins Perozim (vmperozim)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/842451
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/4fd0212f96e4c86d508e974f29c7ac80ca839b08
Submitter: "Zuul (22348)"
Branch: master

commit 4fd0212f96e4c86d508e974f29c7ac80ca839b08
Author: Virginia Martins Perozim <email address hidden>
Date: Wed May 18 21:40:41 2022 -0400

    Delay wait for kubernetes pods be in ready state

    During the execution of k8s-upgrade-networking tasks as part of
    AIO-SX upgrade, the sriov pod changes its name causing the
    subsequent ansible task that verifies each pod in the
    kube_component_list to fail.

       Example:
       $ kubectl --kubeconfig=/etc/kubernetes/admin.conf
                 rollout restart ds -n kube-system
                 kube-sriov-cni-ds-amd64
       $ kubectl get pods -A
       NAMESPACE NAME READY STATUS
       ...
       kube-system kube-sriov-cni-ds-amd64-k5rk7 0/1 Terminating
       $ date
       Wed May 18 12:25:31 UTC 2022
       kubectl --kubeconfig=/etc/kubernetes/admin.conf wait
               --namespace=kube-system
               --for=condition=Ready pods
               --selector app=sriov-cni
               --field-selector spec.nodeName=controller-0
               --timeout=120s
       error: timed out waiting for the condition on
              pods/kube-sriov-cni-ds-amd64-k5rk7
       $ date
       Wed May 18 12:27:35 UTC 2022

       $ kubectl get pods -A
       NAMESPACE NAME READY STATUS
       ...
       kube-system kube-sriov-cni-ds-amd64-w2qlp 1/1 Running
       $ date
       Wed May 18 12:25:40 UTC 2022 <---- running before timeout

    The issue is resolved by moving a wait task further down which
    ensures the k8s pods have adequate time to be ready for the
    verification task in all 3 cases - fresh install, upgrade and B&R.

    Test Plan:

    PASS: AIO-SX upgrade
    PASS: Subcloud upgrade
    PASS: AIO-SX backup and restore
    PASS: AIO-SX system bring up (fresh install)

    Closes-Bug: 1974051
    Signed-off-by: Virginia Martins Perozim <email address hidden>
    Change-Id: I3b80e2ad67221900b1103b7e742d9a5a0586ae2f

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.