User application pod stuck in Container Creating state after upgrade

Bug #1999074 reported by Andre Kantek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Andre Kantek

Bug Description

Brief Description

Customer had to delete a POD manually that was stuck in the "container creating" state after platform upgrade

The issue was seen BEFORE any K8s upgrade was attempted.

Severity

Major - Manual intervention is required to delete the POD so the deployment can create a new one.

Steps to Reproduce

POD is running on previous version as AIO_SX_Subcloud requesting a single SRIOV interface.
Operator upgraded the subcloud and Found the POD stuck in container creating state. After 2.5 hours of system left in the same state, customer deleted the POD and a new POD was created.

Expected Behavior

PODs controlled by deployment/replicaset with labels "restart-on-reboot:true" should be recreated after the CNI Plugin reset.

        labels:
          app: adpf29991502555-dip0
          nename: adpf29991502555
          release: samsungadpf-29991502555
          restart-on-reboot: "true"

Actual Behavior

2022-11-01T16:53:57Z adpf29991502555-dip0-fake-c5c6b75f9-7khvp Pod network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized NetworkNotReady Warning
Followed by non-stop error messages:

2022-11-01T16:55:30Z adpf29991502555-dip0-fake-c5c6b75f9-7khvp Pod Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "e66c60537d9af5fd387be4eb1d9c91e50977eb4f2381b4f543fa8366dec1f66b": Multus: [welktxef-931883vzwcvdu-y-ss-ls6-00000000010/adpf29991502555-dip0-fake-c5c6b75f9-7khvp]: error adding container to network "adpf-f1c-nad": delegateAdd: error invoking DelegateAdd - "sriov": error in getting result from AddNetwork: SRIOV-CNI failed to load netconf: LoadConf(): VF pci addr is required FailedCreatePodSandBox Warning
Reproducibility

Although the POD went to a container-creating state during the Upgrade (with the same error message, VF pci addr is required). The Pod was recreated by the deployment as expected.

System Configuration

AIO_SX_Subcloud

Andre Kantek (akantek)
Changed in starlingx:
assignee: nobody → Andre Kantek (akantek)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/integ/+/866878

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
summary: - Customer application pod stuck in Container Creating state after upgrade
+ User application pod stuck in Container Creating state after upgrade
tags: added: stx.8.0 stx.networking
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/c/starlingx/integ/+/866878
Committed: https://opendev.org/starlingx/integ/commit/e3705e6046fed3f32041f3751f81fe27e3680bec
Submitter: "Zuul (22348)"
Branch: master

commit e3705e6046fed3f32041f3751f81fe27e3680bec
Author: Andre Fernando Zanella Kantek <email address hidden>
Date: Wed Dec 7 06:55:44 2022 -0500

    Execute one extra attempt to restore SRIOV device plugin

    The service k8s-pod-recovery failed to restore the SRIOV device
    plugin, necessary for pods that use SRIOV interfaces to create the
    resource, those pods need to add the label 'restart-on-reboot=true'
    to be restarted during boot. The failure was observed during an
    upgrade, and although rare, it left the operator to actuate by
    manually restarting the pods later.

    This change adds a wait for the pod stabilization (it is considered
    stable when stops the state transitions) and, if still in failure,
    execute 2 attempts to restore the plugin. Logs were added to better
    register the pod state in case of an error.

    Test Plan:
    [PASS] execute 7 upgrades in an AIO-SX lab

    Closes-Bug: 1999074

    Signed-off-by: Andre Fernando Zanella Kantek <email address hidden>
    Change-Id: I838c35d3e0a3557c71344945a8e00f22ccb50eb4

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.