AIO-SX Upgrade : kube-sriov-device-plugin-amd64 failed after ansible run

Bug #1978344 reported by Luis Eduardo Angelini Marquitti
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Luis Eduardo Angelini Marquitti

Bug Description

Brief Description
-----------------
During the AIO-SX kube-sriov-device-plugin-amd64 failed in the ansible execution.

Severity
--------
Major

Steps to Reproduce
------------------
1. Install Simplex
2. Follow simplex standalone upgrade steps
3. ansible run finish but kube-sriov-device-plugin-amd64 isn't running.

Expected Behavior
------------------
Ansible run finish and kube-sriov-device-plugin-amd64 running.

Actual Behavior
----------------
Ansible run finish but kube-sriov-device-plugin-amd64 isn't running.

Reproducibility
---------------
100% reproducible in Standalone Simplex system

System Configuration
--------------------
simplex - standalone

Branch/Pull Time/Commit
-----------------------
-

Last Pass
---------
-

Timestamp/Logs
--------------
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Scheduled 20h default-scheduler Successfully assigned kube-system/kube-sriov-device-plugin-amd64-9vbbt to controller-0
  Normal Created 20h (x4 over 20h) kubelet Created container kube-sriovdp
  Normal Started 20h (x4 over 20h) kubelet Started container kube-sriovdp
  Normal Pulled 20h (x4 over 20h) kubelet Container image "registry.local:9001/ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.3.2" already present on machine
  Warning BackOff 20h (x6 over 20h) kubelet Back-off restarting failed container
  Normal Started 20h kubelet Started container kube-sriovdp
  Normal SandboxChanged 20h kubelet Pod sandbox changed, it will be killed and re-created.
  Normal Pulled 20h kubelet Container image "registry.local:9001/ghcr.io/k8snetworkplumbingwg/sriov-network-device-plugin:v3.3.2" already present on machine
  Normal Created 20h kubelet Created container kube-sriovdp
  Warning FailedMount 4h21m (x88 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[log config device-info kube-api-access-d5xcg devicesock]: timed out waiting for the condition
  Warning FailedMount 82m (x84 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[config device-info kube-api-access-d5xcg devicesock log]: timed out waiting for the condition
  Warning FailedMount 20m (x86 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[device-info kube-api-access-d5xcg devicesock log config]: timed out waiting for the condition
  Warning FailedMount 11m (x154 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[devicesock log config device-info kube-api-access-d5xcg]: timed out waiting for the condition
  Warning FailedMount 6m39s (x581 over 19h) kubelet MountVolume.SetUp failed for volume "config" : hostPath type check failed: /etc/pcidp/config.json is not a file
  Warning FailedMount 34s (x83 over 19h) kubelet Unable to attach or mount volumes: unmounted volumes=[config], unattached volumes=[kube-api-access-d5xcg devicesock log config device-info]: timed out waiting for the condition

Test Activity
-------------
Upgrade Regression Testing

Workaround
----------
Restore file /etc/pcidp/config.json manually from the backup tarball before running upgrade_platform.yml.

Changed in starlingx:
assignee: nobody → Luis Eduardo Angelini Marquitti (leduard1)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/845612
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/8ae173369e1b304632e4828bca8bacadf44ed08f
Submitter: "Zuul (22348)"
Branch: master

commit 8ae173369e1b304632e4828bca8bacadf44ed08f
Author: Luis Eduardo Angelini Marquitti <email address hidden>
Date: Mon Jun 13 13:24:48 2022 -0400

    Adding SRIOV config.json file restore from backup

    During the upgrade process, the kube-sriov-device-plugin-amd64 pod
    needs the /etc/pcidp/config.json file which is not restored from backup
    during the upgrade process and is only created on the next unlock,
    causing the pod stays in 'ContainerCreating' state until unlock is
    executed.
    This change checks if the config.json file is present in the backup
    tarball and restores it if the bootstrap process is in restore mode.

    Test Plan:

    PASS: AIO-SX Upgrade in a SRIOV enabled hardware
    PASS: Check /etc/pcidp/config.json file is restored from backup
    PASS: Check pod kube-sriov-device-plugin-amd64 in Running state
    before the unlock

    Closes-Bug: 1978344

    Signed-off-by: Luis Eduardo Angelini Marquitti <email address hidden>
    Change-Id: Icd696d958836c0d837409481abdb531969cd366e

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.7.0 stx.config stx.update
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.