The sriov device plugin pod may start before it's config manifest is written

Bug #1850438 reported by Steven Webster on 2019-10-29
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Steven Webster

Bug Description

Brief Description
-----------------
If an SR-IOV interface VF driver is changed from vfio to netdevice, it's possible that the SR-IOV device plugin pod can start before the manifest is applied to change the /etc/pcidp/config.json file appropriately.

As such, the SR-IOV device plugin will look for vfio bound devices and will not find any. It is then not possible to launch a pod which uses the SR-IOV interfaces until the SR-IOV device plugin is restarted (or the host locks and unlocks again)

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------

system host-lock
system host-if-modify <worker> -n sriov0 -c pci-sriov -N <num_vfs> --vf-driver=vfio <interface_uuid>
system host-unlock ... wait for system to come up
system host-lock
system host-if-modify <worker> -n sriov0 -c pci-sriov -N <num_vfs> --vf-driver=netdevice <interface_uuid>
system host-unlock

Expected Behavior
------------------
The /etc/pcidp/config.json should be updated before the SR-IOV device plugin starts

Actual Behavior
----------------
It appears the pod has started before the file is written. This was confirmed by looking at the logs of the device plugin.

Reproducibility
---------------
Seen once (so far)

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
master
BUILD_DATE="2019-10-24 20:01:49 -0400"

Last Pass
---------
I believe this is the first time this has been seen

Workaround
---------
Delete the sriov device plugin pod
Or
lock/unlock the host

Test Activity
-------------
Developer testing

Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - would be nice to fix to avoid an extra lock/unlock for the host

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Steven Webster (swebster-wr)
tags: added: stx.3.0 stx.containers stx.networking
Changed in starlingx:
status: Triaged → In Progress
Steven Webster (swebster-wr) wrote :

Just a note that this only affects an AIO system. What's happening is that the pods are starting up after the kubernetes master manifest is applied, but the config file is written by the worker manifest. Sometimes the pods will get up and running just before the config file is written.

We'll want to change things to have the device plugin use a config map and have helm apply and restart the pod with updated config.

Ghada Khalil (gkhalil) on 2019-11-21
Changed in starlingx:
status: In Progress → Triaged
Ghada Khalil (gkhalil) wrote :

This issue is highly intermittent, only affects a subset of configs (AIO only), and has a workaround. Therefore, we are not going to gate stx.3.0 for it.

tags: added: stx.4.0
removed: stx.3.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers