The sriov device plugin pod may start before it's config manifest is written

Bug #1850438 reported by Steven Webster
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Steven Webster

Bug Description

Brief Description
-----------------
If an SR-IOV interface VF driver is changed from vfio to netdevice, it's possible that the SR-IOV device plugin pod can start before the manifest is applied to change the /etc/pcidp/config.json file appropriately.

As such, the SR-IOV device plugin will look for vfio bound devices and will not find any. It is then not possible to launch a pod which uses the SR-IOV interfaces until the SR-IOV device plugin is restarted (or the host locks and unlocks again)

Severity
--------
Major: System/Feature is usable but degraded

Steps to Reproduce
------------------

system host-lock
system host-if-modify <worker> -n sriov0 -c pci-sriov -N <num_vfs> --vf-driver=vfio <interface_uuid>
system host-unlock ... wait for system to come up
system host-lock
system host-if-modify <worker> -n sriov0 -c pci-sriov -N <num_vfs> --vf-driver=netdevice <interface_uuid>
system host-unlock

Expected Behavior
------------------
The /etc/pcidp/config.json should be updated before the SR-IOV device plugin starts

Actual Behavior
----------------
It appears the pod has started before the file is written. This was confirmed by looking at the logs of the device plugin.

Reproducibility
---------------
Seen once (so far)

System Configuration
--------------------
One node system

Branch/Pull Time/Commit
-----------------------
master
BUILD_DATE="2019-10-24 20:01:49 -0400"

Last Pass
---------
I believe this is the first time this has been seen

Workaround
---------
Delete the sriov device plugin pod
Or
lock/unlock the host

Test Activity
-------------
Developer testing

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.3.0 / medium priority - would be nice to fix to avoid an extra lock/unlock for the host

Changed in starlingx:
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Steven Webster (swebster-wr)
tags: added: stx.3.0 stx.containers stx.networking
Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Steven Webster (swebster-wr) wrote :

Just a note that this only affects an AIO system. What's happening is that the pods are starting up after the kubernetes master manifest is applied, but the config file is written by the worker manifest. Sometimes the pods will get up and running just before the config file is written.

We'll want to change things to have the device plugin use a config map and have helm apply and restart the pod with updated config.

Ghada Khalil (gkhalil)
Changed in starlingx:
status: In Progress → Triaged
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This issue is highly intermittent, only affects a subset of configs (AIO only), and has a workaround. Therefore, we are not going to gate stx.3.0 for it.

tags: added: stx.4.0
removed: stx.3.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Steve Webster, he hasn't been seeing this issue in recent testing. The race condition still exists, but is happening at a much lower frequency.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per discussion with Steve Webster, we agreed to lower the priority of this issue for the following reasons:
- The issue is highly intermittent
- There is a workaround
- The scope of changes to transition the device plugin to a helm application is quite large

tags: removed: stx.4.0
Changed in starlingx:
importance: Medium → Low
Revision history for this message
Ghada Khalil (gkhalil) wrote :

The race condition between the device plugin starting and the interface puppet config completing has been seen more frequently in recent stx master load as reported in: https://bugs.launchpad.net/starlingx/+bug/1885229

A fix was merged for the above LP to restart the device plugin after completing the SR-IOV driver bind to ensure that the full allocatable set of VFs is inventoried.

This LP will be left open for longer term consideration of transitioning the plugin to helm.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.