Incorrect SR-IOV interface in pod after lock/unlock/reboot
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Steven Webster |
Bug Description
Brief Description
-----------------
There is a race condition in the system initialization sequence related to the sriov device plugin. This results in an incorrect SR-IOV interface being available in pods after a node lock/unlock or reboot. This issue is reported with the N3000 FPGA device.
Severity
--------
Major
Steps to Reproduce
------------------
- Configure a node with the N3000 FPGA device
- Configure a pod that uses a SR-IOV interface
- Lock/unlock the node (or do a reboot)
- Verify that the correct SR-IOV interface is available in the pod
Expected Behavior
------------------
After a lock/unlock or a reboot, the pod has the correct SR-IOV interface.
Actual Behavior
----------------
After a lock/unlock or a reboot, the pod doesn't have the correct SR-IOV interface
Reproducibility
---------------
Intermittent; frequency is unknown
System Configuration
-------
AIO-SX
Branch/Pull Time/Commit
-------
stx master, but will also be an issue for stx.4.0
Last Pass
---------
Unknown - the issue is related to a race condition
Timestamp/Logs
--------------
Test Activity
-------------
Workaround
----------
Delete / re-launch the pod after the system is up and the initialization sequence is complete
Changed in starlingx: | |
assignee: | nobody → Steven Webster (swebster-wr) |
tags: | added: stx.networking |
tags: | added: stx.5.0 |
Note:
This can affect pods on an AIO system using SR-IOV interfaces on an N3000 FPGA device. There is a race between the kubernetes processes coming up after the controller manifest is applied and the application of the worker manifest. The interface in the pod will be seen to 'disappear' after the FPGA device is reset in the worker manifest. It does not get plugged back in unless the pod is restarted. The fix for this would be quite extensive, requiring the creation of a new AIO, or separate kubernetes manifest to coordinate the bring-up of k8s services and the worker configuration. To mitigate this in the meantime, we could probably plug into the recently introduced pod recovery mechanism.