Incorrect SR-IOV interface in pod after lock/unlock/reboot

Bug #1896631 reported by Ghada Khalil
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Steven Webster

Bug Description

Brief Description
-----------------
There is a race condition in the system initialization sequence related to the sriov device plugin. This results in an incorrect SR-IOV interface being available in pods after a node lock/unlock or reboot. This issue is reported with the N3000 FPGA device.

Severity
--------
Major

Steps to Reproduce
------------------
- Configure a node with the N3000 FPGA device
- Configure a pod that uses a SR-IOV interface
- Lock/unlock the node (or do a reboot)
- Verify that the correct SR-IOV interface is available in the pod

Expected Behavior
------------------
After a lock/unlock or a reboot, the pod has the correct SR-IOV interface.

Actual Behavior
----------------
After a lock/unlock or a reboot, the pod doesn't have the correct SR-IOV interface

Reproducibility
---------------
Intermittent; frequency is unknown

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
stx master, but will also be an issue for stx.4.0

Last Pass
---------
Unknown - the issue is related to a race condition

Timestamp/Logs
--------------

Test Activity
-------------

Workaround
----------
Delete / re-launch the pod after the system is up and the initialization sequence is complete

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
tags: added: stx.networking
Revision history for this message
Steven Webster (swebster-wr) wrote :

Note:

This can affect pods on an AIO system using SR-IOV interfaces on an N3000 FPGA device. There is a race between the kubernetes processes coming up after the controller manifest is applied and the application of the worker manifest. The interface in the pod will be seen to 'disappear' after the FPGA device is reset in the worker manifest. It does not get plugged back in unless the pod is restarted. The fix for this would be quite extensive, requiring the creation of a new AIO, or separate kubernetes manifest to coordinate the bring-up of k8s services and the worker configuration. To mitigate this in the meantime, we could probably plug into the recently introduced pod recovery mechanism.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to integ (master)

Fix proposed to branch: master
Review: https://review.opendev.org/753410

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.5.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / high priority - issue addresses a race condition for pods using sr-iov interfaces. It's specific to AIO and there is a workaround. For now, we won't plan a cherry-pick to stx.4.0; we can re-consider if there is a community need in the future.

Changed in starlingx:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to integ (master)

Reviewed: https://review.opendev.org/753410
Committed: https://git.openstack.org/cgit/starlingx/integ/commit/?id=77562993032f89da5295ee9e4141f1665fbb5e9f
Submitter: Zuul
Branch: master

commit 77562993032f89da5295ee9e4141f1665fbb5e9f
Author: Steven Webster <email address hidden>
Date: Tue Sep 22 12:25:32 2020 -0400

    Enable pod restart based on a label

    This commit adds a mechanism to the pod recovery service to restart
    pods based on the restart-on-reboot label.

    This is a mitigation for an issue seen on an AIO system using SR-IOV
    interfaces on an N3000 FPGA device. Since the kubernetes services
    start coming up after the controller manifest has completed, a race
    can happen with the configuration of devices and the SR-IOV device
    plugin in the worker manifest. The symptom of this would be the
    SR-IOV device in the running pod disappearing as the FPGA device is
    reset.

    Notes:

    - The pod recovery service only runs on controller nodes.
    - The raciness between the kubernetes bring-up and worker configuration
      should be fixed in the future by a re-organization of the manifests to
      either have a separate AIO or kubernetes manifest. This would require
      extensive feature work. In the meantime, this mitigation will allow
      pods which experience this issue to recover.

    Change-Id: If84b66b3a632752bd08293105bb780ea8c7cf400
    Closes-Bug: #1896631
    Signed-off-by: Steven Webster <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.