SRIOV Interface has MAC address of all 0's after manual reboot

Bug #1900736 reported by Ghada Khalil
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Cole Walker

Bug Description

Brief Description
-----------------
This is a follow-up on https://bugs.launchpad.net/starlingx/+bug/1896631
It appears that even with this fix in stx master, there is still a scenario where the SR-IOV Interfaces are not created properly in pods after multiple reboots. In this scenario, the MAC address of the SR-IOV interface is all set to zero.

Severity
--------
Major

Steps to Reproduce
------------------
- Configure a node with the N3000 FPGA device
- Configure a pod that uses a SR-IOV interface
- Lock/unlock the node (or do a reboot)
- Verify that the correct SR-IOV interface is created properly in the pod

Expected Behavior
------------------
After a reboot, the pod has the proper SR-IOV configuration

Actual Behavior
----------------
After a reboot, the pod doesn't have the correct SR-IOV interface

Reproducibility
---------------
Intermittent; frequency is unknown

System Configuration
--------------------
AIO-SX

Branch/Pull Time/Commit
-----------------------
stx master, but will also be an issue for stx.4.0

Last Pass
---------
Unknown - the issue is related to a race condition

Timestamp/Logs
--------------

Test Activity
-------------

Workaround
----------
Delete / re-launch the pod after the system is up and the initialization sequence is complete

Revision history for this message
Ghada Khalil (gkhalil) wrote :

stx.5.0 / high priority - issue addresses a race condition for pods using sr-iov interfaces. It's specific to AIO and there is a workaround. For now, we won't plan a cherry-pick to stx.4.0; we can re-consider if there is a community need in the future.

Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
tags: added: stx.5.0 stx.networking
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
summary: - SRIOV Interface has MAC address of all O's after manual reboot
+ SRIOV Interface has MAC address of all 0's after manual reboot
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: Steven Webster (swebster-wr) → Cole Walker (cwalops)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759481

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/760181

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on stx-puppet (master)

Change abandoned by Cole Walker (<email address hidden>) on branch: master
Review: https://review.opendev.org/760181
Reason: Sent to wrong review, should be on https://review.opendev.org/#/c/759481/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/759481
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=03324f173c366fe89fbe9b372ee8a148f33b45b6
Submitter: Zuul
Branch: master

commit 03324f173c366fe89fbe9b372ee8a148f33b45b6
Author: Cole Walker <email address hidden>
Date: Fri Oct 23 12:24:34 2020 -0400

    Ensure sriovdp is deleted after dev bindings

    This change replaces the daemonset rollout restart command with a more
    spcific pod delete command that only runs if there is an
    sriov-device-plugin pod present on the node. Using the pod delete
    command ensures that an existing device-plugin pod is terminated before
    the worker manifest completes. The rollout restart command did not
    ensure that the pod was terminated before the manifest completed and
    could allow user pods to be assigned incorrect VFs if they started up
    before the device-plugin pod terminated.

    This addresses an issue where pods restarted by k8s-pod-recovery could
    be assigned to incorrect VFs if they were started while the
    sriov-device-plugin was shutting down. Waiting for the device-plugin
    to completely terminate before proceeding with pod-recovery ensures that
    the device-plugin will have an accurate view of all device bindings and
    can allocate VFs correctly.

    Closes-Bug: 1900736

    Change-Id: I30fd602208d14ac887d5417fd87f27f23050f670
    Co-Authored-By: Steven Webster <email address hidden>
    Signed-off-by: Cole Walker <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.