sriovdp allocatable VF changed unexpected after host interface modify

Bug #1885229 reported by Yang Liu
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Steven Webster

Bug Description

Brief Description
-----------------
sriovdp VF count changed unexpected after host interface modify

Severity
--------
Major

Steps to Reproduce
------------------
# system is configured with sriov interface like following:
| 2c9411af-9a1d-4b2d-8059-b075cb5c30af | sriov0 | pci-sriov | ethernet | None | [u'eno3'] | [] | [u'sriovfio'] | MTU=1500 |
| d2d473c0-057e-4873-9d5e-92eca96a7464 | sriovfio | pci-sriov | vf | None | [] | [u'sriov0'] | [] | MTU=1500 |

# interface datenetworks
| controller-0 | b721eb92-0633-493f-8ecb-da162dfd24a2 | sriov0 | group0-data0 |
| controller-0 | e9f1e7db-44a8-4513-a41e-161cd1195fc3 | sriovfio | group0-data1 |

# Allocatable SRIOV VFs via kubectl describe nodes controller-0
"intel.com/pci_sriov_net_group0_data0": "6",
"intel.com/pci_sriov_net_group0_data1": "10",

# lock host, add an additioanl sriov vfio interface, assign datanetwork and unlock host
system host-if-add controller-0 sriovtest vf sriov0 --num-vfs 1 --vf-driver vfio --ifclass pci-sriov
system interface-datanetwork-assign controller-0 sriovtest sriov-test-datanetwork'

# check Allocatable SRIOV VFs via kubectl describe nodes again

TC-name: test_sriovdp_mixed_add_vf_interface[1]

Expected Behavior
------------------
"intel.com/pci_sriov_net_group0_data0": "5",
"intel.com/pci_sriov_net_group0_data1": "11",

Actual Behavior
----------------
"intel.com/pci_sriov_net_group0_data0": "16",
"intel.com/pci_sriov_net_group0_data1": "0",

Reproducibility
---------------
Reproducible - happened 2 out of 2 times

System Configuration
--------------------
Simplex
Lab-name: SM-3

Branch/Pull Time/Commit
-----------------------
2020-06-24_22-16-59

Last Pass
---------
2020-06-23_20-00-00

Timestamp/Logs
--------------
https://files.starlingx.kube.cengn.ca/launchpad/1885229

[2020-06-25 21:25:35,961] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-if-add controller-0 sriovtest vf sriov0 --num-vfs 1 --vf-driver vfio --ifclass pci-sriov'

[2020-06-25 21:25:39,803] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne datanetwork-add sriov-test-datanetwork vlan'

[2020-06-25 21:25:42,744] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne interface-datanetwork-assign controller-0 sriovtest sriov-test-datanetwork'

[2020-06-25 21:25:51,511] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[abcd:204::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock controller-0'

Test Activity
-------------
Regression

Yang Liu (yliu12)
summary: - sriovdp VF count changed unexpected after host interface modify
+ sriovdp allocatable VF changed unexpected after host interface modify
Yang Liu (yliu12)
tags: added: stx.retestneeded
description: updated
Ghada Khalil (gkhalil)
tags: added: stx.networking
Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Steven Webster (swebster-wr)
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
tags: added: stx.4.0
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This is likely the same root-cause as https://bugs.launchpad.net/starlingx/+bug/1850438
stx.4.0 / medium - frequency of this issue has increased recently

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Steve Webster, the workaround is to do a rolling restart on the device plugin:
kubectl --kubeconfig=/etc/kubernetes/admin.conf rollout restart ds -n kube-system kube-sriov-device-plugin-amd64

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (master)

Fix proposed to branch: master
Review: https://review.opendev.org/738299

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to config (master)

Fix proposed to branch: master
Review: https://review.opendev.org/738300

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/738299
Committed: https://git.openstack.org/cgit/starlingx/stx-puppet/commit/?id=ca6546f5621980654954504c44f203a03665a461
Submitter: Zuul
Branch: master

commit ca6546f5621980654954504c44f203a03665a461
Author: Steven Webster <email address hidden>
Date: Fri Jun 26 14:05:48 2020 -0400

    Enable SR-IOV device plugin restart

    In an AIO system, it is possible for the kube-system pods, including
    the SR-IOV device plugin to start before the worker manifest finishes
    enabling and binding drivers to network interface VFs.

    Since the device plugin does not periodically (re)scan the PCI bus,
    it is required to restart the plugin after completing the SR-IOV
    driver bind to ensure that the full allocatable set of VFs is
    inventoried.

    Note that this can probably be mitigated in the future when the
    device plugin is converted to use helm / config map rather than
    having puppet write it's /etc/sriovdp/config.json file.

    Change-Id: I7972d7a56c2d38884238f7c7818892d0a5b33a0e
    Closes-Bug: #1885229
    Signed-off-by: Steven Webster <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to config (master)

Reviewed: https://review.opendev.org/738300
Committed: https://git.openstack.org/cgit/starlingx/config/commit/?id=ff1be48bf0bc22f75932af5be50b8d395275d4ae
Submitter: Zuul
Branch: master

commit ff1be48bf0bc22f75932af5be50b8d395275d4ae
Author: Steven Webster <email address hidden>
Date: Fri Jun 26 22:57:35 2020 -0400

    Fix N3000 FPGA SR-IOV config for split NICs

    This commit fixes an issue that can occur if a user
    creates an SR-IOV interface of type VF with a parent
    SR-IOV interface that belongs to a NIC on an Intel
    N3000 FPGA.

    The FPGA is reset on every worker node bootup (if present),
    clearing all SR-IOV config. Because of this, the current
    puppet code waits for the reset to be completed before
    restarting the interface via the sysconfig network interface
    scripts. Because the VF interface is a separate hieradata
    entry than the parent interface, this can cause a race
    condition where the device is re-initialized after one of
    the child/parent interfaces has already bound a driver to the
    interface.

    Since the whole point of the VF interfaces is to 'split'
    a NIC to allow multiple SR-IOV VF drivers on one physical NIC,
    this commit makes a single hieradata entry for the parent
    interface, rather than individual entries for the parent and
    child(ren). The information to bind the child interfaces
    appropriately is embedded in the vf_config dict of the parent
    interface.

    Change-Id: Iad34de8ae1b913a1c188e5473e0c92cdf8007ba2
    Partial-Bug: #1885229
    Signed-off-by: Steven Webster <email address hidden>

Revision history for this message
Yang Liu (yliu12) wrote :

Verification passed on 20200627 load on wcp112 and sm-3 with netdev driver in parent interface and vfio in child.

tags: removed: stx.retestneeded
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-puppet (f/centos8)

Fix proposed to branch: f/centos8
Review: https://review.opendev.org/762919

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.