ACC100 pod in pending state when vfDriver:igb_uio is configured using sriov-fec-operator

Bug #2020213 reported by Lucas Wizer da Silva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Balendu Mouli Burla

Bug Description

Brief Description
-----------------
Cannot bring up a pod using acc100 when vfDriver:igb_uio is configured using sriov-fec-operator. There are no errors applying the config yaml.

If we uses "system host-device-modify controller-0 pci_0000_c3_00_0 --driver igb_uio --vf-driver igb_uio -N 16" we can bring up the pod.

Severity
--------
<Minor: System/Feature is usable with minor issue>

Steps to Reproduce
------------------
system application-upload /usr/local/share/applications/helm/sriov-fec-operator-22.12-1.tgz
system application-apply sriov-fec-operator
kubectl apply -f sriov-fec-config.yaml (attachments)

wait configuration is done
kubectl get sriovfecnodeconfigs.sriovfec.intel.com -n sriov-fec-system controller-0 -o yaml

create the pod
kubectl create -f acc100.yml (attachments)

check the pod remains on pending state
default acc100 0/1 Pending

kubectl describe pod acc100
0/1 nodes are available: 1 Insufficient intel.com/intel_fec_acc100. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod

Expected Behavior
------------------
Pod in running state

Actual Behavior
----------------
Pod in pending state

Reproducibility
---------------
Reproducible

System Configuration
-------------------
One node system

Branch/Pull Time/Commit
-----------------------
BUILD_ID="2022-12-19_02-22-00"

Timestamp/Logs
--------------

- pod

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Warning FailedScheduling 3s (x26 over 120m) default-scheduler 0/1 nodes are available: 1 Insufficient intel.com/intel_fec_acc100. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

- config output

kubectl get sriovfecnodeconfigs.sriovfec.intel.com -n sriov-fec-system controller-0 -o yaml
apiVersion: sriovfec.intel.com/v2
kind: SriovFecNodeConfig
metadata:
  creationTimestamp: "2023-05-19T04:13:31Z"
  generation: 9
  name: controller-0
  namespace: sriov-fec-system
  resourceVersion: "235211"
  uid: 4dd150d9-c59e-4678-904c-51b0339ca751
spec:
  drainSkip: true
  physicalFunctions:
  - bbDevConfig:
      acc100:
        downlink4G:
          aqDepthLog2: 4
          numAqsPerGroups: 16
          numQueueGroups: 0
        downlink5G:
          aqDepthLog2: 4
          numAqsPerGroups: 16
          numQueueGroups: 4
        maxQueueSize: 1024
        numVfBundles: 16
        uplink4G:
          aqDepthLog2: 4
          numAqsPerGroups: 16
          numQueueGroups: 0
        uplink5G:
          aqDepthLog2: 4
          numAqsPerGroups: 16
          numQueueGroups: 4
    pciAddress: 0000:c3:00.0
    pfDriver: igb_uio
    vfAmount: 16
    vfDriver: igb_uio
status:
  conditions:
  - lastTransitionTime: "2023-05-19T13:19:16Z"
    message: Configured successfully
    observedGeneration: 9
    reason: Succeeded
    status: "True"
    type: Configured
  inventory:
    sriovAccelerators:
    - deviceID: 0d5c
      driver: igb_uio
      maxVirtualFunctions: 16
      pciAddress: 0000:c3:00.0
      vendorID: "8086"
      virtualFunctions:
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.0
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.1
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.2
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.3
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.4
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.5
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.6
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.7
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.2
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.3
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.4
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.5
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.6
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:00.7
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.0
      - deviceID: 0d5d
        driver: igb_uio
        pciAddress: 0000:c4:01.1

- sysfs

 ls /sys/bus/pci/devices/0000:c3:00.0/driver -l
lrwxrwxrwx 1 root root 0 May 19 13:19 /sys/bus/pci/devices/0000:c3:00.0/driver -> ../../../../bus/pci/drivers/igb_uio
sysadmin@controller-0:~$ ls /sys/bus/pci/devices/0000:c4:00.0/driver -l
lrwxrwxrwx 1 root root 0 May 19 13:19 /sys/bus/pci/devices/0000:c4:00.0/driver -> ../../../../bus/pci/drivers/igb_uio
sysadmin@controller-0:~$ ls /sys/bus/pci/devices/0000:c4:00.1/driver -l
lrwxrwxrwx 1 root root 0 May 19 13:19 /sys/bus/pci/devices/0000:c4:00.1/driver -> ../../../../bus/pci/drivers/igb_uio

- sriov-fec-daemonset pod shows that it is binding to igb_uio

{"file":"/workspace-go/pkg/daemon/node_management.go:120","func":"github.com/smart-edge-open/sriov-fec-operator/pkg/daemon.(*NodeConfigurator).bindDeviceToDriver","level":"info","msg":"driver bind path","path":"/sys/bus/pci/drivers/igb_uio/bind","time":"2023-05-19T13:19:12Z"}
{"file":"/workspace-go/pkg/daemon/node_management.go:113","func":"github.com/smart-edge-open/sriov-fec-operator/pkg/daemon.(*NodeConfigurator).bindDeviceToDriver","level":"info","msg":"device's driver_override path","path":"/sys/bus/pci/devices/0000:c4:01.1/driver_override","time":"2023-05-19T13:19:12Z"}
{"file":"/workspace-go/pkg/daemon/node_management.go:120","func":"github.com/smart-edge-open/sriov-fec-operator/pkg/daemon.(*NodeConfigurator).bindDeviceToDriver","level":"info","msg":"driver bind path","path":"/sys/bus/pci/drivers/igb_uio/bind","time":"2023-05-19T13:19:12Z"}
{"file":"/workspace-go/pkg/daemon/device_plugin_controller.go:77","func":"github.com/smart-edge-open/sriov-fec-operator/pkg/daemon.(*devicePluginController).waitForDevicePluginRestart.func1","level":"info","msg":"device-plugin is running","time":"2023-05-19T13:19:16Z"}

- kubectl logs sriov-device-plugin-qwdqc -n sriov-fec-system (igb_uio-sriov-device-plugin.txt)

...
        {
            "resourceName": "intel_fec_acc100",
            "deviceType": "accelerator",
            "selectors": {
                "vendors": ["8086"],
                "devices": ["0d5d"],
                "drivers": ["pci-pf-stub", "vfio-pci"]
            }
        },
...
I0519 13:19:14.847849 1 manager.go:110] Creating new ResourcePool: intel_fec_acc100
I0519 13:19:14.847852 1 manager.go:111] DeviceType: accelerator
I0519 13:19:14.853184 1 manager.go:125] no devices in device pool, skipping creating resource server for intel_fec_acc100

Test Activity
-------------
Feature Testing

Workaround
----------
1 - Change the vfDriver to vfio-pci if configured using sriov-fec-operator
or
2 - Configure vfDriver igb_uio using "system host-modify-device"

Revision history for this message
Lucas Wizer da Silva (lwizerda) wrote :
Revision history for this message
Lucas Wizer da Silva (lwizerda) wrote :
Revision history for this message
Lucas Wizer da Silva (lwizerda) wrote :
Revision history for this message
Lucas Wizer da Silva (lwizerda) wrote :
Revision history for this message
Balendu Mouli Burla (balendu) wrote :

Hi,

As described in the problem description, support driver for VF interface to bind are "pci-pf-stub" and "vfio-pci" only.

This has been the configuration of FEC operator from the beginning. It is recommended to not to use the igb_uio driver for either for PF or for VF interfaces.

====================
                {
                    "resourceName": "intel_fec_acc100",
                    "deviceType": "accelerator",
                    "selectors": {
                        "vendors": ["8086"],
                        "devices": ["0d5d"],
                        "drivers": ["pci-pf-stub", "vfio-pci"]
                    }
                },
===================

FEC Operator with igb_uio driver was never tested, nor it was in the scope.

This configuration is not supported in FEC operator.

Thanks,
Mouli

Revision history for this message
Balendu Mouli Burla (balendu) wrote :

Hi,

as mentioned above, the default configuration does not support igb_uio driver for VF interfaces.

However, you can enable this support through command line before applying the CRD.

Please find the attached document for how to enable igb_uio driver support for VF interface.

we have tested this configuration on N3000 and ACC100.

SPR-EE(ACC200) will be tested soon.. let you know the resutls.

Thanks,
Mouli

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Balendu Mouli Burla (balendu)
Revision history for this message
Gabriel Francischini (gfrancischini) wrote :

Now we are able to bring up the acc100 pod to running state with sriov-fec-operator and run the test successfully on it, but the one-line workaround that actually is a configmap patch "goes away" after the pod goes to running state, and therefore, when we run a reboot, the pod won't be there since the configmap is outdated (see the log file).

I was only able to bring the pod running back again by removing the sriov-fec-operator .tgz file and repeating all the process.

Any ideas on how to make this persistent after pod runs and also after reboot?

Revision history for this message
Balendu Mouli Burla (balendu) wrote :

Hi,

To enable the support for igb_uio driver for VF interface, we see the only way to achieve in the current version is to patch/update the configMap using kubectl command after the FEC Operator is loaded.

As this configMap (used to provide the config.json file) for sriov-network-device-plugin is generated by FEC Operator manager every time when operator is loaded.

Hence, there is no persistence way to apply this configuration, which can preserve these changes during node reboot.

Thanks,

Ghada Khalil (gkhalil)
tags: added: stx.networ
tags: added: stx.networking
removed: stx.networ
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0
Revision history for this message
Balendu Mouli Burla (balendu) wrote :

SRIOV FEC v2.7.1 operator is integrated to StarlingX now.

igb_uio driver is added to the default list of drivers supported for VF interface in v2.7.1

====================
                {
                    "resourceName": "intel_fec_acc100",
                    "deviceType": "accelerator",
                    "selectors": {
                        "vendors": ["8086"],
                        "devices": ["0d5d"],
                        "drivers": ["pci-pf-stub", "vfio-pci", "igb_uio"]
                    }
                },
===================

This issue is fixed.

Changed in starlingx:
status: New → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Reviews for the SRIOV FEC Operator v.2.71:
* fec-operator 2.7.1 integration: https://review.opendev.org/c/starlingx/app-sriov-fec-operator/+/890364 – merged Aug 3
* image tagging: https://review.opendev.org/c/starlingx/root/+/890934 – merged Aug 10
* helm chart update: https://review.opendev.org/c/starlingx/app-sriov-fec-operator/+/890935 – merged Aug 11

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.