dual port SRIOV NIC with 64 VFs per PF is not configured with switchdev eswitch mode

Bug #1981721 reported by Itai Levy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
netplan.io (Ubuntu)
New
Undecided
Unassigned

Bug Description

Trying to deploy Charmed OpenStack (Yoga) Jammy series with OVN Hardware Offload.

# cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04 LTS"

# uname -a
Linux node3 5.15.0-41-generic #44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/openstack-release
OPENSTACK_CODENAME=yoga

As part of the charms bundle the following config is used:
  ovn-chassis:
    charm: ch:ovn-chassis
    # Please update the `bridge-interface-mappings` to values suitable for the
    # hardware used in your deployment. See the referenced documentation at the
    # top of this file.
    options:
      ovn-bridge-mappings: tenantvlan:br-nvda
      bridge-interface-mappings: br-nvda:bond0
      enable-hardware-offload: true
      sriov-numvfs: "ens1f0:64 ens1f1:64"
    channel: 22.03/stable
    bindings:
      "": *internal-space
      data: *overlay-space

This is translated to the following netplan file on the deployed node:
 cat /etc/netplan/150-charm-ovn.yaml
###############################################################################
# [ WARNING ]
# Configuration file maintained by Juju. Local changes may be overwritten.
# Config managed by ovn-chassis charm
###############################################################################
network:
  version: 2
  ethernets:
    ens1f0:
      virtual-function-count: 64
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true

    ens1f1:
      virtual-function-count: 64
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true

After reboot of the deployed servers, the SRIOV VFs are enabled on the NVIDIA NIC, however the embedded-switch-mode is not set to "switchdev" - accorsing to the logs due to udev failure.

#lspci | grep Virtual | wc -l
129

# devlink dev eswitch show pci/0000:08:00.0
pci/0000:08:00.0: mode legacy inline-mode none encap-mode basic

NOTE: When using 50 VFs or below, the switchdev configuration is successful.

Syslog (with udev debug):

Jul 14 14:24:19 node4 systemd-udevd[712]: Parsed configuration file /run/systemd/network/10-netplan-ens1f1.link
Jul 14 14:24:19 node4 systemd-udevd[712]: Parsed configuration file /run/systemd/network/10-netplan-ens1f0.link
Jul 14 14:24:19 node4 systemd-udevd[712]: Parsed configuration file /run/systemd/network/10-netplan-eno4.link
Jul 14 14:24:19 node4 systemd-udevd[712]: Parsed configuration file /run/systemd/network/10-netplan-eno3.link
Jul 14 14:24:19 node4 systemd-udevd[712]: Parsed configuration file /run/systemd/network/10-netplan-eno2.link
Jul 14 14:24:19 node4 systemd-udevd[712]: Parsed configuration file /run/systemd/network/10-netplan-eno1.link
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-netplan-eno1.rules
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-netplan-eno2.rules
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-netplan-eno3.rules
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-netplan-eno4.rules
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-netplan-ens1f0.rules
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-netplan-ens1f1.rules
Jul 14 14:24:19 node4 systemd-udevd[712]: Reading rules file: /run/udev/rules.d/99-sriov-netplan-setup.rules
Jul 14 14:24:55 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) 'Error: mlx5_core: Failed setting eswitch to offloads.'
Jul 14 14:24:55 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) 'kernel answers: Invalid argument'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) 'Traceback (most recent call last):'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/sbin/netplan", line 23, in <module>'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' netplan.main()'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/core.py", line 50, in main'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' self.run_command()'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/utils.py", line 247, in run_command'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' self.func()'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/commands/apply.py", line 61, in run'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' self.run_command()'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/utils.py", line 247, in run_command'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' self.func()'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/commands/apply.py", line 71, in command_apply'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' NetplanApply.process_sriov_config(config_manager, exit_on_error)'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/commands/apply.py", line 376, in process_sriov_config'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' apply_sriov_config(config_manager)'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/sriov.py", line 498, in apply_sriov_config'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' pcidev.devlink_set('eswitch', 'mode', eswitch_mode)'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/sriov.py", line 144, in devlink_set'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' subprocess.check_call('
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/lib/python3.10/subprocess.py", line 369, in check_call'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) ' raise CalledProcessError(retcode, cmd)'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: '/usr/sbin/netplan apply --sriov-only'(err) 'subprocess.CalledProcessError: Command '['/sbin/devlink', 'dev', 'eswitch', 'set', 'pci/0000:08:00.0', 'mode', 'switchdev']' returned non-zero exit status 1.'
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: Process '/usr/sbin/netplan apply --sriov-only' failed with exit code 1.
Jul 14 14:24:56 node4 systemd-udevd[753]: ens1f1np1: Command "/usr/sbin/netplan apply --sriov-only" returned 1 (error), ignoring.
Jul 14 14:24:56 node4 systemd-udevd[763]: ens1f1: Config file /run/systemd/network/10-netplan-ens1f1.link is applied
Jul 14 14:24:56 node4 systemd-networkd[1055]: ens1f1: found matching network '/run/systemd/network/10-netplan-ens1f1.network'.
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) 'Error: mlx5_core: Failed setting eswitch to offloads.'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) 'kernel answers: Invalid argument'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) 'Traceback (most recent call last):'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/sbin/netplan", line 23, in <module>'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' netplan.main()'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/core.py", line 50, in main'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' self.run_command()'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/utils.py", line 247, in run_command'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' self.func()'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/commands/apply.py", line 61, in run'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' self.run_command()'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/utils.py", line 247, in run_command'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' self.func()'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/commands/apply.py", line 71, in command_apply'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' NetplanApply.process_sriov_config(config_manager, exit_on_error)'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/commands/apply.py", line 376, in process_sriov_config'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' apply_sriov_config(config_manager)'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/sriov.py", line 498, in apply_sriov_config'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' pcidev.devlink_set('eswitch', 'mode', eswitch_mode)'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/share/netplan/netplan/cli/sriov.py", line 144, in devlink_set'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' subprocess.check_call('
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' File "/usr/lib/python3.10/subprocess.py", line 369, in check_call'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) ' raise CalledProcessError(retcode, cmd)'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: '/usr/sbin/netplan apply --sriov-only'(err) 'subprocess.CalledProcessError: Command '['/sbin/devlink', 'dev', 'eswitch', 'set', 'pci/0000:08:00.0', 'mode', 'switchdev']' returned non-zero exit status 1.'
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: Process '/usr/sbin/netplan apply --sriov-only' failed with exit code 1.
Jul 14 14:25:00 node4 systemd-udevd[754]: ens1f0np0: Command "/usr/sbin/netplan apply --sriov-only" returned 1 (error), ignoring.
Jul 14 14:25:00 node4 systemd-udevd[763]: ens1f0: Config file /run/systemd/network/10-netplan-ens1f0.link is applied
Jul 14 14:25:00 node4 systemd-networkd[1055]: ens1f0: found matching network '/run/systemd/network/10-netplan-ens1f0.network'.
Jul 14 14:25:22 node4 netplan[3268]: 0000:08:00.0: bound 64 VFs
Jul 14 14:25:22 node4 netplan[3268]: 0000:08:00.1: bound 0 VFs
Jul 14 14:25:22 node4 systemd[1]: netplan-sriov-rebind.service: Deactivated successfully.

Revision history for this message
Itai Levy (etlvnvda) wrote :

I suspect that not all of the VFs were able to "unbind" before trying the devlink switchdev command

Lukas Märdian (slyon)
tags: added: rls-jj-incoming rls-kk-incoming
Revision history for this message
Lukas Märdian (slyon) wrote :

I think the interesting part is this:

```
/sbin/devlink dev eswitch set pci/0000:08:00.0 mode switchdev returned non-zero exit status 1
Error: mlx5_core: Failed setting eswitch to offloads.
kernel answers: Invalid argument
```

Also, it tells us that it would have bound 64 VFs for the first PF, but none for the second, can you confirm this?
```
Jul 14 14:25:22 node4 netplan[3268]: 0000:08:00.0: bound 64 VFs
Jul 14 14:25:22 node4 netplan[3268]: 0000:08:00.1: bound 0 VFs
```

Revision history for this message
Itai Levy (etlvnvda) wrote :

Lukas, I agree - as I mention in my previous comment "I suspect that not all of the VFs were able to "unbind" before trying the devlink switchdev command".
However I cannot confirm that indeed this is the reason for the failure (that while trying the devlink command some of the VFs were still bound).

Regarding this:
```
Jul 14 14:25:22 node4 netplan[3268]: 0000:08:00.0: bound 64 VFs
Jul 14 14:25:22 node4 netplan[3268]: 0000:08:00.1: bound 0 VFs
```
I saw it as well, however I cannot confirm since when the system was finally up, after netplan already configured everything, I had 128 VFs bound, as mentioned in the description:
#lspci | grep Virtual | wc -l
129

Lukas Märdian (slyon)
tags: removed: rls-kk-incoming
Lukas Märdian (slyon)
tags: removed: rls-jj-incoming
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.