[SRU][mlx5] Intermittent VF-LAG activation failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Committed
|
Undecided
|
Unassigned | ||
Jammy |
Confirmed
|
Undecided
|
Unassigned | ||
Kinetic |
Won't Fix
|
Undecided
|
Unassigned | ||
Mantic |
Won't Fix
|
Undecided
|
Unassigned | ||
Noble |
Fix Committed
|
Undecided
|
Unassigned | ||
netplan.io (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Jammy |
Fix Committed
|
Undecided
|
Martin Kalcok | ||
Kinetic |
Won't Fix
|
Medium
|
Unassigned | ||
Mantic |
Won't Fix
|
Undecided
|
Unassigned | ||
Noble |
Fix Released
|
Medium
|
Unassigned |
Bug Description
[ Impact ]
Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
feature found on Mellanox NICs couldn't be used. Certain configuration steps
must happen in a very specific order and Netplan fails to perform the set up correctly.
Netplan must wait until the backend finishes adding interfaces to the Bond
and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to
the driver.
See also https:/
This problem is fixed by introducing a proper ordering in the configuration process
and monitoring the driver state until it reports as ready (or times out).
This fix is available on Ubuntu 24.04.
[ Test Plan ]
To reproduce the problem addressed by this SRU one needs to
have access to specialized hardware (SR-IOV-capable Mellanox NICs).
The fix for the problem described above was already verified on Ubuntu 22.04 and
solved the problem (more details https:/
We will work with Canonical's Openstack team to do the fix verification.
* detailed instructions how to reproduce the bug
A configuration file that looks like the one below can be used
to test the fix.
After booting the system with this configuration, the Mellanox driver
should report the LAG state as "active" for all the devices.
It can be checked in the debugfs file: /sys/kernel/
network:
version: 2
ethernets:
ens4f0np0:
virtual-
embedded-
delay-
ens4f1np1:
virtual-
embedded-
delay-
bonds:
bond0:
interfaces:
- ens4f0np0
- ens4f1np1
parameters:
mode: active-backup
[ Where problems could occur ]
These changes should affect only SR-IOV related scenarios.
Undetected problems could cause Netplan to fail to configure the device
and Virtual Functions wouldn't be created anymore.
[ Other Info ]
Related work:
https:/
https:/
A PPA for Ubuntu 22.04 can be found here https:/
---- Original bug description ----
During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG.
Intermittently one may see that VF-LAG initialization fails:
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_
This is caused by rebinding the driver prior to the VF lag being ready.
A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver:
$ cat /sys/kernel/
The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel.
0: https:/
Related branches
- Lukas Märdian: Approve
- Ubuntu Core Development Team: Pending requested
-
Diff: 2460 lines (+2373/-0)13 files modifieddebian/changelog (+12/-0)
debian/libnetplan0.symbols (+1/-0)
debian/patches/lp1988018/0018-libnetplan-add-a-getter-for-bond-mode.patch (+115/-0)
debian/patches/lp1988018/0019-sriov-move-the-udev-logic-to-a-service-unit.patch (+163/-0)
debian/patches/lp1988018/0020-sriov-check-the-eswitch-mode-before-trying-to-change.patch (+100/-0)
debian/patches/lp1988018/0021-sriov_rebind-cooperate-with-VF-LAG-activation.patch (+172/-0)
debian/patches/lp1988018/0022-sriov_rebind-netplan-rebind-debug-setup.patch (+113/-0)
debian/patches/lp1988018/0023-tests-sriov-adapt-tests-to-the-last-sr-iov-related-c.patch (+554/-0)
debian/patches/lp1988018/0024-sriov_apply-execute-apply-sriov-only-before-network-.patch (+83/-0)
debian/patches/lp2020409/0025-sriov-accept-setting-the-eswitch-mode-without-VFs.patch (+148/-0)
debian/patches/lp2020409/0026-cli-sriov-refactoring.patch (+768/-0)
debian/patches/lp2020409/0027-cli-sriov-set-eswitch-regardless-of-pcidev.vfs.patch (+132/-0)
debian/patches/series (+12/-0)
Changed in linux (Ubuntu Kinetic): | |
status: | New → Fix Committed |
Changed in netplan.io (Ubuntu Kinetic): | |
status: | New → Triaged |
importance: | Undecided → Medium |
tags: | added: foundations-triage-discuss |
tags: | removed: foundations-triage-discuss |
Changed in netplan.io (Ubuntu Jammy): | |
assignee: | nobody → Martin Kalcok (martin-kalcok) |
status: | New → In Progress |
tags: | added: sru-next |
description: | updated |
summary: |
- [mlx5] Intermittent VF-LAG activation failure + [SRU][mlx5] Intermittent VF-LAG activation failure |
description: | updated |
description: | updated |
Ubuntu 22.10 (Kinetic Kudu) has reached end of life, so this bug will not be fixed for that specific release.