[SRU][mlx5] Intermittent VF-LAG activation failure

Bug #1988018 reported by Frode Nordahl
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Committed
Undecided
Unassigned
Jammy
Confirmed
Undecided
Unassigned
Kinetic
Won't Fix
Undecided
Unassigned
Mantic
Won't Fix
Undecided
Unassigned
Noble
Fix Committed
Undecided
Unassigned
netplan.io (Ubuntu)
Fix Released
Medium
Unassigned
Jammy
Fix Committed
Undecided
Martin Kalcok
Kinetic
Won't Fix
Medium
Unassigned
Mantic
Won't Fix
Undecided
Unassigned
Noble
Fix Released
Medium
Unassigned

Bug Description

[ Impact ]

Due to limitations in how Netplan handles SR-IOV devices, the VF-LAG
feature found on Mellanox NICs couldn't be used. Certain configuration steps
must happen in a very specific order and Netplan fails to perform the set up correctly.

Netplan must wait until the backend finishes adding interfaces to the Bond
and the Mellanox driver reports the VF-LAG feature as "active" before binding VFs to
the driver.

See also https://bugs.launchpad.net/netplan/+bug/2083008

This problem is fixed by introducing a proper ordering in the configuration process
and monitoring the driver state until it reports as ready (or times out).

This fix is available on Ubuntu 24.04.

[ Test Plan ]

To reproduce the problem addressed by this SRU one needs to
have access to specialized hardware (SR-IOV-capable Mellanox NICs).

The fix for the problem described above was already verified on Ubuntu 22.04 and
solved the problem (more details https://bugs.launchpad.net/netplan/+bug/2083008).

We will work with Canonical's Openstack team to do the fix verification.

 * detailed instructions how to reproduce the bug

A configuration file that looks like the one below can be used
to test the fix.

After booting the system with this configuration, the Mellanox driver
should report the LAG state as "active" for all the devices.
It can be checked in the debugfs file: /sys/kernel/debug/mlx5/{pci_addr}/lag/state

network:
  version: 2
  ethernets:
    ens4f0np0:
      virtual-function-count: 16
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true

    ens4f1np1:
      virtual-function-count: 16
      embedded-switch-mode: switchdev
      delay-virtual-functions-rebind: true

  bonds:
    bond0:
      interfaces:
        - ens4f0np0
        - ens4f1np1
      parameters:
        mode: active-backup

[ Where problems could occur ]

These changes should affect only SR-IOV related scenarios.
Undetected problems could cause Netplan to fail to configure the device
and Virtual Functions wouldn't be created anymore.

[ Other Info ]

Related work:

https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/1988018
https://github.com/canonical/netplan/pull/439

A PPA for Ubuntu 22.04 can be found here https://launchpad.net/~danilogondolfo/+archive/ubuntu/netplan-sru

---- Original bug description ----

During system initialization there is a specific sequence that must be followed to enable the use of hardware offload and VF-LAG.

Intermittently one may see that VF-LAG initialization fails:
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: lag map port 1:1 port 2:2 shared_fdb:1
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_cmd_check:782:(pid 9): CREATE_LAG(0x840) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x7d49cb)
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_create_lag:248:(pid 9): Failed to create LAG (-22)
[Thu Jul 21 10:54:58 2022] mlx5_core 0000:08:00.0: mlx5_activate_lag:288:(pid 9): Failed to activate VF LAG
                           Make sure all VFs are unbound prior to VF LAG activation or deactivation

This is caused by rebinding the driver prior to the VF lag being ready.

A sysfs knob has recently been added to the driver [0] and we should monitor it before attempting to rebind the driver:

    $ cat /sys/kernel/debug/mlx5/0000\:08\:00.0/lag/state

The kernel feature is available in the upcoming Kinetic 5.19 kernel and we should probably backport it to the Jammy 5.15 kernel.

0: https://github.com/torvalds/linux/commit/7f46a0b7327ae261f9981888708dbca22c283900

Related branches

Frode Nordahl (fnordahl)
Changed in linux (Ubuntu Kinetic):
status: New → Fix Committed
Lukas Märdian (slyon)
Changed in netplan.io (Ubuntu Kinetic):
status: New → Triaged
importance: Undecided → Medium
tags: added: foundations-triage-discuss
Lukas Märdian (slyon)
tags: removed: foundations-triage-discuss
Revision history for this message
Utkarsh Gupta (utkarsh) wrote :

Ubuntu 22.10 (Kinetic Kudu) has reached end of life, so this bug will not be fixed for that specific release.

Changed in netplan.io (Ubuntu Kinetic):
status: Triaged → Won't Fix
Changed in netplan.io (Ubuntu Jammy):
assignee: nobody → Martin Kalcok (martin-kalcok)
status: New → In Progress
Revision history for this message
Lukas Märdian (slyon) wrote :
Revision history for this message
Frode Nordahl (fnordahl) wrote :

I think they are two distinct problems, and hopefully we would get a comment from NVIDIA/Mellanox as the statements in bug 2020409 contradicts the documentation [0] the current Netplan implementation is based on.

Martin may have more details, but wanted to mention that one of our suspected culprits is with how Netplan lays out the udev rules for VF activation [1]:
1) It takes a long time when many are configured, as opposed to the expectation in the comment.
2) The process appears to be executed multiple times, which combined with the fact it takes a long time in turn may end up clashing with both the networking backends creation of the bond and the systemd unit rebinding the VFs.

Bug 2020409 also raises the question if there are any bond/LAG related system bringup quirks for systems using only Scalable Functions (SF) or a combination of SFs and VFs. I have yet to see any documentation about that.

0: https://enterprise-support.nvidia.com/s/article/Configuring-VF-LAG-using-TC
1: https://github.com/canonical/netplan/blob/a7e4be03918c986020650743cb6cf0934696ef0c/src/sriov.c#L107-L112

Revision history for this message
Lukas Märdian (slyon) wrote :

This should be fixed as of Netplan v1.0: https://github.com/canonical/netplan/pull/439

Please re-open if you think this is still an issue.

Changed in netplan.io (Ubuntu Noble):
status: Triaged → Fix Released
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 23.10 (Mantic Minotaur) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Mantic):
status: New → Won't Fix
Changed in netplan.io (Ubuntu Mantic):
status: New → Won't Fix
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 22.10 (Kinetic Kudu) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux (Ubuntu Kinetic):
status: Fix Committed → Won't Fix
Lukas Märdian (slyon)
tags: added: sru-next
description: updated
summary: - [mlx5] Intermittent VF-LAG activation failure
+ [SRU][mlx5] Intermittent VF-LAG activation failure
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Jammy):
status: New → Confirmed
description: updated
description: updated
Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Frode, or anyone else affected,

Accepted netplan.io into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/netplan.io/0.107.1-3ubuntu0.22.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in netplan.io (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Lukas Märdian (slyon) wrote :

Dear openstack team (or anyone with the relevant hardware), can you please help to test this, using the following commands?

cat <<EOF >/etc/apt/sources.list.d/ubuntu-$(lsb_release -cs)-proposed.list
# Enable Ubuntu proposed archive
deb http://archive.ubuntu.com/ubuntu/ $(lsb_release -cs)-proposed restricted main multiverse universe
EOF

cat <<EOF >/etc/apt/preferences.d/proposed-updates
# Configure apt to allow selective installs of packages from proposed
Package: *
Pin: release a=$(lsb_release -cs)-proposed
Pin-Priority: 400
EOF

apt udpate
apt install -t jammy-proposed netplan.io

Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (netplan.io/0.107.1-3ubuntu0.22.04.2)

All autopkgtests for the newly accepted netplan.io (0.107.1-3ubuntu0.22.04.2) for jammy have finished running.
The following regressions have been reported in tests triggered by the package:

initramfs-tools/0.140ubuntu13.5 (arm64, armhf, ppc64el)
initramfs-tools/unknown (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/jammy/update_excuses.html#netplan.io

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.