Bug in iavf driver v4.5.3 results in a system hang

Bug #2037692 reported by M. Vefa Bicakci
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
M. Vefa Bicakci

Bug Description

Brief Description
-----------------

iavf driver version 4.5.3 has a bug that results in a system hang when virtual function (VF) interfaces are reset rapidly, by, for example, manipulating the trust and VLAN settings of a VF interface.

This bug was originally resolved in iavf version 4.6.1, and it is also resolved in iavf version 4.5.3.2.

Severity
--------
Major: Attaching (or possibly detaching) VF interfaces to Kubernetes pods may result in a system hang with an unpredictable probability.

Steps to Reproduce
------------------

The most straightforward approach is to use a script like the following to reproduce the issue:

```
#!/bin/bash

if test "${EUID}" -ne 0; then
        echo "Please run this script as root"
        exit 1
fi

# Debugging options
sysctl -w kernel.sysrq=1 ;
dmesg -n debug;
dmesg -E;

sync
echo 3 > /proc/sys/vm/drop_caches

limit="${1:-4000}"
iter=1;

set -x;
while test "${iter}" -le "${limit}"; do
        tee <<<"Iteration: ${iter}" /dev/kmsg;
        let iter+=1;

        # enp81s17 is the first VF interface.
        # enp81s0f2 is the corresponding PF interface.

        ip l set dev enp81s17 up;
        ip l set dev enp81s0f2 vf 0 trust on;
        ip l set dev enp81s0f2 vf 0 vlan 333;
        ip l set dev enp81s0f2 vf 0 trust off;
        ip l set dev enp81s0f2 vf 0 vlan 310;
        ip l set dev enp81s17 down;

        sleep 0.1;
done
```

Expected Behavior
------------------
The script should not result in a system hang

Actual Behavior
----------------
The script results in a system hang, usually within 200 iterations of the loop.

Eventually, the iavf driver emits the following logs, which indicates that the issue has been triggered:

```
iavf 0000:xx:xx.x: Failed to init adminq: -53
iavf 0000:xx:xx.x: failed to allocate resources during reinit
```

Afterwards, the VF bring-down operation (ip l set dev ... down) hangs, followed by an eventual unresponsiveness of the rest of the system.

Reproducibility
---------------

Reliably reproducible with the script.

System Configuration
--------------------

This issue was reproduced on an All-in-One system with a quad-port Intel E810 NIC managed by the ice driver and VF interfaces managed by the iavf driver.

Branch/Pull Time/Commit
-----------------------

The issue appears to be present in StarlingX versions that ship with iavf 4.5.3. Other versions of the iavf driver have not been verified, but we do know that the issue is fixed in iavf driver v4.6.1 and the intermediate release v4.5.3.2.

Last Pass
---------

Not sure.

Timestamp/Logs
--------------

(Please see the actual behaviour section.)

Test Activity
-------------
Normal use.

Workaround
----------
None.

Changed in starlingx:
assignee: nobody → M. Vefa Bicakci (vbicakci)
status: New → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kernel (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/starlingx/kernel/+/897242

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)
Download full text (5.6 KiB)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/897242
Committed: https://opendev.org/starlingx/kernel/commit/06a162c47d25064e573c139e05e7fb3278d114f4
Submitter: "Zuul (22348)"
Branch: master

commit 06a162c47d25064e573c139e05e7fb3278d114f4
Author: M. Vefa Bicakci <email address hidden>
Date: Tue Sep 12 12:39:51 2023 +0000

    intel-iavf: Update from v4.5.3 to v4.5.3.2

    This commit updates the default Intel NIC driver bundle version of the
    iavf driver from v4.5.3 to v4.5.3.2 to resolve an issue involving system
    hangs after the following messages are printed out by the iavf driver:

    ```
    iavf 0000:51:11.0: Failed to init adminq: -53
    iavf 0000:51:11.0: failed to allocate resources during reinit
    ```

    This is reproduced with the following commands on iavf-4.5.3, which
    carry out rapid virtual function (VF) interface resets:

    ```
    while true; do
      # enp81s17 is the first VF interface
      ip l set dev enp81s17 up;

      # enp81s0f2 is the corresponding PF interface
      ip l set dev enp81s0f2 vf 0 trust on;
      ip l set dev enp81s0f2 vf 0 vlan 333;

      ip l set dev enp81s0f2 vf 0 trust off;
      ip l set dev enp81s0f2 vf 0 vlan 310;

      ip l set dev enp81s17 down;
      sleep 0.1 ;
    done
    ```

    Eventually, iavf reports the aforementioned error messages, and the VF
    bring down operation hangs. This is followed by the hang of many
    unrelated processes, likely due to the "rtnl" mutex.

    This commit updates iavf from v4.5.3 to v4.5.3.2 to resolve this issue
    and other issues that Intel has recommended to fix. Please note that
    this version of the iavf driver is found in the "unsupported" directory
    on Intel's Sourceforge project for NIC drivers, despite Intel having
    recommended this version of the iavf driver to fix the reported issue.
    This is how Intel provides fixed intermediate versions of their older
    NIC drivers on Sourceforge. Furthermore, this version of iavf has gone
    through testing by Intel as well as by the StarlingX community, despite
    the driver having been declared as an "unsupported" version by Intel.

    The corresponding mainline commits are as follows, but note that the
    changes in iavf 4.5.3.2 are only loosely based on these commits, due to
    the divergence between the out-of-tree and mainline versions of the iavf
    source code:

    * Commit 31071173771e ("iavf: Fix reset error handling")
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=31071173771e
      (This is the commit that resolves the issue the user in question has
      encountered.)

    * Commit c2ed2403f12c ("iavf: Wait for reset in callbacks which trigger
      it")
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c2ed2403f12c

    * Commit 7598f4b40bd6 ("iavf: Move netdev_update_features() into
      watchdog task")
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7598f4b40bd6

    The iavf driver versions belonging to other Intel NIC driver bundle
    versions are not updated due...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
tags: added: stx.9.0 stx.distro.other
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.