Comment 2 for bug 2037692

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/897242
Committed: https://opendev.org/starlingx/kernel/commit/06a162c47d25064e573c139e05e7fb3278d114f4
Submitter: "Zuul (22348)"
Branch: master

commit 06a162c47d25064e573c139e05e7fb3278d114f4
Author: M. Vefa Bicakci <email address hidden>
Date: Tue Sep 12 12:39:51 2023 +0000

    intel-iavf: Update from v4.5.3 to v4.5.3.2

    This commit updates the default Intel NIC driver bundle version of the
    iavf driver from v4.5.3 to v4.5.3.2 to resolve an issue involving system
    hangs after the following messages are printed out by the iavf driver:

    ```
    iavf 0000:51:11.0: Failed to init adminq: -53
    iavf 0000:51:11.0: failed to allocate resources during reinit
    ```

    This is reproduced with the following commands on iavf-4.5.3, which
    carry out rapid virtual function (VF) interface resets:

    ```
    while true; do
      # enp81s17 is the first VF interface
      ip l set dev enp81s17 up;

      # enp81s0f2 is the corresponding PF interface
      ip l set dev enp81s0f2 vf 0 trust on;
      ip l set dev enp81s0f2 vf 0 vlan 333;

      ip l set dev enp81s0f2 vf 0 trust off;
      ip l set dev enp81s0f2 vf 0 vlan 310;

      ip l set dev enp81s17 down;
      sleep 0.1 ;
    done
    ```

    Eventually, iavf reports the aforementioned error messages, and the VF
    bring down operation hangs. This is followed by the hang of many
    unrelated processes, likely due to the "rtnl" mutex.

    This commit updates iavf from v4.5.3 to v4.5.3.2 to resolve this issue
    and other issues that Intel has recommended to fix. Please note that
    this version of the iavf driver is found in the "unsupported" directory
    on Intel's Sourceforge project for NIC drivers, despite Intel having
    recommended this version of the iavf driver to fix the reported issue.
    This is how Intel provides fixed intermediate versions of their older
    NIC drivers on Sourceforge. Furthermore, this version of iavf has gone
    through testing by Intel as well as by the StarlingX community, despite
    the driver having been declared as an "unsupported" version by Intel.

    The corresponding mainline commits are as follows, but note that the
    changes in iavf 4.5.3.2 are only loosely based on these commits, due to
    the divergence between the out-of-tree and mainline versions of the iavf
    source code:

    * Commit 31071173771e ("iavf: Fix reset error handling")
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=31071173771e
      (This is the commit that resolves the issue the user in question has
      encountered.)

    * Commit c2ed2403f12c ("iavf: Wait for reset in callbacks which trigger
      it")
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c2ed2403f12c

    * Commit 7598f4b40bd6 ("iavf: Move netdev_update_features() into
      watchdog task")
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7598f4b40bd6

    The iavf driver versions belonging to other Intel NIC driver bundle
    versions are not updated due to the following reasons:

    - intel-iavf-cvl-2.54: We do not yet know if this version of iavf
      (v4.0.1) is affected by this issue. The user reporting the issue fixed
      by this commit is currently using iavf v4.5.3, and we have not
      received field reports regarding a similar issue encountered with iavf
      v4.0.1.

    - intel-iavf-cvl-4.10: This version of iavf (v4.6.1) is not affected by
      this issue, as the changes included in iavf v4.5.3.2 were backported
      by Intel from iavf v4.6.1.

    Verification

    - The following command with this commit results in a successful iavf
      kernel module build for standard and PREEMPT_RT kernels:
        build-pkgs -c -p iavf

    - A StarlingX ISO image from 2023-09-28 was installed onto an All-in-One
      Duplex Dell XR11 lab with one quad-port Intel E810 NIC per server in
      low-latency mode (i.e., with the PREEMPT_RT kernel).

    - The issue was reproduced using a script similar to the one depicted at
      the beginning of this commit message. We should note that the issue
      manifests itself usually within ~200 iterations of the loop.

    - Afterwards, in a StarlingX build environment, the kernel and all of
      the kernel modules were built with this commit from scratch. The
      resulting *.deb files were copied to controller-1 of the StarlingX
      installation and converted into a "sneaky" designer patch with a
      customized version of the "sneaky_patch.py" script, the original
      version of which is available in StarlingX.

    - The resulting designer patch was successfully applied onto
      controller-0 of the aforementioned StarlingX ISO image installation.
      Afterwards, it was confirmed that the iavf driver version changed from
      4.5.3 (prior to the designer patch) to 4.5.3.2 (after the application
      of the designer patch).

    - Afterwards, a shell script based on the snippet quoted above was
      executed for 4000 iterations of the loop, without the reproduction of
      the original issue.

    - Furthermore, basic tests with iavf-managed VF interfaces were carried
      out, involving creating two network namespaces on controller-0,
      assigning one iavf-managed VF interface to each network namespace, and
      finally, running iperf3 across the VF interfaces, from within the
      network namespaces.

    Closes-Bug: 2037692
    Change-Id: I75415e5668b002b91c2208bff081775c9eced083
    Signed-off-by: M. Vefa Bicakci <email address hidden>