Comment 2 for bug 2058858

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kernel (master)

Reviewed: https://review.opendev.org/c/starlingx/kernel/+/914047
Committed: https://opendev.org/starlingx/kernel/commit/1a089999c41ef887d6cec66ae8071651a0db24d1
Submitter: "Zuul (22348)"
Branch: master

commit 1a089999c41ef887d6cec66ae8071651a0db24d1
Author: Jiping Ma <email address hidden>
Date: Fri Mar 22 06:19:04 2024 +0000

    iavf: upgrade to iavf-4.5.3.4

    This commit upgrades iavf to version 4.5.3.4 from 4.5.3.2 to fix the
    issue "iavf 0000:17:01.6: Never saw reset".

    The following root cause analysis comes from Intel.

      """
      The iavf_adminq_task() function processes the device Admin queue,
      which is used to handle receiving messages from the PF driver.

      It calls iavf_clean_arq_element() to extract the message at the head
      of the queue, and processes it by calling iavf_virtchnl_completion().

      There is a subtle race between iavf_adminq_task() and
      iavf_watchdog_task() involving the processing of
      VIRTCHNL_EVENT_RESET_IMPENDING. The race results in the iavf driver
      getting stuck waiting for a reset that has already completed, printing
      "Never saw reset" once every 5 seconds, and locking the driver in the
      __IAVF_RESET state, preventing normal operations from proceeding.

      The entire race can be avoided if the iavf_adminq_task() stops holding
      onto potentially stale data. To do this, acquire the
      __IAVF_IN_CRITICAL_TASK at the start of the function. With this, it is
      no longer possible for the function to be blocked holding the data in
      its event buffer while the iavf_watchdog_task() function processes the
      entire hardware reset.

      Instead of sleeping with a while loop, just re-queue the
      iavf_adminq_task() when we are unable to acquire the bit lock.
      Additionally, align with upstream and check the removal status to
      avoid re-queuing in the event that the driver has already started
      remove.

      This new flow also aligns with the way the upstream driver handles
      locking and completely avoids the race. If the iavf_adminq_task()
      happens to be delayed until the hardware reset completes, it will no
      longer see the VIRTCHNL_EVENT_RESET_IMPENDING data, as this will have
      been cleared by the hardware reset.
      """

    Verification:
    - The following command with this commit results in a successful iavf
      kernel module build for standard and PREEMPT_RT kernels:
        build-pkgs -c -p iavf

    - A StarlingX ISO image was installed onto an All-in-One Dell XR11 lab
      with one Intel E810 NIC server in low-latency mode.

    - The user who reported this issue was provided with a StarlingX
      designer patch that incorporates this change. The user in question
      did not encounter any issues during their testing with the designer
      patch.

    Closes-Bug: 2058858

    Change-Id: I448ee1e302bdc7277a6c5db990d4d5cfc485a0f4
    Signed-off-by: Jiping Ma <email address hidden>