qlcnic: Firmware aborts/hangs in QLogic NIC

Bug #1815033 reported by Guilherme G. Piccoli on 2019-02-07
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Guilherme G. Piccoli
Bionic
High
Guilherme G. Piccoli

Bug Description

[Impact]

* In multi-queue configurations for qlcnic driver, there is a corner case
  in which TX queue zero is used at same time for regular data transmission
  by one CPU while another uses the same queue descriptor for MAC configuration.

* When such "race" indeed happens, it could lead to TX queue zero corruption,
  triggering as net result firmware aborts/hangs out of nowhere. The following
  kernel log messages were collected during the corruption event:

  qlcnic 0000:01:00.0: Pause control frames disabled on all ports
  qlcnic 0000:01:00.0: firmware hang detected
  qlcnic 0000:01:00.0: Dumping hw/fw registers
  PEG_HALT_STATUS1: 0x40001502, PEG_HALT_STATUS2: 0x3de7a0,
  PEG_NET_0_PC: 0x6d268, PEG_NET_1_PC: 0x6d2ac,
  PEG_NET_2_PC: 0x149, PEG_NET_3_PC: 0x6e105,
  PEG_NET_4_PC: 0x1e00b
  [...]
  qlcnic 0000:01:00.0: Detected state change from DEV_NEED_RESET, skipping ack check

* The following device is known to suffer from the issue (lspci output),
  although a whole class of devices (named 82XX series from the vendor) are
  susceptible to this:
  01:00.0 Ethernet controller [0200]: QLogic Corp. cLOM8214 1/10GbE Controller [1077:8020]

* The fix is the following patch, present in mainline kernel as well as
  in supported stable branches:
  c333fa0c4f22 ("qlcnic: fix Tx descriptor corruption on 82xx devices").
  Link for the patch in Linus tree: http://git.kernel.org/linus/c333fa0c4f22

[Test Case]

* Unfortunately this is not easy to reproduce; we have a user report of
  the issue with a pretty reliable reproducer - user is running a NFS
  workload on top of the above PCI adapter. His problem goes away with
  the patch proposed here to SRU. His problem happens in both kernels 4.4
  and 4.15, and the patch fixes it for both of them.
  (Notice this is a Bionic-only SRU, since Ubuntu 4.4 kernel got the patch
  from Greg's supported stable branch).

[Regression Potential]

* The patch scope is restricted to a single driver, and the code itself
  is self-contained - basically a restriction to specific tx_ring when
  setting filters. There is potential for regressions in this path for
  the driver which could cause different firmware issues for example,
  but the user testing exhibited great reliability - without the patch
  issue happens after ~6h of machine boot. With the patch the machine ran
  for more than 8 days without issues.

* Also the patch is present in mainline kernel as well as supported
  stable branches, and is already present in Ubuntu 4.4 kernel.

summary: - qlcnic: Firmware aborts/hangs in QLogic NIC (qlcnic driver)
+ qlcnic: Firmware aborts/hangs in QLogic NIC
Guilherme G. Piccoli (gpiccoli) wrote :

Patch was posted in the mailing-list for the SRU process: https://lists.ubuntu.com/archives/kernel-team/2019-February/098380.html

Stefan Bader (smb) on 2019-02-07
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Bionic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Guilherme G. Piccoli (gpiccoli) wrote :

Kernel was validated by the user that reported the issue - ran for more than 72h
with no problems.

tags: added: verification-done-bionic
removed: verification-needed-bionic
Launchpad Janitor (janitor) wrote :
Download full text (25.4 KiB)

This bug was fixed in the package linux - 4.15.0-47.50

---------------
linux (4.15.0-47.50) bionic; urgency=medium

  * linux: 4.15.0-47.50 -proposed tracker (LP: #1819716)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * C++ demangling support missing from perf (LP: #1396654)
    - [Packaging] fix a mistype

  * arm-smmu-v3 arm-smmu-v3.3.auto: CMD_SYNC timeout (LP: #1818162)
    - iommu/arm-smmu-v3: Fix unexpected CMD_SYNC timeout

  * Crash in nvme_irq_check() when using threaded interrupts (LP: #1818747)
    - nvme-pci: fix out of bounds access in nvme_cqe_pending

  * CVE-2019-9213
    - mm: enforce min addr even if capable() in expand_downwards()

  * CVE-2019-3460
    - Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt

  * amdgpu with mst WARNING on blanking (LP: #1814308)
    - drm/amd/display: Don't use dc_link in link_encoder
    - drm/amd/display: Move wait for hpd ready out from edp power control.
    - drm/amd/display: eDP sequence BL off first then DP blank.
    - drm/amd/display: Fix unused variable compilation error
    - drm/amd/display: Fix warning about misaligned code
    - drm/amd/display: Fix MST dp_blank REG_WAIT timeout

  * tun/tap: unable to manage carrier state from userland (LP: #1806392)
    - tun: implement carrier change

  * CVE-2019-8980
    - exec: Fix mem leak in kernel_read_file

  * raw_skew in timer from the ubuntu_kernel_selftests failed on Bionic
    (LP: #1811194)
    - selftest: timers: Tweak raw_skew to SKIP when ADJ_OFFSET/other clock
      adjustments are in progress

  * [Packaging] Allow overlay of config annotations (LP: #1752072)
    - [Packaging] config-check: Add an include directive

  * CVE-2019-7308
    - bpf: move {prev_,}insn_idx into verifier env
    - bpf: move tmp variable into ax register in interpreter
    - bpf: enable access to ax register also from verifier rewrite
    - bpf: restrict map value pointer arithmetic for unprivileged
    - bpf: restrict stack pointer arithmetic for unprivileged
    - bpf: restrict unknown scalars of mixed signed bounds for unprivileged
    - bpf: fix check_map_access smin_value test when pointer contains offset
    - bpf: prevent out of bounds speculation on pointer arithmetic
    - bpf: fix sanitation of alu op with pointer / scalar type from different
      paths
    - bpf: add various test cases to selftests

  * CVE-2017-5753
    - bpf: properly enforce index mask to prevent out-of-bounds speculation
    - bpf: fix inner map masking to prevent oob under speculation

  * BPF: kernel pointer leak to unprivileged userspace (LP: #1815259)
    - bpf/verifier: disallow pointer subtraction

  * squashfs hardening (LP: #1816756)
    - squashfs: more metadata hardening
    - squashfs metadata 2: electric boogaloo
    - squashfs: more metadata hardening
    - Squashfs: Compute expected length from inode size rather than block length

  * efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted (LP: #1814982)
    - efi/arm/arm64: Allow SetVirtualAddressMap() to be omitted

  * Update ENA driver to version 2.0.3K (LP: #1816806)...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers