bnx2x: fatal hardware error/reboot/tx timeout with LLDP enabled
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Critical
|
Mauricio Faria de Oliveira | ||
Xenial |
Fix Released
|
Critical
|
Mauricio Faria de Oliveira | ||
Bionic |
Fix Released
|
Critical
|
Mauricio Faria de Oliveira | ||
Disco |
Won't Fix
|
Critical
|
Mauricio Faria de Oliveira | ||
Eoan |
Fix Released
|
Critical
|
Mauricio Faria de Oliveira |
Bug Description
[Impact]
* The bnx2x driver may cause hardware faults (leading to
panic/reboot) and other behaviors as transmit timeouts,
after commit 3968d38917eb ("bnx2x: Fix Multi-Cos.") is
introduced.
* This issue has been observed by an user shortly
after starting docker & kubelet, with adapters:
- Broadcom NetXtreme II BCM57800 [14e4:168a] from Dell [1028:1f5c]
- Broadcom NetXtreme II BCM57840 [14e4:16a1] from Dell [1028:1f79]
* If options to ignore hardware faults are used
(erst_disable=1 hest_disable=1 ghes.disable=1)
the system doesn't panic/reboot and continues
on to timeout on adapter stats, then transmit
timeouts, spewing some adapter firmware dumps,
but the network interface is non-functional.
* The issue only happened when LLDP is enabled
on the network switches, and crashdump shows
the bnx2x driver is stuck/waits for firmware
to complete the stop traffic command in LLDP
handling. Workaround used is to disable LLDP
in the network switches/ports.
* Analysis of the driver and firmware dumps
didn't help significantly towards finding
the root cause.
* Upstream/mainline recently just reverted the
patch, due to similar problem reports, while
looking for the root cause/proper fix.
[Test Case]
* No reproducible test case found outside
the user's systems/cluster, where it is
enough to start docker & kubelet & wait.
* The user verified test kernels for Xenial
and Bionic - the problem does not happen;
build-tested on Disco.
[Regression Potential]
* Users who significantly use/apply the non-default
traffic class (tc) / class of service (cos) might
possibly see performance changes (if any at all)
in such applications, however that's unclear now.
* This is a recent revert upstream (v5.3-rc'ish),
so there's chance things might change in this area.
* Nonetheless, the patch is authored by the driver
vendor, and made its way into stable kernels
(e.g., v5.2.8 which made Eoan/19.10 recently).
Changed in linux (Ubuntu): | |
status: | New → In Progress |
assignee: | nobody → Mauricio Faria de Oliveira (mfo) |
Changed in linux (Ubuntu Disco): | |
status: | New → In Progress |
Changed in linux (Ubuntu Bionic): | |
status: | New → In Progress |
Changed in linux (Ubuntu Xenial): | |
status: | New → In Progress |
Changed in linux (Ubuntu Disco): | |
assignee: | nobody → Mauricio Faria de Oliveira (mfo) |
Changed in linux (Ubuntu Bionic): | |
assignee: | nobody → Mauricio Faria de Oliveira (mfo) |
Changed in linux (Ubuntu Xenial): | |
assignee: | nobody → Mauricio Faria de Oliveira (mfo) |
description: | updated |
tags: | added: sts |
Changed in linux (Ubuntu Xenial): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Bionic): | |
importance: | Undecided → High |
Changed in linux (Ubuntu Xenial): | |
importance: | High → Critical |
Changed in linux (Ubuntu Bionic): | |
importance: | High → Critical |
Changed in linux (Ubuntu Disco): | |
importance: | Undecided → Critical |
Changed in linux (Ubuntu Eoan): | |
importance: | Undecided → Critical |
This fix is already present in Eoan and Unstable:
~/git/ubuntu-eoan$ git log --oneline origin/master-next -- drivers/ net/ethernet/ broadcom/ bnx2x/ | head | grep cos
1c41d7b7cf60 bnx2x: Disable multi-cos feature.
~/git/ubuntu-eoan$ git describe --contains 1c41d7b7cf60 5.2.0-12. 13~51
Ubuntu-
~/git/ubuntu- unstable$ git log --oneline origin/master -- drivers/ net/ethernet/ broadcom/ bnx2x/ | head | grep cos unstable$ git describe --contains d1f0b5dce8fd 5.3.0-4. 5~313^2~ 91
d1f0b5dce8fd bnx2x: Disable multi-cos feature.
~/git/ubuntu-
Ubuntu-