bnxt_en_po: TX timed out triggering Netdev Watchdog Timer
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
High
|
Nivedita Singhvi | ||
Xenial |
Fix Released
|
High
|
Nivedita Singhvi |
Bug Description
[Impact]
The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets.
The following 25Gb Broadcom NIC error was seen on Xenial
running the 4.4.0-141-generic kernel on an amd64 host
seeing moderate-heavy network traffic (just once):
* The bnxt_en_po driver froze on a "TX timed out" error
and triggered the Netdev Watchdog timer under load.
* From kernel log:
"NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
See attached kern.log excerpt file for full excerpt of error log.
* Release = Xenial
Kernel = 4.4.0-141-generic #167
eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
* This caused the driver to reset in order to recover:
"bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
driver: bnxt_en_bpo
version: 1.8.1
source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
* The loss of connectivity and softirq stall caused other failures
on the system.
* The bnxt_en_po driver is the imported Broadcom driver
pulled in to support newer Broadcom HW (specific boards)
while the bnx_en module continues to support the older
HW. The current Linux upstream driver does not compile
easily with the 4.4 kernel (too many changes).
* This upstream and bnxt_en driver fix is a likely solution:
"bnxt_en: Fix TX timeout during netpoll"
commit: 73f21c653f930f4
This fix has not been applied to the bnxt_en_po driver
version, but review of the code indicates that it is
susceptible to the bug, and the fix would be reasonable.
[Test Case]
* Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
[Regression Potential]
* The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver).
* The patch is very small and backport is fairly minimal and simple.
* The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same.
Changed in linux (Ubuntu): | |
status: | Incomplete → Confirmed |
description: | updated |
Changed in linux (Ubuntu Xenial): | |
status: | New → Confirmed |
importance: | Undecided → High |
Changed in linux (Ubuntu Xenial): | |
status: | Confirmed → In Progress |
assignee: | nobody → Nivedita Singhvi (niveditasinghvi) |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
tags: | added: sts |
tags: | added: cscc |
Changed in linux (Ubuntu): | |
status: | Confirmed → Fix Released |
assignee: | nobody → Nivedita Singhvi (niveditasinghvi) |
Due to earlier NIC flapping observed on systems for the
25Gb Broadcom NIC, with originally the following config,
the firmware was upgraded to avoid a known FW bug:
$ cat ethtool_ -i_enp59s0f1d1 rom-version: statistics: yes eeprom- access: yes register- dump: no priv-flags: no
driver: bnxt_en_bpo
version: 1.8.1
firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03
expansion-
bus-info: 0000:3b:00.1
supports-
supports-test: yes
supports-
supports-
supports-
The FW was upgraded on affected systems to:
$ cat ethtool_-i_eno2d1 rom-version: statistics: yes eeprom- access: yes register- dump: no priv-flags: no
driver: bnxt_en_bpo
version: 1.8.1
firmware-version: 214.0.166/1.9.2 pkg 21.40.16.6
expansion-
bus-info: 0000:19:00.1
supports-
supports-test: yes
supports-
supports-
supports-
Unfortunately, it's not quite clear which FW version the
current bug happened on (I believe the newer but can't
confirm -- happened in the midst of several reboots)