bnxt_en_po: TX timed out triggering Netdev Watchdog Timer

Bug #1814095 reported by Nivedita Singhvi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Nivedita Singhvi
Xenial
High
Nivedita Singhvi

Bug Description

[Impact]

The bnxt_en_bpo driver experienced tx timeouts causing the system to experience network stalls and fail to send data and heartbeat packets.

The following 25Gb Broadcom NIC error was seen on Xenial
running the 4.4.0-141-generic kernel on an amd64 host
seeing moderate-heavy network traffic (just once):

* The bnxt_en_po driver froze on a "TX timed out" error
  and triggered the Netdev Watchdog timer under load.

* From kernel log:
  "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
  See attached kern.log excerpt file for full excerpt of error log.

* Release = Xenial
  Kernel = 4.4.0-141-generic #167
  eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

* This caused the driver to reset in order to recover:

  "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"

  driver: bnxt_en_bpo
  version: 1.8.1
  source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

* The loss of connectivity and softirq stall caused other failures
  on the system.

* The bnxt_en_po driver is the imported Broadcom driver
  pulled in to support newer Broadcom HW (specific boards)
  while the bnx_en module continues to support the older
  HW. The current Linux upstream driver does not compile
  easily with the 4.4 kernel (too many changes).

* This upstream and bnxt_en driver fix is a likely solution:
   "bnxt_en: Fix TX timeout during netpoll"
   commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

  This fix has not been applied to the bnxt_en_po driver
  version, but review of the code indicates that it is
  susceptible to the bug, and the fix would be reasonable.

[Test Case]

* Unfortunately, this is not easy to reproduce. Also, it is only seen on 4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.

[Regression Potential]

* The patch is restricted to the bpo driver, with very constrained scope - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed driver).

* The patch is very small and backport is fairly minimal and simple.

* The fix has been running on the in-tree driver in upstream mainline as well as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of lower level code that is different, this piece is still the same.

Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :
Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

Due to earlier NIC flapping observed on systems for the
25Gb Broadcom NIC, with originally the following config,
the firmware was upgraded to avoid a known FW bug:

$ cat ethtool_-i_enp59s0f1d1
driver: bnxt_en_bpo
version: 1.8.1
firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03
expansion-rom-version:
bus-info: 0000:3b:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

The FW was upgraded on affected systems to:

$ cat ethtool_-i_eno2d1
driver: bnxt_en_bpo
version: 1.8.1
firmware-version: 214.0.166/1.9.2 pkg 21.40.16.6
expansion-rom-version:
bus-info: 0000:19:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: no
supports-priv-flags: no

Unfortunately, it's not quite clear which FW version the
current bug happened on (I believe the newer but can't
confirm -- happened in the midst of several reboots)

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1814095

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

If anyone is interested and willing to test a 4.4 kernel
patched with the fix "bnxt_en: Fix TX timeout during netpoll"
backported to the bnxt_en_bpo driver, please find the packages
here:

http://people.canonical.com/~nivedita/bpo/

description: updated
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Terry Rudd (terrykrudd) wrote :

Nivedita, per the request to test this patch, determining the correct FW version seems an open issue. There is also the issue of having hardware available to properly test.

Have you been able to determine a reproducer for this bug?
Do you know if anyone has been able to test the backport?
Can you confirm if the request is to actually get the patch merged to xenial at this time?

Changed in linux (Ubuntu Xenial):
status: Confirmed → In Progress
assignee: nobody → Nivedita Singhvi (niveditasinghvi)
Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

Terry,

We've had a lot of discussion over this bug. It does not have
a reliable reproducer, and I have not yet received any acks
on testing of the above.

Our thinking was that it was still better to patch it since
it has been seen by the mainline driver as well and we'd like
to avoid a re-occurrence of the situation.

The need is to have the fix be available in the Xenial official
bits, for sure (rather than providing a temporary test kernel
via our ppa or something, for instance).

FWIW, here are the boards in question:
enum board_idx {
        BCM57301,
        BCM57417_NPAR,
        BCM58700,
        BCM57311,
        BCM57312,
        BCM57402,
        BCM57402_NPAR,
        BCM57407,
        BCM57412,
        BCM57414,
        BCM57416,
        BCM57417,
        BCM57412_NPAR,
        BCM57314,
        BCM57417_SFP,
        BCM57416_SFP,
        BCM57404_NPAR,
        BCM57406_NPAR,
        BCM57407_SFP,
        BCM57407_NPAR,
        BCM57414_NPAR,
        BCM57416_NPAR,
        BCM57452,
        BCM57454,
        NETXTREME_E_VF,
        NETXTREME_C_VF,
};

Per conversation with Brad and Jay, it was agreed that patching
the bnxt_en_bpo driver only with this fix was the way to go,
despite the lack of a reproducer, rather than pulling in an
entire new driver from Broadcom as also potentially mulled over.

The FW version the issue was hit on:
firmware-version: 20.8.163/1.8.4 pkg 20.08.04.03

But it might be best to test with latest available
firmware (214.0.166/1.9.2 pkg 21.40.16.6 or later).

Not sure if that helps? Let me know if I can address anything
else.

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

Just briefly wanted to say that this is one we've discussed at
length -- we may not be able to get someone who has the right
NIC to test with it in time.

I'm sanity checking the kernel, but that is not exercising the
key change here.

If we could assume verification-done for our purposes here,
that might be needed.

Revision history for this message
Scott Smith (bscott.smith) wrote :

Are there repro steps that can be passed along to test?

Revision history for this message
Nivedita Singhvi (niveditasinghvi) wrote :

I am not sure we could deterministically provoke the
issue. At the very least to ensure no other regression
was introduced, I would run it under heavy network load.

The environment in question which saw the issue had
network load, contention for cpus and several other
issues occur.

The basic environment is:

1. For any 25Gb NIC/chipset that requires the 4.4 bnxt_en_bpo
   driver, set its 2 ports/interfaces up in bonding mode
   as follows:

bond-lacp-rate fast
bond-master bond0
bond-miimon 100
bond-mode 802.3ad
bond-xmit-hash-policy layer3+4
mtu 9000

2. Run any heavy TCP network load test over the systems
   (e.g. iperf, netperf, file transfer, etc.)

3. Theoretically, it would appear that if the number of tx
   ring descriptors were lower, than that would be more
   likely to hit this (not successfully proven by testing
   here), but can lower it and see if that helps:

   # ethtool -G eno49 tx 128 // for example

I am not sure if that helps, Scott. I'll try and smoke
up more specific steps but I cannot guarantee you will
see the issue.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (26.1 KiB)

This bug was fixed in the package linux - 4.4.0-145.171

---------------
linux (4.4.0-145.171) xenial; urgency=medium

  * linux: 4.4.0-145.171 -proposed tracker (LP: #1821724)

  * linux-generic should depend on linux-base >=4.1 (LP: #1820419)
    - [Packaging] Fix linux-base dependency

linux (4.4.0-144.170) xenial; urgency=medium

  * linux: 4.4.0-144.170 -proposed tracker (LP: #1819660)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - [Packaging] update helper scripts
    - [Packaging] resync retpoline extraction

  * C++ demangling support missing from perf (LP: #1396654)
    - [Packaging] fix a mistype

  * CVE-2019-9213
    - mm: enforce min addr even if capable() in expand_downwards()

  * CVE-2019-3460
    - Bluetooth: Check L2CAP option sizes returned from l2cap_get_conf_opt

  * Xenial update: 4.4.176 upstream stable release (LP: #1818815)
    - net: fix IPv6 prefix route residue
    - vsock: cope with memory allocation failure at socket creation time
    - hwmon: (lm80) Fix missing unlock on error in set_fan_div()
    - net: Fix for_each_netdev_feature on Big endian
    - net: Add header for usage of fls64()
    - tcp: tcp_v4_err() should be more careful
    - net: Do not allocate page fragments that are not skb aligned
    - tcp: clear icsk_backoff in tcp_write_queue_purge()
    - vxlan: test dev->flags & IFF_UP before calling netif_rx()
    - net: stmmac: Fix a race in EEE enable callback
    - net: ipv4: use a dedicated counter for icmp_v4 redirect packets
    - x86: livepatch: Treat R_X86_64_PLT32 as R_X86_64_PC32
    - mfd: as3722: Handle interrupts on suspend
    - mfd: as3722: Mark PM functions as __maybe_unused
    - net/x25: do not hold the cpu too long in x25_new_lci()
    - mISDN: fix a race in dev_expire_timer()
    - ax25: fix possible use-after-free
    - Linux 4.4.176

  * sky2 ethernet card don't work after returning from suspension
    (LP: #1798921) // Xenial update: 4.4.176 upstream stable release
    (LP: #1818815)
    - sky2: Increase D3 delay again

  * Xenial update: 4.4.175 upstream stable release (LP: #1818813)
    - drm/bufs: Fix Spectre v1 vulnerability
    - staging: iio: adc: ad7280a: handle error from __ad7280_read32()
    - ASoC: Intel: mrfld: fix uninitialized variable access
    - scsi: lpfc: Correct LCB RJT handling
    - ARM: 8808/1: kexec:offline panic_smp_self_stop CPU
    - dlm: Don't swamp the CPU with callbacks queued during recovery
    - x86/PCI: Fix Broadcom CNB20LE unintended sign extension (redux)
    - powerpc/pseries: add of_node_put() in dlpar_detach_node()
    - serial: fsl_lpuart: clear parity enable bit when disable parity
    - ptp: check gettime64 return code in PTP_SYS_OFFSET ioctl
    - staging:iio:ad2s90: Make probe handle spi_setup failure
    - staging: iio: ad7780: update voltage on read
    - ARM: OMAP2+: hwmod: Fix some section annotations
    - modpost: validate symbol names also in find_elf_symbol
    - perf tools: Add Hygon Dhyana support
    - soc/tegra: Don't leak device tree node reference
    - f2fs: move dir data flush to write checkpoint process
    - f2fs: fix wrong return value of f2fs_acl_create
    - sunvdc: Do not spin in an infin...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
tags: added: sts
Brad Figg (brad-figg)
tags: added: cscc
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
assignee: nobody → Nivedita Singhvi (niveditasinghvi)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers