NETDEV WATCHDOG: eno12399np0 (bnxt_en): transmit queue 4 timed out

Bug #2067712 reported by yuan.lu
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned

Bug Description

Issue Description:
We encountered a network device timeout error on our server, as indicated by a NETDEV WATCHDOG timeout event. The error occurred specifically on the transmit queue 4 of the network interface eno12399np0, which uses the bnxt_en driver.

Error Log:

Time of Incident: May 31 03:53:35
Error Message:
yaml
Copy code
NETDEV WATCHDOG: eno12399np0 (bnxt_en): transmit queue 4 timed out
WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:472 dev_watchdog+0x270/0x280
Kernel Version: 5.4.0-182-generic #202-Ubuntu
Hardware: Dell Inc. PowerEdge R650, BIOS 1.13.2 dated 12/19/2023
Modules Linked:
A comprehensive list of kernel modules active at the time was provided, including networking and system management modules, which may be relevant to diagnosing the issue.

Steps Taken:
We have checked physical connections and rebooted the server without resolving the issue. The network interface seems to sporadically fail, leading to these watchdog timeouts.

Questions:

Has anyone experienced similar issues with the bnxt_en driver or similar hardware configurations?
Are there known issues with this driver version on Ubuntu 20.04 LTS that could lead to transmit queue timeouts?
Any recommendations on driver updates, kernel patches, or configuration changes that could help mitigate this problem?
Additional Context:

The server is critical to our operations, handling high network traffic loads.
This is the first occurrence after a recent system update.
Request for Assistance:

Insights on debugging further at the kernel level or specific logs that would be useful to examine.
Suggestions for temporary workarounds or permanent fixes from community members with experience in network management and kernel troubleshooting.

May 31 03:53:35 onf-hk-comp006 kernel: [16160.756411] ------------[ cut here ]------------
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756415] NETDEV WATCHDOG: eno12399np0 (bnxt_en): transmit queue 4 timed out
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756450] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:472 dev_watchdog+0x270/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756452] Modules linked in: nf_conntrack_netlink vhost_net vhost tap xsk_diag udp_diag raw_diag unix_diag af_packet_diag netlink_diag tcp_diag inet_diag ip6table_raw xt_CT xt_mac xt_set xt_multiport xt_tcpudp xt_state xt_conntrack xt_comment xt_physdev ip_set_hash_net ip_set iptable_raw veth sch_ingress vxlan ebtable_filter ip6_udp_tunnel udp_tunnel ebtables ip6table_filter nfnetlink_cttimeout nfnetlink iptable_filter bpfilter aufs rdma_ucm ib_uverbs rdma_cm iw_cm ib_cm ib_core overlay 8021q garp mrp bonding nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif binfmt_misc intel_rapl_msr intel_rapl_common joydev nfit x86_pkg_temp_thermal intel_powerclamp dell_smbios input_leds dcdbas dell_wmi_descriptor wmi_bmof coretemp kvm_intel kvm mei_me isst_if_mbox_pci isst_if_mmio isst_if_common mei ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6_tables msr
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756505] br_netfilter bridge ramoops efi_pstore reed_solomon stp llc ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid1 raid0 multipath linear dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor mgag200 drm_vram_helper i2c_algo_bit ttm hid_generic drm_kms_helper syscopyarea raid6_pq sysfillrect sysimgblt libcrc32c usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel fb_sys_fops aesni_intel crypto_simd cryptd nvme glue_helper ahci drm nvme_core bnxt_en tg3 i2c_i801 libahci wmi
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756543] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.4.0-182-generic #202-Ubuntu
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756546] Hardware name: Dell Inc. PowerEdge R650/0FGCWW, BIOS 1.13.2 12/19/2023
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756551] RIP: 0010:dev_watchdog+0x270/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756556] Code: eb 9d 48 8b 5d d0 c6 05 ba 7c 2a 01 01 48 89 df e8 25 ae fa ff 44 89 e1 48 89 de 48 c7 c7 80 a6 20 b4 48 89 c2 e8 be 46 14 00 <0f> 0b e9 77 ff ff ff 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756559] RSP: 0018:ffffae574017ce38 EFLAGS: 00010282
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756562] RAX: 0000000000000000 RBX: ffff9ead25d40000 RCX: 0000000000000006
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756564] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff9ead3f65c8c0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756566] RBP: ffffae574017ce70 R08: 000000000000094a R09: 0000000000000004
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756567] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000004
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756569] R13: ffff9ead25d4dbc0 R14: 000000000000004a R15: ffff9ead25d40480
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756572] FS: 0000000000000000(0000) GS:ffff9ead3f640000(0000) knlGS:0000000000000000
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756574] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756576] CR2: 00007f311800b3c0 CR3: 0000003f1c522004 CR4: 0000000000762ee0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756578] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756580] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756581] PKRU: 55555554
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756583] Call Trace:
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756586] <IRQ>
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756596] ? show_regs.cold+0x1a/0x1f
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756603] ? __warn+0x98/0xe0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756607] ? dev_watchdog+0x270/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756613] ? report_bug+0xd1/0x100
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756621] ? do_error_trap+0x9b/0xc0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756624] ? do_invalid_op+0x3c/0x50
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756628] ? dev_watchdog+0x270/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756634] ? invalid_op+0x1e/0x30
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756638] ? dev_watchdog+0x270/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756641] ? dev_watchdog+0x270/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756645] ? pfifo_fast_enqueue+0x150/0x150
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756652] call_timer_fn+0x32/0x130
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756658] __run_timers.part.0+0x180/0x280
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756663] ? timerqueue_add+0x9b/0xb0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756668] ? enqueue_hrtimer+0x43/0xa0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756671] ? ktime_get+0x3e/0xa0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756676] run_timer_softirq+0x2a/0x50
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756682] __do_softirq+0xd1/0x2c1
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756687] irq_exit+0xae/0xb0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756692] smp_apic_timer_interrupt+0x7b/0x140
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756697] apic_timer_interrupt+0xf/0x20
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756699] </IRQ>
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756706] RIP: 0010:cpuidle_enter_state+0xc5/0x450
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756710] Code: ff e8 cf 06 83 ff 80 7d c7 00 74 17 9c 58 0f 1f 44 00 00 f6 c4 02 0f 85 65 03 00 00 31 ff e8 f2 1e 89 ff fb 66 0f 1f 44 00 00 <45> 85 ed 0f 88 8f 02 00 00 49 63 cd 4c 8b 7d d0 4c 2b 7d c8 48 8d
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756712] RSP: 0018:ffffae5740397e38 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756715] RAX: ffff9ead3f66ff00 RBX: ffffffffb4969be0 RCX: 000000000000001f
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756717] RDX: 0000000000000000 RSI: 000000002dd27b80 RDI: 0000000000000000
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756718] RBP: ffffae5740397e78 R08: 00000eb2b824f134 R09: 000000007fffffff
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756720] R10: ffff9ead3f66ebc0 R11: ffff9ead3f66eba0 R12: ffff9ead33291800
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756722] R13: 0000000000000002 R14: 0000000000000002 R15: ffff9ead33291800
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756728] ? cpuidle_enter_state+0xa1/0x450
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756733] cpuidle_enter+0x2e/0x40
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756739] call_cpuidle+0x23/0x40
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756742] do_idle+0x1dd/0x270
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756747] cpu_startup_entry+0x20/0x30
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756754] start_secondary+0x178/0x1d0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756760] secondary_startup_64+0xa4/0xb0
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756764] ---[ end trace 73ce74318a7baae1 ]---
May 31 03:53:35 onf-hk-comp006 kernel: [16160.756771] bnxt_en 0000:31:00.0 eno12399np0: TX timeout detected, starting reset task!

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Targeting bug report to Focal for now, as the reported logs indicate a 5.4.0-182-generic running kernel.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Focal):
status: New → Confirmed
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Bob Gibson (rjg-at-work) wrote :

We encountered similar errors after applying updates to Ubuntu 20.04 LTS in mid-April, 2024:

```
bnxt_en 0000:c8:00.0 eno1np0: TX timeout detected, starting reset task!
```

The kernel-image package was upgraded from linux-image-5.4.0-169-generic to linux-image-5.4.0-176-generic, so the problem appears to have occurred within that range of kernel versions.

We worked around the problem by configuring grub to boot the "Linux 5.4.0-169-generic" kernel. The affected servers have been rock solid running that kernel but were unusable running the 5.4.0-176-generic kernel.

/rjg

Revision history for this message
Philipp Hossner (philipp.hossner) wrote :

Also affects jammy with HWE Kernel version 6.5.0-35-generic.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.