Intel E810-XXV - NETDEV WATCHDOG: (ice): transmit queue timed out
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
linux (Ubuntu) | Status tracked in Noble | |||||
Jammy |
Fix Released
|
Medium
|
Robert Malz | |||
Mantic |
Fix Released
|
Medium
|
Robert Malz | |||
Noble |
In Progress
|
Medium
|
Robert Malz |
Bug Description
[Impact]
* Issue is causing transmit hang on E810 ports with bonding enabled.
* Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine).
* Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification.
[Fix]
* Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1].
This change has been tested in an environment where reproduction is easily achieved.
After multiple iterations, no reproduction has been observed.
* Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities.
[Test Plan]
* To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run.
* The issue could appear on a random node, making reproduction hard to achieve.
* Multiple stress tests on single host with similar configuration did not trigger a reproduction.
[Where problems could occur]
* All ice drivers with ice_lag_
* CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released.
Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced.
[1] - https:/
[2] - https:/
[Other Info]
* Issue could be reproduced on custom 6.2 jammy-hwe kernel with ice driver backported from mainline kernel from before patch [2] was added.
* Original description of the case below:
I'm having issues with an Intel E810-XXV card on a Dell server under Ubuntu Jammy.
Details:
- hardware --> a1:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-XXV for SFP (rev 02)
- tested with both GA and HWE kernels (`5.15.0-83-generic #92` and `6.2.0-32-generic #32~22.
- using a bond over the two ports of the same card, at 25Gbps to two different switches, bond is using LACP with hash layer3+4 and fast timeout. But I believe the bug is not directly related to bonding as the problem seems to be in the interface.
- machine installed by maas. No issues during installation, but at that time bond is not formed yet, later when linux is booted, the bond is formed and works without issues for a while
- it works for about 2 to 3 hours fine, then the issue starts (may or may not be related to network load, but it seems that it is triggered by some tests that I run after openstack finishes installing)
- one of the legs of the bond freezes and everything that would go to that lag is discarded, in and out, ping to random external hosts start losing every second packet
- after some time you can see on the kernel log messages about "NETDEV WATCHDOG: enp161s0f0 (ice): transmit queue 166 timed out" and a stack trace
- the switch does log that the bond is flapping
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Sep 12 20:05 seq
crw-rw---- 1 root audio 116, 33 Sep 12 20:05 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5CheckR
CloudArchitecture: x86_64
CloudID: none
CloudName: none
CloudPlatform: none
CloudSubPlatform: config
DistroRelease: Ubuntu 22.04
InstallationDate: Installed on 2023-08-22 (24 days ago)
InstallationMedia: Ubuntu-Server 22.04.3 LTS "Jammy Jellyfish" - Release amd64 (20230810)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R7515
Package: linux (not installed)
PciMultimedia:
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware 20220329.
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy uec-images
Uname: Linux 5.15.0-83-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 07/27/2023
dmi.bios.release: 2.12
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.12.4
dmi.board.name: 0J91V2
dmi.board.vendor: Dell Inc.
dmi.board.version: A01
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7515
dmi.product.sku: SKU=08FD;
dmi.sys.vendor: Dell Inc.
---
ProblemType: Bug
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Sep 15 03:13 seq
crw-rw---- 1 root audio 116, 33 Sep 15 03:13 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.5
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse:
Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/215602/fd/10: Permission denied
Cannot stat file /proc/323635/fd/10: Permission denied
CRDA: N/A
CasperMD5CheckR
CloudArchitecture: x86_64
CloudID: maas
CloudName: maas
CloudPlatform: maas
CloudSubPlatform: seed-dir (http://
DistroRelease: Ubuntu 22.04
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Dell Inc. PowerEdge R7525
NonfreeKernelMo
Package: linux (not installed)
PciMultimedia:
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RebootRequiredPkgs: Error: path contained symlinks.
RelatedPackageV
linux-
linux-
linux-firmware 20220329.
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
Tags: jammy uec-images
Uname: Linux 6.2.0-32-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 07/26/2023
dmi.bios.release: 2.12
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.12.4
dmi.board.name: 03WYW4
dmi.board.vendor: Dell Inc.
dmi.board.version: A02
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.
dmi.product.family: PowerEdge
dmi.product.name: PowerEdge R7525
dmi.product.sku: SKU=08FF;
dmi.sys.vendor: Dell Inc.
mtime.conffile.
description: | updated |
Changed in linux (Ubuntu Jammy): | |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux (Ubuntu Mantic): | |
importance: | Undecided → Medium |
status: | New → In Progress |
Changed in linux (Ubuntu): | |
status: | Confirmed → Invalid |
Changed in linux (Ubuntu): | |
status: | Invalid → Confirmed |
status: | Confirmed → In Progress |
importance: | Undecided → Medium |
assignee: | nobody → Robert Malz (rmalz) |
Changed in linux (Ubuntu Jammy): | |
assignee: | nobody → Robert Malz (rmalz) |
Changed in linux (Ubuntu Mantic): | |
assignee: | nobody → Robert Malz (rmalz) |
Changed in linux (Ubuntu Jammy): | |
status: | In Progress → Fix Committed |
Changed in linux (Ubuntu Mantic): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-mantic-linux removed: verification-needed-mantic-linux |
This is the log from the HWE kernel:
[33219.508873] ------------[ cut here ]------------ sch_generic. c:525 dev_watchdog+ 0x21f/0x230 netlink geneve ip6_udp_tunnel udp_tunnel xt_CT dm_crypt scsi_transport_ iscsi veth nfnetlink_cttimeout openvswitch nsh nf_conncount unix_diag nft_masq zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) vhost_vsock vmw_vsock_ virtio_ transport_ common vhost vhost_iotlb vsock xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink bridge sunrpc nvme_fabrics 8021q garp mrp stp llc bonding tls binfmt_misc ipmi_ssif intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd dell_wmi kvm_amd video ledtrig_audio nls_iso8859_1 irdma sparse_keymap kvm i40e irqbypass dell_smbios dcdbas ib_uverbs rapl dell_wmi_descriptor wmi_bmof ib_core ccp ptdma k10temp acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops watchdog+ 0x21f/0x230 fd0e70 EFLAGS: 00010246 0(0000) GS:ffff9b573de0 0000(0000) knlGS:000000000 0000000 0x21f/0x230
[33219.508877] NETDEV WATCHDOG: enp161s0f1 (ice): transmit queue 35 timed out
[33219.508932] WARNING: CPU: 48 PID: 0 at net/sched/
[33219.508940] Modules linked in: sch_ingress nf_conntrack_
[33219.509051] reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear cdc_ether usbnet mii mgag200 i2c_algo_bit drm_shmem_helper drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect sysimgblt crc32_pclmul bcache polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ahci xhci_pci cryptd ice tg3 libahci drm megaraid_sas i2c_piix4 xhci_pci_renesas nvme_common wmi
[33219.509114] CPU: 48 PID: 0 Comm: swapper/48 Tainted: P O 6.2.0-32-generic #32~22.04.1-Ubuntu
[33219.509116] Hardware name: Dell Inc. PowerEdge R7525/03WYW4, BIOS 2.12.4 07/26/2023
[33219.509118] RIP: 0010:dev_
[33219.509122] Code: 00 e9 31 ff ff ff 4c 89 e7 c6 05 66 83 78 01 01 e8 56 00 f8 ff 44 89 f1 4c 89 e6 48 c7 c7 08 4f e4 b7 48 89 c2 e8 61 df 2b ff <0f> 0b e9 22 ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90 90
[33219.509123] RSP: 0018:ffffb42719
[33219.509125] RAX: 0000000000000000 RBX: ffff9bd91b3e74c8 RCX: 0000000000000000
[33219.509126] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[33219.509127] RBP: ffffb42719fd0e98 R08: 0000000000000000 R09: 0000000000000000
[33219.509128] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9bd91b3e7000
[33219.509129] R13: ffff9bd91b3e741c R14: 0000000000000023 R15: 0000000000000000
[33219.509130] FS: 000000000000000
[33219.509132] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[33219.509133] CR2: 000055fd64034000 CR3: 0000010273ae2004 CR4: 0000000000770ee0
[33219.509135] PKRU: 55555554
[33219.509135] Call Trace:
[33219.509137] <IRQ>
[33219.509140] ? show_regs+0x72/0x90
[33219.509145] ? dev_watchdog+
[33219.509147] ? __warn+0x8...