ice driver RTNL assertion failed warning on shutdown/reboot
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| linux (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
| Noble |
Fix Released
|
Undecided
|
Jacob Martin | ||
| linux-nvidia (Ubuntu) |
Invalid
|
Undecided
|
Unassigned | ||
| Noble |
Fix Released
|
Undecided
|
Jacob Martin | ||
Bug Description
This appears to be a regression in 6.8.0-50-generic.
The following warning from the Intel ice driver is reliably triggered on reboot or shutdown on DGXH100:
[ 97.538724] ------------[ cut here ]------------
[ 97.543943] RTNL: assertion failed at net/core/dev.c (6434)
[ 97.550255] WARNING: CPU: 45 PID: 1 at net/core/dev.c:6434 netif_queue_
[ 97.559676] Modules linked in: qrtr intel_rapl_msr intel_rapl_common intel_uncore_
[ 97.559783] ib_core mlx5_core crct10dif_pclmul crc32_pclmul polyval_clmulni ixgbe mlxfw polyval_generic psample ghash_clmulni_intel nvme xfrm_algo ice sha256_ssse3 tls xhci_pci sha1_ssse3 dca gnss nvme_core pci_hyperv_intf xhci_pci_renesas mdio nvme_auth wmi pinctrl_emmitsburg aesni_intel crypto_simd cryptd
[ 97.691627] CPU: 45 PID: 1 Comm: shutdown Not tainted 6.8.0-50-generic #51-Ubuntu
[ 97.700056] Hardware name: NVIDIA DGXH100/DGXH100, BIOS 1.1.3 10/30/2023
[ 97.707606] RIP: 0010:netif_
[ 97.713399] Code: 00 41 83 e7 01 0f 85 39 ff ff ff ba 22 19 00 00 48 c7 c6 86 10 24 86 48 c7 c7 98 0c 28 86 c6 05 7d 13 90 01 01 e8 83 93 20 ff <0f> 0b e9 13 ff ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 90 90 90
[ 97.734504] RSP: 0018:ff4fe9c7c0
[ 97.740392] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 97.748429] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
[ 97.756464] RBP: ff4fe9c7c0073c38 R08: 0000000000000000 R09: 0000000000000000
[ 97.764494] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[ 97.772529] R13: 0000000000000001 R14: ff39b1cf1a6f9000 R15: 0000000000000000
[ 97.780561] FS: 00007c4b1229f44
[ 97.789671] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 97.796144] CR2: 00007c4b12ef94c0 CR3: 000000013094a006 CR4: 0000000000f71ef0
[ 97.804177] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 97.812210] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 97.820245] PKRU: 55555554
[ 97.823304] Call Trace:
[ 97.826073] <TASK>
[ 97.828455] ? show_regs+0x6d/0x80
[ 97.832305] ? __warn+0x89/0x160
[ 97.835955] ? netif_queue_
[ 97.841062] ? report_
[ 97.845203] ? handle_
[ 97.849146] ? exc_invalid_
[ 97.853478] ? asm_exc_
[ 97.858204] ? netif_queue_
[ 97.863317] ice_vsi_
[ 97.869192] ice_vsi_
[ 97.873864] ice_deinit_
[ 97.878619] ice_remove+
[ 97.883176] ice_shutdown+
[ 97.887736] pci_device_
[ 97.892370] device_
[ 97.896805] kernel_
[ 97.900945] __do_sys_
[ 97.905363] __x64_sys_
[ 97.909694] x64_sys_
[ 97.914029] do_syscall_
[ 97.918162] ? irqentry_
[ 97.923759] ? irqentry_
[ 97.927993] ? exc_page_
[ 97.932422] entry_SYSCALL_
[ 97.938118] RIP: 0033:0x7c4b12e1ba07
[ 97.942226] Code: c7 c0 ff ff ff ff eb be 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 89 fa be 69 19 12 28 bf ad de e1 fe b8 a9 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 e1 c3 0d 00 f7 d8 64 89 02 b8
[ 97.963323] RSP: 002b:00007ffd36
[ 97.971846] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007c4b12e1ba07
[ 97.979879] RDX: 0000000001234567 RSI: 0000000028121969 RDI: 00000000fee1dead
[ 97.987909] RBP: 00007ffd36a493e0 R08: 0000000000000069 R09: 0000000000000000
[ 97.995941] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 98.003973] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000001234567
[ 98.012004] </TASK>
[ 98.014480] ---[ end trace 0000000000000000 ]---
CVE References
| Changed in linux (Ubuntu): | |
| status: | New → Invalid |
| Changed in linux (Ubuntu Noble): | |
| status: | New → In Progress |
| assignee: | nobody → Jacob Martin (jacobmartin) |
| description: | updated |
| Changed in linux-nvidia (Ubuntu): | |
| status: | New → Invalid |
| Changed in linux-nvidia (Ubuntu Noble): | |
| assignee: | nobody → Jacob Martin (jacobmartin) |
| status: | New → Fix Committed |
| Changed in linux (Ubuntu Noble): | |
| status: | In Progress → Fix Committed |
| tags: | added: kernel-daily-bug |

Patch submitted to kernel team mailing list: https:/ /lists. ubuntu. com/archives/ kernel- team/2024- December/ 155769. html.
SRU Justification
[Impact]
An RTNL assertion failed warning is triggered from the Intel ice driver when a set_napi to rtnl-protected clear_napi_ queues( )" in "ice_vsi_close()", and
PCIe device using the driver is removed or the system is shutdown/rebooted.
This was caused by commit "ice: move netif_queue_
sections" which was brought in through stable updates in 6.8.0-50-generic. The
commit adds a call to "ice_vsi_
in K6.8 "ice_vsi_close()" is called via "ice_remove()". Function "ice_remove()"
is used as the PCI remove callback, and is also called via "ice_shutdown()".
[Fix]
This is resolved by cherry picking commit "ice: Remove and readd netdev during set_napi to rtnl-protected sections".
devlink reload" from upstream. This commit refactors "ice_remove()" to not call
"ice_vsi_close()". Upstream was not affected because in mainline "ice: Remove
and readd netdev during devlink reload" is a parent to "ice: move
netif_queue_
Only noble:linux and its derivative kernels are affected.
[Test Plan]
(1) Apply patch "ice: Remove and readd netdev during devlink reload", reboot
into patched kernel on system that utilizes the Intel ice driver.
(2) Reboot or shutdown the system. Observe in the system's console that the
warning splat from LP#2091107 no longer appears.
[Where problems could occur]
This change affects the Intel ice kernel module specifically. Issues with this
fix would manifest as misbehavior of that driver, which would be used with
select intel NICs, including "Intel Corporation Ethernet Controller E810-C for
QSFP".