5.15.0-69 ice driver deadlocks with bonded e810 NICs

Bug #2015414 reported by Zev Weiss
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

The ice driver in the 5.15.0-69 kernel deadlocks on rtnl_lock() when adding e810 NICs to a bond interface. Booting with `sysctl.hung_task_panic=1` and `sysctl.hung_task_all_cpu_backtrace=1` added to the kernel command-line shows (among lots of other output):

```
[ 244.980100] INFO: task kworker/6:1:182 blocked for more than 120 seconds.
[ 244.988431] Not tainted 5.15.0-69-generic #76-Ubuntu
[ 244.995279] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 245.004826] task:kworker/6:1 state:D stack: 0 pid: 182 ppid: 2 flags:0x00004000
[ 245.015017] Workqueue: events linkwatch_event
[ 245.020734] Call Trace:
[ 245.024144] <TASK>
[ 245.027137] __schedule+0x24e/0x590
[ 245.031848] schedule+0x69/0x110
[ 245.036228] schedule_preempt_disabled+0xe/0x20
[ 245.042066] __mutex_lock.constprop.0+0x267/0x490
[ 245.047993] __mutex_lock_slowpath+0x13/0x20
[ 245.053432] mutex_lock+0x38/0x50
[ 245.057714] rtnl_lock+0x15/0x20
[ 245.061901] linkwatch_event+0xe/0x30
[ 245.066571] process_one_work+0x228/0x3d0
[ 245.071607] worker_thread+0x53/0x420
[ 245.076260] ? process_one_work+0x3d0/0x3d0
[ 245.081493] kthread+0x127/0x150
[ 245.085592] ? set_kthread_struct+0x50/0x50
[ 245.090769] ret_from_fork+0x1f/0x30
[ 245.095266] </TASK>
```

and

```
[ 245.530629] INFO: task ifenslave:849 blocked for more than 121 seconds.
[ 245.540433] Not tainted 5.15.0-69-generic #76-Ubuntu
[ 245.549050] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 245.558960] task:ifenslave state:D stack: 0 pid: 849 ppid: 847 flags:0x00004002
[ 245.570930] Call Trace:
[ 245.576175] <TASK>
[ 245.581018] __schedule+0x24e/0x590
[ 245.587445] schedule+0x69/0x110
[ 245.593631] schedule_timeout+0x103/0x140
[ 245.600573] __wait_for_common+0xab/0x150
[ 245.607526] ? usleep_range_state+0x90/0x90
[ 245.614743] wait_for_completion+0x24/0x30
[ 245.621903] flush_workqueue+0x133/0x3e0
[ 245.628887] ib_cache_cleanup_one+0x21/0xf0 [ib_core]
[ 245.637083] __ib_unregister_device+0x79/0xc0 [ib_core]
[ 245.645398] ib_unregister_device+0x27/0x40 [ib_core]
[ 245.653541] irdma_ib_unregister_device+0x4b/0x70 [irdma]
[ 245.662105] irdma_remove+0x1f/0x70 [irdma]
[ 245.669446] auxiliary_bus_remove+0x1d/0x40
[ 245.676688] __device_release_driver+0x1a8/0x2a0
[ 245.684241] device_release_driver+0x29/0x40
[ 245.691416] bus_remove_device+0xde/0x150
[ 245.698396] device_del+0x19c/0x400
[ **712178] ice_lag_link.isra.0+0xdd/0xf0 [ice]
m] (3 of 5) A start job is runni[ 245.720683] ice_lag_changeupper_event+0xe1/0x130 [ice]
ng for\u2026rk interfaces (3min 47s[ 245.729739] ice_lag_event_handler+0x5b/0x150 [ice]
 / 5min 3s)
[ 245.738525] raw_notifier_call_chain+0x46/0x60
[ 245.746006] call_netdevice_notifiers_info+0x52/0xa0
[ 245.754123] __netdev_upper_dev_link+0x1b7/0x310
[ 245.761658] netdev_master_upper_dev_link+0x3e/0x60
[ 245.769627] bond_enslave+0xc3a/0x1720 [bonding]
[ 245.777398] ? sscanf+0x4e/0x70
[ 245.783375] bond_option_slaves_set+0xca/0x170 [bonding]
[ 245.791738] __bond_opt_set+0xbd/0x1a0 [bonding]
[ 245.799505] __bond_opt_set_notify+0x30/0xb0 [bonding]
[ 245.807860] bond_opt_tryset_rtnl+0x56/0xa0 [bonding]
[ 245.816062] bonding_sysfs_store_option+0x52/0xa0 [bonding]
[ 245.824750] dev_attr_store+0x14/0x30
[ 245.831443] sysfs_kf_write+0x3b/0x50
[ 245.837979] kernfs_fop_write_iter+0x138/0x1c0
[ 245.845469] new_sync_write+0x111/0x1a0
[ 245.852210] vfs_write+0x1d5/0x270
[ 245.858429] ksys_write+0x67/0xf0
[ 245.864624] __x64_sys_write+0x19/0x20
[ 245.871288] do_syscall_64+0x59/0xc0
[ 245.877715] ? handle_mm_fault+0xd8/0x2c0
[ 245.884566] ? do_user_addr_fault+0x1e7/0x670
[ 245.891990] ? filp_close+0x60/0x70
[ 245.898452] ? exit_to_user_mode_prepare+0x37/0xb0
[ 245.906272] ? irqentry_exit_to_user_mode+0x9/0x20
[ 245.914042] ? irqentry_exit+0x1d/0x30
[ 245.920703] ? exc_page_fault+0x89/0x170
[ 245.927555] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 245.935763] RIP: 0033:0x7f1e86855a37
[ 245.942153] RSP: 002b:00007fff8da477a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 245.953034] RAX: ffffffffffffffda RBX: 000000000000000a RCX: 00007f1e86855a37
[ 245.963554] RDX: 000000000000000a RSI: 0000556eff580510 RDI: 0000000000000001
[ 245.972468] RBP: 0000556eff580510 R08: 0000556eff582c5a R09: 0000000000000000
[ 245.983048] R10: 0000556eff582c59 R11: 0000000000000246 R12: 0000000000000001
[ 245.993402] R13: 000000000000000a R14: 0000000000000000 R15: 0000000000000000
[ 246.001700] </TASK>
```

This appears consistent with the underlying cause being the bug fixed by mainline commit 248401cb2c4612d83eb0c352ee8103b78b8eb365 (commit 87b9ac7bd301f53b122224fc8eddb1f4045e3f2c in the 5.15.y stable tree).

The 5.15.0-67 kernel does not exhibit the problem; given that the 5.15.0-68 kernel apparently included the "RDMA/irdma: Report the correct link speed" patch listed in one of the "Fixes" tags in the above commit, I suspect that that's the culprit and that importing the above commit shoudl resolve the problem.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-5.15.0-67-generic 5.15.0-67.74
ProcVersionSignature: Ubuntu 5.15.0-67.74-generic 5.15.85
Uname: Linux 5.15.0-67-generic x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Apr 5 22:47 seq
 crw-rw---- 1 root audio 116, 33 Apr 5 22:47 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.3
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
CasperMD5CheckResult: unknown
Date: Wed Apr 5 22:48:03 2023
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 004: ID 0b1f:03ee Insyde Software Corp. RNDIS/Ethernet Gadget
 Bus 001 Device 003: ID 0557:9241 ATEN International Co., Ltd SMCI HID KM
 Bus 001 Device 002: ID 1d6b:0107 Linux Foundation USB Virtual Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro SYS-510T-MR-EI018
PciMultimedia:

ProcEnviron:
 TERM=vt220
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-67-generic root=UUID=0b21ae48-6315-4193-8c24-fc224a18170f ro console=tty0 console=ttyS1,115200n8 modprobe.blacklist=igb modprobe.blacklist=rndis_host
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-67-generic N/A
 linux-backports-modules-5.15.0-67-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.9
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 06/23/2022
dmi.bios.release: 5.22
dmi.bios.vendor: American Megatrends International, LLC.
dmi.bios.version: 1.2
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X12STH-SYS
dmi.board.vendor: Supermicro
dmi.board.version: 1.01
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 1
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInternational,LLC.:bvr1.2:bd06/23/2022:br5.22:svnSupermicro:pnSYS-510T-MR-EI018:pvr0123456789:rvnSupermicro:rnX12STH-SYS:rvr1.01:cvnSupermicro:ct1:cvr0123456789:skuTobefilledbyO.E.M.:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: SYS-510T-MR-EI018
dmi.product.sku: To be filled by O.E.M.
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
Zev Weiss (zevweiss) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Gema Gomez (gema) wrote :

Happy to provide either hardware or help testing a solution if needed!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.