Kernel "BUG: soft lockup" with 5.4 kernels on arm64 node appleton node (dmesg spammed with "mlx5_core 0005:01:00.0: mlx5_eq_comp_int:159:(pid 1180): Completion event for bogus CQ 0x5a5aa9")
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ubuntu-kernel-tests |
New
|
Undecided
|
Unassigned | ||
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Focal |
In Progress
|
Undecided
|
Ike Panhc |
Bug Description
The regression boot test running with bionic:
Dec 2 12:17:12 appleton-kernel kernel: [ 64.281447] watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [swapper/16:0]
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288573] Modules linked in: ipmi_ssif nls_iso8859_1 joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288629] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.4.0-91-generic #102~18.04.1-Ubuntu
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288630] Hardware name: Hisilicon D05/BC11SPCD, BIOS 1.50 06/01/2018
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288632] pstate: 40400005 (nZcv daif +PAN -UAO)
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288640] pc : __do_softirq+
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288644] lr : irq_exit+0xc0/0xc8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288645] sp : ffff800011ee3ef0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288646] x29: ffff800011ee3ef0 x28: ffff002fb71a2d00
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288649] x27: 0000000000000000 x26: ffff800011ee4000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288650] x25: ffff800011ee0000 x24: ffff001fba073600
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288652] x23: ffff80001234bdb0 x22: 0000000000000000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288654] x21: 0000000000000282 x20: 0000000000000002
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288656] x19: ffff8000116b3000 x18: ffff800011267510
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288658] x17: 0000000000000000 x16: 0000000000000000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288659] x15: 0000000000000001 x14: ffff002fbb9f21c8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288661] x13: 0000000000000004 x12: 0000000000000002
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288663] x11: 0000000000000000 x10: 0000000000000040
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288665] x9 : ffff800011bbf228 x8 : ffff800011bbf220
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288666] x7 : ffff001fb9002270 x6 : 00000002c07fa07f
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288668] x5 : 00000000ffff00c1 x4 : ffff802faa352000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288670] x3 : ffff8000116b3780 x2 : ffff802faa352000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288672] x1 : 00000000000000e0 x0 : ffff8000116b3780
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288675] Call trace:
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288677] __do_softirq+
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288679] irq_exit+0xc0/0xc8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288683] __handle_
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288685] gic_handle_
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288687] el1_irq+0x104/0x1c0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288690] arch_cpu_
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288694] default_
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288696] do_idle+0x1d8/0x2b8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288699] cpu_startup_
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288702] secondary_
Dec 2 12:17:46 appleton-kernel kernel: [ 98.829315] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Dec 2 12:17:46 appleton-kernel kernel: [ 98.835229] rcu: 16-....: (17 GPs behind) idle=3b2/0/0x1 softirq=877/877 fqs=7317
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842875] (detected by 35, t=15005 jiffies, g=3737, q=2560)
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842877] Task dump for CPU 16:
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842880] swapper/16 R running task 0 0 1 0x0000002a
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842885] Call trace:
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842897] __switch_
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842902] 0x0
Dec 2 12:20:46 appleton-kernel kernel: [ 278.848845] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Dec 2 12:20:46 appleton-kernel kernel: [ 278.854762] rcu: 16-....: (17 GPs behind) idle=3b2/0/0x1 softirq=877/877 fqs=29680
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862495] (detected by 48, t=60007 jiffies, g=3737, q=14867)
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862498] Task dump for CPU 16:
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862500] swapper/16 R running task 0 0 1 0x0000002a
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862506] Call trace:
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862520] __switch_
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862526] 0x0
This is not so easily reproducible with the focal:linux kernel though, however at one time I was able to get this stack trace where it hung on some mlx5_core code:
[ 356.542512] watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [swapper/23:0]
[ 356.549633] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif input_leds joydev ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ses enclosure hid_generic usbhid hid ib_uverbs marvell ib_core hibmc_drm drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea sysfillrect ghash_ce sha2_ce sysimgblt mlx5_core fb_sys_fops hisi_sas_v2_hw sha256_arm64 ixgbe sha1_ce nvme hisi_sas_main tls xfrm_algo drm mdio libsas megaraid_sas nvme_core mlxfw ehci_platform hns_dsaf scsi_transport_sas hns_enet_drv hns_mdio hnae aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[ 356.620860] CPU: 23 PID: 0 Comm: swapper/23 Not tainted 5.4.0-91-generic #102-Ubuntu
[ 356.628588] Hardware name: Hisilicon D05/BC11SPCD, BIOS 1.50 06/01/2018
[ 356.635188] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 356.640007] pc : mlx5e_poll_
[ 356.645338] lr : mlx5e_napi_
[ 356.650635] sp : ffff800011efbcc0
[ 356.653935] x29: ffff800011efbcc0 x28: 0000000000000001
[ 356.659233] x27: ffff009fa0ed8000 x26: ffff009fa0ed8000
[ 356.664531] x25: ffff009fa0eda480 x24: ffff009fa0ed7ec8
[ 356.669829] x23: 0000000000000040 x22: 0000000000000000
[ 356.675126] x21: ffff002fbbaeddc0 x20: 0000000000000040
[ 356.680424] x19: ffff009fa0ed8100 x18: 0000000000000010
[ 356.685721] x17: 0000000000000000 x16: 0000000000000000
[ 356.691019] x15: ffff002fb71adf28 x14: 6220726f6620746e
[ 356.696316] x13: ffffffffffffc138 x12: ffffffffffffe238
[ 356.701614] x11: 0000000000003138 x10: 3935313a746e695f
[ 356.706911] x9 : ffff800011db5000 x8 : 00000000000006d9
[ 356.712209] x7 : 0000000000000017 x6 : ffff002f8d1c2200
[ 356.717506] x5 : 000000000000000a x4 : 0000000000000006
[ 356.722804] x3 : 0000000000000000 x2 : 0000000000000040
[ 356.728101] x1 : 0000000000000040 x0 : ffff8000094f70a8
[ 356.733400] Call trace:
[ 356.735867] mlx5e_poll_
[ 356.740850] mlx5e_napi_
[ 356.745805] net_rx_
[ 356.749543] __do_softirq+
[ 356.753194] irq_exit+0xa4/0xc8
[ 356.756323] __handle_
[ 356.760406] gic_handle_
[ 356.764228] el1_irq+0x104/0x1c0
[ 356.767443] arch_cpu_
[ 356.771093] default_
[ 356.775003] do_idle+0x214/0x298
[ 356.778218] cpu_startup_
[ 356.782128] secondary_
[ 390.270862] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 390.276771] rcu: 23-....: (364 GPs behind) idle=966/0/0x3 softirq=1461/1461 fqs=7130
[ 390.284676] (detected by 10, t=15005 jiffies, g=15605, q=284)
[ 390.290499] Call trace:
[ 390.292935] __switch_
[ 390.296498] 0xffff002fb71ada00
[ 392.554895] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 23-... } 15478 jiffies s: 645 root: 0x2/.
[ 392.565495] rcu: blocking rcu_node structures: l=1:16-31:0x80/.
[ 392.571412] Call trace:
[ 392.573848] __switch_
[ 392.577413] 0xffff002fb71ada00
[ 424.543333] watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [swapper/23:0]
[ 424.550454] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif input_leds joydev ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ses enclosure hid_generic usbhid hid ib_uverbs marvell ib_core hibmc_drm drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea sysfillrect ghash_ce sha2_ce sysimgblt mlx5_core fb_sys_fops hisi_sas_v2_hw sha256_arm64 ixgbe sha1_ce nvme hisi_sas_main tls xfrm_algo drm mdio libsas megaraid_sas nvme_core mlxfw ehci_platform hns_dsaf scsi_transport_sas hns_enet_drv hns_mdio hnae aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[ 424.621659] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G L 5.4.0-91-generic #102-Ubuntu
[ 424.630775] Hardware name: Hisilicon D05/BC11SPCD, BIOS 1.50 06/01/2018
[ 424.637374] pstate: 20400005 (nzCv daif +PAN -UAO)
[ 424.642152] pc : tasklet_
[ 424.647450] lr : tasklet_
[ 424.651444] sp : ffff800011efbe50
[ 424.654745] x29: ffff800011efbe50 x28: ffff002fbbad8fc0
[ 424.660043] x27: 0000000000000003 x26: ffff002fb71ada00
[ 424.665340] x25: 0000000000000100 x24: 0000000000000006
[ 424.670638] x23: 0000000000000060 x22: 00000000000000e0
[ 424.675935] x21: 0000000000000004 x20: ffff002fa94150a0
[ 424.681233] x19: ffff800011b750f0 x18: 0000000000000010
[ 424.686530] x17: 0000000000000000 x16: 0000000000000000
[ 424.691828] x15: ffff002fb71adf28 x14: 6220726f6620746e
[ 424.697125] x13: ffffffffffffc138 x12: ffffffffffffe238
[ 424.702423] x11: 0000000000003138 x10: 3935313a746e695f
[ 424.707720] x9 : 0000000000003c80 x8 : 00000000000006d9
[ 424.713018] x7 : 0000000000000000 x6 : 00000006e8cc2da3
[ 424.718315] x5 : 00ffffffffffffff x4 : ffff002fad862680
[ 424.723613] x3 : 00000000000004af x2 : ffff802faa45b000
[ 424.728910] x1 : 0000000000000000 x0 : ffff002fb71adf28
[ 424.734208] Call trace:
[ 424.736642] tasklet_
[ 424.741593] tasklet_
[ 424.745241] __do_softirq+
[ 424.748890] irq_exit+0xa4/0xc8
[ 424.752018] __handle_
[ 424.756100] gic_handle_
[ 424.759922] el1_irq+0x104/0x1c0
[ 424.763137] arch_cpu_
[ 424.766785] default_
[ 424.770694] do_idle+0x214/0x298
[ 424.773909] cpu_startup_
[ 424.777818] secondary_
[ 484.716312] INFO: task shutdown:1 blocked for more than 120 seconds.
[ 484.722658] Tainted: G L 5.4.0-91-generic #102-Ubuntu
[ 484.729521] "echo 0 > /proc/sys/
[ 484.737341] Call trace:
[ 484.739776] __switch_
[ 484.743341] __schedule+
[ 484.746818] schedule+0x40/0xb8
[ 484.749950] synchronize_
[ 484.754730] synchronize_
[ 484.758469] netif_napi_
[ 484.762155] mlx5e_close_
[ 484.767315] mlx5e_close_
[ 484.772562] mlx5e_close_
[ 484.777634] mlx5e_close+
[ 484.782101] mlx5e_nic_
[ 484.787087] mlx5e_detach_
[ 484.792246] mlx5e_detach+
[ 484.796796] mlx5_detach_
[ 484.801954] mlx5_unload_
[ 484.806852] shutdown+
[ 484.811202] pci_device_
[ 484.815288] device_
[ 484.819201] kernel_
[ 484.823026] __do_sys_
[ 484.826937] __arm64_
[ 484.830936] el0_svc_
[ 484.835715] el0_svc_
[ 484.839454] el0_svc+0x10/0x2c8
Not sure if the issue with mlx5 is the symptom or the root cause.
This is not a regression as I was able to reproduce it with hwe version 5.4.0-89-generic. I have not tried older kernels but it's possible that this issue is present for a long time.
tags: | added: 5.4 arm64 focal hwe-5.4 sru-20211129 |
tags: | added: appleton-kernel |
Changed in linux (Ubuntu Focal): | |
status: | New → Confirmed |
Changed in linux (Ubuntu): | |
status: | New → Fix Released |
summary: |
- Kernel "BUG: soft lockup" with 5.4 kernels on appleton node (arm64) + Kernel "BUG: soft lockup" with 5.4 kernels on arm64 node appleton node + (dmesg spammed with "mlx5_core 0005:01:00.0: mlx5_eq_comp_int:159:(pid + 1180): Completion event for bogus CQ 0x5a5aa9") |
tags: | added: sru-20220808 |
@cypressyew,
I use another machine with mlx5 NIC but can not reproduce. I might need to borrow appleton for testing.