Kernel "BUG: soft lockup" with 5.4 kernels on arm64 node appleton node (dmesg spammed with "mlx5_core 0005:01:00.0: mlx5_eq_comp_int:159:(pid 1180): Completion event for bogus CQ 0x5a5aa9")

Bug #1953058 reported by Kleber Sacilotto de Souza
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Focal
In Progress
Undecided
Ike Panhc

Bug Description

The regression boot test running with bionic:linux-hwe-5.4 5.4.0-92.103~18.04.2 failed because of the following hung task:

Dec 2 12:17:12 appleton-kernel kernel: [ 64.281447] watchdog: BUG: soft lockup - CPU#16 stuck for 22s! [swapper/16:0]
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288573] Modules linked in: ipmi_ssif nls_iso8859_1 joydev input_leds ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ib_iser rdma_cm iw_cm ib_cm iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib hid_generic ses usbhid enclosure hid ib_uverbs ib_core marvell hibmc_drm drm_vram_helper ttm drm_kms_helper crct10dif_ce ghash_ce syscopyarea sysfillrect sha2_ce mlx5_core sysimgblt sha256_arm64 ixgbe hisi_sas_v2_hw fb_sys_fops nvme sha1_ce hisi_sas_main tls xfrm_algo drm megaraid_sas nvme_core mdio mlxfw libsas ehci_platform scsi_transport_sas hns_dsaf hns_enet_drv hns_mdio hnae aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288629] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 5.4.0-91-generic #102~18.04.1-Ubuntu
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288630] Hardware name: Hisilicon D05/BC11SPCD, BIOS 1.50 06/01/2018
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288632] pstate: 40400005 (nZcv daif +PAN -UAO)
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288640] pc : __do_softirq+0x98/0x350
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288644] lr : irq_exit+0xc0/0xc8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288645] sp : ffff800011ee3ef0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288646] x29: ffff800011ee3ef0 x28: ffff002fb71a2d00
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288649] x27: 0000000000000000 x26: ffff800011ee4000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288650] x25: ffff800011ee0000 x24: ffff001fba073600
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288652] x23: ffff80001234bdb0 x22: 0000000000000000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288654] x21: 0000000000000282 x20: 0000000000000002
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288656] x19: ffff8000116b3000 x18: ffff800011267510
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288658] x17: 0000000000000000 x16: 0000000000000000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288659] x15: 0000000000000001 x14: ffff002fbb9f21c8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288661] x13: 0000000000000004 x12: 0000000000000002
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288663] x11: 0000000000000000 x10: 0000000000000040
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288665] x9 : ffff800011bbf228 x8 : ffff800011bbf220
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288666] x7 : ffff001fb9002270 x6 : 00000002c07fa07f
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288668] x5 : 00000000ffff00c1 x4 : ffff802faa352000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288670] x3 : ffff8000116b3780 x2 : ffff802faa352000
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288672] x1 : 00000000000000e0 x0 : ffff8000116b3780
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288675] Call trace:
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288677] __do_softirq+0x98/0x350
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288679] irq_exit+0xc0/0xc8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288683] __handle_domain_irq+0x6c/0xc0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288685] gic_handle_irq+0x84/0x2c0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288687] el1_irq+0x104/0x1c0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288690] arch_cpu_idle+0x34/0x1c0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288694] default_idle_call+0x24/0x60
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288696] do_idle+0x1d8/0x2b8
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288699] cpu_startup_entry+0x28/0xb0
Dec 2 12:17:12 appleton-kernel kernel: [ 64.288702] secondary_start_kernel+0x198/0x288
Dec 2 12:17:46 appleton-kernel kernel: [ 98.829315] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Dec 2 12:17:46 appleton-kernel kernel: [ 98.835229] rcu: 16-....: (17 GPs behind) idle=3b2/0/0x1 softirq=877/877 fqs=7317
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842875] (detected by 35, t=15005 jiffies, g=3737, q=2560)
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842877] Task dump for CPU 16:
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842880] swapper/16 R running task 0 0 1 0x0000002a
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842885] Call trace:
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842897] __switch_to+0x108/0x248
Dec 2 12:17:46 appleton-kernel kernel: [ 98.842902] 0x0
Dec 2 12:20:46 appleton-kernel kernel: [ 278.848845] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Dec 2 12:20:46 appleton-kernel kernel: [ 278.854762] rcu: 16-....: (17 GPs behind) idle=3b2/0/0x1 softirq=877/877 fqs=29680
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862495] (detected by 48, t=60007 jiffies, g=3737, q=14867)
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862498] Task dump for CPU 16:
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862500] swapper/16 R running task 0 0 1 0x0000002a
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862506] Call trace:
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862520] __switch_to+0x108/0x248
Dec 2 12:20:46 appleton-kernel kernel: [ 278.862526] 0x0

This is not so easily reproducible with the focal:linux kernel though, however at one time I was able to get this stack trace where it hung on some mlx5_core code:

[ 356.542512] watchdog: BUG: soft lockup - CPU#23 stuck for 23s! [swapper/23:0]
[ 356.549633] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif input_leds joydev ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ses enclosure hid_generic usbhid hid ib_uverbs marvell ib_core hibmc_drm drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea sysfillrect ghash_ce sha2_ce sysimgblt mlx5_core fb_sys_fops hisi_sas_v2_hw sha256_arm64 ixgbe sha1_ce nvme hisi_sas_main tls xfrm_algo drm mdio libsas megaraid_sas nvme_core mlxfw ehci_platform hns_dsaf scsi_transport_sas hns_enet_drv hns_mdio hnae aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[ 356.620860] CPU: 23 PID: 0 Comm: swapper/23 Not tainted 5.4.0-91-generic #102-Ubuntu
[ 356.628588] Hardware name: Hisilicon D05/BC11SPCD, BIOS 1.50 06/01/2018
[ 356.635188] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 356.640007] pc : mlx5e_poll_rx_cq+0x30/0x8a0 [mlx5_core]
[ 356.645338] lr : mlx5e_napi_poll+0x100/0x660 [mlx5_core]
[ 356.650635] sp : ffff800011efbcc0
[ 356.653935] x29: ffff800011efbcc0 x28: 0000000000000001
[ 356.659233] x27: ffff009fa0ed8000 x26: ffff009fa0ed8000
[ 356.664531] x25: ffff009fa0eda480 x24: ffff009fa0ed7ec8
[ 356.669829] x23: 0000000000000040 x22: 0000000000000000
[ 356.675126] x21: ffff002fbbaeddc0 x20: 0000000000000040
[ 356.680424] x19: ffff009fa0ed8100 x18: 0000000000000010
[ 356.685721] x17: 0000000000000000 x16: 0000000000000000
[ 356.691019] x15: ffff002fb71adf28 x14: 6220726f6620746e
[ 356.696316] x13: ffffffffffffc138 x12: ffffffffffffe238
[ 356.701614] x11: 0000000000003138 x10: 3935313a746e695f
[ 356.706911] x9 : ffff800011db5000 x8 : 00000000000006d9
[ 356.712209] x7 : 0000000000000017 x6 : ffff002f8d1c2200
[ 356.717506] x5 : 000000000000000a x4 : 0000000000000006
[ 356.722804] x3 : 0000000000000000 x2 : 0000000000000040
[ 356.728101] x1 : 0000000000000040 x0 : ffff8000094f70a8
[ 356.733400] Call trace:
[ 356.735867] mlx5e_poll_rx_cq+0x30/0x8a0 [mlx5_core]
[ 356.740850] mlx5e_napi_poll+0x100/0x660 [mlx5_core]
[ 356.745805] net_rx_action+0x180/0x488
[ 356.749543] __do_softirq+0x130/0x34c
[ 356.753194] irq_exit+0xa4/0xc8
[ 356.756323] __handle_domain_irq+0x74/0xc8
[ 356.760406] gic_handle_irq+0x10c/0x2cc
[ 356.764228] el1_irq+0x104/0x1c0
[ 356.767443] arch_cpu_idle+0x3c/0x1c8
[ 356.771093] default_idle_call+0x24/0x60
[ 356.775003] do_idle+0x214/0x298
[ 356.778218] cpu_startup_entry+0x30/0xb8
[ 356.782128] secondary_start_kernel+0x16c/0x1e0
[ 390.270862] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 390.276771] rcu: 23-....: (364 GPs behind) idle=966/0/0x3 softirq=1461/1461 fqs=7130
[ 390.284676] (detected by 10, t=15005 jiffies, g=15605, q=284)
[ 390.290499] Call trace:
[ 390.292935] __switch_to+0x134/0x190
[ 390.296498] 0xffff002fb71ada00
[ 392.554895] rcu: INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 23-... } 15478 jiffies s: 645 root: 0x2/.
[ 392.565495] rcu: blocking rcu_node structures: l=1:16-31:0x80/.
[ 392.571412] Call trace:
[ 392.573848] __switch_to+0x134/0x190
[ 392.577413] 0xffff002fb71ada00
[ 424.543333] watchdog: BUG: soft lockup - CPU#23 stuck for 22s! [swapper/23:0]
[ 424.550454] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_ssif input_leds joydev ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear mlx5_ib ses enclosure hid_generic usbhid hid ib_uverbs marvell ib_core hibmc_drm drm_vram_helper ttm drm_kms_helper crct10dif_ce syscopyarea sysfillrect ghash_ce sha2_ce sysimgblt mlx5_core fb_sys_fops hisi_sas_v2_hw sha256_arm64 ixgbe sha1_ce nvme hisi_sas_main tls xfrm_algo drm mdio libsas megaraid_sas nvme_core mlxfw ehci_platform hns_dsaf scsi_transport_sas hns_enet_drv hns_mdio hnae aes_neon_bs aes_neon_blk aes_ce_blk crypto_simd cryptd aes_ce_cipher
[ 424.621659] CPU: 23 PID: 0 Comm: swapper/23 Tainted: G L 5.4.0-91-generic #102-Ubuntu
[ 424.630775] Hardware name: Hisilicon D05/BC11SPCD, BIOS 1.50 06/01/2018
[ 424.637374] pstate: 20400005 (nzCv daif +PAN -UAO)
[ 424.642152] pc : tasklet_action_common.isra.0+0x5c/0x1a0
[ 424.647450] lr : tasklet_action+0x30/0x38
[ 424.651444] sp : ffff800011efbe50
[ 424.654745] x29: ffff800011efbe50 x28: ffff002fbbad8fc0
[ 424.660043] x27: 0000000000000003 x26: ffff002fb71ada00
[ 424.665340] x25: 0000000000000100 x24: 0000000000000006
[ 424.670638] x23: 0000000000000060 x22: 00000000000000e0
[ 424.675935] x21: 0000000000000004 x20: ffff002fa94150a0
[ 424.681233] x19: ffff800011b750f0 x18: 0000000000000010
[ 424.686530] x17: 0000000000000000 x16: 0000000000000000
[ 424.691828] x15: ffff002fb71adf28 x14: 6220726f6620746e
[ 424.697125] x13: ffffffffffffc138 x12: ffffffffffffe238
[ 424.702423] x11: 0000000000003138 x10: 3935313a746e695f
[ 424.707720] x9 : 0000000000003c80 x8 : 00000000000006d9
[ 424.713018] x7 : 0000000000000000 x6 : 00000006e8cc2da3
[ 424.718315] x5 : 00ffffffffffffff x4 : ffff002fad862680
[ 424.723613] x3 : 00000000000004af x2 : ffff802faa45b000
[ 424.728910] x1 : 0000000000000000 x0 : ffff002fb71adf28
[ 424.734208] Call trace:
[ 424.736642] tasklet_action_common.isra.0+0x5c/0x1a0
[ 424.741593] tasklet_action+0x30/0x38
[ 424.745241] __do_softirq+0x130/0x34c
[ 424.748890] irq_exit+0xa4/0xc8
[ 424.752018] __handle_domain_irq+0x74/0xc8
[ 424.756100] gic_handle_irq+0x10c/0x2cc
[ 424.759922] el1_irq+0x104/0x1c0
[ 424.763137] arch_cpu_idle+0x3c/0x1c8
[ 424.766785] default_idle_call+0x24/0x60
[ 424.770694] do_idle+0x214/0x298
[ 424.773909] cpu_startup_entry+0x30/0xb8
[ 424.777818] secondary_start_kernel+0x16c/0x1e0
[ 484.716312] INFO: task shutdown:1 blocked for more than 120 seconds.
[ 484.722658] Tainted: G L 5.4.0-91-generic #102-Ubuntu
[ 484.729521] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.737341] Call trace:
[ 484.739776] __switch_to+0x134/0x190
[ 484.743341] __schedule+0x31c/0x7e0
[ 484.746818] schedule+0x40/0xb8
[ 484.749950] synchronize_rcu_expedited+0x1e0/0x3a8
[ 484.754730] synchronize_net+0x24/0x38
[ 484.758469] netif_napi_del+0xac/0xb0
[ 484.762155] mlx5e_close_channel+0x38/0x58 [mlx5_core]
[ 484.767315] mlx5e_close_channels+0x38/0x60 [mlx5_core]
[ 484.772562] mlx5e_close_locked+0x70/0x90 [mlx5_core]
[ 484.777634] mlx5e_close+0x58/0x80 [mlx5_core]
[ 484.782101] mlx5e_nic_disable+0x9c/0xc0 [mlx5_core]
[ 484.787087] mlx5e_detach_netdev+0x54/0x90 [mlx5_core]
[ 484.792246] mlx5e_detach+0x60/0x78 [mlx5_core]
[ 484.796796] mlx5_detach_device+0xb4/0x138 [mlx5_core]
[ 484.801954] mlx5_unload_one+0xc4/0x170 [mlx5_core]
[ 484.806852] shutdown+0x1c4/0x1e0 [mlx5_core]
[ 484.811202] pci_device_shutdown+0x48/0x88
[ 484.815288] device_shutdown+0x14c/0x298
[ 484.819201] kernel_power_off+0x40/0x78
[ 484.823026] __do_sys_reboot+0x158/0x228
[ 484.826937] __arm64_sys_reboot+0x30/0x40
[ 484.830936] el0_svc_common.constprop.0+0xf4/0x200
[ 484.835715] el0_svc_handler+0x38/0xa8
[ 484.839454] el0_svc+0x10/0x2c8

Not sure if the issue with mlx5 is the symptom or the root cause.

This is not a regression as I was able to reproduce it with hwe version 5.4.0-89-generic. I have not tried older kernels but it's possible that this issue is present for a long time.

tags: added: 5.4 arm64 focal hwe-5.4 sru-20211129
tags: added: appleton-kernel
Changed in linux (Ubuntu Focal):
status: New → Confirmed
Changed in linux (Ubuntu):
status: New → Fix Released
Po-Hsu Lin (cypressyew)
summary: - Kernel "BUG: soft lockup" with 5.4 kernels on appleton node (arm64)
+ Kernel "BUG: soft lockup" with 5.4 kernels on arm64 node appleton node
+ (dmesg spammed with "mlx5_core 0005:01:00.0: mlx5_eq_comp_int:159:(pid
+ 1180): Completion event for bogus CQ 0x5a5aa9")
tags: added: sru-20220808
Revision history for this message
Ike Panhc (ikepanhc) wrote :

@cypressyew,

I use another machine with mlx5 NIC but can not reproduce. I might need to borrow appleton for testing.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

I find 2 systems with same Mellanox NIC card and put both systems in reboot test overnight.

0005:01:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
0005:01:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

Revision history for this message
Ike Panhc (ikepanhc) wrote :

appleton with 5.4.0-135.152~18.04.2-generic passes 1000 reboot without any soft lockup. I will try 5.4.0-92.103~18.04.2.

d05-3 with 5.4.0-135.152~18.04.2-generic passes 669 reboot without any error.

Changed in linux (Ubuntu Focal):
assignee: nobody → Ike Panhc (ikepanhc)
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Check log again and it looks like failure happens when reboot from deploying. I will try deploy/release cycle again and see how it goes.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

MAAS deploy/release loop with focal[1] on d05-3 and has deployed for 82 times without failure.
MAAS deploy/release loop with bionic-hwe on appleton run 100 times and 10 of them are failed.

Look like this issue is only happened on appleton.

--
[1] For some reason I can not deploy bionic-hwe with d05-3. Working on it.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

For next steps,

1) Find out why I can not use maas-cli to deploy bionic-hwe on d05-3
2) Collect failure logs on appleton
3) Find out the hardware difference

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Tried to deploy and wait for 100min to see if soft lockup shows.

Deploy focal and I can reproduce 5 times in 8 deploy test. Deploy jammy and it passes 20 deploy and everything looks good. It looks more and more like a focal kernel issue to me.

Revision history for this message
Ike Panhc (ikepanhc) wrote :

Same test in #7 for bionic and so far all 13 deploy looks good, no soft lockup. Looks this is a focal kernel issue and I will try to reboot into different focal kernel.

Changed in linux (Ubuntu Focal):
status: Confirmed → In Progress
Revision history for this message
Ike Panhc (ikepanhc) wrote :

Test from 5.4.0-26.30 and looks like this issue starts from 5.4.0-31.35. I will do more test to make sure this

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.