Bug #2058557 “Kernel panic during checkbox stress_ng_test on Gra...” : Noble (24.04) : Bugs : linux package : Ubuntu

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-20:

#1

dmesg and ubuntu-bug outputs Edit (125.3 KiB, application/x-tar)

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-20:

#2

Download full text (8.8 KiB)

This is also reproducible on the latest mainline version (https://kernel.ubuntu.com/mainline/v6.8/arm64/, retrieved 20 Mar 2024 @ 5 PM):

20 Mar 22:54: Running stress-ng aiol stressor for 240 seconds...
[ 354.451450] Unable to handle kernel paging request at virtual address 17be9b4aa3e187be
[ 354.459580] Mem abort info:
[ 354.462439] ESR = 0x0000000096000021
[ 354.466274] EC = 0x25: DABT (current EL), IL = 32 bits
[ 354.471703] SET = 0, FnV = 0
[ 354.474819] EA = 0, S1PTW = 0
[ 354.478024] FSC = 0x21: alignment fault
[ 354.482118] Data abort info:
[ 354.485056] ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
[ 354.490662] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 354.495823] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 354.501251] [17be9b4aa3e187be] address between user and kernel address ranges
[ 354.508548] Internal error: Oops: 0000000096000021 [#1] SMP
[ 354.514245] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 input_leds dax_hmem cxl_acpi acpi_ipmi onboard_usb_hub nvidia_cspmu ipmi_ssif cxl_co
re ipmi_devintf arm_cspmu_module arm_smmuv3_pmu ipmi_msghandler uio_pdrv_genirq uio spi_nor cppc_cpufreq joydev mtd acpi_power_meter dm_multipath nvme_fabrics
efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
xor_neon raid6_pq libcrc32c raid1 raid0 hid_generic rndis_host usbhid cdc_ether hid usbnet uas usb_storage crct10dif_ce polyval_ce polyval_generic ghash_ce s
m4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce i2c_smbus ixgbe sha2_ce nvme_core ast sha256_arm64 xhci_pci sha1_ce xfrm_algo xhci_pci_r
enesas i2c_algo_bit nvme_auth mdio spi_tegra210_quad i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[ 354.594676] CPU: 61 PID: 0 Comm: swapper/61 Kdump: loaded Not tainted 6.8.0-060800-generic-64k #202403131158
[ 354.604728] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c 12/28/2023
[ 354.611844] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 354.618962] pc : _raw_spin_lock_irqsave+0x44/0x100
[ 354.623863] lr : try_to_wake_up+0x68/0x758
[ 354.628053] sp : ffff8000807afaf0
[ 354.631436] x29: ffff8000807afaf0 x28: 0000000000040000 x27: 0000000000000000
[ 354.638731] x26: ffffa06103dc8a98 x25: ffff8000807afd98 x24: 0000000000000002
[ 354.646027] x23: ffff0000f8156840 x22: 17be9b4aa3e187be x21: 0000000000000000
[ 354.653323] x20: 0000000000000003 x19: 00000000000000c0 x18: ffff8000819a0098
[ 354.660619] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe97dca18
[ 354.667914] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[ 354.675208] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa06100ba6810
[ 354.682504] x8 : 0000000000000000 x7 : 0000004000000000 x6 : 0000000000009080
[ 354.689800] x5 : 0000c2fb0dc488b0 x4 : 0000000000000000 x3 : ffff0000894178c0
[ 354.697096] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 17be9b4aa3e187be
[ 354.704391] Call trace:
[ 354.706886] _raw_spin_lock_irqsave+0x44/0x100
[ 354.711426] try_to_wake_up+0x68/0x758
[ 354.715254] wake_up_process+0x24/0x50
[ 354.719082] aio...

This is also reproducible on the latest mainline version (https://kernel.ubuntu.com/mainline/v6.8/arm64/, retrieved 20 Mar 2024 @ 5 PM):

20 Mar 22:54: Running stress-ng aiol stressor for 240 seconds...
[  354.451450] Unable to handle kernel paging request at virtual address 17be9b4aa3e187be
[  354.459580] Mem abort info:
[  354.462439]   ESR = 0x0000000096000021
[  354.466274]   EC = 0x25: DABT (current EL), IL = 32 bits
[  354.471703]   SET = 0, FnV = 0
[  354.474819]   EA = 0, S1PTW = 0
[  354.478024]   FSC = 0x21: alignment fault
[  354.482118] Data abort info:
[  354.485056]   ISV = 0, ISS = 0x00000021, ISS2 = 0x00000000
[  354.490662]   CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[  354.495823]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  354.501251] [17be9b4aa3e187be] address between user and kernel address ranges
[  354.508548] Internal error: Oops: 0000000096000021 [#1] SMP
[  354.514245] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 input_leds dax_hmem cxl_acpi acpi_ipmi onboard_usb_hub nvidia_cspmu ipmi_ssif cxl_co
re ipmi_devintf arm_cspmu_module arm_smmuv3_pmu ipmi_msghandler uio_pdrv_genirq uio spi_nor cppc_cpufreq joydev mtd acpi_power_meter dm_multipath nvme_fabrics
 efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
 xor_neon raid6_pq libcrc32c raid1 raid0 hid_generic rndis_host usbhid cdc_ether hid usbnet uas usb_storage crct10dif_ce polyval_ce polyval_generic ghash_ce s
m4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce i2c_smbus ixgbe sha2_ce nvme_core ast sha256_arm64 xhci_pci sha1_ce xfrm_algo xhci_pci_r
enesas i2c_algo_bit nvme_auth mdio spi_tegra210_quad i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  354.594676] CPU: 61 PID: 0 Comm: swapper/61 Kdump: loaded Not tainted 6.8.0-060800-generic-64k #202403131158
[  354.604728] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c 12/28/2023
[  354.611844] pstate: 034000c9 (nzcv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  354.618962] pc : _raw_spin_lock_irqsave+0x44/0x100
[  354.623863] lr : try_to_wake_up+0x68/0x758
[  354.628053] sp : ffff8000807afaf0
[  354.631436] x29: ffff8000807afaf0 x28: 0000000000040000 x27: 0000000000000000
[  354.638731] x26: ffffa06103dc8a98 x25: ffff8000807afd98 x24: 0000000000000002
[  354.646027] x23: ffff0000f8156840 x22: 17be9b4aa3e187be x21: 0000000000000000
[  354.653323] x20: 0000000000000003 x19: 00000000000000c0 x18: ffff8000819a0098
[  354.660619] x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffffe97dca18
[  354.667914] x14: 0000000000000000 x13: 0000000000000000 x12: 0000000000000000
[  354.675208] x11: 0000000000000000 x10: 0000000000000000 x9 : ffffa06100ba6810
[  354.682504] x8 : 0000000000000000 x7 : 0000004000000000 x6 : 0000000000009080
[  354.689800] x5 : 0000c2fb0dc488b0 x4 : 0000000000000000 x3 : ffff0000894178c0
[  354.697096] x2 : 0000000000000001 x1 : 0000000000000000 x0 : 17be9b4aa3e187be
[  354.704391] Call trace:
[  354.706886]  _raw_spin_lock_irqsave+0x44/0x100
[  354.711426]  try_to_wake_up+0x68/0x758
[  354.715254]  wake_up_process+0x24/0x50
[  354.719082]  aio_complete+0x1c4/0x2b8
[  354.722825]  aio_complete_rw+0x11c/0x2c8
[  354.726831]  iomap_dio_bio_end_io+0x1f0/0x248
[  354.731282]  bio_endio+0x170/0x270
[  354.734758]  __dm_io_complete+0x180/0x200
[  354.738855]  clone_endio+0xc8/0x288
[  354.742416]  bio_endio+0x170/0x270
[  354.745889]  blk_mq_end_request_batch+0x2e0/0x558
[  354.750696]  nvme_pci_complete_batch+0x94/0x118 [nvme]
[  354.755958]  nvme_irq+0x9c/0xb0 [nvme]
[  354.759788]  __handle_irq_event_percpu+0x68/0x2c0
[  354.764595]  handle_irq_event+0x58/0xe8
[  354.768511]  handle_fasteoi_irq+0xb0/0x218
[  354.772695]  generic_handle_domain_irq+0x38/0x70
[  354.777411]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[  354.783195]  gic_handle_irq+0x2c/0xa0
[  354.786935]  call_on_irq_stack+0x3c/0x50
[  354.790941]  do_interrupt_handler+0xb0/0xc8
[  354.795214]  el1_interrupt+0x48/0xf0
[  354.798866]  el1h_64_irq_handler+0x1c/0x40
[  354.803050]  el1h_64_irq+0x7c/0x80
[  354.806523]  cpuidle_enter_state+0xd8/0x790
[  354.810795]  cpuidle_enter+0x44/0x78
[  354.814446]  cpuidle_idle_call+0x15c/0x210
[  354.818631]  do_idle+0xb0/0x130
[  354.821837]  cpu_startup_entry+0x44/0x50
[  354.825845]  secondary_start_kernel+0xec/0x130
[  354.830386]  __secondary_switched+0xc0/0xc8
[  354.834661] Code: b9001041 d503201f 52800001 52800022 (88e17c02) 
[  354.840893] SMP: stopping secondary CPUs
[  355.897569] SMP: failed to stop secondary CPUs 0-60,62-143
[  355.904206] Starting crashdump kernel...
[  355.908214] ------------[ cut here ]------------
[  355.912930] Some CPUs may be stale, kdump will be unreliable.
[  355.918807] WARNING: CPU: 61 PID: 0 at arch/arm64/kernel/machine_kexec.c:174 machine_kexec+0x48/0x1f0
[  355.928236] Modules linked in: qrtr cfg80211 binfmt_misc nls_iso8859_1 input_leds dax_hmem cxl_acpi acpi_ipmi onboard_usb_hub nvidia_cspmu ipmi_ssif cxl_co
re ipmi_devintf arm_cspmu_module arm_smmuv3_pmu ipmi_msghandler uio_pdrv_genirq uio spi_nor cppc_cpufreq joydev mtd acpi_power_meter dm_multipath nvme_fabrics
 efi_pstore nfnetlink dmi_sysfs ip_tables x_tables autofs4 btrfs blake2b_generic raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor
 xor_neon raid6_pq libcrc32c raid1 raid0 hid_generic rndis_host usbhid cdc_ether hid usbnet uas usb_storage crct10dif_ce polyval_ce polyval_generic ghash_ce s
m4_ce_gcm sm4_ce_ccm sm4_ce sm4_ce_cipher sm4 sm3_ce sm3 nvme sha3_ce i2c_smbus ixgbe sha2_ce nvme_core ast sha256_arm64 xhci_pci sha1_ce xfrm_algo xhci_pci_r
enesas i2c_algo_bit nvme_auth mdio spi_tegra210_quad i2c_tegra aes_neon_bs aes_neon_blk aes_ce_blk aes_ce_cipher
[  356.008649] CPU: 61 PID: 0 Comm: swapper/61 Kdump: loaded Not tainted 6.8.0-060800-generic-64k #202403131158
[  356.018699] Hardware name: Supermicro MBD-G1SMH/G1SMH, BIOS 1.0c 12/28/2023
[  356.025815] pstate: 634000c9 (nZCv daIF +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[  356.032932] pc : machine_kexec+0x48/0x1f0
[  356.037027] lr : machine_kexec+0x48/0x1f0
[  356.041121] sp : ffff8000807af620
[  356.044504] x29: ffff8000807af620 x28: ffff0000894178c0 x27: 0000000000000000
[  356.051800] x26: ffffa06102735fd8 x25: 00000000000000c0 x24: 0000000000000000
[  356.059096] x23: ffffa0610273afc0 x22: ffffa061043200f8 x21: ffffa0610437a000
[  356.066393] x20: ffff0000d13db400 x19: ffff0000d13db400 x18: ffff8000819a00e8
[  356.073688] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000040000
[  356.080983] x14: 0000000000000000 x13: 2e656c6261696c65 x12: 726e75206562206c
[  356.088279] x11: 6c697720706d7564 x10: 0000000000000000 x9 : 0000000000000000
[  356.095574] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000
[  356.102871] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000
[  356.110166] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000000
[  356.117463] Call trace:
[  356.119957]  machine_kexec+0x48/0x1f0
[  356.123698]  __crash_kexec+0x94/0x128
[  356.127440]  crash_kexec+0x4c/0xb8
[  356.130913]  die+0x27c/0x2c8
[  356.133853]  die_kernel_fault+0x110/0x1d0
[  356.137948]  __do_kernel_fault+0x1e4/0x200
[  356.142133]  do_alignment_fault+0x90/0xc8
[  356.146228]  do_mem_abort+0x50/0xd0
[  356.149789]  el1_abort+0x50/0xd8
[  356.153086]  el1h_64_sync_handler+0x114/0x1c0
[  356.157536]  el1h_64_sync+0x7c/0x80
[  356.161098]  _raw_spin_lock_irqsave+0x44/0x100
[  356.165636]  try_to_wake_up+0x68/0x758
[  356.169466]  wake_up_process+0x24/0x50
[  356.173295]  aio_complete+0x1c4/0x2b8
[  356.177037]  aio_complete_rw+0x11c/0x2c8
[  356.181042]  iomap_dio_bio_end_io+0x1f0/0x248
[  356.185494]  bio_endio+0x170/0x270
[  356.188969]  __dm_io_complete+0x180/0x200
[  356.193066]  clone_endio+0xc8/0x288
[  356.196627]  bio_endio+0x170/0x270
[  356.200101]  blk_mq_end_request_batch+0x2e0/0x558
[  356.204909]  nvme_pci_complete_batch+0x94/0x118 [nvme]
[  356.210164]  nvme_irq+0x9c/0xb0 [nvme]
[  356.213995]  __handle_irq_event_percpu+0x68/0x2c0
[  356.218802]  handle_irq_event+0x58/0xe8
[  356.222718]  handle_fasteoi_irq+0xb0/0x218
[  356.226903]  generic_handle_domain_irq+0x38/0x70
[  356.231619]  __gic_handle_irq_from_irqson.isra.0+0x180/0x310
[  356.237405]  gic_handle_irq+0x2c/0xa0
[  356.241144]  call_on_irq_stack+0x3c/0x50
[  356.245150]  do_interrupt_handler+0xb0/0xc8
[  356.249424]  el1_interrupt+0x48/0xf0
[  356.253074]  el1h_64_irq_handler+0x1c/0x40
[  356.257258]  el1h_64_irq+0x7c/0x80
[  356.260731]  cpuidle_enter_state+0xd8/0x790
[  356.265003]  cpuidle_enter+0x44/0x78
[  356.268655]  cpuidle_idle_call+0x15c/0x210
[  356.272841]  do_idle+0xb0/0x130
[  356.276048]  cpu_startup_entry+0x44/0x50
[  356.280053]  secondary_start_kernel+0xec/0x130
[  356.284594]  __secondary_switched+0xc0/0xc8
[  356.288868] ---[ end trace 0000000000000000 ]---
[  356.293585] Bye!

Brett Grandbois (brettgrand) on 2024-03-21

Changed in linux (Ubuntu):
assignee:	nobody → Jose Ogando Justo (joseogando)

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-21:

#3

I have observed that this panic does not seem to happen when stressing non-device-mapper devices (ex: it panics when running /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device dm-0 --base-time 240, but completes successfully when running /usr/lib/checkbox-provider-base/bin/stress_ng_test.py disk --device nvme0n1 --base-time 240).

I'm going to investigate this further to confirm.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-22:

#4

Upon further investigation, the device mapper observation does not seem to be a hard line, as I was able to observe panics when stressing both dm-0 and nvme0n1 under different circumstances.

At the moment, it also seems like the specific part of stress_ng_test that is the culprit is the "stress-ng aiol stressor". When running only the "aiol" stressor in isolation on linux-image-6.8.0-11-generic-64k, the panic reliably happens in under 5 minutes.

Currently investigating to see if any other stress_ng tests cause the same issue on this kernel version, or if it is only aiol.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-22:

#5

I did not observe this issue with any other stress_ng disk tests on linux-image-6.8.0-11-generic-64k after 1 full run of the suite with the "aiol" test disabled.

(When running the "aiol" test alone, it panicked reliably each time.)

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-22:

#6

Earlier, I said that the device mapper observation did not seem to be a hard line - however, further testing now indicates that the situations where I observed panics when stressing nvme0n1 were due to an unrelated bug that is present in the latest 6.5 mainline tree, but *not* the latest 6.5 Ubuntu kernel tree (6.5.0-26-generic-64k).

Therefore, from the perspective of *this* bug report, it once again *does* appear that this issue is only present when stressing dm-0 and not present when stressing a non-device-mapper device.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-25:

#7

I did some more version testing, and I have not been able to reproduce this bug with the "aiol" stressor on either Upstream 6.5 or Ubuntu 6.5.0-26-generic-64k, so it was evidently introduced after that version.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-27:

#8

It turns out that this issue does not appear with *every* run of the aiol test on affected kernels, so multiple runs of that test may be necessary for the panic to occur.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-28:

#9

I have isolated the cause of this bug to this commit: https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/noble/commit/?h=Ubuntu-6.8.0-20.20&id=71eb6b6b0ba93b1467bccff57b5de746b09113d2

All versions that I tested before this commit during my bisect passed the aiol test at least 15 times in a row, and all versions after this commit panic during at least one test. To confirm, I reverted this patch on the latest 6.8 Ubuntu kernel (which was previously panicking reliably within 5 tests) and verified that, with that change, it passes the test at least 15x in a row without any panics.

The contents of the patch also support this conclusion, as the patch is a change to the Linux AIO interface that introduces new calls to spin_lock_irqsave() and wake_up_process() inside aio_complete(), which corresponds with the content of the traces I have observed.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-03-28:

#10

This issue is still present upstream, so I reported it to the original committer of the patch.

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-04-01:

#11

A fix has been applied to vfs.fixes upstream and should land soon. I have tested this patch and verified that the panic no longer occurs.

Changed in linux (Ubuntu):
status:	New → Fix Committed

Revision history for this message

Mitchell Augustin (mitchellaugustin) wrote on 2024-04-09:

#12

Fix has landed upstream: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/aio.c?h=v6.9-rc3&id=caeb4b0a11b3393e43f7fa8e0a5a18462acc66bd

dann frazier (dannf) on 2024-04-09

Changed in linux (Ubuntu):
assignee:	Jose Ogando Justo (joseogando) → Mitchell Augustin (mitchellaugustin)
status:	Fix Committed → In Progress

Roxana Nicolescu (roxanan) on 2024-04-26

Changed in linux (Ubuntu Noble):
status:	In Progress → Fix Committed

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2024-05-06:

#13

This bug is awaiting verification that the linux/6.8.0-32.32 kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-noble-linux' to 'verification-done-noble-linux'. If the problem still exists, change the tag 'verification-needed-noble-linux' to 'verification-failed-noble-linux'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags:

added: kernel-spammed-noble-linux-v2 verification-needed-noble-linux

Mitchell Augustin (mitchellaugustin) 9 hours ago

tags:

added: verification-done-noble-linux
removed: verification-needed-noble-linux

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Committed	Undecided	Mitchell Augustin
	Noble	Fix Committed	Undecided	Mitchell Augustin

Ubuntu
linux package

Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulinux package

Kernel panic during checkbox stress_ng_test on Grace running noble 6.8 (arm64+largemem) kernel

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package