[Potential Regression] cpuhotplug related tests triggers kernel bug (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge

Bug #2029917 reported by Po-Hsu Lin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned
linux-aws (Ubuntu)
Invalid
Undecided
Unassigned
Bionic
Confirmed
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned

Bug Description

Issue found with 5.4.0-1107.115~18.04.1 Bionic AWS and 5.4.0-1107.115 Focal AWS kernel, on c3.xlarge instance only.

cpu-hotplug related tests will crash the instance, they are:
* cpuset_hotplug in ubuntu_ltp_controllers
* cpuhotplug:cpuhotplug02 in ubuntu_ltp (comment #7 in this bug)
* cpu-hotplug:cpu-on-off-test.sh in ubuntu_kernel_selftests (comment #8 in this bug)

Take cpuset_hotplug in ubuntu_ltp_controllers for example.

There is no output from the test itself (looks like it has crashed):
 START ubuntu_ltp_controllers.cpuset_hotplug ubuntu_ltp_controllers.cpuset_hotplug timestamp=1689920544 timeout=4500 localtime=Jul 21 06:22:24
 Persistent state client._record_indent now set to 2
 Persistent state client.unexpected_reboot now set to ('ubuntu_ltp_controllers.cpuset_hotplug', 'ubuntu_ltp_controllers.cpuset_hotplug')
 Waiting for pid 925631 for 4500 seconds
 System python is too old, crash handling disabled
(nothing after this point)

But from the console log you will see a kernel BUG and kernel panic:
[ 3451.829941] kernel BUG at /build/linux-aws-5.4-I38rpz/linux-aws-5.4-5.4.0/arch/x86/xen/spinlock.c:62!
[ 3451.833383] invalid opcode: 0000 [#1] SMP PTI
[ 3451.835146] CPU: 1 PID: 14 Comm: cpuhp/1 Tainted: G C 5.4.0-1107-aws #115~18.04.1-Ubuntu
[ 3451.838679] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
[ 3451.840965] RIP: 0010:dummy_handler+0x4/0x10
[ 3451.842675] Code: 8b 75 e4 74 d6 44 89 e7 e8 39 89 61 00 eb d6 44 89 e7 e8 af ab 61 00 eb cc 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 80 3d 69 d0 9f 01 00 75 02 f3
[ 3451.849042] RSP: 0000:ffffb54b0000ee38 EFLAGS: 00010046
[ 3451.851021] RAX: ffffffff92c2e3d0 RBX: 000000000000003b RCX: 0000000000000000
[ 3451.853509] RDX: 0000000000400e00 RSI: 0000000000000000 RDI: 000000000000003b
[ 3451.855996] RBP: ffffb54b0000ee38 R08: ffff8a9de6c01240 R09: ffff8a9de6c01440
[ 3451.858435] R10: 0000000000000000 R11: ffffffff94664da8 R12: 0000000000000000
[ 3451.860896] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8a9de6583200
[ 3451.863313] FS: 0000000000000000(0000) GS:ffff8a9de8040000(0000) knlGS:0000000000000000
[ 3451.899246] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3451.901338] CR2: 0000000000000000 CR3: 000000002040a001 CR4: 00000000001606e0
[ 3451.903757] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3451.906184] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3451.908623] Call Trace:
[ 3451.909869] <IRQ>
[ 3451.911014] __handle_irq_event_percpu+0x44/0x1a0
[ 3451.912818] handle_irq_event_percpu+0x32/0x80
[ 3451.914578] handle_percpu_irq+0x3d/0x60
[ 3451.916198] generic_handle_irq+0x28/0x40
[ 3451.917834] handle_irq_for_port+0x8f/0xe0
[ 3451.919493] evtchn_2l_handle_events+0x157/0x270
[ 3451.921298] __xen_evtchn_do_upcall+0x76/0xe0
[ 3451.923046] xen_evtchn_do_upcall+0x2b/0x40
[ 3451.924742] xen_hvm_callback_vector+0xf/0x20
[ 3451.926484] </IRQ>
[ 3451.927632] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 3451.929674] Code: e8 a0 3d 64 ff 4c 29 e0 4c 39 f0 76 cf 80 0b 08 eb 8a 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 d6 ad 66 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 07
[ 3451.935996] RSP: 0000:ffffb54b000fbcf8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[ 3451.939023] RAX: 0000000000000001 RBX: ffff8a9de6583200 RCX: 000000000002cc00
[ 3451.941475] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[ 3451.943948] RBP: ffffb54b000fbcf8 R08: ffff8a9de6c01240 R09: ffff8a9de6c01440
[ 3451.946382] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000003b
[ 3451.948849] R13: 0000000000000000 R14: ffff8a9d8e75c600 R15: ffff8a9d8e75c6a4
[ 3451.951297] __setup_irq+0x456/0x760
[ 3451.952850] ? kmem_cache_alloc_trace+0x170/0x230
[ 3451.954661] request_threaded_irq+0xfb/0x160
[ 3451.956376] bind_ipi_to_irqhandler+0xba/0x1c0
[ 3451.958113] ? xen_qlock_wait+0x90/0x90
[ 3451.959723] ? snr_uncore_mmio_init+0x20/0x20
[ 3451.961445] xen_init_lock_cpu+0x78/0xd0
[ 3451.963057] ? snr_uncore_mmio_init+0x20/0x20
[ 3451.964810] xen_cpu_up_online+0xe/0x20
[ 3451.966415] cpuhp_invoke_callback+0x8a/0x580
[ 3451.968144] cpuhp_thread_fun+0xb8/0x120
[ 3451.969760] smpboot_thread_fn+0xfc/0x170
[ 3451.971400] kthread+0x121/0x140
[ 3451.972855] ? sort_range+0x30/0x30
[ 3451.974378] ? kthread_park+0x90/0x90
[ 3451.975929] ret_from_fork+0x35/0x40
[ 3451.977454] Modules linked in: exfat(C) ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs nfsd auth_rpcgss nfs_acl lockd grace sunrpc nls_iso8859_1 binfmt_misc serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ixgbevf
[ 3451.992926] ---[ end trace 4433bc23c8979a4c ]---
[ 3451.994720] RIP: 0010:dummy_handler+0x4/0x10
[ 3451.996427] Code: 8b 75 e4 74 d6 44 89 e7 e8 39 89 61 00 eb d6 44 89 e7 e8 af ab 61 00 eb cc 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 80 3d 69 d0 9f 01 00 75 02 f3
[ 3452.002753] RSP: 0000:ffffb54b0000ee38 EFLAGS: 00010046
[ 3452.004708] RAX: ffffffff92c2e3d0 RBX: 000000000000003b RCX: 0000000000000000
[ 3452.007130] RDX: 0000000000400e00 RSI: 0000000000000000 RDI: 000000000000003b
[ 3452.009569] RBP: ffffb54b0000ee38 R08: ffff8a9de6c01240 R09: ffff8a9de6c01440
[ 3452.011998] R10: 0000000000000000 R11: ffffffff94664da8 R12: 0000000000000000
[ 3452.014449] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8a9de6583200
[ 3452.016893] FS: 0000000000000000(0000) GS:ffff8a9de8040000(0000) knlGS:0000000000000000
[ 3452.020028] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3452.022109] CR2: 0000000000000000 CR3: 000000002040a001 CR4: 00000000001606e0
[ 3452.024568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3452.027003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3452.029446] Kernel panic - not syncing: Fatal exception in interrupt
[ 3452.031753] Kernel Offset: 0x11c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Po-Hsu Lin (cypressyew)
tags: added: sru-20230710
Po-Hsu Lin (cypressyew)
summary: cpuset_hotplug in ubuntu_ltp_controllers triggers kernel bug
- (arch/x86/xen/spinlock.c:62) on AWS cloud c3.xlarge
+ (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge
Revision history for this message
Po-Hsu Lin (cypressyew) wrote : Re: cpuset_hotplug in ubuntu_ltp_controllers triggers kernel bug (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge

Tested with older version of LTP (commit ac1a3e40c5b0) with 5.4.0-1107.115~18.04.1 Bionic AWS, on AWS cloud c3.xlarge. It's triggering a system crash as well:
[20570.648998] kernel BUG at /build/linux-aws-5.4-I38rpz/linux-aws-5.4-5.4.0/arch/x86/xen/spinlock.c:62!

Tested with 5.4.0-1106.114~18.04.1 on the very same system this test can finish without any issue.
<<<test_start>>>
tag=cpuset_hotplug stime=1691141679
cmdline="cpuset_hotplug_test.sh"
contacts=""
analysis=exit
<<<test_output>>>
incrementing stop
cpuset_hotplug 1 TINFO: CPUs are numbered continuously starting at 0 (0-3)
cpuset_hotplug 1 TINFO: Nodes are numbered continuously starting at 0 (0)
cpuset_hotplug 1 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 3 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 5 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 7 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 9 TPASS: Cpuset vs CPU hotplug test succeeded.
cpuset_hotplug 11 TPASS: Cpuset vs CPU hotplug test succeeded.
<<<execution_status>>>
initiation_status="ok"
duration=8 termination_type=exited termination_id=0 corefile=no
cutime=80 cstime=656
<<<test_end>>>
INFO: ltp-pan reported all tests PASS
LTP Version: 20220527

       ###############################################################

            Done executing testcases.
            LTP Version: 20220527
       ###############################################################

Po-Hsu Lin (cypressyew)
summary: - cpuset_hotplug in ubuntu_ltp_controllers triggers kernel bug
- (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge
+ [Potential Regression] cpuset_hotplug in ubuntu_ltp_controllers triggers
+ kernel bug (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud
+ c3.xlarge
Revision history for this message
Po-Hsu Lin (cypressyew) wrote : Re: [Potential Regression] cpuset_hotplug in ubuntu_ltp_controllers triggers kernel bug (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge
Download full text (4.1 KiB)

This issue can be reproduced with X-aws-hwe (4.15.0-1160-aws), passed with X-aws-hwe 4.15.0-1158-aws
I was unable to find X-aws-hwe 4.15.0-1159 to test.

[ 1121.855862] kernel BUG at /build/linux-aws-hwe-dFjJIX/linux-aws-hwe-4.15.0/arch/x86/xen/spinlock.c:69!
[ 1121.857747] invalid opcode: 0000 [#1] SMP PTI
[ 1121.858746] Modules linked in: sb_edac i2c_piix4 intel_rapl_perf serio_raw nfsd auth_rpcgss nfs_acl lockd grace sunrpc ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc cirrus ttm drm_kms_helper syscopyarea aesni_intel sysfillrect sysimgblt fb_sys_fops aes_x86_64 crypto_simd glue_helper drm cryptd i2c_core ixgbevf pata_acpi
[ 1121.868747] CPU: 1 PID: 13 Comm: cpuhp/1 Not tainted 4.15.0-1160-aws #173~16.04.1-Ubuntu
[ 1121.870378] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
[ 1121.871665] RIP: 0010:dummy_handler+0x4/0x10
[ 1121.872535] RSP: 0000:ffff92f327a43e38 EFLAGS: 00010046
[ 1121.873589] RAX: ffffffffa6e2ac40 RBX: ffff92f320d8ee80 RCX: 0000000000000000
[ 1121.875039] RDX: 0000000000400e00 RSI: 0000000000000000 RDI: 000000000000003b
[ 1121.876472] RBP: ffff92f327a43e38 R08: ffff92f32161e400 R09: ffff92f327002480
[ 1121.877908] R10: 0000000000000000 R11: 0000000000000040 R12: 000000000000003b
[ 1121.879346] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[ 1121.880781] FS: 0000000000000000(0000) GS:ffff92f327a40000(0000) knlGS:0000000000000000
[ 1121.882404] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1121.883574] CR2: 0000000000000000 CR3: 00000001bf00a001 CR4: 00000000001606e0
[ 1121.885023] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1121.886472] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1121.887918] Call Trace:
[ 1121.888436] <IRQ>
[ 1121.888871] __handle_irq_event_percpu+0x84/0x1a0
[ 1121.889836] handle_irq_event_percpu+0x32/0x80
[ 1121.890756] handle_percpu_irq+0x3d/0x60
[ 1121.891571] generic_handle_irq+0x28/0x40
[ 1121.892399] handle_irq_for_port+0x82/0xf0
[ 1121.893246] evtchn_2l_handle_events+0x1a7/0x270
[ 1121.894200] __xen_evtchn_do_upcall+0x76/0xe0
[ 1121.895111] xen_evtchn_do_upcall+0x2b/0x50
[ 1121.895974] xen_hvm_callback_vector+0x90/0xa0
[ 1121.896887] </IRQ>
[ 1121.897339] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 1121.898482] RSP: 0000:ffffa2abc0d37d08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[ 1121.900019] RAX: 0000000000000001 RBX: ffff92f32161e400 RCX: 000000000002cc00
[ 1121.901461] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[ 1121.902912] RBP: ffffa2abc0d37d08 R08: ffff92f32161e470 R09: ffff92f327002480
[ 1121.904363] R10: ffff92f32161e4a4 R11: 0000000000000246 R12: 0000000000000000
[ 1121.905816] R13: ffff92f320d8ee80 R14: 000000000000003b R15: ffff92f32161e560
[ 1121.907274] __setup_irq+0x449/0x740
[ 1121.908018] request_threaded_irq+0x101/0x1b0
[ 1121.908913] bind_ipi_to_irqhandler+0xcc/0x1f0
[ 1121.909829] ? xen_qlock_wait+0x80...

Read more...

Changed in linux-aws (Ubuntu):
status: New → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-aws (Ubuntu Bionic):
status: New → Confirmed
Changed in linux-aws (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Seen on b/aws-5.4 with version 5.4.0-1109 during cycle 2023.08.07.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I got this hammered again manually this morning by just running the cpuset_hotplug test with ltp
$ cat /home/ubuntu/short
cpuset_hotplug cpuset_hotplug_test.sh
$ sudo /opt/ltp/runltp -f /home/ubuntu/short ; sudo reboot

* with -1160 the fail rate is 5 out of 25 attempts
* with -1159 deb downloaded from the link you provided, the fail rate is 2 out of 25 attempts
* with -1158 the fail rate is 4 out of 25 attempts

I am not sure why this seems failing constantly when being tested on our jenkins. Probably because of other tests started before this cpuset_hotplug test.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

The cpuhotplug02 test in ubuntu_ltp can trigger the same error message with AWS 5.4.0-1108-aws

[ 366.764348] ------------[ cut here ]------------
[ 366.764350] kernel BUG at arch/x86/xen/spinlock.c:62!
[ 366.765592] invalid opcode: 0000 [#1] SMP PTI
(system rebooted here)

The console output is useless in this case because the system got rebooted.

Po-Hsu Lin (cypressyew)
summary: - [Potential Regression] cpuset_hotplug in ubuntu_ltp_controllers triggers
- kernel bug (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud
- c3.xlarge
+ [Potential Regression] ubuntu_ltp_controllers/cpuset_hotplug and
+ ubuntu_ltp/cpuhotplug:cpuhotplug02 triggers kernel bug
+ (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge
Po-Hsu Lin (cypressyew)
summary: - [Potential Regression] ubuntu_ltp_controllers/cpuset_hotplug and
- ubuntu_ltp/cpuhotplug:cpuhotplug02 triggers kernel bug
+ [Potential Regression] cpuhotplug related tests triggers kernel bug
(arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge
Revision history for this message
Po-Hsu Lin (cypressyew) wrote (last edit ):

This also applies to cpu-hotplug:cpu-on-off-test.sh in ubuntu_kernel_selftests

It can trigger the same error with AWS 5.4.0-1108-aws on c3.xlarge as well
[ 549.176719] IRQ 75: no longer affine to CPU3
[ 549.176735] IRQ 76: no longer affine to CPU3
[ 549.176750] IRQ 77: no longer affine to CPU3
[ 549.178081] smpboot: CPU 3 is now offline
[ 549.234896] installing Xen timer for CPU 3
[ 549.235323] smpboot: Booting Node 0 Processor 3 APIC 0x3
[ 549.236789] ------------[ cut here ]------------
[ 549.236791] kernel BUG at arch/x86/xen/spinlock.c:62!
[ 549.237807] invalid opcode: 0000 [#1] SMP PTI
[ 549.238692] CPU: 3 PID: 24 Comm: cpuhp/3 Not tainted 5.4.0-1108-aws #116-Ubuntu

Bug title / content updated.

description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

ubuntu_ltp/cpuhotplug:cpuhotplug04 can be found failing with system crash on AWS c3.xlarge with 5.15.0-1046.51~20.04.1 as well.

Revision history for this message
Magali Lemes do Sacramento (magalilemes) wrote :

Similarly, the following tests also fail (crash the instance) on j/gcp-fips with version 5.15.0-1048.56+fips1 during development cycle d2024.01.02 on the n2d-standard-4.sev_snp instance:
 - cpuhotplug:cpuhotplug02 in ubuntu_ltp and
 - cpu-hotplug:cpu-on-off-test.sh in ubuntu_kernel_selftests

If necessary, another bug could be filed targeting GCP specifically.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.