function_graph tracer in ftrace related tests triggers kernel bug (arch/x86/xen/spinlock.c:62) and kernel panic on AWS cloud c3.xlarge

Bug #2034057 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned
linux-aws (Ubuntu)
New
Undecided
Unassigned
Bionic
New
Undecided
Unassigned
Focal
New
Undecided
Unassigned
Jammy
New
Undecided
Unassigned
Lunar
Won't Fix
Undecided
Unassigned

Bug Description

Test:
  * ftrace:test.d--00basic--basic2.tc from ubuntu_kselftests_ftrace
  * test_enable_all_tracers() and test_enable_all_tracers() from ubuntu_ftrace_smoke_test

Will crash AWS instance c3.xlarge when testing the "function_graph" tracer.

We have a similar issue filed against Azure (bug 1882669). Filing a new bug report because on AWS this is affecting 5.4 ~ 6.2 AWS kernel. However on Azure this is not affecting newer kernels.

Take B-aws-5.4-1108 for example, with the ubuntu_kselftests_ftrace test the system will crash with the following error message and reboot (similar to bug 2029917):
[ 211.675624] kernel BUG at /build/linux-aws-5.4-VHOhmt/linux-aws-5.4-5.4.0/arch/x86/xen/spinlock.c:62!
[ 211.678258] invalid opcode: 0000 [#1] SMP PTI
[ 211.679596] CPU: 1 PID: 14 Comm: cpuhp/1 Not tainted 5.4.0-1108-aws #116~18.04.1-Ubuntu
[ 211.681825] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
[ 211.683728] RIP: 0010:dummy_handler+0x4/0x10
[ 211.685042] Code: 8b 75 e4 74 d6 44 89 e7 e8 f9 88 61 00 eb d6 44 89 e7 e8 6f ab 61 00 eb cc 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 80 3d 59 d0 9f 01 00 75 02 f3
[ 211.690314] RSP: 0000:ffffaecd4000ee38 EFLAGS: 00010046
[ 211.691934] RAX: ffffffffb462e3e0 RBX: 000000000000003b RCX: 0000000000000000
[ 211.694036] RDX: 0000000000400e00 RSI: 0000000000000000 RDI: 000000000000003b
[ 211.696159] RBP: ffffaecd4000ee38 R08: ffff8aefa6c036c0 R09: ffff8aefa6c038c0
[ 211.698232] R10: 0000000000000000 R11: ffffffffb6064da8 R12: 0000000000000000
[ 211.700365] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8aefa34b8700
[ 211.702482] FS: 0000000000000000(0000) GS:ffff8aefa8040000(0000) knlGS:0000000000000000
[ 211.704894] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 211.706598] CR2: 0000000000000000 CR3: 00000001b400a001 CR4: 00000000001606e0
[ 211.708731] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 211.710835] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 211.713448] Call Trace:
[ 211.714796] <IRQ>
[ 211.715996] __handle_irq_event_percpu+0x44/0x1a0
[ 211.717907] handle_irq_event_percpu+0x32/0x80
[ 211.719759] handle_percpu_irq+0x3d/0x60
[ 211.721474] generic_handle_irq+0x28/0x40
[ 211.723184] handle_irq_for_port+0x8f/0xe0
[ 211.724988] evtchn_2l_handle_events+0x157/0x270
[ 211.726913] __xen_evtchn_do_upcall+0x76/0xe0
[ 211.728749] xen_evtchn_do_upcall+0x2b/0x40
[ 211.730520] xen_hvm_callback_vector+0xf/0x20
[ 211.732271] </IRQ>
[ 211.733425] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 211.735470] Code: e8 70 3d 64 ff 4c 29 e0 4c 39 f0 76 cf 80 0b 08 eb 8a 90 90 90 0f 1f 44 00 00 55 48 89 e5 e8 a6 ad 66 ff 66 90 48 89 f7 57 9d <0f> 1f 44 00 00 5d c3 0f 1f 40 00 0f 1f 44 00 00 55 48 89 e5 c6 07
[ 211.741920] RSP: 0000:ffffaecd400fbcf8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[ 211.744980] RAX: 0000000000000001 RBX: ffff8aefa34b8700 RCX: 000000000002cc00
[ 211.747420] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[ 211.749882] RBP: ffffaecd400fbcf8 R08: ffff8aefa6c036c0 R09: ffff8aefa6c038c0
[ 211.752340] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000003b
[ 211.754785] R13: 0000000000000000 R14: ffff8aef95bdfa00 R15: ffff8aef95bdfaa4
[ 211.757249] __setup_irq+0x456/0x760
[ 211.758779] ? kmem_cache_alloc_trace+0x170/0x230
[ 211.760639] request_threaded_irq+0xfb/0x160
[ 211.762358] bind_ipi_to_irqhandler+0xba/0x1c0
[ 211.764124] ? xen_qlock_wait+0x90/0x90
[ 211.765734] ? snr_uncore_mmio_init+0x20/0x20
[ 211.767496] xen_init_lock_cpu+0x78/0xd0
[ 211.769135] ? snr_uncore_mmio_init+0x20/0x20
[ 211.770864] xen_cpu_up_online+0xe/0x20
[ 211.772500] cpuhp_invoke_callback+0x8a/0x580
[ 211.774233] cpuhp_thread_fun+0xb8/0x120
[ 211.775866] smpboot_thread_fn+0xfc/0x170
[ 211.777524] kthread+0x121/0x140
[ 211.778968] ? sort_range+0x30/0x30
[ 211.780508] ? kthread_park+0x90/0x90
[ 211.782073] ret_from_fork+0x35/0x40
[ 211.783622] Modules linked in: nls_iso8859_1 binfmt_misc serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd glue_helper ixgbevf
[ 211.796497] ---[ end trace bb7d4e9bb7f852cb ]---
[ 211.798303] RIP: 0010:dummy_handler+0x4/0x10
[ 211.800022] Code: 8b 75 e4 74 d6 44 89 e7 e8 f9 88 61 00 eb d6 44 89 e7 e8 6f ab 61 00 eb cc 0f 1f 00 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 <0f> 0b 66 2e 0f 1f 84 00 00 00 00 00 80 3d 59 d0 9f 01 00 75 02 f3
[ 211.806379] RSP: 0000:ffffaecd4000ee38 EFLAGS: 00010046
[ 211.808356] RAX: ffffffffb462e3e0 RBX: 000000000000003b RCX: 0000000000000000
[ 211.810792] RDX: 0000000000400e00 RSI: 0000000000000000 RDI: 000000000000003b
[ 211.813263] RBP: ffffaecd4000ee38 R08: ffff8aefa6c036c0 R09: ffff8aefa6c038c0
[ 211.815713] R10: 0000000000000000 R11: ffffffffb6064da8 R12: 0000000000000000
[ 211.818162] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8aefa34b8700
[ 211.820637] FS: 0000000000000000(0000) GS:ffff8aefa8040000(0000) knlGS:0000000000000000
[ 211.823821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 211.825904] CR2: 0000000000000000 CR3: 00000001b400a001 CR4: 00000000001606e0
[ 211.828357] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 211.830829] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 211.833284] Kernel panic - not syncing: Fatal exception in interrupt
[ 211.835575] Kernel Offset: 0x33600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Test output:
 + echo 1
 + . /home/ubuntu/autotest/client/tmp/ubuntu_kselftests_ftrace/src/linux/tools/testing/selftests/ftrace/test.d/00basic/basic2.tc
 + test -f available_tracers
 + cat available_tracers
 + echo hwlat
 + echo blk
 + echo mmiotrace
 + echo function_graph
(Test interrupted here)

To verify this, you can hack linux/tools/testing/selftests/ftrace/test.d/00basic/basic2.tc and add "read -p" to check what tracer we're testing:
  #!/bin/sh
  # SPDX-License-Identifier: GPL-2.0
  # description: Basic test for tracers
  # flags: instance
  test -f available_tracers
  for t in `cat available_tracers`; do
    read -p "testing $t" foo
    echo $t > current_tracer
  done
  echo nop > current_tracer

And run this manually with sudo ./ftracetest -vvv test.d/00basic/basic2.tc, you will see the system crashes right away with "function_graph".

For ubuntu_ftrace_smoke_test, the "function_graph" tracer will be tested with test_function_graph_tracer(). And since "function_graph" tracer is in /sys/kernel/debug/tracing/available_tracers, it will be tested with test_enable_all_tracers() as well.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (4.2 KiB)

And this is affecting 4.15 AWS as well. Here is the output from b-aws-4.15-2099

Test output:
+ echo blk
+ echo mmiotrace
+ echo function_graph
 (Test interrupted here)

dmesg:
[ 216.814155] kernel BUG at /build/linux-aws-fips-hcslsq/linux-aws-fips-4.15.0/arch/x86/xen/spinlock.c:69!
[ 216.816072] invalid opcode: 0000 [#1] SMP PTI
[ 216.816991] Modules linked in: nls_iso8859_1 binfmt_misc sb_edac intel_rapl_perf serio_raw sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc cirrus ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops aesni_intel drm aes_x86_64 crypto_simd i2c_piix4 glue_helper cryptd i2c_core ixgbevf pata_acpi
[ 216.827515] CPU: 1 PID: 13 Comm: cpuhp/1 Not tainted 4.15.0-2099-aws-fips #105-Ubuntu
[ 216.829132] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
[ 216.830471] RIP: 0010:dummy_handler+0x4/0x10
[ 216.831362] RSP: 0000:ffffa0c827a43e38 EFLAGS: 00010046
[ 216.832477] RAX: ffffffff8302a9c0 RBX: 000000000000003b RCX: 000000000000003b
[ 216.833951] RDX: 0000000000400e00 RSI: 0000000000000000 RDI: 000000000000003b
[ 216.835425] RBP: ffffa0c827a43e38 R08: ffffa0c827000db0 R09: ffffa0c81b14ea00
[ 216.836902] R10: 0000000000000040 R11: 0000000000000000 R12: 0000000000000000
[ 216.838378] R13: 0000000000000000 R14: 0000000000000000 R15: ffffa0c82444f400
[ 216.839854] FS: 0000000000000000(0000) GS:ffffa0c827a40000(0000) knlGS:0000000000000000
[ 216.841533] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 216.842708] CR2: 0000000000000000 CR3: 000000003b80a001 CR4: 00000000001606e0
[ 216.844141] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 216.845631] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 216.847095] Call Trace:
[ 216.847640] <IRQ>
[ 216.848079] __handle_irq_event_percpu+0x44/0x1a0
[ 216.849070] handle_irq_event_percpu+0x32/0x80
[ 216.850009] handle_percpu_irq+0x3d/0x60
[ 216.850829] generic_handle_irq+0x28/0x40
[ 216.851687] handle_irq_for_port+0x8f/0xe0
[ 216.852556] evtchn_2l_handle_events+0x157/0x270
[ 216.853537] __xen_evtchn_do_upcall+0x76/0xe0
[ 216.854453] xen_evtchn_do_upcall+0x2b/0x50
[ 216.855327] xen_hvm_callback_vector+0x90/0xa0
[ 216.856266] </IRQ>
[ 216.856724] RIP: 0010:_raw_spin_unlock_irqrestore+0x15/0x20
[ 216.857890] RSP: 0000:ffffb95780d37d00 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c
[ 216.859472] RAX: 0000000000000001 RBX: ffffa0c82444f400 RCX: 000000000002cc00
[ 216.860907] RDX: 0000000000000001 RSI: 0000000000000246 RDI: 0000000000000246
[ 216.862353] RBP: ffffb95780d37d00 R08: ffffa0c827000db0 R09: ffffa0c81b14ea00
[ 216.863782] R10: 0000000000000040 R11: 0000000000000246 R12: 000000000000003b
[ 216.865233] R13: 0000000000000000 R14: ffffa0c81b14ea00 R15: ffffa0c81b14eaa4
[ 216.866662] __setup_irq+0x424/0x6e0
[ 216.867392] request_threaded_irq+0xf9/0x170
[ 216.868281] bind_ipi_to_irqhandler+0xc6/0x1f0
[ ...

Read more...

tags: added: 4.15 bionic
Revision history for this message
Roxana Nicolescu (roxanan) wrote (last edit ):
Download full text (5.3 KiB)

It seems the crash happens in the dummy handler for an interrupt that is used for the unlock of a paravirtualized spinlock. It is set when a vcpu is initialized but then the interrupt is disabled. So I assume an interrupt happens between request_irq and disable_irq.
I tried 2 things on jammy (5.15.0 kernel on top of the latest in proposed:

1. disabled the paravirtualized spinlock (used xen_nopvspin as boot param) and as expected the crash did not occur
2. Tried the irq flag: IRQ_NOAUTOEN, basically this is used to not enable the irq when requested.
Unfortunately crash happened again:
[ 434.016304] smpboot: Booting Node 0 Processor 2 APIC 0x1
[ 434.020165] ------------[ cut here ]------------
[ 434.020166] kernel BUG at arch/x86/xen/spinlock.c:62!
[ 434.021628] invalid opcode: 0000 [#1] SMP PTI
[ 434.022901] CPU: 2 PID: 25 Comm: cpuhp/2 Kdump: loaded Not tainted 5.15.0-88-generic #98-Ubuntu
[ 434.025279] Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
[ 434.027056] RIP: 0010:dummy_handler+0x0/0x10
[ 434.028343] Code: e4 e8 54 e1 7a 00 8b 75 e4 84 c0 74 d2 44 89 ef e8 35 a5 7a 00 eb d2 44 89 ef e8 cb e1 7a 00 eb c8 66 0f 1f 84 00 00 00 00 00 <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 41 57
[ 434.033297] RSP: 0000:ffffa06a40180e28 EFLAGS: 00010046
[ 434.034830] RAX: ffffffffa5637ef0 RBX: ffff8e41c88b9b80 RCX: 0000000000000000
[ 434.036839] RDX: 0000000000000018 RSI: 0000000000000000 RDI: 0000000000000041
[ 434.038839] RBP: ffffa06a40180e68 R08: 0000000000000000 R09: ffffffffa757bf68
[ 434.040833] R10: 0000000000000000 R11: ffffa06a40180ff8 R12: 0000000000000000
[ 434.042818] R13: ffffa06a40180e7c R14: 0000000000000041 R15: ffffa06a40180f98
[ 434.044855] FS: 0000000000000000(0000) GS:ffff8e42a8280000(0000) knlGS:0000000000000000
[ 434.047083] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 434.048762] CR2: 0000000000000000 CR3: 0000000149610001 CR4: 00000000001706e0
[ 434.050778] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 434.052767] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 434.054757] Call Trace:
[ 434.056028] <IRQ>
[ 434.057196] ? show_trace_log_lvl+0x1d6/0x2ea
[ 434.058984] ? show_trace_log_lvl+0x1d6/0x2ea
[ 434.060765] ? handle_irq_event_percpu+0x33/0x80
[ 434.062656] ? show_regs.part.0+0x23/0x29
[ 434.064357] ? __die_body.cold+0x8/0xd
[ 434.065977] ? __die+0x2b/0x37
[ 434.067413] ? die+0x30/0x60
[ 434.068818] ? do_trap+0xbe/0x100
[ 434.070345] ? do_error_trap+0x6f/0xb0
[ 434.071947] ? xen_qlock_wait+0x90/0x90
[ 434.073581] ? exc_invalid_op+0x53/0x70
[ 434.075205] ? xen_qlock_wait+0x90/0x90
[ 434.076832] ? asm_exc_invalid_op+0x1b/0x20
[ 434.078546] ? xen_qlock_wait+0x90/0x90
[ 434.080164] ? xen_qlock_wait+0x90/0x90
[ 434.081788] ? xen_qlock_wait+0x90/0x90
[ 434.083411] ? __handle_irq_event_percpu+0x42/0x170
[ 434.085302] handle_irq_event_percpu+0x33/0x80
[ 434.087097] handle_percpu_irq+0x3d/0x60
[ 434.088754] handle_irq_desc+0x3e/0x50
[ 434.090383] generic_handle_irq+0x1f/0x30
[ 434.092056] handle_irq_for_port+0x8d/0x160
[ 434.093769] ? irq_get_irq_data+0xe/0x20
[ 434.095422] evtchn_2l_ha...

Read more...

Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 23.04 (Lunar Lobster) has reached end of life, so this bug will not be fixed for that specific release.

Changed in linux-aws (Ubuntu Lunar):
status: New → Won't Fix
Revision history for this message
Magali Lemes do Sacramento (magalilemes) wrote (last edit ):

This is also been seen on j/gcp-fips with version 5.15.0-1048.56+fips1 during development cycle d2024.01.02 on the n2d-standard-4.sev_snp instance.
Both ftrace-smoke-test from ubuntu_ftrace_smoke_test and ftrace:test.d--00basic--basic2.tc from ubuntu_kselftests_ftrace fail similarly.

If necessary, another bug could be filed targeting GCP specifically.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

The bug for GCP n2d-standard-4.sev_snp can be tracked in bug 2051378

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.