ftracetest from selftests in linux ADT test failure with jammy/linux-intel-iotg (kernel NULL pointer dereference)

Bug #2020607 reported by Jian Hui Lee
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned
linux-intel-iotg (Ubuntu)
Invalid
Undecided
Unassigned
Jammy
New
Undecided
Unassigned

Bug Description

the failure only is seen on the machine rizzo.

how to reproduce:
1. run net selftest in the kernel.
2. run ftracetest in the kernel, and then there is a highly chance that causes the kernel oops.

issue could be seen on kernel 5.15.112-0515112 (mainline), 5.15.0-1030-intel-iotg, and 5.15.0-74-generic on the same machine.
issue could not be reproduced on kernel 5.19.0-42-generic, 5.17.15-051715 (mainline), and 5.16.20-051620 (mainline).

[13279.176639] BUG: kernel NULL pointer dereference, address: 0000000000000000
[13279.183712] #PF: supervisor instruction fetch in kernel mode
[13279.189446] #PF: error_code(0x0010) - not-present page
[13279.194654] PGD 0 P4D 0
[13279.197230] Oops: 0010 [#1] SMP PTI
[13279.200767] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-74-generic #81-Ubuntu
[13279.208431] Hardware name: Dell Inc. PowerEdge R310/05XKKK, BIOS 1.12.0 09/06/2013
[13279.216100] RIP: 0010:0x0
[13279.218767] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[13279.225721] RSP: 0018:ffff9a1f80003e90 EFLAGS: 00010097
[13279.231013] RAX: 0000000000000000 RBX: 00000000000231f0 RCX: 0000000000000004
[13279.238229] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffff8bfdc0074280
[13279.245449] RBP: ffff9a1f80003eb8 R08: ffff8bfdc0074280 R09: 0000000000000001
[13279.252673] R10: 0000000000000020 R11: ffffffffffffffff R12: ffff8bfdc0074280
[13279.259900] R13: 00000c13a1cab100 R14: 0000000000000004 R15: 0000000000000000
[13279.267124] FS: 0000000000000000(0000) GS:ffff8bfef7600000(0000) knlGS:0000000000000000
[13279.275317] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13279.281140] CR2: ffffffffffffffd6 CR3: 000000004a010000 CR4: 00000000000006f0
[13279.288366] Call Trace:
[13279.290851] <IRQ>
[13279.292902] tick_do_broadcast+0xa1/0xd0
[13279.296894] tick_handle_oneshot_broadcast+0x14d/0x200
[13279.302107] timer_interrupt+0x18/0x30
[13279.305914] __handle_irq_event_percpu+0x42/0x170
[13279.310689] ? timekeeping_advance+0x32a/0x470
[13279.315194] handle_irq_event+0x59/0xb0
[13279.319086] handle_edge_irq+0x8c/0x230
[13279.322976] __common_interrupt+0x52/0xe0
[13279.327045] common_interrupt+0x89/0xa0
[13279.330941] </IRQ>
[13279.333079] <TASK>
[13279.335211] asm_common_interrupt+0x27/0x40
[13279.339461] RIP: 0010:cpuidle_enter_state+0xd9/0x620
[13279.344501] Code: 3d e4 e1 d8 4a e8 77 cb 67 ff 49 89 c7 0f 1f 44 00 00 31 ff e8 b8 d8 67 ff 80 7d d0 00 0f 85 61 01 00 00 fb 66 0f 1f 44 00 00 <45> 85 f6 0f 88 6d 01 00 00 4d 63 ee 49 83 fd 09 0f 87 e7 03 00 00
[13279.363476] RSP: 0018:ffffffffb6603db8 EFLAGS: 00000246
[13279.368768] RAX: 0000000000000000 RBX: ffff8bfef763b900 RCX: 0000000000000000
[13279.375984] RDX: ffff8bfdc01863c0 RSI: 0000000000000002 RDI: 0000000000000000
[13279.383203] RBP: ffffffffb6603e08 R08: 00000c13cc9b2925 R09: 00000000000c3500
[13279.390423] R10: 0000000000000005 R11: 071c71c71c71c71c R12: ffffffffb68d4b20
[13279.397646] R13: 0000000000000004 R14: 0000000000000004 R15: 00000c13cc9b2925
[13279.404883] ? cpuidle_enter_state+0x24a/0x620
[13279.409389] cpuidle_enter+0x2e/0x50
[13279.413019] cpuidle_idle_call+0x142/0x1e0
[13279.417176] do_idle+0x83/0xf0
[13279.422843] cpu_startup_entry+0x20/0x30
[13279.429360] rest_init+0xd3/0x100
[13279.435396] ? acpi_enable_subsystem+0x21d/0x229
[13279.442594] arch_call_rest_init+0xe/0x23
[13279.449114] start_kernel+0x4a9/0x4ca
[13279.455389] x86_64_start_reservations+0x24/0x2a
[13279.462619] x86_64_start_kernel+0xfb/0x106
[13279.469328] secondary_startup_64_no_verify+0xc2/0xcb
[13279.476844] </TASK>
[13279.481553] Modules linked in: br_netfilter tls act_mirred cls_matchall ip6_gre gre ip6_tunnel tunnel6 sch_ingress dummy ip6t_rpfilter mpls_gso mpls_iptunnel mpls_router ip_tunnel esp6 esp4 xfrm_user xfrm_algo l2tp_ip6 l2tp_eth l2tp_ip l2tp_netlink l2tp_core 8021q garp mrp ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_tcpudp sch_etf sch_fq dccp_ipv6 dccp_ipv4 dccp vxlan ip6_udp_tunnel udp_tunnel bridge stp llc vrf nft_counter nft_chain_nat xt_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink algif_hash af_alg veth ipmi_ssif intel_powerclamp coretemp kvm_intel kvm ipmi_si ipmi_devintf binfmt_misc intel_cstate ipmi_msghandler dcdbas acpi_power_meter i7core_edac mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ramoops reed_solomon pstore_blk pstore_zone efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c
[13279.481743] raid1 raid0 multipath linear mgag200 i2c_algo_bit drm_kms_helper gpio_ich syscopyarea sysfillrect sysimgblt fb_sys_fops mpt3sas cec rc_core raid_class drm bnx2 lpc_ich pata_acpi scsi_transport_sas wmi [last unloaded: br_netfilter]
[13279.619551] CR2: 0000000000000000
[13279.626035] ---[ end trace f8201db10668ab38 ]---
[13279.665976] RIP: 0010:0x0
[13279.671856] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[13279.681978] RSP: 0018:ffff9a1f80003e90 EFLAGS: 00010097
[13279.690380] RAX: 0000000000000000 RBX: 00000000000231f0 RCX: 0000000000000004
[13279.700720] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffff8bfdc0074280
[13279.711050] RBP: ffff9a1f80003eb8 R08: ffff8bfdc0074280 R09: 0000000000000001
[13279.721394] R10: 0000000000000020 R11: ffffffffffffffff R12: ffff8bfdc0074280
[13279.731930] R13: 00000c13a1cab100 R14: 0000000000000004 R15: 0000000000000000
[13279.742503] FS: 0000000000000000(0000) GS:ffff8bfef7600000(0000) knlGS:0000000000000000
[13279.753902] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[13279.763030] CR2: ffffffffffffffd6 CR3: 000000004a010000 CR4: 00000000000006f0
[13279.773701] Kernel panic - not syncing: Fatal exception in interrupt
[13279.783722] Kernel Offset: 0x33800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[13279.819653] ---[ end Kernel panic - not syncing: Fatal exception in interrupt ]---

no longer affects: ubuntu
Changed in linux-intel-iotg (Ubuntu):
status: New → Invalid
description: updated
description: updated
Po-Hsu Lin (cypressyew)
tags: added: 5.15 jammy sru-20230417 ubuntu-kernel-selftests
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue can be found on J-ibm 5.15.0-1032.35 as well.

tags: added: sru-20230515
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Found on J-realtime 5.15.0-1040.45 with node rizzo

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (4.2 KiB)

After splitting ubuntu_kselftests_ftrace out and run test cases one-by-one, we can see it's failing with the second test case, ftrace:test.d--00basic--basic2.tc, on J-intel-iotg-5.15.0-1048.54 with node rizzo.

However I was unable to reproduce this manually on rizzo:
  * Passed with running just the ftrace:test.d--00basic--basic2.tc, with "./ftracetest -vvv test.d/00basic/basic2.tc"
  * Passed with running basic2.tc multiple times.
  * Passed with running the 1st test case and the offending basic2.tc test case.
  * Passed with running the whole test suite.

But if you try to run this remotely from out build server:
  SRU_CYCLE="2024.01.08-1" INSTANCE_TYPE="rizzo" timeout 180m $KT/sut-test --nc --region kernel $DEBUG metal $SUT jammy ubuntu_kselftests_ftrace $HOME

It will panic right away when hitting the second test case. It looks like it has something to do with CPU hotplug:
[ 5990.967618] mmiotrace: Disabling non-boot CPUs...
[ 5991.032796] smpboot: CPU 1 is now offline
[ 5991.052877] mmiotrace: CPU1 is down.
[ 5991.124833] smpboot: CPU 2 is now offline
[ 5991.140709] mmiotrace: CPU2 is down.
[ 5991.196717] smpboot: CPU 3 is now offline
[ 5991.216486] mmiotrace: CPU3 is down.
[ 5991.233400] smpboot: CPU 4 is now offline
[ 5991.272507] mmiotrace: CPU4 is down.
[ 5991.313356] smpboot: CPU 5 is now offline
[ 5991.328204] mmiotrace: CPU5 is down.
[ 5991.353591] smpboot: CPU 6 is now offline
[ 5991.376155] mmiotrace: CPU6 is down.
[ 5991.393484] smpboot: CPU 7 is now offline
[ 5991.394580] mmiotrace: CPU7 is down.
[ 5991.394586] mmiotrace: enabled.
[ 5991.394693] mmiotrace: Re-enabling CPUs...
[ 5991.394761] x86: Booting SMP configuration:
[ 5991.394763] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 5991.432595] mmiotrace: enabled CPU1.
[ 5991.479537] smpboot: Booting Node 0 Processor 2 APIC 0x4
[ 5991.508524] mmiotrace: enabled CPU2.
[ 5991.547586] smpboot: Booting Node 0 Processor 3 APIC 0x6
[ 5991.576690] mmiotrace: enabled CPU3.
[ 5991.619582] smpboot: Booting Node 0 Processor 4 APIC 0x1
[ 5991.639516] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 5991.646618] #PF: supervisor instruction fetch in kernel mode
[ 5991.652336] #PF: error_code(0x0010) - not-present page
[ 5991.657530] PGD 0 P4D 0
[ 5991.660096] Oops: 0010 [#1] SMP PTI
[ 5991.663626] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.0-1048-intel-iotg #54-Ubuntu
[ 5991.671709] Hardware name: Dell Inc. PowerEdge R310/05XKKK, BIOS 1.12.0 09/06/2013
[ 5991.679350] RIP: 0010:0x0
[ 5991.682010] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
[ 5991.688955] RSP: 0018:ffffb92a40003e90 EFLAGS: 00010097
[ 5991.694233] RAX: 0000000000000000 RBX: 00000000000231f0 RCX: 0000000000000004
[ 5991.701435] RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffff9c20c007b990
[ 5991.708639] RBP: ffffb92a40003eb8 R08: ffff9c20c007b990 R09: 0000000000000001
[ 5991.715842] R10: 0000000000000020 R11: ffffffffffffffff R12: ffff9c20c007b990
[ 5991.723050] R13: 00000572e77e8500 R14: 0000000000000004 R15: 0000000000000000
[ 5991.730252] FS: 0000000000000000(0000) GS:ffff9c21f7600000(0000) knlGS:0000000000000000
[ 5991.738417] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.