read_all_sys test in ubuntu_ltp triggers "BUG: kernel NULL pointer dereference" with 5.15.0-1058-nvidia on node hidon

Bug #2069147 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
New
Undecided
Unassigned

Bug Description

Issue found with Jammy 5.15.0-1058-nvidia on node hidon (DGXH100), other NVIDIA nodes are good.

Steps:
sudo apt install -y automake bison build-essential byacc flex git keyutils libacl1-dev libaio-dev libcap-dev libmm-dev libnuma-dev libsctp-dev libselinux1-dev libssl-dev libtirpc-dev pkg-config quota xfslibs-dev xfsprogs
git clone https://github.com/linux-test-project/ltp.git
cd ltp
git reset HEAD 998df1a5aa5026c5c9b91b0caa3b1188146aa678 --hard
make autotools
./configure
make
sudo make install
# Start watching demsg output here
sudo /opt/ltp/testcases/bin/read_all -d /sys/devices/pci0000\:e7/0000\:e7\:01.0/dsa2/

dmesg output:
[ 206.893706] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 206.901552] #PF: supervisor read access in kernel mode
[ 206.907341] #PF: error_code(0x0000) - not-present page
[ 206.913128] PGD 1660ef067 P4D 0
[ 206.916775] Oops: 0000 [#1] SMP NOPTI
[ 206.920909] CPU: 31 PID: 4238 Comm: read_all Tainted: G OE 5.15.0-1058-nvidia #59-Ubuntu
[ 206.931379] Hardware name: NVIDIA DGXH100/DGXH100, BIOS 1.1.3 10/30/2023
[ 206.938925] RIP: 0010:op_cap_show_common+0x33/0x110 [idxd]
[ 206.945114] Code: 41 57 49 89 f7 41 56 4c 8d 72 10 41 55 49 89 d5 41 54 53 31 db 48 83 ec 18 48 89 7d c0 65 48 8b 04 25 28 00 00 00 48 89 45 d0 <48> 8b 42 18 48 89 45 c8 89 de 4c 8d 45 c8 b9 40 00 00 00 4c 89 ff
[ 206.966209] RSP: 0018:ff3645a473cbfc50 EFLAGS: 00010286
[ 206.972099] RAX: bf9049f9dfedf700 RBX: 0000000000000000 RCX: 0000000000000000
[ 206.980130] RDX: 0000000000000000 RSI: ff3332a81111e000 RDI: ff3333a809ae6040
[ 206.988158] RBP: ff3645a473cbfc90 R08: ff3333a809ae6040 R09: ff3332a81111e000
[ 206.996183] R10: 000000000000000b R11: 0000000000000000 R12: ffffffffa3cd1fe0
[ 207.004215] R13: 0000000000000000 R14: 0000000000000010 R15: ff3332a81111e000
[ 207.012248] FS: 00007fb56afe3740(0000) GS:ff3333a37ebc0000(0000) knlGS:0000000000000000
[ 207.021356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 207.027826] CR2: 0000000000000018 CR3: 000000011086a006 CR4: 0000000000771ee0
[ 207.035858] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 207.043890] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 207.051921] PKRU: 55555554
[ 207.054978] Call Trace:
[ 207.057748] <TASK>
[ 207.060125] ? show_trace_log_lvl+0x1d6/0x2ea
[ 207.065043] ? show_trace_log_lvl+0x1d6/0x2ea
[ 207.069952] ? wq_op_config_show+0x18/0x20 [idxd]
[ 207.075253] ? show_regs.part.0+0x23/0x29
[ 207.079774] ? __die_body.cold+0x8/0xd
[ 207.084004] ? __die+0x2b/0x37
[ 207.087454] ? page_fault_oops+0x13b/0x170
[ 207.092085] ? memcg_slab_post_alloc_hook+0x19e/0x210
[ 207.097783] ? __wake_up+0x13/0x20
[ 207.101632] ? do_user_addr_fault+0x321/0x670
[ 207.106543] ? exc_page_fault+0x77/0x170
[ 207.110977] ? asm_exc_page_fault+0x27/0x30
[ 207.115700] ? op_cap_show_common+0x33/0x110 [idxd]
[ 207.121197] wq_op_config_show+0x18/0x20 [idxd]
[ 207.126303] dev_attr_show+0x1a/0x50
[ 207.130333] sysfs_kf_seq_show+0xa2/0x100
[ 207.134857] kernfs_seq_show+0x24/0x30
[ 207.139087] seq_read_iter+0x121/0x4b0
[ 207.143321] kernfs_fop_read_iter+0x30/0x40
[ 207.148035] new_sync_read+0x10a/0x190
[ 207.152269] vfs_read+0x103/0x1a0
[ 207.156008] ksys_read+0x67/0xf0
[ 207.159653] __x64_sys_read+0x19/0x20
[ 207.163781] x64_sys_call+0x1dba/0x1fa0
[ 207.168110] do_syscall_64+0x56/0xb0
[ 207.172142] ? exit_to_user_mode_prepare+0x37/0xb0
[ 207.177545] ? syscall_exit_to_user_mode+0x35/0x50
[ 207.182941] ? x64_sys_call+0x1dba/0x1fa0
[ 207.187449] ? do_syscall_64+0x63/0xb0
[ 207.191676] ? exit_to_user_mode_prepare+0x96/0xb0
[ 207.197077] ? syscall_exit_to_user_mode+0x35/0x50
[ 207.202477] ? x64_sys_call+0x1a55/0x1fa0
[ 207.206998] ? do_syscall_64+0x63/0xb0
[ 207.211226] ? do_syscall_64+0x63/0xb0
[ 207.215455] entry_SYSCALL_64_after_hwframe+0x67/0xd1
[ 207.221142] RIP: 0033:0x7fb56b0fa7e2
[ 207.225176] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 8a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 207.246269] RSP: 002b:00007ffdb852f498 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 207.254787] RAX: ffffffffffffffda RBX: 00007fb56afab028 RCX: 00007fb56b0fa7e2
[ 207.262819] RDX: 00000000000003ff RSI: 00007ffdb852f570 RDI: 0000000000000003
[ 207.270850] RBP: 0000563101392168 R08: 0000000000000000 R09: 00007ffdb852ec30
[ 207.278881] R10: 00007ffdb85ca170 R11: 0000000000000246 R12: 000056310137f012
[ 207.286912] R13: 000056310137f06f R14: 000056310222bab0 R15: 00007fb56afa7000
[ 207.294944] </TASK>
[ 207.297412] Modules linked in: nvidia_uvm(O) nvidia_drm(O) nvidia_modeset(O) intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm binfmt_misc nls_iso8859_1 ipmi_ssif rapl mlx5_ib(OE) isst_if_mbox_pci nvidia(O) intel_th_gth qat_4xxx ib_uverbs(OE) joydev input_leds intel_qat idxd pmt_crashlog pmt_telemetry isst_if_mmio mei_me intel_th_pci pmt_class isst_if_common authenc idxd_bus ib_core(OE) mei intel_th switchtec acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_lenovo hid_generic usbhid hid mlx5_core(OE) ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul mlxdevm(OE) crc32_pclmul sysfillrect ghash_clmulni_intel
[ 207.297467] sha256_ssse3 mlxfw(OE) sha1_ssse3 sysimgblt psample fb_sys_fops aesni_intel cec tls ixgbe crypto_simd nvme xhci_pci cryptd i2c_i801 rc_core mlx_compat(OE) xfrm_algo dca intel_pmt i2c_ismt i2c_smbus pci_hyperv_intf drm xhci_pci_renesas mdio nvme_core wmi pinctrl_emmitsburg
[ 207.423040] CR2: 0000000000000018
[ 207.426783] ---[ end trace 7e35f51fec2ac5d9 ]---

5.15.0-1054-nvidia Looks ok on this system.

Po-Hsu Lin (cypressyew)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.