Comment 0 for bug 2069147

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Issue found with Jammy 5.15.0-1058-nvidia on node hidon (DGXH100), other NVIDIA nodes are good.

Steps:
sudo apt install -y automake bison build-essential byacc flex git keyutils libacl1-dev libaio-dev libcap-dev libmm-dev libnuma-dev libsctp-dev libselinux1-dev libssl-dev libtirpc-dev pkg-config quota xfslibs-dev xfsprogs
git clone https://github.com/linux-test-project/ltp.git
cd ltp
git reset HEAD 998df1a5aa5026c5c9b91b0caa3b1188146aa678 --hard
make autotools
./configure
make
sudo make install
# Start watching demsg output here
sudo /opt/ltp/testcases/bin/read_all -d /sys/devices/pci0000\:e7/

dmesg output:
[ 206.893706] BUG: kernel NULL pointer dereference, address: 0000000000000018
[ 206.901552] #PF: supervisor read access in kernel mode
[ 206.907341] #PF: error_code(0x0000) - not-present page
[ 206.913128] PGD 1660ef067 P4D 0
[ 206.916775] Oops: 0000 [#1] SMP NOPTI
[ 206.920909] CPU: 31 PID: 4238 Comm: read_all Tainted: G OE 5.15.0-1058-nvidia #59-Ubuntu
[ 206.931379] Hardware name: NVIDIA DGXH100/DGXH100, BIOS 1.1.3 10/30/2023
[ 206.938925] RIP: 0010:op_cap_show_common+0x33/0x110 [idxd]
[ 206.945114] Code: 41 57 49 89 f7 41 56 4c 8d 72 10 41 55 49 89 d5 41 54 53 31 db 48 83 ec 18 48 89 7d c0 65 48 8b 04 25 28 00 00 00 48 89 45 d0 <48> 8b 42 18 48 89 45 c8 89 de 4c 8d 45 c8 b9 40 00 00 00 4c 89 ff
[ 206.966209] RSP: 0018:ff3645a473cbfc50 EFLAGS: 00010286
[ 206.972099] RAX: bf9049f9dfedf700 RBX: 0000000000000000 RCX: 0000000000000000
[ 206.980130] RDX: 0000000000000000 RSI: ff3332a81111e000 RDI: ff3333a809ae6040
[ 206.988158] RBP: ff3645a473cbfc90 R08: ff3333a809ae6040 R09: ff3332a81111e000
[ 206.996183] R10: 000000000000000b R11: 0000000000000000 R12: ffffffffa3cd1fe0
[ 207.004215] R13: 0000000000000000 R14: 0000000000000010 R15: ff3332a81111e000
[ 207.012248] FS: 00007fb56afe3740(0000) GS:ff3333a37ebc0000(0000) knlGS:0000000000000000
[ 207.021356] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 207.027826] CR2: 0000000000000018 CR3: 000000011086a006 CR4: 0000000000771ee0
[ 207.035858] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 207.043890] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 207.051921] PKRU: 55555554
[ 207.054978] Call Trace:
[ 207.057748] <TASK>
[ 207.060125] ? show_trace_log_lvl+0x1d6/0x2ea
[ 207.065043] ? show_trace_log_lvl+0x1d6/0x2ea
[ 207.069952] ? wq_op_config_show+0x18/0x20 [idxd]
[ 207.075253] ? show_regs.part.0+0x23/0x29
[ 207.079774] ? __die_body.cold+0x8/0xd
[ 207.084004] ? __die+0x2b/0x37
[ 207.087454] ? page_fault_oops+0x13b/0x170
[ 207.092085] ? memcg_slab_post_alloc_hook+0x19e/0x210
[ 207.097783] ? __wake_up+0x13/0x20
[ 207.101632] ? do_user_addr_fault+0x321/0x670
[ 207.106543] ? exc_page_fault+0x77/0x170
[ 207.110977] ? asm_exc_page_fault+0x27/0x30
[ 207.115700] ? op_cap_show_common+0x33/0x110 [idxd]
[ 207.121197] wq_op_config_show+0x18/0x20 [idxd]
[ 207.126303] dev_attr_show+0x1a/0x50
[ 207.130333] sysfs_kf_seq_show+0xa2/0x100
[ 207.134857] kernfs_seq_show+0x24/0x30
[ 207.139087] seq_read_iter+0x121/0x4b0
[ 207.143321] kernfs_fop_read_iter+0x30/0x40
[ 207.148035] new_sync_read+0x10a/0x190
[ 207.152269] vfs_read+0x103/0x1a0
[ 207.156008] ksys_read+0x67/0xf0
[ 207.159653] __x64_sys_read+0x19/0x20
[ 207.163781] x64_sys_call+0x1dba/0x1fa0
[ 207.168110] do_syscall_64+0x56/0xb0
[ 207.172142] ? exit_to_user_mode_prepare+0x37/0xb0
[ 207.177545] ? syscall_exit_to_user_mode+0x35/0x50
[ 207.182941] ? x64_sys_call+0x1dba/0x1fa0
[ 207.187449] ? do_syscall_64+0x63/0xb0
[ 207.191676] ? exit_to_user_mode_prepare+0x96/0xb0
[ 207.197077] ? syscall_exit_to_user_mode+0x35/0x50
[ 207.202477] ? x64_sys_call+0x1a55/0x1fa0
[ 207.206998] ? do_syscall_64+0x63/0xb0
[ 207.211226] ? do_syscall_64+0x63/0xb0
[ 207.215455] entry_SYSCALL_64_after_hwframe+0x67/0xd1
[ 207.221142] RIP: 0033:0x7fb56b0fa7e2
[ 207.225176] Code: c0 e9 b2 fe ff ff 50 48 8d 3d 8a b4 0c 00 e8 a5 1d 02 00 0f 1f 44 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 0f 05 <48> 3d 00 f0 ff ff 77 56 c3 0f 1f 44 00 00 48 83 ec 28 48 89 54 24
[ 207.246269] RSP: 002b:00007ffdb852f498 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[ 207.254787] RAX: ffffffffffffffda RBX: 00007fb56afab028 RCX: 00007fb56b0fa7e2
[ 207.262819] RDX: 00000000000003ff RSI: 00007ffdb852f570 RDI: 0000000000000003
[ 207.270850] RBP: 0000563101392168 R08: 0000000000000000 R09: 00007ffdb852ec30
[ 207.278881] R10: 00007ffdb85ca170 R11: 0000000000000246 R12: 000056310137f012
[ 207.286912] R13: 000056310137f06f R14: 000056310222bab0 R15: 00007fb56afa7000
[ 207.294944] </TASK>
[ 207.297412] Modules linked in: nvidia_uvm(O) nvidia_drm(O) nvidia_modeset(O) intel_rapl_msr intel_rapl_common i10nm_edac nfit x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm binfmt_misc nls_iso8859_1 ipmi_ssif rapl mlx5_ib(OE) isst_if_mbox_pci nvidia(O) intel_th_gth qat_4xxx ib_uverbs(OE) joydev input_leds intel_qat idxd pmt_crashlog pmt_telemetry isst_if_mmio mei_me intel_th_pci pmt_class isst_if_common authenc idxd_bus ib_core(OE) mei intel_th switchtec acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_lenovo hid_generic usbhid hid mlx5_core(OE) ast i2c_algo_bit drm_vram_helper drm_ttm_helper ttm drm_kms_helper syscopyarea crct10dif_pclmul mlxdevm(OE) crc32_pclmul sysfillrect ghash_clmulni_intel
[ 207.297467] sha256_ssse3 mlxfw(OE) sha1_ssse3 sysimgblt psample fb_sys_fops aesni_intel cec tls ixgbe crypto_simd nvme xhci_pci cryptd i2c_i801 rc_core mlx_compat(OE) xfrm_algo dca intel_pmt i2c_ismt i2c_smbus pci_hyperv_intf drm xhci_pci_renesas mdio nvme_core wmi pinctrl_emmitsburg
[ 207.423040] CR2: 0000000000000018
[ 207.426783] ---[ end trace 7e35f51fec2ac5d9 ]---

5.15.0-1054-nvidia Looks ok on this system.