dev test in ubuntu_stress_smoke_test hang on some nodes with 4.15 (unable to handle kernel NULL pointer dereference)

Bug #1929187 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Stress-ng
Invalid
Undecided
Unassigned
ubuntu-kernel-tests
Fix Released
Undecided
Unassigned

Bug Description

Issue found on 4.15.0-144-generic with node "akis"

09:24:25 DEBUG| [stdout] dccp RETURNED 0
09:24:25 DEBUG| [stdout] dccp PASSED
09:24:25 DEBUG| [stdout] dentry STARTING

09:24:27 DEBUG| [stdout] dentry RETURNED 0
09:24:27 DEBUG| [stdout] dentry PASSED
09:24:27 DEBUG| [stdout] dev STARTING
^Test hang here.

The dev test will hang with Oops in dmesg:
May 21 09:24:29 akis kernel: [ 366.747371] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
May 21 09:24:29 akis kernel: [ 366.752640] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:29 akis kernel: [ 366.755208] IP: knem_miscdev_poll+0x16/0x50 [knem]
May 21 09:24:29 akis kernel: [ 366.764818] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:29 akis kernel: [ 366.767987] PGD 0 P4D 0
May 21 09:24:29 akis kernel: [ 366.767989] Oops: 0000 [#1] SMP PTI
May 21 09:24:29 akis kernel: [ 366.767990] Modules linked in: cuse snd_seq snd_seq_device snd_timer snd soundcore dccp_ipv4 dccp ipx p8023 psnap atm p8022 llc algif_rng algif_aead anubis fcrypt khazad seed tea cmac md4 michael_mic poly1305_x86_64 poly1305_generic rmd128 rmd160 rmd256 rmd320 sha3_generic sm3_generic tgr192 wp512 algif_hash chacha20_x86_64 chacha20_generic blowfish_generic blowfish_x86_64 blowfish_common cast5_avx_x86_64 cast5_generic des3_ede_x86_64 des_generic salsa20_generic camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 cast6_avx_x86_64 cast6_generic cast_common serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic twofish_generic twofish_avx_x86_64 ablk_helper twofish_x86_64_3way lrw twofish_x86_64 twofish_common algif_skcipher af_alg ipmi_ssif nls_iso8859_1 intel_rapl
May 21 09:24:29 akis kernel: [ 366.779064] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:29 akis kernel: [ 366.782008] skx_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass intel_cstate intel_rapl_perf mei_me input_leds joydev lpc_ich wmi mei shpchp ioatdma ipmi_si ipmi_devintf ipmi_msghandler acpi_pad acpi_power_meter mac_hid sch_fq_codel ib_iser(OE) rdma_cm(OE) iw_cm(OE) ib_cm(OE) iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi knem(OE) ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 multipath linear mlx5_ib(OE) ib_uverbs(OE) raid0 ib_core(OE) mlx5_core(OE) mlxfw(OE) hid_generic devlink crct10dif_pclmul crc32_pclmul ast ghash_clmulni_intel vfio_iommu_type1 pcbc usbhid ttm igb vfio drm_kms_helper aesni_intel syscopyarea aes_x86_64 sysfillrect hid mdev(OE) dca sysimgblt
May 21 09:24:29 akis kernel: [ 366.858437] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:29 akis kernel: [ 366.861005] crypto_simd i2c_algo_bit fb_sys_fops mlx_compat(OE) glue_helper cryptd ptp nvme drm pps_core nvme_core uas usb_storage
May 21 09:24:30 akis kernel: [ 366.940077] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 366.951200] CPU: 12 PID: 52783 Comm: stress-ng Tainted: G OE 4.15.0-144-generic #148-Ubuntu
May 21 09:24:30 akis kernel: [ 366.951201] Hardware name: NVIDIA NVIDIA DGX-2/NVIDIA DGX-2, BIOS 0.17 10/11/2018
May 21 09:24:30 akis kernel: [ 366.951203] RIP: 0010:knem_miscdev_poll+0x16/0x50 [knem]
May 21 09:24:30 akis kernel: [ 366.951204] RSP: 0018:ffffbb0e1c22fac8 EFLAGS: 00010202
May 21 09:24:30 akis kernel: [ 366.961357] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 366.968664] RAX: 0000000000000008 RBX: 0000000000000000 RCX: 0000000000000002
May 21 09:24:30 akis kernel: [ 366.968665] RDX: 0000000000000019 RSI: ffffbb0e1c22fc30 RDI: ffff8ed103c1cd00
May 21 09:24:30 akis kernel: [ 366.968665] RBP: ffffbb0e1c22fad0 R08: ffff8ed103c1cd01 R09: 0000000000000016
May 21 09:24:30 akis kernel: [ 366.968666] R10: ffff8ed103c1cd38 R11: 0000000000000000 R12: 0000000000000000
May 21 09:24:30 akis kernel: [ 366.968666] R13: 0000000000000000 R14: ffffbb0e1c22fb3c R15: ffff8ed103c1cd00
May 21 09:24:30 akis kernel: [ 366.968667] FS: 00007f8b4fea60c0(0000) GS:ffff8ed200900000(0000) knlGS:0000000000000000
May 21 09:24:30 akis kernel: [ 366.968669] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 21 09:24:30 akis kernel: [ 366.976781] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 366.981447] CR2: 0000000000000010 CR3: 000000bc44536002 CR4: 00000000007606e0
May 21 09:24:30 akis kernel: [ 366.981448] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 21 09:24:30 akis kernel: [ 366.981449] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 21 09:24:30 akis kernel: [ 366.981449] PKRU: 55555554
May 21 09:24:30 akis kernel: [ 366.981450] Call Trace:
May 21 09:24:30 akis kernel: [ 366.981456] do_sys_poll+0x26f/0x580
May 21 09:24:30 akis kernel: [ 366.988574] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 366.994669] ? misc_open+0x12a/0x170
May 21 09:24:30 akis kernel: [ 366.994673] ? chrdev_open+0xc4/0x1b0
May 21 09:24:30 akis kernel: [ 367.017883] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 367.023171] ? mntput+0x24/0x40
May 21 09:24:30 akis kernel: [ 367.023173] ? terminate_walk+0x90/0x100
May 21 09:24:30 akis kernel: [ 367.023174] ? path_openat+0x64e/0x18b0
May 21 09:24:30 akis kernel: [ 367.023179] ? timerqueue_add+0x57/0x80
May 21 09:24:30 akis kernel: [ 367.031643] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 367.038388] ? do_filp_open+0xaf/0x110
May 21 09:24:30 akis kernel: [ 367.038390] ? _copy_to_user+0x26/0x30
May 21 09:24:30 akis kernel: [ 367.038393] ? cp_new_stat+0x152/0x180
May 21 09:24:30 akis kernel: [ 367.045982] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 367.052125] ? do_vfs_ioctl+0xa8/0x630
May 21 09:24:30 akis kernel: [ 367.052127] ? SYSC_newfstat+0x44/0x70
May 21 09:24:30 akis kernel: [ 367.052128] SyS_poll+0x9b/0x140
May 21 09:24:30 akis kernel: [ 367.052128] ? SyS_poll+0x9b/0x140
May 21 09:24:30 akis kernel: [ 367.052131] do_syscall_64+0x73/0x130
May 21 09:24:30 akis kernel: [ 367.052137] entry_SYSCALL_64_after_hwframe+0x41/0xa6
May 21 09:24:30 akis kernel: [ 367.060366] sr 0:0:0:0: [sr0] CDROM not ready. Make sure there is a disc in the drive.
May 21 09:24:30 akis kernel: [ 367.066389] RIP: 0033:0x7f8b4e98acb9
May 21 09:24:30 akis kernel: [ 367.066389] RSP: 002b:00007ffd6617b8a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000007
May 21 09:24:30 akis kernel: [ 367.066390] RAX: ffffffffffffffda RBX: 00007ffd6617b938 RCX: 00007f8b4e98acb9
May 21 09:24:30 akis kernel: [ 367.066391] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 00007ffd6617b938
May 21 09:24:30 akis kernel: [ 367.066391] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000036
May 21 09:24:30 akis kernel: [ 367.066391] R10: 00007ffd6617b8a0 R11: 0000000000000293 R12: 0000000000000000
May 21 09:24:30 akis kernel: [ 367.066394] R13: 0000000000000000 R14: 0000000000000016 R15: 00007ffd6617fe60

Related branches

Po-Hsu Lin (cypressyew)
description: updated
tags: added: 4.15 bionic kqa-blocker sru-20210510 ubuntu-stress-smoke-test
Revision history for this message
dann frazier (dannf) wrote :

The crash here is in a module provided by the Mellanox OFED stack (knem). We've seen crashes in this module before and have not been able to reproduce them w/o MOFED, so unlikely a bug in the Ubuntu kernel.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Thanks for the info, Dann!

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

Found on other nodes as well (e.g. exotic-skunk). Adjusting bug title.

tags: removed: kqa-blocker
summary: - dev test in ubuntu_stress_smoke_test hang on node akis with 4.15 (unable
- to handle kernel NULL pointer dereference)
+ dev test in ubuntu_stress_smoke_test hang on some nodes with 4.15
+ (unable to handle kernel NULL pointer dereference)
Revision history for this message
Francis Ginther (fginther) wrote :

Deployment of these nvidia systems has been updated to not include the knem module.

Changed in stress-ng:
status: New → Invalid
Changed in ubuntu-kernel-tests:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.