GPU device disable/enable test failure

Bug #1853014 reported by Joseph Salisbury
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-azure (Ubuntu)
New
Undecided
Unassigned

Bug Description

We found a GPU device disable/enable test failure, and it is related to below call trace. When GPU device is disable, this call-trace happens at the device disable step.

The system does not panic but the driver is not loaded back.

%echo 1 > /sys/bus/pci/devices/c09d:00:00.0/remove

Note: after this command, PCI bus is not removed but only ‘remove’ file is disappeared with below call trace. All other PCI devices are removed successfully.

<Call trace>
[ 56.649648] hv_balloon: Max. dynamic memory size: 57344 MB
[ 457.438303] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[ 457.438305] ------------[ cut here ]------------
[ 457.438465] WARNING: CPU: 4 PID: 5026 at /var/lib/dkms/nvidia/430.50/build/nvidia/nv.c:4068 nvidia_remove+0x39d/0x3b0 [nvidia]
[ 457.438466] Modules linked in: xt_owner xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_security bpfilter nvidia_uvm(OE) nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops nls_iso8859_1 drm drm_panel_orientation_quirks ipmi_devintf ipmi_msghandler i2c_core pci_hyperv hv_balloon serio_raw sch_fq_codel joydev ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hid_generic aesni_intel aes_x86_64 hyperv_fb crypto_simd cryptd glue_helper hid_hyperv cfbfillrect cfbimgblt hyperv_keyboard cfbcopyarea pata_acpi hid hv_netvsc hv_utils
[ 457.438493] CPU: 4 PID: 5026 Comm: bash Tainted: P OE 5.0.0-1025-azure #27~18.04.1-Ubuntu
[ 457.438494] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017
[ 457.438564] RIP: 0010:nvidia_remove+0x39d/0x3b0 [nvidia]
[ 457.438565] Code: ff e8 17 c5 9a f3 41 8b 95 68 04 00 00 48 c7 c6 f8 97 8e c1 bf 04 00 00 00 e8 cf 9c 00 00 48 c7 c7 b0 82 8e c1 e8 b6 8b a1 f3 <0f> 0b e8 cc a2 00 00 eb f9 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44
[ 457.438566] RSP: 0018:ffffb1578bcfbcf8 EFLAGS: 00010282
[ 457.438567] RAX: 0000000000000024 RBX: ffff8ec43bdf0000 RCX: 0000000000000006
[ 457.438568] RDX: 0000000000000000 RSI: 0000000000000086 RDI: ffff8ec445d15580
[ 457.438568] RBP: ffffb1578bcfbd40 R08: 0000000000000001 R09: 000000000000023c
[ 457.438569] R10: ffffb1578bcfba38 R11: 0000000000000000 R12: ffff8ec43d3b2000
[ 457.438569] R13: ffff8ec4388b3000 R14: ffffffffc19411b0 R15: 0000000000000060
[ 457.438570] FS: 00007f92d7263740(0000) GS:ffff8ec445d00000(0000) knlGS:0000000000000000
[ 457.438573] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 457.438573] CR2: 0000560d3d973f60 CR3: 0000000e4aeca004 CR4: 00000000001606e0
[ 457.438574] Call Trace:
[ 457.438579] pci_device_remove+0x3e/0xc0
[ 457.438582] device_release_driver_internal+0x18d/0x260
[ 457.438583] device_release_driver+0x12/0x20
[ 457.438585] pci_stop_bus_device+0x68/0x90
[ 457.438586] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 457.438588] remove_store+0x7c/0x90
[ 457.438590] dev_attr_store+0x1b/0x30
[ 457.438592] sysfs_kf_write+0x3c/0x50
[ 457.438593] kernfs_fop_write+0x125/0x1a0
[ 457.438596] __vfs_write+0x1b/0x40
[ 457.438598] vfs_write+0xb1/0x1a0
[ 457.438599] ksys_write+0x5c/0xe0
[ 457.438601] __x64_sys_write+0x1a/0x20
[ 457.438603] do_syscall_64+0x64/0x1b0
[ 457.438607] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 457.438608] RIP: 0033:0x7f92d6947154
[ 457.438609] Code: 89 02 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 8d 05 b1 07 2e 00 8b 00 85 c0 75 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 f3 c3 66 90 41 54 55 49 89 d4 53 48 89 f5
[ 457.438610] RSP: 002b:00007ffe5a69f208 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 457.438611] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f92d6947154
[ 457.438612] RDX: 0000000000000002 RSI: 0000560d3d7bd8c0 RDI: 0000000000000001
[ 457.438612] RBP: 0000560d3d7bd8c0 R08: 000000000000000a R09: 0000000000000001
[ 457.438613] R10: 000000000000000a R11: 0000000000000246 R12: 00007f92d6c23760
[ 457.438613] R13: 0000000000000002 R14: 00007f92d6c1f2a0 R15: 00007f92d6c1e760
[ 457.438615] ---[ end trace 64ddc7a9a2dd8bd8 ]---

Kernel: 5.0.0-1025-azure

This issue happens with 18.04 and not 16.04.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.