nvidia-390 causes kernel hang

Bug #1767932 reported by md_5 on 2018-04-30
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
nvidia-graphics-drivers-390 (Ubuntu)
Undecided
Unassigned

Bug Description

Here is the hung task:

Apr 30 15:21:50 michael-desktop-ubuntu kernel: INFO: task nvidia-modeset:243 blocked for more than 120 seconds.
Apr 30 15:21:50 michael-desktop-ubuntu kernel: Tainted: P IOE 4.15.0-20-generic #21-Ubuntu
Apr 30 15:21:50 michael-desktop-ubuntu kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 30 15:21:50 michael-desktop-ubuntu kernel: nvidia-modeset D 0 243 2 0x80000000
Apr 30 15:21:50 michael-desktop-ubuntu kernel: Call Trace:
Apr 30 15:21:50 michael-desktop-ubuntu kernel: __schedule+0x297/0x8b0
Apr 30 15:21:50 michael-desktop-ubuntu kernel: schedule+0x2c/0x80
Apr 30 15:21:50 michael-desktop-ubuntu kernel: schedule_timeout+0x1cf/0x350
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? schedule_timeout+0x1cf/0x350
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? __slab_free+0x14d/0x2c0
Apr 30 15:21:50 michael-desktop-ubuntu kernel: __down+0x91/0xe0
Apr 30 15:21:50 michael-desktop-ubuntu kernel: down+0x41/0x50
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? down+0x41/0x50
Apr 30 15:21:50 michael-desktop-ubuntu kernel: nvkms_kthread_q_callback+0x65/0xe0 [nvidia_modeset]
Apr 30 15:21:50 michael-desktop-ubuntu kernel: _main_loop+0x76/0x140 [nvidia]
Apr 30 15:21:50 michael-desktop-ubuntu kernel: kthread+0x121/0x140
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? _raw_q_schedule+0x80/0x80 [nvidia]
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ? kthread_create_worker_on_cpu+0x70/0x70
Apr 30 15:21:50 michael-desktop-ubuntu kernel: ret_from_fork+0x35/0x40

uname:
Linux michael-desktop-ubuntu 4.15.0-20-generic #21-Ubuntu SMP Tue Apr 24 06:16:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Drivers:
ii libnvidia-cfg1-390:amd64 390.48-0ubuntu3 amd64 NVIDIA binary OpenGL/GLX configuration library
ii libnvidia-common-390 390.48-0ubuntu3 all Shared files used by the NVIDIA libraries
ii libnvidia-compute-390:amd64 390.48-0ubuntu3 amd64 NVIDIA libcompute package
ii libnvidia-compute-390:i386 390.48-0ubuntu3 i386 NVIDIA libcompute package
ii libnvidia-decode-390:amd64 390.48-0ubuntu3 amd64 NVIDIA Video Decoding runtime libraries
ii libnvidia-decode-390:i386 390.48-0ubuntu3 i386 NVIDIA Video Decoding runtime libraries
ii libnvidia-encode-390:amd64 390.48-0ubuntu3 amd64 NVENC Video Encoding runtime library
ii libnvidia-encode-390:i386 390.48-0ubuntu3 i386 NVENC Video Encoding runtime library
ii libnvidia-fbc1-390:amd64 390.48-0ubuntu3 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-fbc1-390:i386 390.48-0ubuntu3 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
ii libnvidia-gl-390:amd64 390.48-0ubuntu3 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-gl-390:i386 390.48-0ubuntu3 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
ii libnvidia-ifr1-390:amd64 390.48-0ubuntu3 amd64 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii libnvidia-ifr1-390:i386 390.48-0ubuntu3 i386 NVIDIA OpenGL-based Inband Frame Readback runtime library
ii nvidia-compute-utils-390 390.48-0ubuntu3 amd64 NVIDIA compute utilities
ii nvidia-dkms-390 390.48-0ubuntu3 amd64 NVIDIA DKMS package
ii nvidia-driver-390 390.48-0ubuntu3 amd64 NVIDIA driver metapackage
ii nvidia-headless-no-dkms-390 390.48-0ubuntu3 amd64 NVIDIA headless metapackage - no DKMS
ii nvidia-kernel-common-390 390.48-0ubuntu3 amd64 Shared files used with the kernel module
ii nvidia-kernel-source-390 390.48-0ubuntu3 amd64 NVIDIA kernel source package
ii nvidia-prime 0.8.8 all Tools to enable NVIDIA's Prime
ii nvidia-settings 390.42-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
ii nvidia-utils-390 390.48-0ubuntu3 amd64 NVIDIA driver support binaries
ii xserver-xorg-video-nvidia-390 390.48-0ubuntu3 amd64 NVIDIA binary Xorg driver

I can't reliably reproduce, but it happens fairly often after reboot (circa 2-5 minutes).

md_5 (md-5) wrote :

Apport report attached

md_5 (md-5) wrote :

Card is a GTX770, doesn't seem to be reported anywhere.

md_5 (md-5) wrote :

Nvidia bug report

md_5 (md-5) wrote :

Hang on nvidia-driver-396 396.18-0ubuntu0~gpu18.04.9
as well.

May 04 08:44:06 michael-desktop-ubuntu kernel: INFO: task nvidia-modeset:244 blocked for more than 120 seconds.
May 04 08:44:06 michael-desktop-ubuntu kernel: Tainted: P IOE 4.15.0-20-generic #21-Ubuntu
May 04 08:44:06 michael-desktop-ubuntu kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 04 08:44:06 michael-desktop-ubuntu kernel: nvidia-modeset D 0 244 2 0x80000000
May 04 08:44:06 michael-desktop-ubuntu kernel: Call Trace:
May 04 08:44:06 michael-desktop-ubuntu kernel: __schedule+0x297/0x8b0
May 04 08:44:06 michael-desktop-ubuntu kernel: schedule+0x2c/0x80
May 04 08:44:06 michael-desktop-ubuntu kernel: schedule_timeout+0x1cf/0x350
May 04 08:44:06 michael-desktop-ubuntu kernel: ? schedule_timeout+0x1cf/0x350
May 04 08:44:06 michael-desktop-ubuntu kernel: ? __slab_free+0x14d/0x2c0
May 04 08:44:06 michael-desktop-ubuntu kernel: ? ttwu_do_activate+0x7a/0x90
May 04 08:44:06 michael-desktop-ubuntu kernel: __down+0x91/0xe0
May 04 08:44:06 michael-desktop-ubuntu kernel: down+0x41/0x50
May 04 08:44:06 michael-desktop-ubuntu kernel: ? down+0x41/0x50
May 04 08:44:06 michael-desktop-ubuntu kernel: nvkms_kthread_q_callback+0x65/0xe0 [nvidia_modeset]
May 04 08:44:06 michael-desktop-ubuntu kernel: _main_loop+0x76/0x140 [nvidia]
May 04 08:44:06 michael-desktop-ubuntu kernel: kthread+0x121/0x140
May 04 08:44:06 michael-desktop-ubuntu kernel: ? _raw_q_schedule+0x80/0x80 [nvidia]
May 04 08:44:06 michael-desktop-ubuntu kernel: ? kthread_create_worker_on_cpu+0x70/0x70
May 04 08:44:06 michael-desktop-ubuntu kernel: ret_from_fork+0x35/0x40

md_5 (md-5) wrote :

Same on 396.24-0ubuntu0~gpu18.04.1

May 04 22:17:05 michael-desktop-ubuntu kernel: INFO: task nvidia-modeset:245 blocked for more than 120 seconds.
May 04 22:17:05 michael-desktop-ubuntu kernel: Tainted: P IOE 4.15.0-20-generic #21-Ubuntu
May 04 22:17:05 michael-desktop-ubuntu kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 04 22:17:05 michael-desktop-ubuntu kernel: nvidia-modeset D 0 245 2 0x80000000
May 04 22:17:05 michael-desktop-ubuntu kernel: Call Trace:
May 04 22:17:05 michael-desktop-ubuntu kernel: __schedule+0x297/0x8b0
May 04 22:17:05 michael-desktop-ubuntu kernel: schedule+0x2c/0x80
May 04 22:17:05 michael-desktop-ubuntu kernel: schedule_timeout+0x1cf/0x350
May 04 22:17:05 michael-desktop-ubuntu kernel: ? schedule_timeout+0x1cf/0x350
May 04 22:17:05 michael-desktop-ubuntu kernel: ? __slab_free+0x14d/0x2c0
May 04 22:17:05 michael-desktop-ubuntu kernel: ? ttwu_do_activate+0x7a/0x90
May 04 22:17:05 michael-desktop-ubuntu kernel: __down+0x91/0xe0
May 04 22:17:05 michael-desktop-ubuntu kernel: down+0x41/0x50
May 04 22:17:05 michael-desktop-ubuntu kernel: ? down+0x41/0x50
May 04 22:17:05 michael-desktop-ubuntu kernel: nvkms_kthread_q_callback+0x65/0xe0 [nvidia_modeset]
May 04 22:17:05 michael-desktop-ubuntu kernel: _main_loop+0x76/0x140 [nvidia]
May 04 22:17:05 michael-desktop-ubuntu kernel: kthread+0x121/0x140
May 04 22:17:05 michael-desktop-ubuntu kernel: ? _raw_q_schedule+0x80/0x80 [nvidia]
May 04 22:17:05 michael-desktop-ubuntu kernel: ? kthread_create_worker_on_cpu+0x70/0x70
May 04 22:17:05 michael-desktop-ubuntu kernel: ret_from_fork+0x35/0x40

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-graphics-drivers-390 (Ubuntu):
status: New → Confirmed
Aaahh Ahh (woohoomoo2u) wrote :

Cannot confirm but when using nvidia through nvidia-prime, I get kernel hangs at seemingly random times. Does not occur with intel or without nvidia drivers

Kirill Romanov (djaler1) wrote :
Download full text (5.5 KiB)

Same shit on GTX 1050 Ti

Apr 26 09:19:47 juno kernel: [ 75.364483] nouveau 0000:01:00.0: DRM: failed to idle channel 0 [DRM]
Apr 26 09:19:47 juno kernel: [ 75.365464] BUG: unable to handle kernel paging request at ffff975f5b029100
Apr 26 09:19:47 juno kernel: [ 75.366130] IP: evo_wait+0x5d/0x130 [nouveau]
Apr 26 09:19:47 juno kernel: [ 75.366780] PGD 1333e067 P4D 1333e067 PUD 0
Apr 26 09:19:47 juno kernel: [ 75.367423] Oops: 0002 [#1] SMP PTI
Apr 26 09:19:47 juno kernel: [ 75.368067] Modules linked in: ccm cmac bnep nouveau ttm binfmt_misc nls_iso8859_1 arc4 hid_multitouch dell_wmi dell_smbios_wmi wmi_bmof mxm_wmi dell_wmi_descriptor snd_hda_codec_realtek snd_hda_codec_generic intel_rapl dell_laptop dell_smbios_smm dell_smbios x86_pkg_temp_thermal dcdbas intel_powerclamp coretemp iwlmvm dell_smm_hwmon kvm_intel mac80211 kvm uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep irqbypass crct10dif_pclmul snd_pcm crc32_pclmul ghash_clmulni_intel pcbc snd_seq_midi snd_seq_midi_event snd_rawmidi aesni_intel aes_x86_64 crypto_simd glue_helper cryptd intel_cstate intel_rapl_perf snd_seq iwlwifi snd_seq_device snd_timer joydev idma64 btusb input_leds virt_dma btrtl btbcm serio_raw btintel
Apr 26 09:19:47 juno kernel: [ 75.371468] snd cfg80211 bluetooth soundcore mei_me intel_lpss_pci processor_thermal_device intel_soc_dts_iosf mei ecdh_generic shpchp intel_pch_thermal intel_lpss int3403_thermal wmi int3402_thermal int340x_thermal_zone intel_hid tpm_crb sparse_keymap acpi_pad int3400_thermal mac_hid acpi_thermal_rel sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs xor zstd_compress raid6_pq hid_generic usbhid i915 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops psmouse r8169 ahci drm mii libahci i2c_hid hid video
Apr 26 09:19:47 juno kernel: [ 75.374295] CPU: 7 PID: 74 Comm: kworker/7:1 Tainted: G W 4.15.0-20-generic #21-Ubuntu
Apr 26 09:19:47 juno kernel: [ 75.375203] Hardware name: Dell Inc. Inspiron 15 7000 Gaming/065C71, BIOS 1.5.3 01/25/2018
Apr 26 09:19:47 juno kernel: [ 75.376127] Workqueue: pm pm_runtime_work
Apr 26 09:19:47 juno kernel: [ 75.377076] RIP: 0010:evo_wait+0x5d/0x130 [nouveau]
Apr 26 09:19:47 juno kernel: [ 75.378006] RSP: 0018:ffffaec001b8fc10 EFLAGS: 00010216
Apr 26 09:19:47 juno kernel: [ 75.378934] RAX: ffff975ea0329000 RBX: 000000002eb40060 RCX: 0000000000000000
Apr 26 09:19:47 juno kernel: [ 75.380028] RDX: 000000002eb40040 RSI: 0000000000000007 RDI: ffff975ebf5e2880
Apr 26 09:19:47 juno kernel: [ 75.381160] RBP: ffffaec001b8fc38 R08: 0000000000000067 R09: 0000000000000000
Apr 26 09:19:47 juno kernel: [ 75.382332] R10: ffffaec00205fd10 R11: 0000000000000065 R12: ffff975eac7ee308
Apr 26 09:19:47 juno kernel: [ 75.383281] R13: ffff975ea77b82b0 R14: 0000000000000020 R15: ffff975eac7ee3a8
Apr 26 09:19:47 juno kernel: [ 75.384222] FS: 0000000000000000(0000) GS:ffff975ebf5c0000(0000) knlGS:0000000000000000
Apr 26 09:19:47 juno kernel: [ 75.385171] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 26 09:19:47 juno kernel: [ 75.386117] ...

Read more...

md_5 (md-5) wrote :

Kirill, looks to me like you are using the open source nouveau driver.
This is for the proprietary binary nvidia driver.

Daniel Cox (danielpcox) wrote :

I had this exact problem today (same message in `dmesg`) which I found while investigating a hang of anything CUDA-related, appearing out of nowhere after my setup had been working for a while.

I was able to fix it by adding `acpi=ht` (or `acpi=off`) to my GRUB_CMDLINE_LINUX_DEFAULT in /etc/default/grub, running `sudo update-grub`, and rebooting.

I've got two nVidia 1080tis in this box, and I'm using the nvidia-396 driver.

Jason Priest (justaperson) wrote :

Getting this "task kworker blocked for more than 120 seconds" with nvidia-drivers-390 on Ubuntu 18.04 (Kernel 4.15.0.36-generic). I have a GTX 1070Ti and GTX 770 installed.

John Stowers (nzjrs) wrote :
Download full text (25.8 KiB)

I get this every couple of days on our CI - from processes which access the GPU. nvidia-driver 396.24.02. Here are some dmesg warning from various failures

[Mon Nov 5 18:43:29 2018] INFO: task kworker/4:2:25281 blocked for more than 120 seconds.
[Mon Nov 5 18:43:29 2018] Tainted: P OE 4.4.0-127-generic #153~14.04.1-Ubuntu
[Mon Nov 5 18:43:29 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 5 18:43:29 2018] kworker/4:2 D ffff88026a5cbb68 0 25281 2 0x00000000
[Mon Nov 5 18:43:29 2018] Workqueue: events os_execute_work_item [nvidia]
[Mon Nov 5 18:43:29 2018] ffff88026a5cbb68 0000000000000036 ffff880827175a00 ffff88026a5cc000
[Mon Nov 5 18:43:29 2018] ffff88082370a768 0000000000000002 0000000000000000 ffff880827175a00
[Mon Nov 5 18:43:29 2018] ffff88026a5cbb80 ffffffff81818105 7fffffffffffffff ffff88026a5cbc28
[Mon Nov 5 18:43:29 2018] Call Trace:
[Mon Nov 5 18:43:29 2018] [<ffffffff81818105>] schedule+0x35/0x80
[Mon Nov 5 18:43:29 2018] [<ffffffff8181aafb>] schedule_timeout+0x23b/0x2d0
[Mon Nov 5 18:43:29 2018] [<ffffffff810b73bf>] ? enqueue_entity+0x3af/0xbe0
[Mon Nov 5 18:43:29 2018] [<ffffffff81819d85>] __down_common+0xa6/0xf9
[Mon Nov 5 18:43:29 2018] [<ffffffff81819df5>] __down+0x1d/0x1f
[Mon Nov 5 18:43:29 2018] [<ffffffff810c88e1>] down+0x41/0x50
[Mon Nov 5 18:43:29 2018] [<ffffffffc0724d97>] os_acquire_mutex+0x37/0x40 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffffc0cdb9fc>] _nv031564rm+0x5c/0x120 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffffc0b33978>] ? _nv007828rm+0x38/0x120 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffffc0d62ad4>] ? _nv001065rm+0x84/0xe0 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffffc0d663f9>] ? rm_execute_work_item+0x49/0xc0 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffff811e3701>] ? kmem_cache_alloc+0x191/0x200
[Mon Nov 5 18:43:29 2018] [<ffffffffc0725101>] ? os_execute_work_item+0x1/0x70 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffffc0725146>] ? os_execute_work_item+0x46/0x70 [nvidia]
[Mon Nov 5 18:43:29 2018] [<ffffffff81099716>] ? process_one_work+0x156/0x400
[Mon Nov 5 18:43:29 2018] [<ffffffff8109a0fa>] ? worker_thread+0x11a/0x480
[Mon Nov 5 18:43:29 2018] [<ffffffff81099fe0>] ? rescuer_thread+0x310/0x310
[Mon Nov 5 18:43:29 2018] [<ffffffff8109f5d8>] ? kthread+0xd8/0xf0
[Mon Nov 5 18:43:29 2018] [<ffffffff81817b52>] ? __schedule+0x2a2/0x820
[Mon Nov 5 18:43:29 2018] [<ffffffff8109f500>] ? kthread_park+0x60/0x60
[Mon Nov 5 18:43:29 2018] [<ffffffff8181be75>] ? ret_from_fork+0x55/0x80
[Mon Nov 5 18:43:29 2018] [<ffffffff8109f500>] ? kthread_park+0x60/0x60
[Mon Nov 5 18:43:29 2018] INFO: task kworker/4:1:3562 blocked for more than 120 seconds.
[Mon Nov 5 18:43:29 2018] Tainted: P OE 4.4.0-127-generic #153~14.04.1-Ubuntu
[Mon Nov 5 18:43:29 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Mon Nov 5 18:43:29 2018] kworker/4:1 D ffff880104db3b68 0 3562 2 0x00000000
[Mon Nov 5 18:43:29 2018] Workqueue: events os_execute_work_item [nvidia]
[Mon Nov 5 18:43:29 2018] ffff880104db3b68 ffffffff81817b46 ffff88044e1f8f00 ffff880104db400...

John Stowers (nzjrs) wrote :

BTW: Linux lb-santi 4.4.0-127-generic #153~14.04.1-Ubuntu SMP Sat May 19 14:00:03 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers