10de:1c8c [MSI] Module nouveau fails to manage GeForce GTX 1050 Ti Mobile

Bug #1729736 reported by Etienne URBAH on 2017-11-03
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
Medium
Unassigned

Bug Description

Trying to use 'Nvidia GeForce GTX 1050 Ti Mobile' with 'Ubuntu 17.10 (Artful)' :

- Linux kernel 4.13.0-16 from 'Ubuntu 17.10 (Artful)' systematically fails (1 CPU stuck).

- Linux kernel 4.14.0-041400rc7 from http://kernel.ubuntu.com/~kernel-ppa/mainline seems to succeed.

See attached kern.log

After reboot on Linux kernel 4.13.0-16 with nouveau blacklisted :

$ lspci -nn -v -s 1:0.0

01:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] [10de:1c8c] (rev a1)
 Subsystem: Micro-Star International Co., Ltd. [MSI] GP107M [GeForce GTX 1050 Ti Mobile] [1462:11c8]
 Flags: bus master, fast devsel, latency 0, IRQ 255
 Memory at de000000 (32-bit, non-prefetchable) [size=16M]
 Memory at c0000000 (64-bit, prefetchable) [size=256M]
 Memory at d0000000 (64-bit, prefetchable) [size=32M]
 I/O ports at e000 [disabled] [size=128]
 Expansion ROM at df000000 [disabled] [size=512K]
 Capabilities: <access denied>
 Kernel modules: nvidiafb, nouveau

Hello,

I have a Dell XPS 15 9560 with a GTX 1050 Mobile (GP107M) and an Intel Kabylake CPU. Using Kernel 4.11.3 on the Arch installer image, whenever I run 'lspci', it blocks indefinitely and never prints any output.

After running lspci, dmesg contains:

[ 54.819264] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[ 54.879968] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[ 54.879973] nouveau 0000:01:00.0: Refused to change power state, currently in D3
[ 54.879974] nouveau 0000:01:00.0: DRM: resuming object tree...

Eventually the scheduler gets cranky about the hung process and starts spewing
[ 245.385522] INFO: task lspci:576 blocked for more than 120 seconds.
and a backtrace every 2 minutes.

Blacklisting nouveau makes lspci work.

I believe including a full dmesg (from boot) would be helpful to show what may be going wrong.

Note that running with nouveau.runpm=0 will prevent the suspend from happening. However that will, of course, cause the GPU to remain on. [I believe it will remain on without nouveau loading as well, but with this new PCIe PM stuff, I'm not sure anymore.]

With the "new PCIe PM stuff", if nouveau is not loaded and something else enabled automatic runtime PM (via powertop, via TLP or manually by writing "auto" to /sys/bus/pci/devices/.../power/control) for the Nvidia PCI devices, then indeed the problematic ACPI methods could be triggered.

Kenneth, can you upload your acpidump?
sudo pacman -S acpidump && sudo acpidump > acpidump.txt

Most likely you are affected by
https://bugzilla.kernel.org/show_bug.cgi?id=156341

this is because lspci reads the config file, which then triggers a full GPU wake up, which is a silly thing to do in the first place.

What we need is something like this in the kernel: https://github.com/karolherbst/linux/commit/cb918e4c926990dfcfce92e1ecd905e0896de605 and then make use of those in userspace, so that we don't need to read config every time anymore.

Etienne URBAH (eurbah) wrote :
affects: ubuntu-meta (Ubuntu) → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1729736

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Etienne URBAH (eurbah) wrote :

With nouveau enabled in Linux kernel 4.13.0-16 on GP107M [GeForce GTX 1050 Ti Mobile], 'lspci' systematically makes nouveau fail (1 CPU stuck), and returns nothing.

This is the reason why I could NOT submit this issue using 'apport-bug', and why 'apport-collect' returns nothing.

But I can try other commands permitting to obtain relevant information.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.14 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.14-rc7

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Etienne URBAH (eurbah) wrote :

Concerning the support of Nvidia GP107M [GeForce GTX 1050 Ti Mobile] by the 'nouveau' module of Linux kernels :

- Before kernel v4.12, it was NOT supported at all.

- Beginning with kernel v4.12, I could NOT install Ubuntu 17.10 Beta at all, so I suppose that the current issue was present.

- After reading https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1723619/comments/24
   I succeeded installing Unbuntu 17.10 Artful with following addition at the end of the boot line :
   modprobe.blacklist=nouveau -- modprobe.blacklist=nouveau

With Linux kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline :

- I already wrote that Linux kernel 4.14.0-041400rc7 seems to succeed.

- I just tested Linux kernel 4.14.0-041400rc8 : Systematically, as soon as I try 'lspci -nn -v -s 1:0.0', the whole machine immediately freezes. That is a regression compared to v4.14-rc7.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed

I am trying to use 'nouveau' with GP107M [GeForce GTX 1050 Ti Mobile].

With Linux Kernel 4.13.0-16 from Ubuntu 17.10 Artful, 'lspci' systematically makes immediately 1 CPU freeze.

Therefore, I am testing Linux kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline

With Linux kernel 4.14.0-rc7, this issue does NOT show up :

$ lspci -nn -v -s 1:0
01:00.0 3D controller [0302]: NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] [10de:1c8c] (rev a1)
 Subsystem: Micro-Star International Co., Ltd. [MSI] GP107M [GeForce GTX 1050 Ti Mobile] [1462:11c8]
 Flags: bus master, fast devsel, latency 0, IRQ 134
 Memory at de000000 (32-bit, non-prefetchable) [size=16M]
 Memory at c0000000 (64-bit, prefetchable) [size=256M]
 Memory at d0000000 (64-bit, prefetchable) [size=32M]
 I/O ports at e000 [size=128]
 Expansion ROM at df000000 [disabled] [size=512K]
 Capabilities: <access denied>
 Kernel driver in use: nouveau
 Kernel modules: nvidiafb, nouveau

With Linux kernel 4.14.0-rc8, 'lspci' systematically makes immediately the whole computer freeze.

With Linux kernel 4.14.0 (released yesterday), 'lspci' systematically fails to answer, and makes the whole computer freeze after some time.

So, there is probably a regression.

I have also reported this issue at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1729736

Etienne URBAH (eurbah) wrote :

With Linux kernel 4.14.0 from http://kernel.ubuntu.com/~kernel-ppa/mainline (released yesterday), 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

That is a regression compared to v4.14-rc7.

This issue is also reported at https://bugs.freedesktop.org/show_bug.cgi?id=101665

Etienne URBAH (eurbah) wrote :

With following Linux kernels, 'lspci' systematically fails to answer, and makes the whole machine immediately freeze :
- 4.13.0-17 from Ubuntu 17.10 (Artful)
- 4.14.1 from http://kernel.ubuntu.com/~kernel-ppa/mainline

With following Linux kernels, 'lspci' systematically fails to answer, and makes the whole machine immediately freeze :
- 4.13.0-17 from Ubuntu 17.10 (Artful)
- 4.14.1 from http://kernel.ubuntu.com/~kernel-ppa/mainline

Etienne URBAH (eurbah) on 2017-11-27
tags: added: kernel-bug-exists-upstream kernel-bug-exists-upstream-4.15
removed: kernel-fixed-upstream kernel-fixed-upstream-4.14
Etienne URBAH (eurbah) wrote :

With Linux kernel 4.15.0-041500rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

With Linux kernel 4.15.0-041500rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

Kai-Heng Feng (kaihengfeng) wrote :

Please file an upstream bug at https://bugs.freedesktop.org/
Product: DRI
Component: DRM/nouveau

Etienne URBAH (eurbah) wrote :

At https://bugs.freedesktop.org/ :

- Product 'DRI' has NO component 'DRM/nouveau'.

- In fact, this issue is already reported at https://bugs.freedesktop.org/show_bug.cgi?id=101665 inside product 'xorg', component 'Driver/nouveau'.

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed

With Linux kernel 4.15.0-041500rc2 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

With Linux kernel 4.15.0-041500rc3 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

Besides, inside 'kern.log', I have detected following messages :
nouveau 0000:01:00.0: DRM: BIT table 'A' not found
nouveau 0000:01:00.0: DRM: BIT table 'L' not found
nouveau 0000:01:00.0: DRM: Pointer to TMDS table invalid

@Étienne Could you please provide the information that was asked in comment #1 and comment #2 of this bug report? Adding `nouveau.runpm=0` to the kernel command line should avoid the freeze but will prevent the NVIDIA card from being suspended.
Looking at the bug report mentioned in comment #2, could you try booting with `acpi_osi=! acpi_osi="Windows 2009"` and/or `acpi_rev_override=5` on the kernel command line (without `nouveau.runpm=0`)?

Created attachment 136105
lspci for GP107M [GeForce GTX 1050 Ti Mobile]

Lot of thanks to Pierre Moreau for his suggestions of options in the kernel command line :

- Adding just 'nouveau.runpm=0' prevents the whole machine to freeze, but 'nouveau' FAILS to manage an external display with resolution 3840 x 2160 at 60Hz through DisplayPort.

- Adding just 'acpi_rev_override=5' does NOT prevent the whole machine to freeze.

- Adding just 'acpi_osi=! acpi_osi="Windows 2009"' permits 'lspci' to succeed, and 'nouveau' to successfully manage an external display with resolution 3840 x 2160 at 60Hz through DisplayPort.

Etienne URBAH (eurbah) wrote :

With Linux kernel 4.15.0-041500rc3 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

Following workaround given by Pierre Moreau at https://bugs.freedesktop.org/show_bug.cgi?id=101665 succeeds:
Inside the kernel command line, add options 'acpi_osi=! acpi_osi="Windows 2009"'

With Linux kernel 4.16.0-041600rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

Etienne URBAH (eurbah) wrote :

With Linux kernel 4.16.0-041600rc5 from http://kernel.ubuntu.com/~kernel-ppa/mainline, 'lspci' systematically fails to answer, and makes the whole machine freeze after some time.

tags: added: bionic kernel-bug-exists-upstream-4.16
removed: kernel-bug-exists-upstream-4.15
Etienne URBAH (eurbah) wrote :

With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, graphical login fails.

tags: added: kernel-bug-exists-upstream-4.17
removed: kernel-bug-exists-upstream-4.16

With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, graphical login fails, and the machine is frozen.

With Linux kernel 4.17.0-041700rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, I confirm that systematically :

- Inside a Linux console, 'lspci' fails to answer, and makes the whole machine immediately freeze.

- Graphical login fails, and makes the whole machine immediately freeze.

I'm experiencing the same issue on an XPS 9560 with Ubuntu 18.04 (same symptoms, same dmesg output).
Any new information required?

Etienne URBAH (eurbah) wrote :

With Linux kernel 4.18.0-041800rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline, graphical login fails.

Following workaround given by Pierre Moreau at https://bugs.freedesktop.org/show_bug.cgi?id=101665 succeeds:
Inside the kernel command line, add options 'acpi_osi=! acpi_osi="Windows 2009"'

tags: added: kernel-bug-exists-upstream-4.18
removed: kernel-bug-exists-upstream-4.17

With Linux kernel 4.18.0-041800rc1 from http://kernel.ubuntu.com/~kernel-ppa/mainline :

I tried to open a Linux console with Ctrl Alt F2, but did NOT succeed.

Systematically, graphical login fails, and makes the whole machine immediately freeze.

Created attachment 143031
dmesg from thinkpad x1 extreme / GTX 1050 Ti on kernel 4.20 with hybrid graphics

I may be blind but I haven't seen anyone attaching full dmesg output, sorry if I missed it.

The laptop in question is Thinkpad X1 Extreme with NVIDIA Corporation GP107M [GeForce GTX 1050 Ti Mobile] (rev a1). It has a BIOS setting to switch between "Hybrid" and "Discrete only" graphics.

This dmesg output is from a boot with "Hybrid" graphics, where running 'lspci' hangs and causes the system fans to spin. X server hangs on start too (using modesetting DDX)

Created attachment 143032
dmesg from thinkpad x1 extreme / GTX 1050 Ti on kernel 4.20 with discrete only graphics

Not sure if that helps with the diagnostics, but on the same Thinkpad X1 Extreme laptop with "Discrete only" BIOS setting, lspci works fine, X starts and works, but there's a timeout logged by nouveau.

Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.