[cosmic+] error booting with prime-select intel: prime-select does not update initramfs to blacklist nvidia modules

Bug #1848326 reported by makk50
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
nvidia-prime (Ubuntu)
Fix Released
High
Alberto Milone
Bionic
Triaged
Undecided
Alberto Milone
Eoan
Won't Fix
Undecided
Alberto Milone
ubuntu-drivers-common (Ubuntu)
Fix Released
High
Alberto Milone
Bionic
Triaged
Undecided
Alberto Milone
Eoan
Won't Fix
Undecided
Alberto Milone

Bug Description

when I try to boot with the iGPU selected, DE won't boot, with nvidia selected, everithing is fine.
I tried uninstalling nvidia driver and it allowed my to access without any problems and intel is working fine

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: nvidia-prime 0.8.13
ProcVersionSignature: Ubuntu 5.3.0-18.19-generic 5.3.1
Uname: Linux 5.3.0-18-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.11-0ubuntu8
Architecture: amd64
CurrentDesktop: KDE
Date: Wed Oct 16 12:41:26 2019
Dependencies:

InstallationDate: Installed on 2019-09-25 (20 days ago)
InstallationMedia: Kubuntu 19.10 "Eoan Ermine" - Beta amd64 (20190925)
PackageArchitecture: all
SourcePackage: nvidia-prime
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
makk50 (makk50) wrote :
tags: added: kubuntu
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nvidia-prime (Ubuntu):
status: New → Confirmed
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I had the same problem (the system got upgraded from 19.04 to 19.10, nvidia-430 driver).

I noticed that nvidia drivers were loaded in the rescue mode - so they were likely not blacklisted properly.

When I did not enter the rescue mode I had the following message displayed:
"A start job is running for udev Wait for Complete Device Initialization (1min 3s/ 3min)

/var/lib/gdm3/.local/share/xorg/Xorg.0.log contained the following (see the full file attached):

[ 47.771] (==) ModulePath set to "/usr/lib/xorg/modules"
[ 47.771] (II) The server relies on udev to provide the list of input devices.
 If no devices become available, reconfigure udev or disable AutoAddDevices.
# ...
[ 47.778] loading driver: nvidia
[ 47.882] (==) Matched nvidia as autoconfigured driver 0
[ 47.882] (==) Matched nouveau as autoconfigured driver 1
[ 47.882] (==) Matched modesetting as autoconfigured driver 2
[ 47.882] (==) Matched fbdev as autoconfigured driver 3
[ 47.882] (==) Matched vesa as autoconfigured driver 4
[ 47.882] (==) Assigned the driver to the xf86ConfigLayout
[ 47.882] (II) LoadModule: "nvidia"
# ...
[ 48.142] (II) Unloading vesa
[ 48.142] (EE) modeset(G0): drmSetMaster failed: Permission denied
[ 48.142] (EE)
Fatal server error:
[ 48.142] (EE) AddScreen/ScreenInit failed for gpu driver 0 -1

I removed the nvidia driver and got the system to the working state again and then installed it from scratch.

Then I set up a trace point in prime-select here:
https://github.com/tseliot/nvidia-prime/blob/cf757cc9585dfc032930379fc81effb3a3d59606/prime-select#L126-L138

Tracing showed that /lib/modprobe.d/blacklist-nvidia.conf and /lib/udev/rules.d/80-pm-nvidia.rules were created correctly this time:

https://paste.ubuntu.com/p/2Dp5jgFQty/

cat /lib/modprobe.d/blacklist-nvidia.conf
# Do not modify
# This file was generated by nvidia-prime
blacklist nvidia
blacklist nvidia-drm
blacklist nvidia-modeset
alias nvidia off
alias nvidia-drm off
alias nvidia-modeset off

I could then successfully reboot without hacking.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

If anybody still has a reproducer, it would be useful to set a trace point like this:

sudo vim /usr/bin/prime-select

# ...
        import pdb
        pdb.set_trace()

        if profile == 'nvidia':
            # Always allow enabling nvidia
            # (No need to check if nvidia is available)
            self._enable_nvidia()
        elif profile == "on-demand":
            self._disable_nvidia(keep_nvidia_modules=True)
        else:
            # Make sure that the installed packages support PRIME
            #if not self._supports_prime():
            # sys.stderr.write('Error: the installed packages do not support PRIME\n')
            # return False
            self._disable_nvidia()

# ...

and see if _disable_nvidia is entered when `prime-select intel` is invoked while `prime-select query` shows "nvidia" or "on-demand".

Before I removed the driver and the nvidia-prime package, I could switch the profile back and forth using `prime-select intel` and `prime-select nvidia` without any positive effect so it would be useful to find out why.

The code only depends on the target profile specified via command line and the current profile (read from /etc/prime-discrete) so I don't have any guesses yet:

https://github.com/tseliot/nvidia-prime/blob/cf757cc9585dfc032930379fc81effb3a3d59606/prime-select#L114-L143

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Managed to reproduce it:

1) switched to nvidia;
2) worked for a while with it for a while with 5.3.0-19-generic;
3) got a kernel update to 5.3.0-23-generic;
4) switched to intel;
5) reproduced the problem.

What I found is:

1) /lib/modprobe.d/blacklist-nvidia.conf gets created if you switch to "intel";

2) blacklist-nvidia.conf wasn't in initramfs for the new kernel:

lsinitramfs initrd.img-5.3.0-19-generic | grep blacklist-nvidia.conf ; echo $?
usr/lib/modprobe.d/blacklist-nvidia.conf
0

lsinitramfs initrd.img-5.3.0-23-generic | grep blacklist-nvidia.conf ; echo $?
1

3) running `updateinitramfs -u` fixes it.

From what I can see:

* updateinitramfs works correctly and the blacklist file gets created correctly as well so long as "intel" is selected;

* If a kernel is upgraded while the profile is set to "nvidia" though, /lib/modprobe.d/blacklist-nvidia.conf is (correctly) not there so any initramfs generation or updates do not include it;

* So when `prime-select intel` is done on a system that got upgraded with `prime-select query -> nvidia`, there is nothing to update initramfs.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

updateinitramfs used to be run previously, however, this was removed at some point (introduced into the distro in Cosmic/18.10):

https://github.com/tseliot/nvidia-prime/commit/7595f47b84f713dc969440e31d0e53708fddd71f

https://git.launchpad.net/~usd-import-team/ubuntu/+source/nvidia-prime/commit/?id=1180051b7a3f59cafc2c56a02fe9b0a8b8991273

I think this functionality needs to be brought back to fix this bug.

summary: - error booting with prime-select intel
+ [cosmic+] error booting with prime-select intel: prime-select does not
+ update initramfs to blacklist nvidia modules
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.7 KiB)

Based on a discussion with ~albertomilone, powering down the NVIDIA GPU while keeping the modules loaded is the way to go long-term as opposed to blacklisting the modules.

The power management feature is described here (requires Turing GPUs and above):
http://us.download.nvidia.com/XFree86/Linux-x86_64/440.44/README/dynamicpowermanagement.html

My GPU is pre-Turing (Pascal, 1060m), however, powering off is not where the problem is.

Running `prime-select intel` creates /lib/udev/rules.d/80-pm-nvidia.rules which contains the following line to unbind an NVIDIA GPU device from its driver:

https://github.com/tseliot/nvidia-prime/blob/cf757cc9585dfc032930379fc81effb3a3d59606/prime-select#L164-L165
ACTION=="add", SUBSYSTEM=="pci", ATTR{vendor}=="0x10de", ATTR{class}=="0x030000", ATTR{remove}="1"

If I comment it out, I can boot just fine with my iGPU after running `prime-select intel`. The resulting 80-pm-nvidia.rules file looks like this: https://paste.ubuntu.com/p/HX6t9y8BPg/

Just commenting out the power management lines while leaving the unbinding in-place results in the same issue (80-pm-nvidia.rules: https://paste.ubuntu.com/p/mTdXbZZk8H/).

The unbinding operation hangs which results in something like this even before X11 or gdm3 are attempted to be started:

[ 15.683190] nvidia-uvm: Loaded the UVM driver, major device number 511.
[ 15.824882] NVRM: Attempting to remove minor device 0 with non-zero usage count!
[ 15.824903] ------------[ cut here ]------------
[ 15.825082] WARNING: CPU: 0 PID: 759 at /var/lib/dkms/nvidia/440.59/build/nvidia/nv-pci.c:577 nv_pci_remove+0x338/0x360 [nvidia]
# ...
[ 15.825330] ---[ end trace 353e142c2126a8a0 ]---
# ...
[ 242.649248] INFO: task nvidia-persiste:1876 blocked for more than 120 seconds.
[ 242.649931] Tainted: P W O 5.4.0-12-generic #15-Ubuntu
[ 242.650618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 242.651319] nvidia-persiste D 0 1876 1 0x00000004

Eventually it fails with a timeout:
systemd[1]: nvidia-persistenced.service: start operation timed out. Terminating.
systemd[1]: nvidia-persistenced.service: Failed with result 'timeout'.
systemd[1]: Failed to start NVIDIA Persistence Daemon.

Masking nvidia-persistenced via `sudo systemctl mask nvidia-persistenced` and rebooting shows that systemd-udevd and rmmod hang as well:

Feb 9 17:18:43 blade systemd-udevd[717]: 0000:01:00.0: Worker [756] processing SEQNUM=4430 is taking a long time
Feb 9 17:18:43 blade systemd-udevd[717]: 0000:01:00.1: Worker [746] processing SEQNUM=4440 is taking a long time
Feb 9 17:20:43 blade systemd-udevd[717]: 0000:01:00.1: Worker [746] processing SEQNUM=4440 killed
Feb 9 17:20:43 blade systemd-udevd[717]: 0000:01:00.0: Worker [756] processing SEQNUM=4430 killed
Feb 9 17:21:31 blade kernel: [ 242.818665] INFO: task systemd-udevd:746 blocked for more than 120 seconds.
Feb 9 17:21:31 blade kernel: [ 242.819381] Tainted: P W O 5.4.0-12-generic #15-Ubuntu
Feb 9 17:21:31 blade kernel: [ 242.820075] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 9 17:21:31 blade kernel: [ 242.820797] systemd-ud...

Read more...

tags: added: champagne
Revision history for this message
Alberto Milone (albertomilone) wrote :

I am working on a few fixes to make sure that the nvidia devices are removed before the nvidia modules are loaded (I think this is what is going on in your case), and that booting in intel ("Power Saving") mode doesn't mislead GDM into using PRIME when the NVIDIA GPU is actually disabled (this causes booting or logging into a black screen).

None of this will require blacklisting the nvidia module in the initramfs. We have been there in the past, and it is not needed.

I am going to provide a PPA for Ubuntu 19.10, so that you can test the fixes.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Thanks, will help you with testing for sure.

Changed in nvidia-prime (Ubuntu):
status: Confirmed → In Progress
Changed in ubuntu-drivers-common (Ubuntu):
status: New → In Progress
Changed in nvidia-prime (Ubuntu):
importance: Undecided → High
Changed in ubuntu-drivers-common (Ubuntu):
importance: Undecided → High
Changed in nvidia-prime (Ubuntu):
assignee: nobody → Alberto Milone (albertomilone)
Changed in ubuntu-drivers-common (Ubuntu):
assignee: nobody → Alberto Milone (albertomilone)
Changed in nvidia-prime (Ubuntu Eoan):
assignee: nobody → Alberto Milone (albertomilone)
Changed in nvidia-prime (Ubuntu Bionic):
assignee: nobody → Alberto Milone (albertomilone)
Changed in ubuntu-drivers-common (Ubuntu Bionic):
assignee: nobody → Alberto Milone (albertomilone)
Changed in ubuntu-drivers-common (Ubuntu Eoan):
assignee: nobody → Alberto Milone (albertomilone)
Changed in nvidia-prime (Ubuntu Bionic):
status: New → Triaged
Changed in nvidia-prime (Ubuntu Eoan):
status: New → Triaged
Changed in ubuntu-drivers-common (Ubuntu Bionic):
status: New → Triaged
Changed in ubuntu-drivers-common (Ubuntu Eoan):
status: New → Triaged
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nvidia-prime - 0.8.14

---------------
nvidia-prime (0.8.14) focal; urgency=medium

  * prime-offload:
    - Detect nvidia modules more accurately before using prime.
      We don't want to catch i2c_nvidia_gpu or any other module
      that is not relevant.
  * prime-select:
    - Use udev rules to run early (LP: #1848326).
    - Remove the 380 pci class.

 -- Alberto Milone <email address hidden> Wed, 19 Feb 2020 13:24:26 +0100

Changed in nvidia-prime (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Alberto Milone (albertomilone) wrote :

@Dmitrii: if you would to test, please add the following PPA:

https://launchpad.net/~oem-solutions-group/+archive/ubuntu/nvidia-driver-staging

and update nvidia-prime (0.8.14~0.19.10.1) and ubuntu-drivers-common (1:0.7.8.1~0.19.10.1)

Also, make sure that the nvidia driver is not blacklisted in the initramfs. Then restart your system.

Revision history for this message
Alberto Milone (albertomilone) wrote :

*would like

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

I'm on focal now, just got the updated packages.

I will give it a try soon.

Revision history for this message
Alberto Milone (albertomilone) wrote :

Ok, that would be nvidia-prime (0.8.14) and ubuntu-drivers-common (1:0.7.8.1) then. Thank you.

Revision history for this message
Alberto Milone (albertomilone) wrote :

Also, please make sure to do a cycle of "sudo prime-select intel" and "sudo prime-select intel" so that the new udev rules are in place.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

albertomilone, looks good to me.

I tried switching from nvidia to intel and back rebooting on each attempt and haven't managed to reproduce the issue. Looks like reordering udev rules and/or adding another device class in your change helped (thanks a lot!).

I don't see an nvidia device in lspci output as well after rebooting with `prime-select intel` done:

➜ ~ lspci | grep -i nvidia
➜ ~
➜ ~ lspci | grep -i VGA
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 630 (rev 04)

/lib/udev/rules.d/50-pm-nvidia.rules
https://paste.ubuntu.com/p/2WScDVDQPq/

cat /lib/udev/rules.d/61-gdm.rules
https://paste.ubuntu.com/p/5K6hgKmrG2/

# that's for nvidiafb which is normal and comes from the kmod package
➜ ~ grep -RiP nvidia /etc/modprobe.d
/etc/modprobe.d/blacklist-framebuffer.conf:blacklist nvidiafb

➜ ~ uname -r
5.4.0-14-generic

➜ ~ dpkg -l | grep -P 'nvidia-prime|nvidia-driver|Xorg|gdm3'
ii gdm3 3.34.1-1ubuntu1 amd64 GNOME Display Manager
ii nvidia-driver-440 440.59-0ubuntu2 amd64 NVIDIA driver metapackage
ii nvidia-prime 0.8.14 all Tools to enable NVIDIA's Prime
ii xserver-xorg-core 2:1.20.7-2ubuntu1 amd64 Xorg X server - core server
ii xserver-xorg-legacy 2:1.20.7-2ubuntu1 amd64 setuid root Xorg server wrapper
ii xserver-xorg-video-nvidia-440 440.59-0ubuntu2 amd64 NVIDIA binary Xorg driver

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ubuntu-drivers-common - 1:0.7.8.1

---------------
ubuntu-drivers-common (1:0.7.8.1) focal; urgency=medium

  * tests/gpu-manager.py:
    - Do not test against renamed intel modules.
      We don't rename intel modules (such as i915-brw) any more.
      This fixes a test failure now that we have switched to exact
      module matching.

ubuntu-drivers-common (1:0.7.8) focal; urgency=medium

  * 71-u-d-c-gpu-detection.rules:
    - Look only for the actual nvidia module.
  * gpu-manager.c:
    - Stricter module name matching in is_module_loaded()
      (LP: #1848326).

 -- Alberto Milone <email address hidden> Thu, 20 Feb 2020 11:31:12 +0100

Changed in ubuntu-drivers-common (Ubuntu):
status: In Progress → Fix Released
Revision history for this message
Alberto Milone (albertomilone) wrote :

Great, thanks for testing!

Revision history for this message
J. Snow (jon.snow) wrote :

I'm facing the exact same problem on Ubuntu 20.04. Is there any workaround?

Revision history for this message
demensd (demensdeum) wrote :

Ubuntu 20.04
sudo prime-select intel
restart
stuck on HP/Kubuntu logo

Had to switch back in recovery mode:
sudo prime-select nvidia

Laptop:
https://support.hp.com/ru-ru/document/c06074117

Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in nvidia-prime (Ubuntu Eoan):
status: Triaged → Won't Fix
Changed in ubuntu-drivers-common (Ubuntu Eoan):
status: Triaged → Won't Fix
Revision history for this message
Alonso (patax87) wrote :

I'm facing the same problem with Ubuntu 20.04.1 LTS and GeForce 310M, using an old Samsung QX310.

Without these drivers, the preinstalled intel driver works fine.
But with "prime-select intel" I boot into a black screen.

To go back into the DE, first I have to press Ctrl+Alt+F4 and then input my login and password. Sometimes it takes a few tries. I hope this helps others, so that booting into recovery mode is not needed.

Revision history for this message
Alonso (patax87) wrote :

I also meant to say that after logging in into this black screen you can "prime select nvidia" and reboot again.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.