Ubuntu

wrong nvidia kernel module (7185 instead of 100.11.14) loads at boot time (manual install)

Reported by Christoph Lechleitner on 2007-09-02
4
Affects Status Importance Assigned to Milestone
linux-restricted-modules-2.6.22 (Ubuntu)
Undecided
Christoph Lechleitner

Bug Description

Binary package hint: linux-restricted-modules-2.6.22-10-generic

I have a Lenovo Thinkpad T61p with a nVidia Quadro FX 570M card (PCI-Id: 10de:040c) that needs up-to-date driver 100.11.14 to work at all.
(It still has the problem of slowing down after suspend/resume, but X works with NV-GLX and all)

Unfortunately, with up-to-the-second Gutsy something loads an old nvidia kernel module from somewhere ...

$ dmesg |grep -i nvidia
[ 17.384000] nvidia: module license 'NVIDIA' taints kernel.
[ 17.488000] NVRM: loading NVIDIA Linux x86 Kernel Module 1.0-7185 Mon Apr 2 18:29:54 PDT 2007

... which prevents X from starting.

There is a workaround ...

rmmod nvidia
insmod /lib/modules/$(uname -r)/kernel/drivers/video/nvidia.ko

... to be called before X starts (e.g. in some /etc/init.d/boot.local script), but that's too difficult for the average user.

I believe this problem is in the area of the linux-restricted-modules-2.6.22-10-generic, but I am not sure ;-))

Here is what I tried to solve the problem w/o the above module hack:

I changed /etc/default/nvidia-kernel to NVIDIA_CARDS=0
I had only the nvidia-glx-new 100.14.11+2.6.22.3-10.1 installed, no other nvida-glx-* package).
I even had xserver-xorg-video-nv removed,
I even had the initrd images re-collected after the manual module change.

After all this, the kernel still starts with the old 7185 module.

I know it's Nvidia's fault to need kernel modules at all, and the existence of 4 driver branches, even with some overlapping concerning the lists of cards supported is a pain in the ass for all of us, but we need to avoid such a version mess if Ubuntu is to become usable for the next guy.

I see two options:
1. /etc/default/nvidia could get an option to choose the driver version.
or
2. Divide the linux-restricted-modules package into packages with specific versions (generations) of specific drivers for specific devices.
E.g.:
linux-restricted-modules-nvidia-7185-$(uname -r)
linux-restricted-modules-nvidia-100.11.14-$(uname -r)

But I think the

Sitsofe Wheeler (sitsofe) wrote :

Christoph:
This is sounding familiar. Can you post the output of
ls -al /lib/linux-restricted-modules.
? Hopefully you are still seeing the problem which will allow a diagnosis to take place...

Christoph Lechleitner (lech) wrote :

While searching for a solution to the X-slow-after-resume problem, I found, say another workaround:

In NVidia's Forum, on
http://www.nvnews.net/vbulletin/showthread.php?t=72490
gives some hints.

As it is impossible to remove linux-restricted-modules-* without lateral damage,
the crucial step seems to be to edit /etc/default/linux-restricted-modules-common and add the nvidia modules to the DISABLED_MODULES list:
Despite NVida recommends...
DISABLED_MODULES="nv nvidia_new"
... I 'd rather use ...
DISABLED_MODULES="nv nvidia nvidia_new"

I also clearied all NVidia traces from ubunut binary packages, as suggested (but did not remove linux-restricted-modules-* of course).

Still, this is not a real solution for common ubuntu users.

But I see a third solution now:
Perhaps there is or should be a list of PCI IDs that maps NVidia cards to NVidia driver generation.
Probably there are already some 100 lists of vendor and/or device IDs for PCI[e], USB, PCI, IEEE1394 and BlueTooth, but what's one more ;->>

Christoph Lechleitner (lech) wrote :

After hours of googling, I found the according OpenSuSE 10.3 bug:
https://bugzilla.novell.com/show_bug.cgi?id=290385

Christoph Lechleitner (lech) wrote :

Voila:

 # ls -al /lib/linux-restricted-modules
total 24
drwxr-xr-x 4 root root 4096 2007-09-02 22:27 .
drwxr-xr-x 18 root root 12288 2007-09-01 19:52 ..
drwxr-xr-x 17 root root 4096 2007-08-23 20:39 2.6.22-10-generic
drwxr-xr-x 17 root root 4096 2007-08-23 20:29 2.6.22-9-generic

I don't think modules from 2.6.22-9-generic could be loaded in 2.6.22-10-generic?

I am gonna try puring the 2.6.22-9-generic stuff and boot again.

Christoph Lechleitner (lech) wrote :

Sorry for the SuSE link, that one concerns the X-slow-after-resume problem, not this module related one.

Christoph Lechleitner (lech) wrote :

Purging all 2.6.22-9-generic stuff did not prevent the loading of the 7185 driver.

Sitsofe Wheeler (sitsofe) wrote :

Christoph:
Can you indicate whether you manually install the NVIDIA .pkg or whether you are you still using the Ubuntu provided driver (don't change drivers though! Just indicate which you are using)?

Christoph Lechleitner (lech) wrote :

One addition:
I am currently with gutsy i386 usually, but the 100 Gig drive allows me to have installations of gutsy amd64, feisty i386 and feisty amd64 on the same machine.
Due to the far-too-new hardware I don't start feisty often, I usually switch betwenn gutsy i386 (provides Opera, Skype, realplayer) and gutsy amd64 (uses the full 4 Gig, but VMWare does not work, some apps are missing).
What I want to say: The problem is the same in my i386 and amd64 installations.

Christoph Lechleitner (lech) wrote :

Initially I used the binary package nvidia-glx-new with NVIDIA*.run run over it (due to a .so file missing in the binary .deb).
Now I am using _only_ a manual installed NVidia driver.
All of these variants share the same problem, only adding nvidia to DISABLED_MODULES in /etc/default/linux-restricted-modules-common did the trick "within the rules" (manual module change kind of does not count).

Sitsofe Wheeler (sitsofe) wrote :

If you've pulled up the Ubuntu NVIDIA package (as I believe the NVIDIA instructions tell you to) then you need to take care and use DISABLED_MODULES (or some more extreme measure) as you have found (see https://help.ubuntu.com/community/NvidiaManual for details about this and other potential issues). As you are using a manual install I will stop looking at this particular bug as I believe this changes where you can go for support. If you don't see useful follow ups here you _may_ have better luck on the NVIDIA web forums (http://www.nvnews.net/vbulletin/forumdisplay.php?f=14 ) or one of the methods described on http://www.ubuntu.com/support/communitysupport .

Using the deb and putting the .pkg over the top (rather than extracting just the bit you need) is a definite way to cause trouble though and I would recommend against doing that if possible. Go with one or the other not both at the same time because if the deb is updated you will run into issues and the .pkg will no longer know which files are supposed to be managed by it.

Christoph Lechleitner (lech) wrote :

I did'nt go with both ways at the same time.
But I could not go the .deb only way, because the nvidia-glx-new 100.11.14 package is still incomplete (#98641) and I can see this one kind of depends on #98641 and perhaps other nvidia-glx-* bugs.
And there is of course the possibility that the final nvidia-glx-new package might fix this module problem, too.

Is there (in launchpad) a way to let this issue depend on i.e. #98641 (and put this "on hold") or do you prefer to close this bug and I just open a new bug in case an otherwise clean nvidia-glx-new/restricted-modules combination should not solve this old-module problem?

Sitsofe Wheeler (sitsofe) wrote :

I've just looked and dependencies sounds like an unimplemented launchpad feature - see Bug #95419 .

Sitsofe Wheeler (sitsofe) wrote :

I think the "module problem" was due to the manual install needing extra steps to avoid conflicting with the Ubuntu provided drivers (see https://help.ubuntu.com/community/NvidiaManual ). I won't close this bug but if I were you I would probably resolve this bug invalid and subscribe myself to #98641 so I could tell if it had been fixed and then choose what I wanted to do (bearing in mind the information I'd read in NvidiaManual).

Christoph Lechleitner (lech) wrote :

For now this problem only exists with (obviously unsupported) "foreign" software (NVidia driver installed manually).
From a pure Ubuntu point of view it cannot be reproduced because of other problems of the current nvidia-glx-new package (see Bug #98641).
Therefore I set this bug to invalid and assign it to me.
Whenever untill nvidia-glx-new is fixed, I will retest the situation with a pure Ubuntu configuration and then decide either to close or to re-open this bug.

Changed in linux-restricted-modules-2.6.22:
assignee: nobody → lech
status: New → Invalid

I have a T61, and I also experienced this problem, and solving it required multiple steps. This problem is really hard to debug because it is almost impossible to even FIND the incorrect module that is being loaded, since it seems to exist only in the initramfs! It ends up being mounted on /lib/modules/<version>/volatile .

1. As of current gutsy (Sep 6) the nvidia_new modules is 100.14.11 and should work. (I actually installed the nvidia-new kernel module using m-a, but I think the nvidia_new module from the ramdisk is the one that is getting loaded.)

2. Edit /etc/defaults/linux-restricted-modules to contain the line
DISABLED_MODULES="nvidia nvidia_legacy"

Note that nvidia_new should NOT be disabled.

3. Run sudo lrm-manager.
If you don't do this, "modprobe nvidia_new" dies because the "install" line doesn't work.

4. Run sudo update-initramfs -u
I think that this is necessary also.

I *think* that this should result in the correct nvidia module getting loaded. (lsmod shows the nvidia_new module as just 'nvidia' after being loaded, but it is correct. My version has size 8115544)

5. The nvidia-glx-new package is MISSING the file /usr/lib/xorg/modules/libwfb.so
    You can get this file by extracting it from the NVIDIA driver archive and copying it to the correct place. If you don't do this, the X server will get further in startup, but still won't start. The debian version of the packages, but not the ubuntu version, contain this file. This has been known for a long time, so I don't know why it hasn't been fixed.

Christoph Lechleitner (lech) wrote :

Finally, the new 2.6.22-11 kernel packages and other update from the last 3 day breaks down restricted stuff completely.

There are no loadable nvidia kernel modules!!??
The restricted manages tells us we would'nt need restricted drivers!!!???
The damn self-repair-function of gdm (or is it X) simply renames our nice working X11 config and produces only things that are much worse than our initial X experiences in early 1990s.

Re-Running NVidia' s installer seems to solve this (and, together with some of the tricks above) leads to a working X11 again.

Christoph Lechleitner (lech) wrote :

Confirming solution by package updates, and how-to undo workarounds ...

With updates from today this problem does not exist any more.

If a machine has traces of workaround described above, one needs to:
+ --uninstall any manual NVIDIA*run installation
+ remove nv* from DISABLED_MODULES (in /etc/default/linux-restricted-modules-common)
+ eventually add nvidia / nvidia_new to /etc/modules (not sure this is needed)
+ make sure there is no real stuff commented out in /etc/modprobe.d/lrm-video
+ reinstall linux-restricted-modules-2.6.22-11-generic and eventually nvidia-glx-new

Christoph Lechleitner (lech) wrote :

Wether or not this bug was "real" (or only the siblings were), I set it to "Fix Released" so other people can find the knowledge collected here.

Changed in linux-restricted-modules-2.6.22:
status: Invalid → Fix Released
Sebastian (sebastian-voitzsch) wrote :

Hello,

I don´t know whether to reopen this bug or file a new one. The problem persists and even with the workarounds posted here I wasn´t able to get my 8200 chipset running.

I´m using 2.6.24-21-generic, before with an Geforce4 chip that needed the legacy driver. On upgrade, Ubuntu told me it wants to use new proprietary drivers. I agreed and got 169.xx installed.

However, the 8200 chipset needs the 177.xx version instead. So I downloaded the package from nvidia and manually installed the actual driver version.

Now I can manually do an "insmod /path/to/module/nvidia.ko" and then start X. This is the only way it works.

However, automatic module loading fails. First, when using "modprobe nvidia" the 169.xx module was used. Funny thing, within /lib/modules/2.6.24-21-generic path there is no other module than my own compiled nvidia.ko. Hmm, maybe initramfs?

Then I read something about restricted-modules-manager. So I added nvidia, nvidia_new to /etc/default/linux-restricted-modules-common and did update-initramfs.

But this doesn´t lead to a working system. Now "modprobe nvidia" fails with "error installing module nvidia" with no other comment or error message. modprobe -l nvidia however points me to the correct module?!

I think some of the autoloader or restricted-modules-manager settings left behind, but I can´t find it. Meanwhile I removed linux-restricted-modules, but no luck.

Sebastian

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.