Ubuntu 3.2 kernels vs. nvidia-current vs GTX 560 (Mainline works)

Bug #901386 reported by 3vi1
64
This bug affects 12 people
Affects Status Importance Assigned to Milestone
NVIDIA Drivers Ubuntu
In Progress
Undecided
Unassigned
nvidia-graphics-drivers (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Not a single one of the packaged 3.2 kernels in the Precise repos has worked for me on my main (AMD64 - i980x) system, when using the nvidia-current (or nvidia-current-updates) drivers.

However, the stock 3.2 mainline kernels do appear to work fine. http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2-rc4-oneiric/linux-image-3.2.0-030200rc4-generic_3.2.0-030200rc4.201112012035_amd64.deb, for instance does work with no issues whatsoever. Also, the standard 3.1 packages from the normal repos have always worked fine on the same system with the same hardware.

What works:
--------------
3.1 with nvidia-current or nouveau drivers
3.2.0-2 with nouveau drivers.
3.2.0-3 with nouveau drivers.
3.2rc4 (mainline) with nvidia-current -r nouveau drivers

What does not work:
------------------------
3.2.0-2 with nvidia-current or nvidia-current-updates
3.2.0-3 with nvidia-current or nvidia-current-updates

Relevant errors:
------------------
When attempting to start 3.2.0-2 or -3 with nvidia-current*, X fails to start and the following is reported in Xorg.0.log:

[ 6.002] (II) NVIDIA(0): Creating default Display subsection in Screen section
[ 6.002] (==) NVIDIA(0): Depth 24, (==) framebuffer bpp 32
[ 6.002] (==) NVIDIA(0): RGB weight 888
[ 6.002] (==) NVIDIA(0): Default visual is TrueColor
[ 6.002] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
[ 6.002] (**) NVIDIA(0): Option "NoLogo" "True"
[ 14.524] (EE) NVIDIA(0): Failed to initialize the NVIDIA GPU at PCI:5:0:0. Please
[ 14.524] (EE) NVIDIA(0): check your system's kernel log for additional error
[ 14.524] (EE) NVIDIA(0): messages and refer to Chapter 8: Common Problems in the
[ 14.524] (EE) NVIDIA(0): README for additional information.
[ 14.524] (EE) NVIDIA(0): Failed to initialize the NVIDIA graphics device!

When referring to the kernel log, this is the only NVRM error message:

Dec 3 15:20:58 saturn kernel: [ 4.614087] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 285.05.09 Fri Sep 23 17:31:57 PDT 2011
Dec 3 15:21:09 saturn kernel: [ 16.267342] NVRM: RmInitAdapter failed! (0x27:0x38:1193)
Dec 3 15:21:09 saturn kernel: [ 16.267348] NVRM: rm_init_adapter(0) failed

Here is the important parts of my glxinfo output (taken while running mainline kernel):

direct rendering: Yes
server glx vendor string: NVIDIA Corporation
server glx version string: 1.4
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: GeForce GTX 560 Ti/PCI/SSE2
OpenGL version string: 4.2.0 NVIDIA 285.05.09
OpenGL shading language version string: 4.20 NVIDIA via Cg compiler

What I've tried:
-----------------
I've completely reformatted/reinstalled from the current (Precise) daily live-cd image to make sure this was not some artifact of upgrading the system from 11.10. But, the symptoms remained exactly the same.

Before reinstalling, I also tried the newer packages from the Edgers repository (including the nvidia 290.10 drivers there) - but saw no change in results.

I've looked around, but can't find an existing report that seems to exactly match my issue, and I know others are using the standard repo versions of the Precise 3.2 kernels with nvidia-current without issue, so I'm wondering if it's not the result of some kernel patch that's only affecting some higher-end cards.

Other info:
-------------
evil@saturn:~$ lsb_release -rd
Description: Ubuntu precise (development branch)
Release: 12.04

evil@saturn:~$ apt-cache policy nvidia-current
nvidia-current:
  Installed: (none)
  Candidate: 285.05.09-0ubuntu1
  Version table:
     285.05.09-0ubuntu1 0
        500 http://us.archive.ubuntu.com/ubuntu/ precise/restricted amd64 Packages
        100 /var/lib/dpkg/status

The card is a GTX 560 with 1GB RAM.

CVE References

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

I just tested with the linux-image (3.2.0-3.9) packages that hit the repo last night. Unfortunately, the problem is still there:

kern.log:
Dec 8 07:51:51 saturn kernel: [ 6.132265] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 285.05.09 Fri Sep 23 17:31:57 PDT 2011
Dec 8 07:52:04 saturn kernel: [ 19.727681] NVRM: RmInitAdapter failed! (0x27:0x38:1193)
Dec 8 07:52:04 saturn kernel: [ 19.727688] NVRM: rm_init_adapter(0) failed

syslog:
Dec 8 07:51:51 saturn kernel: [ 6.064390] nvidia: module license 'NVIDIA' taints kernel.
Dec 8 07:51:51 saturn kernel: [ 6.132165] nvidia 0000:05:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Dec 8 07:51:51 saturn kernel: [ 6.132173] nvidia 0000:05:00.0: setting latency timer to 64

Xorg.0.log:
[ 19.451] (EE) NVIDIA(0): Failed to initialize the NVIDIA GPU at PCI:5:0:0. Please
[ 19.451] (EE) NVIDIA(0): check your system's kernel log for additional error
[ 19.451] (EE) NVIDIA(0): messages and refer to Chapter 8: Common Problems in the
[ 19.451] (EE) NVIDIA(0): README for additional information.
[ 19.451] (EE) NVIDIA(0): Failed to initialize the NVIDIA graphics device!
[ 19.451] (II) UnloadModule: "nvidia"
[ 19.451] (II) Unloading nvidia
[ 19.451] (II) UnloadModule: "wfb"
[ 19.451] (II) Unloading wfb
[ 19.451] (II) UnloadModule: "fb"
[ 19.451] (II) Unloading fb
[ 19.451] (EE) Screen(s) found, but none have a usable configuration.
[ 19.451]
Fatal server error:
[ 19.451] no screens found

Also, since I didn't explicitly say this before: When the nvidia driver fails to load, the system does *not* fall back and use nouveau or nv. It is just left at the TTY console. The only way to get the system to boot with the nouveau drivers appears to be to remove nvidia-current* and reboot.

Other things I tried:
-----------------------
Forgot to mention this in the original message too, but one thing I did try was temporarily blacklisting the nouveau drivers in order to make sure they weren't a factor. However, it did not appear to have any effect.

affects: linux-meta (Ubuntu) → linux (Ubuntu)
tags: added: precise
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 901386

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
status: Incomplete → New
status: New → Incomplete
Revision history for this message
Brad Figg (brad-figg) wrote : Test with newer development kernel (3.2.0-3.9)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

If the bug still exists, change the bug status from Incomplete to Incomplete. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

tags: added: kernel-request-3.2.0-3.9
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This may be a duplicate of bug 897102

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
importance: Undecided → High
Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

@Joseph re:#4 - Possibly, but there's not enough info in that bug and the description of "hung" is too vague to be sure. To be clear, my system doesn't "hang" per se - lightdm just can't start. I can still alt-f# over to a tty once the nvidia driver fails to start.

@Brad re:#3 - I already tested with 3.2.0-3.9, as mentioned in comment #1. I will add the tag bot-stop-nagging.

@Brad re:#2 - I don't believe I can run that command properly, as it invokes a gui and browser and the bug prevents me from starting X. I could run it while running nouveau - but I'm not sure that would provide any useful information. Let me know if you would like me to do that anyway.

I see that this has already been marked as confirmed, but would appreciate if anyone else reading and experiencing this bug could chime in and /really/ confirm that it's affecting multiple users/systems. From the symptoms, my testing, and past lack of problems with nvidia-current, my confidence is high that it's not a unique problem to my individual hardware, but everyone could use a sanity check now and then. :)

tags: added: bot-stop-nagging
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Unfortunately, There isn't much more we can do on the kernel side because of the proprietary nature of this package. Since we can't look at the code and figure out what's wrong and fix. I'll re-assign to the nvidia-current package.

affects: linux (Ubuntu) → nvidia-graphics-drivers (Ubuntu)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, Ubuntu 12.04(Precise) will pick up this fix automatically if you confirm its fixed upstream.

Revision history for this message
Daniel Fletcher (xyzzyman) wrote :

This is happening to me with a 9600M GS 512MB. Tried with a fresh daily ISO install of precise, and it also happens on my 11.10 install with the newer 3.2 kernel packages applied. I have also purged nvidia-current and tried with the binaries from nvidia of 290.10, 290.06 and 285.05.09. If I download RC5 from kernel.org and compile wether with my custom .conf file or with the defaults any of the nvidia drivers work just fine. I don't like using them from kernel.org though as then ureadahead doesn't work.

Revision history for this message
Daniel Fletcher (xyzzyman) wrote :

I forget to mention that my system doesn't drop to a TTY. It just goes to a black screen, then the backlight eventually turns off. No matter which nvidia driver or in 12.04 or 11.10. I know it's with the video driver as I disabled lightdm to boot to a command prompt, then run startx manually and it still locks. Can't switch to TTY1-6.

Revision history for this message
Daniel Fletcher (xyzzyman) wrote :

Further troubleshooting lead me to acpi=off being a solution. Just not a good one. I'll take a look at the .config file that's default with 3.2 and compare to the latest ubuntu .config later. Or pull the config out of a working 3.2 deb and compare to the new ones. Something must have changed in ubuntu's .config to cause this.

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

Just tested and found that acpi=off lets me boot the 3.2.0-6 too - and it works fine with the nvidia-current drivers. Too bad it means my i980x is treated single-core.

I'm not sure you'll find the trouble in a diff of the more recent mainline configs - both mainline rc5 and rc6 don't work for me (same symptoms as normal Ubuntu kernels). I guess now at least we know we're looking for some 3.2 ACPI change vs. nvidia blob issue.

Revision history for this message
Haitao Li (lht) wrote :

Same issue here, Quadro FX 770M 512M. (mainline rc6 doesn't work either)

The acpi=off tricks works for me too.

Revision history for this message
Daniel Fletcher (xyzzyman) wrote :

I did a custom compile of 3.2.0-6 with no nouveau driver at all, then purged all nvidia debs and in single mode ran the nvidia binary with the --uninstall option and deleted my xorg.conf file. Rebooted and installed the nvidia binary in single mode then rebooted again and have 3.2 kernel along with nvidia acceration. This is just on my 11.10 install. Will do a fresh install of 12.04 from a daily ISO, and try installing the nvidia binary without touching nvidia-current.

Revision history for this message
Haitao Li (lht) wrote :

I found the blank screen is actually caused busy xorg process. And there are huge number of error like below in kern.log. Add intel_iommu=off to kernel boot option and now it works perfectly.

Dec 24 13:28:54 ainur kernel: [ 420.863224] DMAR:[DMA Read] Request device [01:00.0] fault addr 122886000
Dec 24 13:28:54 ainur kernel: [ 420.863225] DMAR:[fault reason 01] Present bit in root entry is clear
Dec 24 13:28:54 ainur kernel: [ 420.866458] DRHD: handling fault status reg 2

Revision history for this message
Sander van Hulst (sanmaz31) wrote :

I have this problem too.
My videocard is Nvidia Quadro FX 3800.

Adding intel_iommu=off to grub parameters indeed solves this problem temporarily...

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

I can confirm Haitao's workaround as well. I added intel_iommu=off and was then able to boot 3.2.0-6 without issue.

Thanks Haitao!

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

Looks like I have to back off the "without issue" in my previous statement. I just noticed that the USB3.0 devices attached to my Marvel 88SE91xx adapter are not being seen when I use this workaround. Of course, at this point I'm assuming it's caused by the work around and not an entirely separate bug in 3.2.0-6... so take that for what it's worth.

I had to completely power-down/power-on and boot back to the mainline rc4 kernel (the only 3.2 kernel that's not had this nvidia issue for me) in order to get the USB3 devices seen again (that is to say: just warm-rebooting and choosing the old kernel from grub didn't work - had to completely power down the motherboard before they could again be seen).

I also tried using intel_iommu=igfx_off, but the kernel still fails to load the nVidia drivers in manner described in the original bug when doing that.

Revision history for this message
Sander van Hulst (sanmaz31) wrote :

I just tried 3.2.0.9 and it still does not work..
Has anyone had any luck with newer kernels than 3.2rc4?

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

> Has anyone had any luck with newer kernels than 3.2rc4?

No luck here. As of 3.2.0.9 I still have to disable intel_iommu in grub to get it to work, even with the new 290.10 nvidia-current packages.

Revision history for this message
Daniel Fletcher (xyzzyman) wrote :

It was my mistake earlier. I used my own compile of the 3.2 kernel with IOMMU all disabled as I only have a Core2 Duo that doesn't support IOMMU. I have to use the intel_iommu=off with any build of the 3.2 or 3.2.1 kernel wether from ubuntu sources or kernel.org. On 12.10 and 11.10. Using any nvidia binary.

Revision history for this message
Thomi Richards (thomir-deactivatedaccount) wrote :

Hi,

This affects me, although with slightly different symptoms to other reporters.

The current (18th Jan 2012) kernel & nvidia-current driver in precise doesn't work - after the kernel boots I'm stuck with a black screen. I can't switch to any of the terminals, or do anything other than hold down the power button to switch the machine off. If I start the same kernel in recovery mode then I can get to a single user shell, but when I try and start X I end up in the same place - locked out of the system entirely.

Adding acpi=off to the kernel parameters does not make any difference. Adding intel_iommi=off to the kernel parameters doesn't make any difference. Adding both makes no difference either.

Revision history for this message
entonjackson (aj-mysc) wrote :

Same for me with Quadro FX 770M, Precise 64bit, an Linux 3.2.0.2

Revision history for this message
Daniel Fletcher (xyzzyman) wrote :

For those that need intel_iommu=off, it looks like it's been fixed w/this change * [Config] Disable CONFIG_INTEL_IOMMU_DEFAULT_ON:

---------------
linux (3.2.0-10.17) precise; urgency=low

  [ Andy Whitcroft ]

  * Revert "SAUCE: overlayfs -- fs: limit filesystem stacking depth"
  * Revert "SAUCE: overlayfs -- overlay: overlay filesystem documentation"
  * Revert "SAUCE: overlayfs -- overlayfs: implement show_options"
  * Revert "SAUCE: overlayfs -- overlayfs: add statfs support"
  * Revert "SAUCE: overlayfs -- overlay filesystem"
  * Revert "SAUCE: overlayfs -- vfs: introduce clone_private_mount()"
  * Revert "SAUCE: overlayfs -- vfs: export do_splice_direct() to modules"
  * Revert "SAUCE: overlayfs -- vfs: add i_op->open()"
  * ensure debian/ is not excluded from git by default
  * add new scripting to handle buglinks in rebases
  * ubuntu: overlayfs -- overlayfs: add statfs support
  * ubuntu: overlayfs -- overlayfs: apply device cgroup and security
    permissions to overlay files
    - LP: #915941, #918212
    - CVE-2012-0055

  [ Erez Zadok ]

  * ubuntu: overlayfs -- overlayfs: implement show_options

  [ Leann Ogasawara ]

  * Revert "SAUCE: dmar: disable if ricoh multifunction detected"
  * [Config] Disable CONFIG_INTEL_IOMMU_DEFAULT_ON
    - LP: #907377, #911236
  * [Config] Enable CONFIG_IRQ_REMAP

  [ Miklos Szeredi ]

  * ubuntu: overlayfs -- vfs: pass struct path to __dentry_open()
  * ubuntu: overlayfs -- vfs: add i_op->open()
  * ubuntu: overlayfs -- vfs: export do_splice_direct() to modules
  * ubuntu: overlayfs -- vfs: introduce clone_private_mount()
  * ubuntu: overlayfs -- overlay filesystem
  * ubuntu: overlayfs -- fs: limit filesystem stacking depth

  [ Neil Brown ]

  * ubuntu: overlayfs -- overlay: overlay filesystem documentation

  [ Upstream Kernel Changes ]

  * (pre-stable) x86/PCI: amd: factor out MMCONFIG discovery
    - LP: #647043
  * (pre-stable) PNP: work around Dell 1536/1546 BIOS MMCONFIG bug that
    breaks USB
    - LP: #647043
 -- Leann Ogasawara <email address hidden> Mon, 16 Jan 2012 07:10:08 -0800

Revision history for this message
Bryce Harrington (bryce) wrote :

Based on the last comment, and the original reporter's comment #16 it sounds like this might now be worked around in the distro. Can someone confirm the original reported problem (failure to boot) is resolved?

Comment #17 suggests there may be secondary problems (which may or may not be a consequence of the fix), so I'll leave the bug open for now. It would be helpful if someone can pin down what those issues are and if they are indeed a result of disabling IOMMU.

Even though this issue sounds like it's been worked around in the kernel, I'd like to see this escalated to nvidia. Alberto, can you mention it to them in the next call?

Changed in nvidia-graphics-drivers (Ubuntu):
assignee: nobody → Alberto Milone (albertomilone)
Revision history for this message
Sander van Hulst (sanmaz31) wrote :

With 3.2.0-10 I can boot normally without adding iommu=off.
(But it was necessary to reinstall the nvidia drivers, before that I couldn't boot)

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

I can confirm that I too am able to boot 3.2.0-10 without adding intel_iommu=off.

I did *not* have to reinstall any drivers like Sander though.

Revision history for this message
Chris Van Hoof (vanhoof) wrote : Re: [Bug 901386] Re: Ubuntu 3.2 kernels vs. nvidia-current vs GTX 560 (Mainline works)

On 01/20/2012 09:29 PM, Bryce Harrington wrote:
> Based on the last comment, and the original reporter's comment #16 it
> sounds like this might now be worked around in the distro. Can someone
> confirm the original reported problem (failure to boot) is resolved?

@Bryce -- Confirmed here, after installing 3.2.0-10 and removing
intel_iommu=off I'm now able to boot with the nvidia driver installed.

--chris

Revision history for this message
Alberto Milone (albertomilone) wrote :

@Bryce:
Sure, I'll bring this up in the next call.

Bryce Harrington (bryce)
Changed in nvidia-graphics-drivers (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Bryce Harrington (bryce) wrote :

Thanks everyone for confirming the fix in Ubuntu. I'm opening an upstream task for Alberto to track the work with NVIDIA. Meanwhile I'm closing out the Ubuntu task.

If there are any remaining issues left as a consequence of this change please file new bug reports.

Changed in nvidia-drivers-ubuntu:
assignee: nobody → Alberto Milone (albertomilone)
assignee: Alberto Milone (albertomilone) → nobody
status: New → In Progress
Changed in nvidia-graphics-drivers (Ubuntu):
assignee: Alberto Milone (albertomilone) → nobody
status: In Progress → Fix Released
Revision history for this message
sandipt (sandipt) wrote :

3vi1, Can you provide system information and attach file by running nvidia-bug-report.sh as root?

Revision history for this message
3vi1 (launchpad-net-eternaldusk) wrote :

Sure thing. Attached!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.