10de:0426 GPU loads unreliably, possible kernel timeout

Bug #1009312 reported by Kyle Auble on 2012-06-06
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned

Bug Description

Reverse upstream kernel commit bisecting revealed a fix via commit d34883d4e35c0a994e91dd847a82b4c9e0c31d83 by Xiao Guangrong.

WORKAROUND: If I boot my computer from battery power alone without AC, my GPU & the Ubuntu splash screen load on startup.

I've been running Ubuntu 12.04 for a few weeks now, I really like it, but from the beginning, I had the issue where the proprietary nvidia driver installs but fails to load (confirmed from the commandline, jockey, and the nvidia-dashboard). Over time, I've noticed that sometimes when I power on, the driver does load and I can enter a full unity session without problems, but other times, I fall back onto the VESA driver and a unity 2d session. On a whim, I finally copied logs from both successful and unsuccessful boots, cut out the times, ran a diff on them, and noticed a pattern in the kernel messages.

I'm filing this bug after a successful boot so I've also attached copies of dmesg, Xorg, & jockey logs from an unsuccessful boot. The first thing I saw in the logs was a timing discrepancy between the two boots, most of which is due to GPE storms. I've checked other logs and there's not a clear relation, I've had successful boots with them and unsuccessful ones without them. I do still wonder if they may be involved because it seems I'm a little luckier if I turn off and unplug any peripherals before booting.

But around line 325 in my dmesg logs, at the last step that mentions my GPU (pci device 0000:01:00.0), there is consistently at most a 6 ms delay for successful boots, but a 30 ms one for unsuccessful ones. Also, on all dmesg logs from successful boots, around line 610, the message "Boot video device" is recorded for the PCI number of my GPU, but for every fallback, the message never appears. That's why I'm thinking it's a kernel issue because the earliest mention of a specific driver module doesn't occur until later in the log.

I'm currently using fully updated versions of nvidia driver 295.49.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-24-generic-pae 3.2.0-24.39
ProcVersionSignature: Ubuntu 3.2.0-24.39-generic-pae 3.2.16
Uname: Linux 3.2.0-24-generic-pae i686
NonfreeKernelModules: nvidia
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 2.0.1-0ubuntu8
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: STAC92xx Analog [STAC92xx Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: kyle 1790 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xfc400000 irq 48'
   Mixer name : 'SigmaTel STAC9872AK'
   Components : 'HDA:83847662,104d1c00,00100201 HDA:14f12c06,104d1700,00100000'
   Controls : 18
   Simple ctrls : 9
Date: Tue Jun 5 22:44:22 2012
EcryptfsInUse: Yes
HibernationDevice: RESUME=UUID=1b676222-44c7-453c-a522-06b6fd5d66f4
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release i386 (20120423)
MachineType: Sony Corporation VGN-FZ260E
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.2.0-24-generic-pae root=UUID=e330e46a-b426-439f-8037-c1069cc693ce ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-24-generic-pae N/A
 linux-backports-modules-3.2.0-24-generic-pae N/A
 linux-firmware 1.79
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/04/2007
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: R1120J7
dmi.board.asset.tag: N/A
dmi.board.name: VAIO
dmi.board.vendor: Sony Corporation
dmi.board.version: N/A
dmi.chassis.asset.tag: N/A
dmi.chassis.type: 10
dmi.chassis.vendor: Sony Corporation
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrR1120J7:bd07/04/2007:svnSonyCorporation:pnVGN-FZ260E:pvrFC000001:rvnSonyCorporation:rnVAIO:rvrN/A:cvnSonyCorporation:ct10:cvrN/A:
dmi.product.name: VGN-FZ260E
dmi.product.version: FC000001
dmi.sys.vendor: Sony Corporation
---
AcpiTables: Error: command ['pkexec', '/usr/share/apport/dump_acpi_tables.py'] failed with exit code 127: Error executing /usr/share/apport/dump_acpi_tables.py: Permission denied
ApportVersion: 2.5.1-0ubuntu4
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 3344 F.... pulseaudio
CasperVersion: 1.321
DistroRelease: Ubuntu 12.10
LiveMediaBuild: Ubuntu 12.10 "Quantal Quetzal" - Alpha i386 (20120831)
MachineType: Sony Corporation VGN-FZ260E
Package: linux (not installed)
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: noprompt cdrom-detect/try-usb=true file=/cdrom/preseed/username.seed boot=casper initrd=/casper/initrd.lz quiet splash -- maybe-ubiquity
ProcVersionSignature: Ubuntu 3.5.0-13.14-generic 3.5.3
RelatedPackageVersions:
 linux-restricted-modules-3.5.0-13-generic N/A
 linux-backports-modules-3.5.0-13-generic N/A
 linux-firmware 1.91
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: yes
Tags: quantal running-unity
Uname: Linux 3.5.0-13-generic i686
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
dmi.bios.date: 07/04/2007
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: R1120J7
dmi.board.asset.tag: N/A
dmi.board.name: VAIO
dmi.board.vendor: Sony Corporation
dmi.board.version: N/A
dmi.chassis.asset.tag: N/A
dmi.chassis.type: 10
dmi.chassis.vendor: Sony Corporation
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrR1120J7:bd07/04/2007:svnSonyCorporation:pnVGN-FZ260E:pvrFC000001:rvnSonyCorporation:rnVAIO:rvrN/A:cvnSonyCorporation:ct10:cvrN/A:
dmi.product.name: VGN-FZ260E
dmi.product.version: FC000001
dmi.sys.vendor: Sony Corporation
---
ApportVersion: 2.10.2-0ubuntu1
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 4176 F.... pulseaudio
                      ubuntu 6045 F.... pulseaudio
CasperVersion: 1.333
DistroRelease: Ubuntu 13.10
LiveMediaBuild: Ubuntu 13.10 "Saucy Salamander" - Alpha i386 (20130529)
MachineType: Sony Corporation VGN-FZ260E
MarkForUpload: True
Package: linux (not installed)
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcEnviron:
 LANGUAGE=en_US
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: noprompt cdrom-detect/try-usb=true persistent file=/cdrom/preseed/hostname.seed boot=casper initrd=/casper/initrd.lz quiet splash -- maybe-ubiquity
ProcVersionSignature: Ubuntu 3.9.0-3.8-generic 3.9.4
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.9.0-3-generic N/A
 linux-backports-modules-3.9.0-3-generic N/A
 linux-firmware 1.109
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
Tags: saucy
Uname: Linux 3.9.0-3-generic i686
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 07/04/2007
dmi.bios.vendor: Phoenix Technologies LTD
dmi.bios.version: R1120J7
dmi.board.asset.tag: N/A
dmi.board.name: VAIO
dmi.board.vendor: Sony Corporation
dmi.board.version: N/A
dmi.chassis.asset.tag: N/A
dmi.chassis.type: 10
dmi.chassis.vendor: Sony Corporation
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvrR1120J7:bd07/04/2007:svnSonyCorporation:pnVGN-FZ260E:pvrFC000001:rvnSonyCorporation:rnVAIO:rvrN/A:cvnSonyCorporation:ct10:cvrN/A:
dmi.product.name: VGN-FZ260E
dmi.product.version: FC000001
dmi.sys.vendor: Sony Corporation

Brad Figg (brad-figg) on 2012-06-06
Changed in linux (Ubuntu):
status: New → Confirmed

I can now also confirm that suspend works perfectly with the nvidia driver loaded

Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.5kernel[0] (Not a kernel in the daily directory) and install both the linux-image and linux-image-extra .deb packages.

Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc1-quantal/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-unable-to-test-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

I just tried booting into the newest 3.5 release candidate (dated from June 2nd), and it choked somewhere between starting up my wireless card and reaching a login screen. While installing the mainline kernel, dpkg actually printed an error about not being able to build the nvidia module against the kernel, but I guess not even the VESA driver worked. Do you want me to try a bisect on different kernel versions, or would I need different hardware for a proper test?

I've gone ahead and attached the 3.5 kernel dmesg log. I'm not sure if you wanted me to remove the 'needs-upstream-testing' tag and change to confirmed because I was unable to test the bug specifically. If you don't consider the tags mutually exclusive, I can add back the incomplete status and upstream testing tag.

I also have to report that it turns out suspend doesn't work perfectly even with the GPU. I was able to unlock the computer fine, but the fan just started running full blast, and when I tried logging out of the session, my monitor just turned off. The following restart didn't reboot the GPU so I'm back onto the VESA driver. Since it happens both with and without the GPU though, I guess the suspend issue is separate, and I think I already saw a similar bug report.

tags: removed: needs-upstream-testing

After paying attention to how my computer boots, I've finally found a consistent relationship. If I boot my computer from battery power alone without AC, my GPU & the Ubuntu splash screen load on startup. However, if I start my laptop on AC power, with or without the battery connected, it seems to be random luck if the GPU does load, and most of the time it doesn't.

Whether peripherals or wireless are active doesn't seem to matter. I've gone ahead and added a kernel-acpi tag because power management seems to be involved somehow.

tags: added: kernel-acpi
tags: added: kernel-graphics
removed: nvidia
tags: added: kernel-unable-to-test-upstream-v3.5-rc1-quantal
description: updated

Kyle Auble, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, :
+ Did this issue not occur in prior Ubuntu releases?
+ As well, could you please test upstream kernel http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc7-quantal/ following https://wiki.ubuntu.com/KernelMainlineBuilds ? Once you've tested the upstream kernel, please comment on which kernel version specifically you tested and remove the tag:
needs-upstream-testing

This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the text:
needs-upstream-testing

If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

Please let us know your results. Thank you for your understanding.

Helpful Bug Reporting Links:
https://help.ubuntu.com/community/ReportingBugs#Bug_Reporting_Etiquette
https://help.ubuntu.com/community/ReportingBugs#A3._Make_sure_the_bug_hasn.27t_already_been_reported
https://help.ubuntu.com/community/ReportingBugs#Adding_Apport_Debug_Information_to_an_Existing_Launchpad_Bug
https://help.ubuntu.com/community/ReportingBugs#Adding_Additional_Attachments_to_an_Existing_Launchpad_Bug

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: needs-upstream-testing

Before using Ubuntu 12.04, I was running Xubuntu 11.10. Although,
I didn't use Unity, the computer would detect my GPU, load the
nvidia driver module, and I was able to use higher resolutions.
However, I didn't have the newest version of the kernel because
starting around version 3.2.0-15 my computer would freeze during
bootup until upgrading to 3.2.0-23 (which came with Ubuntu 12.04).

I just tested the recent mainline build of the kernel (version 3.5-rc7),
and I was able to boot into a session, unlike with rc1. However, the
GPU still isn't loading and I can only enter a Unity2D session. Also,
the boot from battery workaround didn't work with the 3.5 kernel.

 tag -needs-upstream-testing
 tag kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.5-rc7-quantal

Sorry about the tags, I guess I did something wrong or don't have access to the email interface

tags: added: kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.5-rc7-quantal
removed: kernel-unable-to-test-upstream needs-upstream-testing

Kyle Auble, thank you for testing the newest mainline.

Could you please confirm this issue exists with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/daily/current/ . If the issue remains, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Thank you for your understanding.

I've been busy recently, and probably won't be able to test the Ubuntu build right away. I only have spare CD-Rs right now, but the download page says that the daily build of Quetzal is currently too large to burn to CD.

I'll keep checking, and once I have a spare USB drive or some DVD-Rs, or the build can fit on a CD-R, I'll run the test. Just to check though, will booting into a Live session in RAM be valid, or do I need install the build to hard-disk to reduce the variables involved? Although I won't have the proprietary NVIDIA driver in the Live session, I should still be able to tell if the GPU is being used . If I have to install, it just may take a little longer. I'll have to shuffle some partitions around.

Anyways, thanks for the suggestions, and I'll try to run that test ASAP.

apport information

tags: added: apport-collected quantal running-unity
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

I finally picked up a spare USB thumb drive and set it up as a Live USB with the Aug 31 build of Quantal. Just to be sure, I went back and checked with my old Precise Live CD, and the bug appears for both versions in a live boot too. While both live-boots also respond to the "boot-from-battery" workaround, I think there's a bug in the Quantal version of nouveau or something because although the splash screen appears for a USB boot of Quantal on battery, the screen then freezes with static (but I can hear sounds in the background).

One other thing I'm going to try is setting up a Live USB of the AMD64 version to check. While I'm not expecting it too, I'll let you know if that one behaves differently.

What do you know? I tried booting the 64bit version of Quantal from a Live USB multiple times, and the splash screen consistently appeared. Now I wasn't able to check the logs because the screen would then freeze (as I mentioned in my last post, that may be a bug in Quantal's graphics).

However, I then downloaded the 64 bit version of Precise, and sure enough, the splash screen appears every time, and both nouveau & Unity 3d load, with the AC plugged in. dmesg shows the 6ms (instead of 30ms) gap near line 300, and the GPU is being booted around line 600. I suppose I should reinstall with the 64 bit version (which I was planning to do at some point anyway) to be sure, but it looks like that may resolve the bug for me.

I guess it may be something to do specifically with the PAE version of the kernel. If you want to me keep helping narrow down what's going on, I wouldn't mind, but if you want to close the bug as resolved, I can do that and make a note about the fix somewhere if anyone else is having problems (here/the wiki/the forums?) I'll still let you know how the fresh install turns out.

I'm now running the AMD64 version of Precise from the hard-disk, and over several reboots, the GPU has been loading without any problems. It looks like dropping the PAE version of the kernel fixed things for me. I don't know if whatever was happening was due to the mismatch between the 32bit OS and 64bit chip, or if it's something in the 32 bit kernel regardless of the hardware. All I can say is that I've managed to resolve it on my machine. Thanks for your help talking me through different ideas and narrowing this down.

Arrrgh, spoke too soon. After several reboots that went fine, I just booted up the computer, the splash screen failed to show, and I'm now in Unity 2D without my GPU. I'm really out of ideas as to what could be going on.

I diffed the dmesg logs again, and after the times have been stripped, the only difference before the video device fails to boot is that the chip is detected as being at slightly different frequencies (fraction of a MHz) and the boot time is obviously different.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: amd64

Kyle Auble, thank you for testing for this in Quantal.

Could you please provide the information following https://wiki.ubuntu.com/DebuggingACPI ?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete

It's strange, but the bug hasn't appeared again yet. I've tried a few more random things, but nothing seems to make it come up again. I suppose it's always possible that one bad session was a complete outlier, but since I can't consistently replicate the bug now in 64bit Ubuntu, I can't really test the ACPI flags from my installed system, although I do still have my 32bit live CD. I'll try fiddling with the ACPI flags when booting from the live disk and see if anything happens.

As for the ACPI debugging attachments, did you want those specifically from a session where the bug appears, or any session? If you only wanted logs from a bad session, I'll remember to copy the logs and submit them next time the bug shows up.

Kyle Auble, please provide the attachments requested in https://wiki.ubuntu.com/DebuggingACPI#Filing_a_Bug_Report without any of the suggested kernel parameters in place and from a session where the bug does not appear.

I wasn't able to follow the exact instructions on the wiki page because my system is just appending different sessions to the main kernel log, instead of archiving old log files. I've cut out all of the earlier sessions so it should just be of the most recent one, including a test of suspend and awake.

I finally had another bad reboot and remembered to run the ACPI debugging steps. However a diff on the output of `sudo dmidecode` showed no differences from the one in a good session, and the only difference in the /proc/acpi files was that in battery/BAT0/info, my battery's design voltage was listed as 123340 mV instead of 123600 mv (but I don't think this is related because I'm in a good session again and my battery info says 123340 mV this time too).

I did a suspend and restart though, and like before when the GPU wasn't active, the computer powers up, but the screen never comes on. I've copied the kernel log from that session to include. First, I have updated the kernel so I'll include the new `uname -a`.

Also, there were several differences in the output of `sudo lspci -vvnn` between the good and the bad sessions. For several of the devices, there seems to be a consistent change in the MAbort flag between the good and bad session, although PresDet switched for the Root Port, the Audio Controller had a different data value on line 104, and the SATA controller's address changed. However, there are several different and missing values for the GPU (starting at line 457). After the GPU though, there are a couple switched flags for the Wireless Network system (and MAbort isn't one of them), but nothing else.

I have been busy lately, but I've searched Launchpad a little and found one other expired bug that I felt confident enough to label a duplicate. I also came across Bug #940564, which while I don't think it's an exact duplicate, seems to have a lot of similarities to this one. That one is currently just marked against the nvidia-graphics-drivers though. Unfortunately, I still haven't noticed any pattern in when a good or bad session occurs, but I'll leave a comment if I do find one. Let me know if there's anything else I can do.

tags: added: regression-release

Kyle Auble, thank you for providing the requested attachments.

Regarding your comments https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1009312/comments/41 :
Kernel log of a failed suspend attempt (123.2 KiB, text/plain)

Let us not introduce suspend problems, and accompanying attachments, into this report. Suspend is typically treated as a separate problem. For more on this, please see the Ubuntu Bug Control and Ubuntu Bug Squad article:
https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue

and Ubuntu Community article:
https://help.ubuntu.com/community/ReportingBugs#Bug_Reporting_Etiquette

Thank you for your understanding.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
description: updated

Hello, I've been away from a while, but I just wanted to see if there was anything I could still do to help with this bug. I haven't had a problem with it for a few weeks, but I did have one weird startup this morning when my monitor simply didn't come on.

I moved to one of the beta Nvidia drivers in December to use Steam temporarily, but I can drop back to the stable driver now. I never knew how to consistently provoke the bug, but if someone has a better idea now as to what causes it, I can try replicating it and seeing if one of the recent kernel versions actually did fix it.

I know it's been a while, but I thought I should report that I've still been seeing this bug. I recently installed a fresh mainline kernel (3.9.0-999-generic, built on 4/21), and it runs fine except I'm still seeing the bug. I want to try some of the ACPI options again when booting, but my problem is I still have no clue about how to consistently replicate the bug.

After looking through my PCI info, I have a rough hypothesis of where the problem's happening. Although the GPU loses PCI features like bus-mastering & ASPM on bad sessions, my gut feeling is these are side-effects of an underlying PCI/ACPI issue (since multiple devices raise error flags on bad sessions). I'm wondering if it has something to do with how space is being allocated for DMA (since the 32-bit memory region for the GPU is always treated as virtual in bad sessions). This might explain why the bug was so common in the PAE kernel.

The one other thing I realized is that my GPU is the only device to use a PCI-to-PCI bridge straight off the root port. Every other
device on my system either routes directly to the root port (like the audio device, 00:1b.0) or uses a bridge off of a secondary PCI Express port (prefix 00:1c). Especially since forming the GPU's PCI-to-PCI bridge is exactly where the timing discrepancy occurs, I wonder if this is why the GPU is the one device that fails. I've gone ahead and attached the output from `lspci -t` to help see my PCI arrangement.

I'm not a kernel hacker so these are just hunches based on the data, but if anyone that's comfortable with the kernel's PCI system could suggest a test that might consistently reveal the bug, I'd be happy to keep testing things out.

Kyle Auble, could you please confirm this issue exists with the latest development release of Ubuntu Saucy Salamander? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ . If the issue remains, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available (v3.10-rc3-saucy) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the kernel in the mainline kernels archive directory daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.10-rc3

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags: added: kernel-bug-exists-upstream-v3.5-rc7 latest-bios-r1120j7 needs-upstream-testing
removed: kernel-bug-exists-upstream kernel-bug-exists-upstream-v3.5-rc7-quantal kernel-unable-to-test-upstream-v3.5-rc1-quantal
Changed in linux (Ubuntu):
status: Confirmed → Incomplete

Sure, can do.

Because the bug seems to lie dormant most of the time in the 64-bit build, I've tried both the 32 & 64 bit versions of Saucy Salamander. I tried the 64 bit version 4 times without seeing the bug, but it popped right up with the 32 bit version. I'll run apport from within the 32 bit version later today (need to go somewhere with wifi).

I've installed v3.10-rc3 (64-bit) of the kernel too, and so far, the PCI system is initializing the GPU fine. Since the bug is still appearing in the 32 bit ISO though and a mainline build from earlier this month, I'm thinking it's lurking there. After some googling, it looks like there is a workaround for updating a Live USB kernel. I'll use that to try the newest 32bit mainline kernel too.

On a side note, I tried all of the kernel parameters from the DebuggingACPI page (with a mainline daily build from earlier this
month). While I did see one bad session, it was actually from a reboot where I forgot to change the parameters. I tried the same sequence of reboots (with ACPI modified, then with normal parameters) to see if the following session changed, but the bug didn't reappear.

apport information

tags: added: saucy
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-fixed-upstream kernel-fixed-upstream-v3.10-rc4
removed: needs-upstream-testing

Huzzah, I think you all got it! I finally found the time to do a separate install of i386 Ubuntu on the end of my harddisk for testing the 32bit version of the mainline kernel.

When I boot the 32bit version with the default kernel (v3.5), the GPU still doesn't load, but with kernel v3.10-rc4, everything checks out fine. The output from `lspci -vvnn` looks right, "boot video device" appears in dmesg, and the first PCI bridge takes under 10 ms to initialize. Although Unity 3D still won't load, that's just because the nvidia module couldn't build against the mainline kernel.

I'll keep the test partition installed a little while longer. Just let me know what else I need to do to help with closing the bug out. Thanks again for all your help.

Kyle Auble, thank you for performing the requested tests. The next step would be to perform a reverse mainline kernel commit bisect in order to find the earliest mainline kernel commit that fixed your problem. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection ?

tags: removed: kernel-bug-exists-upstream-v3.5-rc7
Changed in linux (Ubuntu):
status: Confirmed → Incomplete

Just finished the reverse bisection (managed to remember that "bisect bad" is good and vice versa). It looks like the commit that fixes the problem is something to do with memory mapping:

d34883d4e35c0a994e91dd847a82b4c9e0c31d83 by Xiao Guangrong

To be safe, I tested the kernel built at that commit 7 times in 32 bit mode, and the GPU loaded fine all 7 times. I did skip one commit towards the end because it was purely a documentation update, and it wound up being cut out in the next step anyway. I'm attaching the bisection log too, and I'll keep an eye on my emails for any updates.

I'm not sure what the next step in the process is, but I really appreciate everyone's help, and I'm glad it looks like we've finally pinned this thing down.

tags: added: bisect-done
tags: added: cherry-pick
description: updated

Hello again, I just thought I would make a quick update. I noticed that the commit that fixed the mainstream build has been backported into recent stable kernels. Unfortunately, I'm still seeing the bug on my 32-bit kernel (v3.5.0-37), and while it hasn't popped up for the current 64-bit kernel yet (v3.2.0-51), I saw it a couple of times even after the commit had been backported (v3.2.0-49).

The booting from battery work-around still seems to work though. It's just my guess, but I'm thinking there may be an even earlier patch in the mainline kernel to the power management system. Through some very indirect logic, that patch works in tandem with the memory mapping one to resolve the problem. If anyone has any suggestions about another logging option I could try or how the power source might be affecting things, I could try another bisect on the mainline kernel.

So I've been busy, but I have some more useful info. I've confirmed the bug's still in both version 3.5.0-40 (32 bit) and 3.2.0-53 (64 bit) of the Ubuntu kernel.

However, while I've been busy, I had the idea of just splicing the changes from Xiao's patch (which we found in the last bisect) into the mainline kernel before doing a reverse bisect on commits before that one. After a couple of false starts, I was able to isolate a prior patch clearly (I tested it 10 times in various boot-up situations, and it always worked). Apparently the magic patch was a merge by Linus:

99c6bcf46d2233d33e441834e958ed0bc22b190a by Linus Torvalds

I honestly have no clue why this patch would be the earlier necessary one, and my gut feeling is that it means this bug is very tangled and subtle. I'm both busy and a little out of my league to contact the kernel mailing-list directly, but while running the bisection, I came across the name Rafael J. Wysocki a couple of times. My 2nd reverse bisection actually uses a patch by him as the earliest commit because I originally zeroed in on his commit as the next critical one. It was only after testing over 6 or 7 boots that I confirmed a bad session.

Anyways, when searching for his commits in the git log, a recent one (60f75b8e97daf4a39790a20d962cb861b9220af5) jumped out since it sounded particularly relevant. It specifically handles interaction problems between PCI bridges and ACPI, then mentions graphics adapter detection as a major justification. I'm guessing you may already be in touch with him, but if not, it sounds like he might be a good person to talk to.

summary: - GPU loads unreliably, possible kernel timeout
+ 10de:0426 GPU loads unreliably, possible kernel timeout
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We are approaching release and would like to confirm if this bug is still present. Please test again with the latest development kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get dist-upgrade

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.11.0-7.14

I was a little confused about exactly which version of the kernel you wanted to test, but it's a moot point because all of the ones I tried had the bug still.

I'm still using Ubuntu 12.04 so sudo apt-get dist-upgrade just keeps me on v3.5.0-40, which definitely has the bug. I also tested v3.11-rc1 (built 7/14) off of the Ubuntu Mainline PPA, v3.11-rc5 (it had the patch by Rafael Wysocki I mentioned previously), and v3.12-rc1 (the latest version). Every single one showed the bug, which I confirmed by checking the dmesg logs after booting up with a stable kernel. Actually, none of those three kernels even made it to the login screen.

It's a little unnerving that whatever changes fixed the bug around May this year have been canceled out since then. Seeing the glass half-full though, once I have some free time, I can do a standard bisection to see where the fix was knocked out. That might give us a little more data to work with.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed

So after another couple of months, I've managed to do more testing, and I may have found something useful. First off, two fresh, stable versions of the Ubuntu kernel have shown the bug: v3.2.0-56 (64-bit) and v3.5.0-44 (32-bit). Also, the most recent package from the Ubuntu mainline kernel PPA, v3.13.0-rc3 (built Dec. 6), showed the bug and failed to boot. I confirmed the bug by looking at the old dmesg log after rebooting into a working kernel.

On the positive side, after a little more free-time and thinking about the problem, I can give you a commit that may be canceling out the effect from Xiao Guangrong's earlier patch. Instead of using git-bisect, I narrowed down the problem to a small range from previous tests, then manually checked the merges in the mainline kernel's history. After tracing the regression to a simple merge, I rebased the short side branch leading to it onto the preceding, bug-free commit. I don't know if this method would give false results, but I figured since the merge itself involved no extra changes and the rebase didn't cause any conflicts, it should be useful.

The patch where the bug reappeared for me was:
ee8209fd026b074bb8eb75bece516a338a281b1b by Andy Shevchenko

Hope this helps some, and let me know if there's anything else I could try.

tags: added: bot-stop-nagging
removed: kernel-request-3.11.0-7.14
Changed in linux (Ubuntu):
status: Confirmed → Triaged

Kyle Auble, thank you for your commit bisection work. One thing that would be helpful is if we just revert the noted commit in the latest mainline and see if it continues to occur via a terminal, reboot, and testing the new kernel:
git config --global user.email "<email address hidden>" && git config --global user.name "Your Name" && cd $HOME && git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git && cd linux && git revert ee8209fd026b074bb8eb75bece516a338a281b1b && git add . && git commit && cp /boot/config-`uname -r` .config && yes '' | make oldconfig && make clean && make -j `getconf _NPROCESSORS_ONLN` deb-pkg LOCALVERSION=-customrevert && cd .. && sudo dpkg -i *.deb && git fetch origin;git fetch origin master;git reset --hard FETCH_HEAD

Changed in linux (Ubuntu):
status: Triaged → Incomplete

Hmm... so I've just finished a first set of tests with reverting that commit, and I definitely have results, though they aren't as cut-and-dry as I hoped. When I reverted the commit, there were merge conflicts in:
drivers/acpi/scan.c
drivers/dma/acpi-dma.c

Since I really have no clue how these files work, I used git mergetool to try simple ways of resolving the conflicts. I tried completely reverting both files to the older version and leaving them as they are at the tip of the master branch. In both of those cases, the kernel failed to build, with make throwing an error when it reached the appropriate file, then completely stopping soon after with a [deb-pkg] error. What's interesting is that when I kept acpi/scan.c in its up-to-date form but entirely reverted /dma/acpi-dma.c, the kernel built successfully.

Unfortunately, when I tried testing it, that kernel build froze during boot, and when I logged in with a stable kernel to check the dmesg logs, the bug was still there. I would need to actually take the plunge and spend a while learning how the code works before I could resolve the conflict more precisely. However, I noticed the build process created a debug package this time; is there some debug setting that I could enable that would shed light on anything?

Changed in linux (Ubuntu):
status: Incomplete → Triaged

It's been a while, but I've found the time to dig much deeper into this and familiarize myself with the kernel code some. Actually, I feel comfortable with the idea of directly contacting the appropriate mailing list now so this is more to keep the record up-to-date than a request for more triage.

Anyways, after just walking through the kernel code, I first realized that the first sign of the bug (the 30ms gap) was occurring somewhere within the function pci_scan_child_bus (in drivers/pci/probe.c), between when it invokes the function pci_scan_slot (also in drivers/pci/probe.c) and the function pcibios_fixup_bus (in my case, under arch/x86/pci/common.c)

From there, I began adding dev_info statements around function calls that would be executed in between, then looked between whichever 2 messages the gap occurred between to further narrow down the problem. After a few rounds of this, I found the delay consistently appearing within the function pcie_aspm_configure_common_clock (in drivers/pci/pcie/aspm.c) After a little research about what the PCIe common clock is about, it actually explains several aspects of this bug. Booting the computer from battery power would influence the power state of the device, which is what ASPM is all about. And it turns out the discrepancy of 24ms between a good boot and a bad boot is precisely the length of time the PCIe standard defines as a timeout for link training.

Unfortunately, I don't know how, or even if, the two commits I found earlier directly tie into this. It seems there's a really weird race condition or resource fight going on. I'm not exactly sure how to fix the problem clearly either because just adding the overhead of dev_info statements to the function makes the bug go away (so I can technically "fix" the bug, but that's just a total hack). The one other little cliue I found was that the delay went away completely when I put dev_info statements in every possible branch of the function's logic. When I only added dev_info to the ifs corresponding to a problem though, a slight delay appeared (bumping the total time in the function to around 10ms), but still not enough for link training to timeout (so my GPU always loaded).

I plan on mailing the list for the PCI subsystem of the kernel soon, but I'm stumped about how exactly to proceed so if you have any debugging suggestions, I'd be happy to hear them. Thanks again.

Just wanted to add here that I think I've found an even simpler workaround. It looks like passing "pci=bios" as a kernel parameter consistently allows the GPU to load, regardless of kernel version or power source. I haven't tested it a whole lot, but so far it has worked 100%.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments