Comment 44 for bug 1009312

Revision history for this message
Kyle Auble (auble48-deactivatedaccount) wrote : Re: GPU loads unreliably, possible kernel timeout

I know it's been a while, but I thought I should report that I've still been seeing this bug. I recently installed a fresh mainline kernel (3.9.0-999-generic, built on 4/21), and it runs fine except I'm still seeing the bug. I want to try some of the ACPI options again when booting, but my problem is I still have no clue about how to consistently replicate the bug.

After looking through my PCI info, I have a rough hypothesis of where the problem's happening. Although the GPU loses PCI features like bus-mastering & ASPM on bad sessions, my gut feeling is these are side-effects of an underlying PCI/ACPI issue (since multiple devices raise error flags on bad sessions). I'm wondering if it has something to do with how space is being allocated for DMA (since the 32-bit memory region for the GPU is always treated as virtual in bad sessions). This might explain why the bug was so common in the PAE kernel.

The one other thing I realized is that my GPU is the only device to use a PCI-to-PCI bridge straight off the root port. Every other
device on my system either routes directly to the root port (like the audio device, 00:1b.0) or uses a bridge off of a secondary PCI Express port (prefix 00:1c). Especially since forming the GPU's PCI-to-PCI bridge is exactly where the timing discrepancy occurs, I wonder if this is why the GPU is the one device that fails. I've gone ahead and attached the output from `lspci -t` to help see my PCI arrangement.

I'm not a kernel hacker so these are just hunches based on the data, but if anyone that's comfortable with the kernel's PCI system could suggest a test that might consistently reveal the bug, I'd be happy to keep testing things out.