Comment 71 for bug 1009312

Revision history for this message
Kyle Auble (auble48-deactivatedaccount) wrote :

It's been a while, but I've found the time to dig much deeper into this and familiarize myself with the kernel code some. Actually, I feel comfortable with the idea of directly contacting the appropriate mailing list now so this is more to keep the record up-to-date than a request for more triage.

Anyways, after just walking through the kernel code, I first realized that the first sign of the bug (the 30ms gap) was occurring somewhere within the function pci_scan_child_bus (in drivers/pci/probe.c), between when it invokes the function pci_scan_slot (also in drivers/pci/probe.c) and the function pcibios_fixup_bus (in my case, under arch/x86/pci/common.c)

From there, I began adding dev_info statements around function calls that would be executed in between, then looked between whichever 2 messages the gap occurred between to further narrow down the problem. After a few rounds of this, I found the delay consistently appearing within the function pcie_aspm_configure_common_clock (in drivers/pci/pcie/aspm.c) After a little research about what the PCIe common clock is about, it actually explains several aspects of this bug. Booting the computer from battery power would influence the power state of the device, which is what ASPM is all about. And it turns out the discrepancy of 24ms between a good boot and a bad boot is precisely the length of time the PCIe standard defines as a timeout for link training.

Unfortunately, I don't know how, or even if, the two commits I found earlier directly tie into this. It seems there's a really weird race condition or resource fight going on. I'm not exactly sure how to fix the problem clearly either because just adding the overhead of dev_info statements to the function makes the bug go away (so I can technically "fix" the bug, but that's just a total hack). The one other little cliue I found was that the delay went away completely when I put dev_info statements in every possible branch of the function's logic. When I only added dev_info to the ifs corresponding to a problem though, a slight delay appeared (bumping the total time in the function to around 10ms), but still not enough for link training to timeout (so my GPU always loaded).

I plan on mailing the list for the PCI subsystem of the kernel soon, but I'm stumped about how exactly to proceed so if you have any debugging suggestions, I'd be happy to hear them. Thanks again.