Comment 45 for bug 1751268

Revision history for this message
In , Ahabig (ahabig) wrote :

Sorry for the long silence. Going has been slow: it takes a while to do a tweak/test/reboot/hang cycle. Especially since the disk controller gets munged upon the bug happening, even persisting across a power cycle (!) about 50% of the time (I have no idea how such a thing is even possible). And, this device is my "get work done" laptop, using the 4.14 kernel which is perfectly happy.

Anyway: tried a different approach. Threw in a spare disk, and tried a clean install of F29 and Kubuntu 18.04.1, in case cruft was to blame.

F29 installed from dvd... but the kernel installed to disk didn't boot once the first install pass finished.
Kubuntu's install dvd wouldn't even boot past the spinny "it's starting to boot" graphics screen.

In both cases, since things are "graphical boot" and "quiet", there's no feedback as to exactly what went wrong, and the system hangs too hard to switch to a different vtty, but it felt the same as the hangs described above.

Trying with the latest/last F27 kernel (yes, it's EOL now) 4.18.19-100, tried all permutations of integrated, discrete, and optimus bios settings, excluding nouveau drivers, and the nvidia proprietary blob. Sometimes it can get a clean boot - but then the sata controller goes out to lunch as soon as a write cache flush happens. Which makes me think the kms problem which started this thread is a symptom rather than the problem, just the one which usually triggers first. This is consistent with the problems getting log traces of the problems described above, because a lunched sata controller can't log errors.

Went back to drm-tip. See that it's kernel 5.0 now, cool.

This minimally configured kernel continues to work. I've enabled the extra features needed to run the laptop, no problems.

So: the bug is at the very least triggered by one of the (myriad) of enabled kernel options in the distro stock kernels. It feels to me like the old days of unprotected flat memory space where you could POKE random values into random addresses and watch the system fall apart: with the initial kms call being the most sensitive to it, and things unravel into thrashing the disk controller.

Parameter space is too vast for me to find the culprit with intermittent effort and a logging system that's often the first victim of the bug. So, I'm ready to punt, documenting this here in case someone else with more clues googles it is the only remaining thing I can do.

Time to just return to the 1990's and compile my own kernel :( At least git now makes tracking updates easier than it used to be in the Bad Old Days.