Comment 622 for bug 1690085

Revision history for this message
In , ison (ison-linux-kernel-bugs) wrote :

I have tested it on 5.0 and still experienced the soft lockups just as frequently as with 4.20.
However, I am currently testing kernel 5.0 with PAGE_TABLE_ISOLATION disabled and it seems that it may be a fix.

Although maybe it's a bit premature to say that, since my experience with this issue (on 2700X) has been extremely inconsistent. I get the soft lockups, but in addition to the lockups (and usually preceding them) it seems like my whole system becomes unstable. Applications start segfaulting, or turning into zombies that can't be killed, then the lockup happens if I just let it go.
What's strange is that once this sort of "instability" occurs it seems to stay unstable even across reboots, and I can only have my machine running 20-30min before things start segfaulting or locking up again. Usually it takes me about 5 or 6 restarts before it just seems to hit some sweet spot and become "stable". Once it's stable it stays stable for days, even if I restart the machine.

So based on the above description I can see that narrowing down the problem could be very difficult since something might seem to be working for a while, until it doesn't.

At any rate, the solution I'm currently testing was proposed in this Gentoo thread:
https://forums.gentoo.org/viewtopic-t-1074860-start-0.html
which I stumbled upon after receiving a kernel panic and noticing this error
>Unexpected reschedule of offline CPU#0!
From what I can tell "offline CPU" sounds very relevant to our issue.

Their solution was to disable PAGE_TABLE_ISOLATION in the kernel. This also intrigues me as I remember seeing other errors in dmesg relating to "page" writing.
Apparently that kernel option isn't even necessary for AMD CPU's anyway. It was meant to fix an insecurity in intel CPU's, so it can be safely disabled.

NOTE: I am testing this with some of the BIOS modifications recommended by others here (such as disabling c6 state and "typical current idle", etc..)
If this solution works for another week or so I'll try restoring the BIOS settings to their defaults and testing again.