Comment 26 for bug 1798961

Revision history for this message
KH (uy) wrote :

My latest posting on this bug issue posted in AMD communities copied here FYI:

https://community.amd.com/thread/225795?start=90&tstart=0

102. Re: Ryzen linux kernel bug 196683 - Random Soft Lockup
uncle yap
uncle yap Jan 4, 2019 3:29 AM (in response to imshalla)

Dear All,

Some good news and discovery.

My crisis is greatly improved so far after 1st 5 hours running without lockup now. All I did essentially was changing my Linux Kernel from 4.18.0-11-generic to 4.15.0-43-generic

I had previously also tried 4.18.0-13-generic and found it equally bad.

My highest suspicion is 4.18.0-X kernel's thread scheduler is/are buggy with a same bug that would freeze up some threads randomly and up to 12hours long and later randomly unfreeze them. I call that random because I can not find any consistent pattern on how it freeze / unfreeze. These hardly require a hard reset unless it is left frozen for very long time. If I discovered soon enough and gave soft reset by SSH command sudo systemctl restart sddm it will be recovered. It would be gdm instead of sddm if you are in ubuntu instead of kubuntu.

My guess for this difference (between requiring a motherboard reset switch vs soft reset command) is that TOO MANY REPEATED THREAD FROZEN OVER LONGER TIME UNATTENDED. It is a guess only because I cannot afford the time to test and prove that. My faithful logical analysis and derivation is so, because this kernel thread scheduler bug will freeze more & more threads than it unfreeze over longer unattended time, and that critical kernel or driver module threads or ssh or bash itself could have been frozen, hence you have no more chance to soft reset / recover.

I have proven that when only 1 or 2 threads frozen, servers, ssh, bash, and even ksysguard (CPUs usage / load percentage graphs) will still be running and I never found any single CPU core nor logical CPU (hyperthread) completely stuck in ZERO% usage.

265px-Ksysguard1.png

When my X.org console freezes, mouse will freeze and CPU usage graph will all freeze, but usually still a good chance if I quickly ssh my favorite reset command sudo systemctl restart sddm it will be recovered. If I wasn't checking and left it frozen for long time, there had been a high chance of it completely not recoverable via ssh command, and reset switch became the only way to get system back rebooted up.

Today, when I checked my CPU Pstate via kernel, it is not running any C6, but I mt BIOS setting neither DISABLED C6 nor use TYPICAL CURRENT IDLE, nor I am using kernel boot idle=nowait , but I think my F4E version BIOS by Gigabyte X470 had DISABLED C6 power state & forced TYPICAL CURRENT IDLE:

    ~$ cat /sys/devices/system/cpu/cpu*/cpuidle/state*/name

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

    C1

    C2

    POLL

From existing state of stability I am optimistic to expect no further debugging on my system for now.

My proposal for Kubuntu/Ubuntu users is to check kernel version to be other than version 4.18.0-X , and try older 4.15.x 1st, and newer version when they released, if your stability improved with alternate kernels than stay with them and await for improved kernels and try them when they became available.

Thanks & regards

uy