Comment 478 for bug 1690085

Revision history for this message
In , willemdreyer (willemdreyer-linux-kernel-bugs) wrote :

(In reply to ValdikSS from comment #391)
> These lockups are probably not related to this bug. I've updated by Intel
> Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have
> random CPU lockups, too.
> 4.17.3 worked fine.

Please take a look at this report, it matches your description https://bugzilla.redhat.com/show_bug.cgi?id=1598989 The user that reported it also has an Intel CPU (Acer Aspire V3-771 from his attached screenshot). I have experienced this issue once on one of my Ryzen systems running 4.17.6, could it be an unrelated problem? Another one https://bugzilla.redhat.com/show_bug.cgi?id=1598462

With regards to the original Ryzen Random Soft Lockup issue: I am witnessing it on every kernel that I have tested so far, ranging from kernel 4.10 through to 4.17.6. The frequency of the Ryzen soft lockup has increased since kernel >= 4.15 in my experience. It could just be a random effect as I am unaware of the cause of this problem.

I am running my machines at stock clocks and have the latest stable BIOS updates install as of today on a Ryzen 1700 w/ MSI PRIME X370-PRO, Ryzen 1800X w/ ASRock Fatal1ty X370 Professional Gaming, and finally my personal desktop Ryzen 1800X w/ X470 Taichi Ultimate w/ Seasonic 1000W Platinum PSU (Haswell ready). All of my CPUs are running microcode patch level 0x8001137

I can't find any correlation in workload, the issue occurs on web servers, VFIO gaming, even live USB sessions. I used the following (try) to provide insight.
journalctl -t kernel --no-pager | grep "soft lockup" | awk -F"!" '{print $2}' | sort -u
 [Compositor:4167]
 [kworker/10:1:18249]
 [kworker/1:3:418]
 [libvirtd:1226]
 [systemd:1]
 [Web Content:3505]

I have attempted CPU pinning, disabling ASLR completely. I have also tried isolating workloads in virtual machines that do not cross CCX units with hugepages (THP off). I am not an expert in knowledge of CCX, NUMA, Infinity Fabric, etc... I could have made a mistake in my test. That said, I am currently testing the "idle=nomwait" parameter in hope of getting better results, who knows perhaps even a stable system.