Comment 813 for bug 1690085

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

This thing can happen due to multiple factors.
I was running a 1800X. Freezes ocurred in 24-48h of uptime. Disabling Global C-State or enabling typical power idle in UEFI stopped those freezes. The latter option disables PC6 on top of using 0.85V idle voltage. To disable PC6 you need a cold boot, otherwise only the voltage change is applied and PC6 remains enabled, crashing the system as usual.

I upgraded to a 2700X (UEFI cleared) and apparently the issue disappeared. But nope, it's just less frequent. Much less. Currently I have an uptime of 23d. I also had another 2700X running for 17d before testing again a 1800X which crashed within 24h, afterwards I inserted the current 2700X which crashed within 48h. The following boot is the one with an uptime of 23d.

All in all, this might be related to either a PSU thing or to a non-existent but required kernel workaround for a bug in the processor, as detailed in the AMD Revision Guide [1]. The 1800X would crash more because it has more bugs and workarounds needed, while the 2700X has fewer, specially related to the PCIe controller. Ironically, the 2700X consumes less power at idle than the 1800X because it requires lower voltages: 12nm+ vs 14nm, and the voltages specified at every power level are also lower. I don't see much logic in saying that the PSU is the culprit of system instability.

All the people trying "idle=nomwait", "idle=halt" or "processor.max_cstate=5" should be warned those options are useless. There are only 2 Cstates available in Ryzen systems, so if you want to limit Cstate you have to set it to 1 at most -> "processor.max_cstate=1". And the use of the MWAIT instruction is disabled by the UEFI if you insert a Ryzen 1800X processor. The 2700X and the rest of 2nd gen Ryzen are not affected by any MWAIT bug. Idle option is thus useless too.

It's better to try with "pcie_aspm=off" or "pcie_aspm=force" pcie_aspm.policy=performance" and or "nvme_core.default_ps_max_latency_us=0". Maybe the PCIe root or anything related does not wakeup and the processor stalls waiting for an interrupt to be served.

[1] https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf