Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent
218
This bug affects 39 people
Affects Status Importance Assigned to Milestone
Linux
Expired
Medium
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Revision history for this message
Vincent (hvincent13) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Vincent (hvincent13) wrote : JournalErrors.txt

apport information

tags: added: apport-collected zesty
description: updated
Revision history for this message
Vincent (hvincent13) wrote : ProcCpuinfoMinimal.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Revision history for this message
Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Revision history for this message
Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Revision history for this message
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Revision history for this message
Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Revision history for this message
Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

Revision history for this message
camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Revision history for this message
Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

Revision history for this message
camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Revision history for this message
Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Revision history for this message
Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Revision history for this message
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Revision history for this message
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Revision history for this message
Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Revision history for this message
Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Revision history for this message
Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Revision history for this message
Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Revision history for this message
Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Revision history for this message
Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Revision history for this message
Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

Revision history for this message
David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Revision history for this message
Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257955
Example Log

Full description of the problem/report:

Ryzen 1700x (8 cores) system randomly hard freezes with a soft lockup bug and a hard reset is needed. The system doesn't respond anymore to mouse or keyboard input. The problem typically occurs when the load is low while web surfing - stress testing with e.g. mprime is no problem. Example Log: Please see attachment.

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DOM Worker:1364]

Kernel version:
Linux version 4.12.4-2-MANJARO (builduser@ste74-i5) (gcc version 7.1.1 20170630 (GCC) ) #1 SMP PREEMPT Sun Jul 30 17:04:04 UTC 2017

Environment - Software:
Manajaro Linux - Stable Update 2017-08-10; KDE Plasma; Graphics driver: linux-nvidia 1:375.82-1-any

Environment - Additional information:
Attachments follow.

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257957
cpuinfo

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257959
iomem

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257961
ioports

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257963
lspci -vvv

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257965
modules

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Created attachment 257967
scsi

Revision history for this message
Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

Revision history for this message
In , goric (goric-linux-kernel-bugs) wrote :

I have the exact same problem with the same setup (Ryzen 1700X).

I have experienced the problem with kernels 4.4, 4.10 and now 4.12.8. The system freeze also without X and without nvidia proprietary drivers.

Revision history for this message
xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Revision history for this message
In , wim (wim-linux-kernel-bugs) wrote :

Bug confirmed. Will revert patch 1fccb73011ea8a5fa0c6d357c33fa29c695139ea.

Revision history for this message
Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Wim Van Sebroeck from comment #8)
> Bug confirmed. Will revert patch 1fccb73011ea8a5fa0c6d357c33fa29c695139ea.

Are you sure? That commit concerns iTCO_wdt, which is an Intel driver?

I have seen similar freezes many times. I captured the last time with netconsole and admittedly the call trace was different but I've seen many reports of Ryzen freezes on a variety of motherboards and disabling C6 or enabling CONFIG_RCU_NOCB_CPU is always said to work around it. Both of these certainly work here.

I must stress that this issue must be not confused with the segfault issue commonly reported with early Ryzens. I was facing that issue too and had my CPU replaced but while the segfaults have gone, both my original and the replacement exhibit these freezes.

I believe we have not seen so many reports of this issue, partly because most users will not know how to get kernel output in this situation, and partly because Fedora's kernel already has CONFIG_RCU_NOCB_CPU enabled. I am using Gentoo but I have confirmed that this affects Debian.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Revision history for this message
In , wim (wim-linux-kernel-bugs) wrote :

Oeps. Sorry. That comment was for BUG 196509.

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

Thanks for James Le Cuirot 2017-10-05 20:23:40 UTC. My Ryzen 5 1600 build did hang randomly with Amd drm-next-4.15-wip kernel and mainline 4.14-rc1 and rc2 and 4.13.5 custom non debug 1000Hz timer kernels. It took 3 weeks to find this solution.

X did freeze randomly when scrolling web content in the Chrome Beta browser. I thought it was usb 3.0 problem at first and moved my mouse to usb 2.0 port. Then I found help for a Chrome bug:
https://askubuntu.com/questions/765974/chrome-freeze-very-frequently-with-ubuntu-16-04
No help.

Now uptime is 7 hours and 19 minutes and system looks stable. I did enable gpu support in the Chrome again and system is not freezing while surfing.

xfce@ryzen5pc:~$ screenfetch
         _,met$$$$$gg. xfce@ryzen5pc
      ,g$$$$$$$$$$$$$$$P. OS: Debian unstable sid
    ,g$$P"" """Y$$.". Kernel: x86_64 Linux 4.14.0-rc2
   ,$$P' `$$$. Uptime: 7h 19m
  ',$$P ,ggs. `$$b: Packages: 1942
  `d$$' ,$P"' . $$$ Shell: bash 4.4.12
   $$P d$' , $$P Resolution: 1920x1080
   $$: $$. - ,d$$' DE: XFCE
   $$\; Y$b._ _,d$P' WM: Xfwm4
   Y$$. `.`"Y$$$$P"' WM Theme: Default
   `$$b "-.__ GTK Theme: Xfce [GTK2]
    `Y$$ Icon Theme: Tango
     `Y$$. Font: Sans 10
       `$$b. CPU: AMD Ryzen 5 1600 Six-Core @ 12x 3.194GHz [39.6°C]
         `Y$$b. GPU: AMD/ATI Baffin [Polaris11]
            `"Y$b._ RAM: 1150MiB / 7989MiB

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

With RCU_NOCB_CPU you need to have the following in the kernel command line:
rcu_nocbs=0-11

Ryzen 5 1600 has 12 threads so change the upper limit according to your cpu thread count.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #12)
> With RCU_NOCB_CPU you need to have the following in the kernel command line:
> rcu_nocbs=0-11

Correct, there used to be CONFIG_RCU_NOCB_CPU_ALL until very recently. I don't know why it was removed. I still see this as a workaround though as I gather it effectively avoids C6.

ptheis, could you please confirm whether these workarounds work for you and reassign the bug to someone more appropriate? This isn't a watchdog bug, the watchdog just gets triggered by the problem.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

I wonder if this is going to start affecting Fedora 27 now that CONFIG_RCU_NOCB_CPU_ALL has gone. I grabbed a rawhide kernel RPM and that entry is indeed missing from the config. It seems unlikely they would add rcu_nocbs to everyone's command line.

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #13)
> (In reply to fin4478 from comment #12)
> > With RCU_NOCB_CPU you need to have the following in the kernel command
> line:
> > rcu_nocbs=0-11
>
> Correct, there used to be CONFIG_RCU_NOCB_CPU_ALL until very recently. I
> don't know why it was removed. I still see this as a workaround though as I
> gather it effectively avoids C6.
>

RCU_NOCB_CPU_ALL was removed 2017-06-08 18:52:43 -0700. "The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing and
are redundant with the rcu_nocbs= boot parameter. This commit therefore
removes these three Kconfig options and adjusts the rcutorture scripts
to use the boot parameter instead."

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

This is typical not end user friendly solution from tired programmers. Other companies have a conspiracy against Amd, they do want to make Amd look bad and do code like this. Another example, Ryzen k10temp and Zen cpu select patches have been ready for several months but are not in the mainline kernel. Coffee lake and q cpus are coming.

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

Alright, many thanks for the workarounds. Manjaro has seen a few kernel updates recently. The last time I saw the error was on Kernel 4.13.4-1 (current stable is 4.13.4-3). This is what I changed on my system (not sure both are needed at the same time):

- added rcu_nocbs=0-15 to kernel command line (1700x has 16 threads)
- disabled C6 power management in BIOS

I also have an eye on HW acceleration in Firefox which was enabled all the time. If the workarounds help, I will need 1-3 weeks to confirm as the system doesn't run very long on weekdays.

@James Le Cuirot: I also changed the category of this bug (I hope this is correct - this is the first time I filed a Kernel bug).

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to ptheis from comment #16)

> - added rcu_nocbs=0-15 to kernel command line (1700x has 16 threads)
> - disabled C6 power management in BIOS

You need to make a custom kernel and enable RCU_NOCB_CPU in the kernel configuration. Otherwise the rcu_nocbs parameter have no effect.

You want to save power when your CPU cores are not used so keep the C6 option enabled in the Bios.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to ptheis from comment #16)
> This is what I changed on my system (not sure
> both are needed at the same time):
Only one is needed, I am not enabling these kernel options at the moment.

> @James Le Cuirot: I also changed the category of this bug (I hope this is
> correct - this is the first time I filed a Kernel bug).
The category/component you've chosen is better but the bug is still assigned to the watchdog team. Unfortunately the default assignee for this component works for Intel and I doubt they'd be able to spend time on this. Maybe you should change it to "x86-64|Platform Specific/Hardware" instead. Make sure you tick the "Reset Assignee to default" box when it appears.

(In reply to fin4478 from comment #17)
> You need to make a custom kernel and enable RCU_NOCB_CPU in the kernel
> configuration. Otherwise the rcu_nocbs parameter have no effect.
Yes, Manjaro's kernel config doesn't appear to have this enabled.

> You want to save power when your CPU cores are not used so keep the C6
> option enabled in the Bios.
Building your own kernel can be daunting for some so the BIOS option is far easier. I don't think it makes any difference power-wise as I gather RCU_NOCB_CPU effectively prevents the CPU from entering C6 anyway.

(In reply to fin4478 from comment #15)
> This is typical not end user friendly solution from tired programmers. Other
> companies have a conspiracy against Amd, they do want to make Amd look bad
> and do code like this.
Let's just stick to the facts. I think the change was justified and it shouldn't be necessary to do this for Ryzen. I doubt the guy was even aware of this Ryzen issue as I discovered the RCU_NOCB_CPU workaround myself after spending several days flipping kernel options.

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #18)

> Let's just stick to the facts. I think the change was justified and it
> shouldn't be necessary to do this for Ryzen.

IBM does not have the money to build one Ryzen pc to test their code;-)
Think of all newbies that have a new Ryzen PC and even using Bios is difficult. Ryzen & C6 documentation is hidden in here. Thousands of PCs with stock kernels are freezing randomly now. Great for Amd;-)

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #18)
> easier. I don't think it makes any difference power-wise as I gather
> RCU_NOCB_CPU effectively prevents the CPU from entering C6 anyway.

"
Depending on the CPU / APU model, the highest boosted frequency PState usually has a C6-state requirement.
For example on FX-8370 which has 4.0GHz (P0, base), 4.1GHz (Pb1, boost) and 4.3GHz (Pb0, boost) states, the C6-state requirement is set to four. This means that the highest PState (4.3GHz, Pb0) will not activate unless half of the cores are currently in C6 active state.

If you disable C6-state this condition can obviously never be met, and the highest boosted state will never activate.
In this case the CPU will operate at the highest frequency PState which doesn´t have the C6-state requirement (4.1GHz Pb1).
"

http://www.overclock.net/t/1328938/what-is-core-c6-state-exactly

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

C6 activated in BIOS ... testing custom kernel with rcu_nocbs=0-15 ...

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

Thanks for changing the category/component again but it still needs reassigning. Please set this to <email address hidden>.

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

There is no option for setting ...

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to ptheis from comment #23)
> There is no option for setting ...

If you mean kernel configuration, You need to to enable the RCU_EXPERT option to see the RCU_NOCB_CPU setting.

https://cateee.net/lkddb/web-lkddb/RCU_NOCB_CPU.html
"
depends on: ( CONFIG_TREE_RCU || CONFIG_PREEMPT_RCU ) && ( CONFIG_RCU_EXPERT || CONFIG_NO_HZ_FULL )

"

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #24)
> (In reply to ptheis from comment #23)
> > There is no option for setting ...
>
> If you mean kernel configuration, You need to to enable the RCU_EXPERT
> option to see the RCU_NOCB_CPU setting.
>
> https://cateee.net/lkddb/web-lkddb/RCU_NOCB_CPU.html
> "
> depends on: ( CONFIG_TREE_RCU || CONFIG_PREEMPT_RCU ) && ( CONFIG_RCU_EXPERT
> || CONFIG_NO_HZ_FULL )
>
> "

Sorry, my comment was on changing the assignee. I found no option to do so.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

I got hit by this bug on Fedora. My Ryzen 1600X system would randomly hang a short while after boot after upgrading to kernel 4.13.4. I looked at various things that could be the cause and thought it was fixed but then I rebooted and it happened again. And again. I quickly figured out that there's no problem with kernel 4.12.3 on Fedora. Tried kernel 4.13.5, same problem, went back to 4.12.3 until I wasted too much time looking into this today.

config-4.12.14-300.fc26.x86_64 on Fedora has
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y

and there's no problem. config-4.13.5-300.fc27.x86_64 only has
CONFIG_RCU_NOCB_CPU=y

and with that kernel there's a problem _unless_ I add rcu_nocbs=0-11 to the kernel command line - which I only figured out after looking at this bug.

Thank you James Le Cuirot.

This bug is listed as "Regression: No". It should be Yes in the case of Fedora; 4.12.x kernels work, 4.13.x do not work without a kernel boot parameter fix.

The commit that removed CONFIG_RCU_NOCB_CPU_ALL should please be reverted. The statement "The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing" is clearly false since these options are/were used by distributions like Fedora and removing CONFIG_RCU_NOCB_CPU_ALL *breaks* kernel 4.13.5. You can't really expect distributions to ship with/add the rcu_nocbs= parameter as an alternative.

Just to repeat this point: My first conclusion was simply that 4.13.x kernels are broken and this made me simply stick with 4.12.x which doesn't have this new problem with Ryzen CPUs until I wasted time looking into this.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to oyvinds from comment #26)
> I got hit by this bug on Fedora. My Ryzen 1600X system would randomly hang a
> short while after boot after upgrading to kernel 4.13.4.

Welcome to the club. Hopefully now that Fedora is affected, this will get more attention.

> This bug is listed as "Regression: No". It should be Yes in the case of
> Fedora; 4.12.x kernels work, 4.13.x do not work without a kernel boot
> parameter fix.

I wouldn't call it as a regression as the kernel options involved are merely a workaround for the underlying issue. Fedora just happened to be enabling them anyway until one of them went away.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #27)
> I wouldn't call it as a regression as the kernel options involved are merely
> a workaround for the underlying issue.

That's true and we shouldn't have to apply this workaround. Still, on Fedora it's the regression in my opinion because kernel 4.12.x (happens to) work fine and 4.13.x doesn't.

As a end-user my conclusion was simply that 4.13.x is broken and I kept using 4.12.x until I had time to look into why 4.13.x is broken. And it still is for many, my 68 year old mother isn't going to be adding no rcu_nocbs= kernel command line on her laptop or do much beyond using it to check the news and weather. I am fairly sure I wouldn't be able to guide her through doing so even if I spent 10 hours on the phone, I'd have to go there. Luckily it's got some old Intel CPU.

Yes, these are kernel options that some enable and some don't - but in the case of Fedora it's You update, you get new kernel, you're screwed and now your machine hangs randomly - which it previously did not. Isn't that the definition of regression?

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

My results, testing some days: There was one freeze without errors in journal. The second was also not visible in the logs, but the system rebooted and threw a mce hardware error. The system is not stable yet.

Okt 16 19:32:22 pc1 kernel: x86: Booting SMP configuration:
Okt 16 19:32:22 pc1 kernel: .... node #0, CPUs: #1 #2
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: Machine check events logged
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: b2a00020003f0000
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: TSC 0 IPID 300b000000000
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1508175138 SOCKET 0 APIC 2 microcode 8001126
Okt 16 19:32:22 pc1 kernel: #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
Okt 16 19:32:22 pc1 kernel: smp: Brought up 1 node, 16 CPUs

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to ptheis from comment #29)
> My results, testing some days: There was one freeze without errors in
> journal. The second was also not visible in the logs, but the system
> rebooted and threw a mce hardware error. The system is not stable yet.

How old is your Ryzen? The older ones with the random segfault issue often emitted MCE messages. I don't recall seeing any with my newly replaced Ryzen. Although segfaults are the most common symptom, I wouldn't be surprised if the fault could also trigger a freeze in some way.

Revision history for this message
Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

Revision history for this message
tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Revision history for this message
Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

Revision history for this message
AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

Revision history for this message
tgui (eric-c-morgan) wrote :
Revision history for this message
tgui (eric-c-morgan) wrote :

More details.

- latest agesa
- disabled nouveau
- disabled randomized mem
- upped SOC voltage
- disabled C-states / cool-n-quiet
- using kernel 4.13.4 with RCU changes made, though only CONFIG_RCU_NOCB_CPU was available in 4.13.4

I'm open to trying other kernels if requested.

Revision history for this message
Franck Charras (franckc) wrote :

It seems that this problem is widespread, and is not related to https://bugzilla.kernel.org/show_bug.cgi?id=196481 which is linked in this thread.

Revision history for this message
tgui (eric-c-morgan) wrote :

Franck: I very much agree that its not related to what you linked. The link you refer to deals with heavy load compilation segfaults, which is addressed with a RMAd CPU.

This thread seems to be mostly centered around system idle / low usage complete system crash. No logs, blank screen.

I have a movie playing on my server on loop to hopefully keep the machine from becoming too idle.... which I hate doing.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Week 33 Ryzen 1700 checking in here. I compiled my 4.13.4 kernel withCONFIG_RCU_NOCB_CPU=Y, and still have crashes as described.

I'm hoping CONFIG_RCU_NOCB_CPU_ALL will be added back as an option as kernel command params seem like a hack.

Please do know I appreciate those that offer their time to the linux kernel.

Revision history for this message
tgui (eric-c-morgan) wrote :

PLEASE post your concerns to this kernel.org thread with this issue!

https://bugzilla.kernel.org/show_bug.cgi?id=196683

CONFIG_RCU_NOCB_CPU_ALL kernel option was removed and would likely fix this issue if put back! Supposedly there is a kernel command param you can pass to alleviate some of this, but thats a hack in my mind.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Update, added kernel boot param and enabled C6 and cool and quiet in the BIOS. I'll report back... I guess in a few days or a few weeks.

Revision history for this message
In , hoper (hoper-linux-kernel-bugs) wrote :

Just a small message to say "me too". I spend the last 4 months spending a lot of time and money to change the motherboard, the cpu, the power... Before doing research and find that these freezes are software related :(

I tried (and manage) to compile my own kernel with CONFIG_RCU_NOCB_CPU=Y and so on. Before that, my server always crashed after 2 ou 3 days. With this custom kernel, yes, it's better... The freeze only appear after 8 or 9 days. But the freeze are still here. And I guess I will just sell all this stuff and go back to intel :(

I can't understand that this information "LINUX + RYZEN = NOT STABLE" is not spread everywhere. Lot's of people out there are loosing lot's of time and money, I'm sure of that. Of course I'm grateful for all open sources developers (I'm sharing also what I can :) and I really hope that the root cause of this bug will be found and corrected in the next few months... We need to be able to use linux on RYZEN ! (perfect cpu for servers).

If someone manage to make this bug disappear (and have an uptime > 30 days), please share ! how you did that, with enough details for beginner like me :)

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Hoper,

If you're still running your compiled kernel add a kernel param "rcu_nocbs=0-15", where 0-15 is for a 16 thread CPU, 0-11 for a 12 thread for example.

Here is a good link on how to add it.

https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter

Remember that you're still using a new architecture. It takes time to make Linux play nice. We've been in an intel dominated world for a very long time. ;-)

Revision history for this message
In , hoper (hoper-linux-kernel-bugs) wrote :

Thanks. I did it Sunday... Waiting for the freeze :o)

Revision history for this message
Franck Charras (franckc) wrote :

I've added "rcu_nocbs=0-15" to the kernel boot parameters this morning (along with "processor.max_cstate=1" last week), will update if a new freeze occurs.

Revision history for this message
In , jon780 (jon780-linux-kernel-bugs) wrote :
Download full text (3.3 KiB)

I'm experiencing this bug as well, locking up every 24-48 hours roughly. I've tried adding "rcu_nocbs=0-15" to the grub boot to see if that resolves it.

Fedora 26 (kernel 4.13.9-200.fc26.x86_64)
Ryzen 1700X

Nov 03 19:54:30 pc003 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [kworker/14:1:217]
Nov 03 19:54:30 pc003 kernel: Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_co
Nov 03 19:54:30 pc003 kernel: crc32_pclmul ccp snd_timer drm ghash_clmulni_intel snd soundcore tpm_tis sp5100_tco tpm_tis_core wmi_bmof i2c_piix4 shpchp tpm parport_pc parport acp
Nov 03 19:54:30 pc003 kernel: CPU: 14 PID: 217 Comm: kworker/14:1 Tainted: P OEL 4.13.9-200.fc26.x86_64 #1
Nov 03 19:54:30 pc003 kernel: Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 TOMAHAWK ARCTIC (MS-7A34), BIOS H.50 06/22/2017
Nov 03 19:54:30 pc003 kernel: Workqueue: events netstamp_clear
Nov 03 19:54:30 pc003 kernel: task: ffff8ac634d62640 task.stack: ffffaea343fdc000
Nov 03 19:54:30 pc003 kernel: RIP: 0010:smp_call_function_many+0x24a/0x270
Nov 03 19:54:30 pc003 kernel: RSP: 0018:ffffaea343fdfce0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Nov 03 19:54:30 pc003 kernel: RAX: ffff8ac63e61f398 RBX: ffff8ac63e99b580 RCX: 0000000000000000
Nov 03 19:54:30 pc003 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ac63e022b88
Nov 03 19:54:30 pc003 kernel: RBP: ffffaea343fdfd18 R08: ffffffffffffffff R09: 000000000000bfff
Nov 03 19:54:30 pc003 kernel: R10: fffffa649fe76140 R11: ffff8ac63e007c00 R12: 0000000000000010
Nov 03 19:54:30 pc003 kernel: R13: 0000000000000010 R14: ffffffff8b02d6a0 R15: 0000000000000000
Nov 03 19:54:30 pc003 kernel: FS: 0000000000000000(0000) GS:ffff8ac63e980000(0000) knlGS:0000000000000000
Nov 03 19:54:30 pc003 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 03 19:54:30 pc003 kernel: CR2: 00007fc490f23000 CR3: 000000011fe09000 CR4: 00000000003406e0
Nov 03 19:54:30 pc003 kernel: Call Trace:
Nov 03 19:54:30 pc003 kernel: ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel: ? setup_data_read+0xa0/0xa0
Nov 03 19:54:30 pc003 kernel: ? netif_receive_skb_internal+0x29/0x410
Nov 03 19:54:30 pc003 kernel: on_each_cpu+0x2d/0x60
Nov 03 19:54:30 pc003 kernel: ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel: text_poke_bp+0x6a/0xf0
Nov 03 19:54:30 pc003 kernel: __jump_label_transform.isra.0+0x10b/0x120
Nov 03 19:54:30 pc003 kernel: arch_jump_label_transform+0x32/0x50
Nov 03 19:54:30 pc003 kernel: __jump_label_update+0x68/0x80
Nov 03 19:54:30 pc003 kernel: jump_label_update+0xae/0xc0
Nov 03 19:54:30 pc003 kernel: static_key_slow_inc+0x95/0xa0
Nov 03 19:54:30 pc003 kernel: static_key_enable+0x1d/0x30
Nov 03 19:54:30 pc003 kernel: netstamp_clear+0x2d/0x40
Nov 03 19:54:30 pc003 kernel: process_one_work+0x193/0x3c0
Nov 03 19:54:30 pc003 kernel: worker_thread+0x4a/0x3a0
Nov 03 19:54:30 pc003 kernel: kthread+0x125/0x140
Nov 03 19:54:30 pc003 kernel: ? process_one_work+0x3c0/0x3c0
Nov 03 19:54:30 pc003 kernel: ? kthread_park+0x60/0x60
Nov 03 19:54:30 pc003 kernel: ret_from_f...

Read more...

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Another hang on 4.13.5 with Ryzen 1600X using Fedora 26. Works fine on 4.12.9.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

I'm at 11 days uptime with the custom 4.13.4 kernel and kernel boot params.

Jon, did you also compile the RCU settings or only apply the boot params?

Seth, I'm on 4.13.4 FWIW and now doing well. How long was your uptime?

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Eric, Fedora has CONFIG_RCU_NOCB_CPU=y for 4.13 so I just set rcu_nocbs=0-11 on the kernel command line. I'll report how it goes. Without the params, I would hang within a few hours.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Eric, been running for a few hours now, longer than the hang normally takes to surface, and it hasn't. Seems that the rcu_nocbs=0-11 masks the issue.

I sent an email off to Paul McKenney, RCU maintainer, to see if he couldn't take a look. I'm out of my depth on this one. RCU is voodoo of the highest order!

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Seth, fantastic! Agreed on RCU complexity. I started reading up on it.. and then went back to funny cat pictures.

Revision history for this message
In , johnpaul7 (johnpaul7-linux-kernel-bugs) wrote :

Eric and Seth, I'm debugging and testing myself now after having very similar issues. Hardware no longer an issue after early-week RMA and confirmed stability with 48hr stressapptest run. Curious if you wouldn't mind sharing what distro and DE/WM you are running as well? (Seth looks like Fedora 26 for you).

I also assume that at this point the only Ryzen specific tweaks you are running are `CONFIG_RCU_NOCB_CPU` and `rcu_nocbs`? No more Ryzen-specific bios tweaks?

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

John,

No custom kernel or BIOS settings. Running stock clocks and voltages for memory and CPU. Just added rcu_nocbs to cmdline to (re)mask whatever is going on.

Also, I'm not sure that stress tests are the best way to recreate this issue. In my experience, it happens when the system is mostly idle.

Revision history for this message
In , jon780 (jon780-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #38)
> I'm at 11 days uptime with the custom 4.13.4 kernel and kernel boot params.
>
> Jon, did you also compile the RCU settings or only apply the boot params?
>
> Seth, I'm on 4.13.4 FWIW and now doing well. How long was your uptime?

I've only specified the boot parameter. Too soon to say it's resolved for sure, but uptime is over 2 days at this point, which seems to be an improvement.

Revision history for this message
In , jon780 (jon780-linux-kernel-bugs) wrote :

(In reply to John Paul Herold from comment #42)
> Eric and Seth, I'm debugging and testing myself now after having very
> similar issues. Hardware no longer an issue after early-week RMA and
> confirmed stability with 48hr stressapptest run. Curious if you wouldn't
> mind sharing what distro and DE/WM you are running as well? (Seth looks like
> Fedora 26 for you).
>
> I also assume that at this point the only Ryzen specific tweaks you are
> running are `CONFIG_RCU_NOCB_CPU` and `rcu_nocbs`? No more Ryzen-specific
> bios tweaks?

I'm also running Fedora 26, DE/WM is dwm.

I also want to say that my CPU is confirmed to be affected by the Ryzen bug. I'm currently discussing RMA with AMD but I do not believe these issues are related.

Revision history for this message
In , johnpaul7 (johnpaul7-linux-kernel-bugs) wrote :

(In reply to Seth Jennings from comment #43)
> Also, I'm not sure that stress tests are the best way to recreate this
> issue. In my experience, it happens when the system is mostly idle.

Seth, correct on the stress test, just wanted to clarify my hardware is stable, no bad OC, etc. I too get the issue when I return to work the next day after system had been idle overnight.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Just got the hang even with rcu_nocbs=0-11. Back to 4.12 then.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

John,

- week 33 R7 1700 CPU no segfault issues
- ubuntu 16.04 with 4.13.4 kernel pulled from git, copied config, applied RCU_NOCBS, the RCU_NOCBS_ALL option was NOT available.
- XFCE/Xubuntu to be more exact
- memtest 4 passes OK
- all c states, cool and quiet enabled in BIOS, no changes really
- boot params used as discussed

Seth,

My system crashes/d under very low load, almost idle. Before I tried any of these fixes I would run a video on loop to keep one processor more active to stave off crashes. It seemed to help. Maybe this would help you until a fix is official?

Jon,

Segfault bug is indeed a different issue. I had to RMA to get my week 33 CPU. It takes some time. Hang in there!

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Seth Jennings from comment #47)
> Just got the hang even with rcu_nocbs=0-11. Back to 4.12 then.

Are you able to disable C6 in the BIOS? I still don't know whether the RCU workaround avoids the issue entirely or just makes it less likely. I haven't had a single freeze since disabling C6.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Seth/James,

Also consider disabling ASLR.

# disable in current session
echo 0 | tee /proc/sys/kernel/randomize_va_space

# make change permanent (across reboots)
echo "kernel.randomize_va_space = 0" > /etc/sysctl.d/01-disable-aslr.conf

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #50)
> Also consider disabling ASLR.

That only helps with the segfault issue. I now have a new Ryzen and no longer encounter that.

I discovered the RCU workaround after receiving the new Ryzen and trying to work out why Gentoo froze but Fedora (at the time) didn't. I compared the kernel configurations and manually converged one towards the other, focusing on the things that seemed most likely to make a difference. It took several days. I was certain by that point that ASLR was unrelated here.

Revision history for this message
In , johnpaul7 (johnpaul7-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #51)
> That only helps with the segfault issue. I now have a new Ryzen and no
> longer encounter that.

James, regarding ASLR, that is helpful as I've seen it recommended as a general Ryzen stability tweak. Which we definitely want to keep separate from this bug.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

I want to confirm this issue on my system (AMD Ryzen 1700X, segfault free chip after RMA) and offer some more info:

- First of all, let's define what CONFIG_RCU_NOCB_CPU does. It enables the support for rcuo kernel threads which handle the RCU callback processing. Option is automatically enabled if you select CONFIG_NO_HZ_FULL=y and this is the reason why it is automatically enabled in Fedora kernels. Fedora also chose to set CONFIG_RCU_NOCB_CPU_ALL=y in their 4.12 series kernels, while the default is CONFIG_RCU_NOCB_CPU_NONE=y which only enables RCU callback offloading to rcuo kernel threads for those CPUs that are defined in rcu_nocbs boot parameter. This is also the default for 4.13 kernels, and options CONFIG_RCU_NOCB_CPU_* have been dropped. So, in order to have the behaviour of CONFIG_RCU_NOCB_CPU_ALL=y in 4.13 kernel, you need to supply boot parameter rcu_nocbs=0-XX, where XX is your number of cpus-1. So, by setting CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y (or rcu_nocbs=0-XX), we are telling the kernel to offload RCU callbacks to seperate kernel threads called rcuo (rcu offload), one for each cpu. Affinity of those kernel threads is set to ffff, so they can run to any CPU.

- C6 is not effectively disabled by using rcuo kernel threads for RCU callback processing. When C6 is disabled, CPU voltage never drops under the voltage defined for P2 state (0.9V for my CPU). With C6 enabled and rcuo kernel threads enabled, my voltage drops to 0.4V frequently. Also, disabling C6 prevents single core to hit XFR turbo speeds. Max cpu frequency for my 1700X cpu is 3500 MHz when C6 is disabled. With C6 enabled, I can run single thread processes at 3900 MHz (XFR turbo speed of 1700X). Enabling rcuo kernel threads for RCU callback processing does not disable XFR turbo speeds. So, this feature does not effectively disables C6, this is not true.

- C6 can be disabled manually with zenstates.py script, even if your BIOS does not offer this option. It can be found at https://github.com/r4m0n/ZenStates-Linux

- In my case, idle freezes are prevented with either C6 disabled or with rcuo kernel threads for RCU callback processing enabled and C6 enabled. I had 14 days uptime with rcuo kernel threads enabled, while my system will usually freeze overnight without it.

- And finally, I have no idle freezes in Windows 10 with C6 enabled.

My opinion is that this idle freeze is a hardware issue of the AMD Zen processor and the rcuo kernel threads for RCU callback just hide or make less probable to trigger this hardware issue. Probably something similar happens in Windows 10 kernel.

I think AMD should be contacted.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

Thank you very much for this extra detail, Panagiotis. I'll turn C6 back on! I wanted to contact AMD but I wasn't sure how best to do so.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

I think that an experienced kernel developer that understands how RCU callback could affect the triggering of a probable hardware issue should contact AMD. There are AMD kernel developers so they should get the attention of this bug report.

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #55)
> I think that an experienced kernel developer that understands how RCU
> callback could affect the triggering of a probable hardware issue should
> contact AMD. There are AMD kernel developers so they should get the
> attention of this bug report.

How can we contact them? For me, the situation is still unclear ... I didn't have problems for some weeks thus power using my PC while configuring a new firewall. After a custom kernel I used standard Manjaro kernels from 4.13.9 to 4.13.11, keeping the RCU_NOCB_CPU option in my standard kernel command line. I is not visible in the Manjaro kernel config. Maybe it's triggered by something else?

https://github.com/manjaro/packages-core/blob/master/linux413/config.x86_64

Revision history for this message
In , jon780 (jon780-linux-kernel-bugs) wrote :

I've now got over 4 days of uptime with ryzen 1700x on fedora 26 (4.13.9-200) using the rcu_nocbs=0-XX boot option. I feel pretty confident in saying this resolved my issue. Should we be filing a bug report with Fedora as well? Clearly the current kernel configuration isn't compatible with (at least some) Ryzen CPUs.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

I'm running 4.13.11 now on F26. I enabled kdump and kernel.softlockup_panic = 1 so maybe I can get a kernel core on the next hang if/when it occurs.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

Using Fedora 27 I've had zero problems with random hangs since I first commented on this on 2017-10-13, I've had rcu_nocbs=0-11 as part of GRUB_CMDLINE_LINUX= in /etc/sysconfig/grub since then. Currently using 4.13.11-300.fc27.x86_64, been using various 4.13.x kernels.

Like already mentioned, this was something I noticed very quickly after going from kernel 4.12.x to 4.13.x, everything was working fine and then Fedora "upgraded" and the system started randomly hanging (mostly right after boot if I didn't start stuff immediately). My first reaction (before finding this bug) was to simply go back to 4.12.x.

I have not filed a Fedora bug but I'm not sure how useful it would be. Their kernels used to ship with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y and they probably still would if CONFIG_RCU_NOCB_CPU_ALL wasn't removed.

The kernel developers could fix this (for Fedora at minimum, do not know what other distributions kernels are configured to do) by reverting this,

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

I'm not sure why this is not happening. Perhaps it will in the far future when Ubuntu and RHEL/CentOS finally switch to a 4.13+ kernel, AMD seems to only care about Ubuntu 16 and CentOS 7 for some reason (look at their binary blob GPU drivers, for example). It would be excellent if someone with access could please revert that git commit before 2020. I don't really see any compelling arguments as to why that change was done and a whole lot of Ryzen systems randomly hanging is a good argument as to why it should be reverted.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Reverting the CONFIG_RCU_NOCB_CPU_{ALL|ZERO|NONE} options deletion is not a solution. This option is just a workaround for what I think is a hardware issue.

There is at least one Ryzen 1700X processor (friend of mine has it) that doesn't freeze on idle with stock kernel settings. Also, my original CPU that had segfault issues, didn't freeze on idle too. Actually, it seems most CPUs from RMA that don't have the segfault issue are more affected with the idle freeze issue than others.

AMD should be informed about the freeze on idle issue and an experienced kernel developer should work with them to find why a kernel with the rcuo kernel threads for RCU callback enabled hide the issue (or make it less probable to happen).

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #60)
> Reverting the CONFIG_RCU_NOCB_CPU_{ALL|ZERO|NONE} options deletion is not a
> solution. This option is just a workaround for what I think is a hardware
> issue.
Very much agreed.

> There is at least one Ryzen 1700X processor (friend of mine has it) that
> doesn't freeze on idle with stock kernel settings. Also, my original CPU
> that had segfault issues, didn't freeze on idle too. Actually, it seems most
> CPUs from RMA that don't have the segfault issue are more affected with the
> idle freeze issue than others.
My original CPU definitely had both issues and I wouldn't say that one froze more than the other. Perhaps there's another factor involved.

Revision history for this message
In , b-o-s-s (b-o-s-s-linux-kernel-bugs) wrote :

Another affected user here. Got my RMA week 33 segfault-free Ryzen 7 1700 a few weeks ago, running Gentoo on a custom vanilla kernel 4.13.10 atm. After 10 days of uptime, I now consider it stable. I did have freezes before, also with my pre-RMA CPU, but rcu_nocbs boot parameter seems to have fixed it, together with CONFIG_RCU_NOCB_CPU of course.

What I don't understand is why this should be another hardware bug? Ok, we don't know, maybe it is, but even then, this time there should be some software workaround. And I don't necessarily mean this RCU offloading-thing, which probably just masks the issue, but more in the terms of some underlying RCU (?) kernel issue triggering on this platform...?

Also consider that there are no freeze-issues on Windows as far as I know. Ok, AMD quickly put out their AMD Ryzen Balanced power plan, but I think it's mostly about SMT-scheduling, although it does disable C6 (I think, not sure). But anyhow, even on default Windows power plans, there are no freezes. So... either Microsoft quickly and silently incorporated some patch fixing this issue a long time ago, or it never was an issue for Windows in the first place. Doesn't matter however, the point is that something can be done in software in order to make this "bug" not triggering.

Forgive my ignorance, but why is it so difficult to assign this bug to the correct kernel dev? I'm just a user really and wondering... thanks.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

I'm at 13 days uptime with kernel and boot params applied. This ties my system stability days counted from before the boot params were added. I'll keep reporting back.

I am used to months of uptime with my previous Intel i5 2400 server. I hope to reach the same with my Ryzen.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Created attachment 260577
softlock-dmesg.log

Was finally able to get a BUG on the soft lockup after attaching a serial console. Hoping to get a few more to create a good sampling of cases in which this happens.

tl;dr, two cores seem to be soft locked in native_flush_tlb_others().

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Created attachment 260579
panic-dmesg.log

Here is a soft lock followed by watchdog timer panic.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Created attachment 260581
panic2-dmesg.log

Last one for good measure.

smp_call_function_many() is the common denominator for all traces in this report.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

@Seth: How you achieve capturing these logs? You do serial console logging to another computer? Enabling serial console is enough or something else is needed too? I would also like to capture my idle freeze logs.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

@Panagiotis, yes, I attached a serial port header to my motherboard and am using a DB9 null modem cable and minicom to get the dmesg. Added console=ttyS0,38400n8 to the kernel boot line.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

I was able to capture logs using netconsole, which may be easier for some people.

Revision history for this message
In , derrick.aguren (derrick.aguren-linux-kernel-bugs) wrote :

Disabling C6 with zenstates.py has worked so far for me (6 hours without a freeze vs a freeze every one to two hours).

Ryzen 1800x (I haven't checked what week)
Ubuntu 16.04 LTS 4.11.0-kfd-compute-rocm-rel-1.6-180
Gigabyte AX370 - Gaming 5 BIOS version F8 (default params)
MSI Radeon RX 580
amdgpu-pro version 17.40-492261

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Doing some code investigation

http://elixir.free-electrons.com/linux/v4.13.12/source/kernel/smp.c#L401

The two RIP addresses in my 3 dmesg logs go here

smp_call_function_many
csd_lock_wait
smp_cond_load_acquire
cpu_relax (called in loop from smp_cond_load_acquire)
rep_nop
asm volatile("rep; nop" ::: "memory") <-- smp_call_function_many+0x248

smp_call_function_many
csd_lock_wait
smp_cond_load_acquire
READ_ONCE(x) (called in loop from smp_cond_load_acquire)
__READ_ONCE(x, 1)
__read_once_size
__READ_ONCE_SIZE <-- smp_call_function_many+0x24a

Both are within the tight loop in smp_cond_load_acquire waiting on the per-cpu csd locks

Basically, smp_call_function_many() executes a function on each cpu via IPI.
When wait=true, it runs synchronously, with the cpu that runs smp_call_function_many() waiting for each of the other cpus to report they have run the function as indicated by releasing their per-cpu csd lock.

The CSD_FLAG_SYNCHRONOUS flag determines the order of the func and unlock in flush_smp_call_function_queue()

http://elixir.free-electrons.com/linux/v4.13.11/source/kernel/smp.c#L242

This leads me to believe the issue here deals with IPIs being dropped and thus the CPU calling smp_call_function_many() deadlocks waiting on a cpu that has dropped the IPI and therefore will not unlock the csd lock.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

I found this commit that went into 4.12-rc2

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3fc5b3b6a80b2e08a0fec0056208c5dff757e547

While the reasoning in the commit message makes sense, it does modify the exact code area where this is happening. It would be interesting to revert this and see if the problem goes away. There could be some edge case where the enqueued csd get dropped.

Revision history for this message
In , jon780 (jon780-linux-kernel-bugs) wrote :

This will be my last post unless the behavior changes, but I've gone from 1-2 lock ups per day to an uptime of: 7 days, 11:57 so I'm considering this resolved for me.

By adding the following boot option: rcu_nocbs=0-15

And disabling ASLR. I don't know which one solved the problem, but I'd start with the boot option then try disabling ASLR if you still have issues.

Revision history for this message
In , b-o-s-s (b-o-s-s-linux-kernel-bugs) wrote :

(In reply to Seth Jennings from comment #72)
> I found this commit that went into 4.12-rc2
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=3fc5b3b6a80b2e08a0fec0056208c5dff757e547

Great find, thanks!
I'm going to test 4.13.12 with this reverted. C6 is enabled in UEFI, no special boot params, just booted. Now, let's wait... :D

Revision history for this message
In , thunderbird2k (thunderbird2k-linux-kernel-bugs) wrote :

Created attachment 260619
Kernel softlockup Fedora 27 serial log

Hi,

I'm seeing the same issue as others reported here. If my memory serves me well I started seeing it on some 4.13.x kernel on fedora 26 and have had it ever since even now on fedora 27.

The issue usually happens every few hours and my system is fairly idle, just a few gnome-terminals and a browser open, no compilation or heavy 3d content.

I'm about to try the workarounds suggested on this ticket, but I want to capture a few more logs just in case.

Thanks,
Roderick

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

@Seth: I tried with the commit you mentioned reverted and got idle freeze again. If I remember well I have had idle freeze with Debian 9.0 4.9 kernel as well.

I don't know if it helps, but running without gnome/X, I didn't have idle freeze. System was running only some network services (ssh, nfs, samba etc).

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

Yeah, I really want to believe it was that commit but while I haven't tried reverting it, I'm pretty sure I started with a kernel older than 4.12.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

@Panagiotis yes, I also built a kernel with a revert for that commit and it still locked up so... not that.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

@James, I am disabling C6 now (w/o the kernel options). See how this goes. If that alone can fix it, that is a pretty strong indicator in my mind that this is a hardware issue :-/

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

Derrick, evidently you work for AMD, is there anybody you can speak to about this?

Revision history for this message
Franck Charras (franckc) wrote :

Almost 1 month of uptime, it looks solved.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

I'm stable with C6 disabled (both package and core) and no other modifications.

In my mind, this proves that this is a hardware issue. Gotta find some way to get AMDs attention on this...

Revision history for this message
In , tom (tom-linux-kernel-bugs) wrote :

We have a machine (with Ryzen Threadripper 1950X) where this is happening even with rcu_nocbs=0-31 and I'd love to test the disabling C6 hypothesis except that as far as I can tell there's not such option in the BIOS (ASUS PRIME X399-A motherboard).

In fact it looks to me like there is only C1 and C2 and hence no C6 to disable - that's going both by powertop and looking at /sys/devices/system/cpu/cpu0/cpuidle/state*/name to see what states exist.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

@Seth I crashed with C states disabled.

I haven't crashed in 20 days with C states ENABLED and RCU kernel and boot params.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

@Tom, did you compile your kernel with the RCU option/s set?

Revision history for this message
In , tom (tom-linux-kernel-bugs) wrote :

No this is a stock Fedora kernel, but my understanding was that rcu_nocbs=0-31 was equivalent to recompiling with the RCU configuration change?

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

@Tom, unfortunately not. You'll need to verify CONFIG_RCU_NOCB_CPU=Y and if possible CONFIG_RCU_NOCB_CPU_ALL=Y. The newer kernels dont have CONFIG_RCU_NOCB_CPU_ALL, so the boot param you reference does the same.

Here is an article I contributed to that discusses this. While Ubuntu focused it can still guide you.

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Like I stated before, with these options set I'm at 20 days crash free.

Revision history for this message
In , tom (tom-linux-kernel-bugs) wrote :

Turns out it is actually set:

rover [~] % fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

@Tom that is correct. You do NOT have to recompile your kernel to use the rcu_nocbs option. The fedora kernel already has CONFIG_RCU_NOCB_CPU=y.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

@Tom, then life is easier for you! ;-)

Perhaps consider disabling ASLR? http://blog.programster.org/how-to-disable-aslr

Maybe up the SOC voltage a tad?

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

I found this project a while ago that claims to set C states for ryzen chips on some motherboards.

https://github.com/r4m0n/ZenStates-Linux

Revision history for this message
In , tom (tom-linux-kernel-bugs) wrote :

So that does show it as enabled at code level:

rover [~/ZenStates-Linux] % sudo ./zenstates.py -l
P0 - Enabled - FID = 88 - DID = 8 - VID = 44 - Ratio = 34.00 - vCore = 1.12500
P1 - Enabled - FID = 8C - DID = A - VID = 5A - Ratio = 28.00 - vCore = 0.98750
P2 - Enabled - FID = 84 - DID = C - VID = 6A - Ratio = 22.00 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Enabled

I've turned it off now so we'll see what happens with that...

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

@Tom, Good luck!

As dumb as it sounds, at one point I had a video running on loop on my server to keep it from going too idle. If your C state attempt is fruitless then maybe that would be a bandaid until a resolution for all this is found.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

@Tom the BIOS option on Gigabyte boards that disabled C6 is called "Global C-state Control". I think ASUS calls it the same thing.

@Eric interesting. IIRC I hung with just the rcu_nocbs kernel options set and C6 enabled, similar to Tom's experience.

All of this doesn't not change the fact that one should not have to do hacks to prevent the hardware from going "too idle".

Revision history for this message
In , derrick.aguren (derrick.aguren-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #80)
> Derrick, evidently you work for AMD, is there anybody you can speak to about
> this?

Hi James. Unfortunately this sort of thing is outside of my wheelhouse. Let's let it progress through normal channels.

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

@Derrick these "normal channels" of which you speak... please, do tell! :D What are they?

How do users report issues that will likely require and AGESA update to fix?

Revision history for this message
In , derrick.aguren (derrick.aguren-linux-kernel-bugs) wrote :

@Seth I would start here, which mentions this site, but also other paths: https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html

Revision history for this message
In , lucio (lucio-linux-kernel-bugs) wrote :

(In reply to Derrick Aguren from comment #96)
> @Seth I would start here, which mentions this site, but also other paths:
> https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html

Maybe there's a shorter path, maybe Bridgman@Phoronix knows how to get in touch with the right AMD person. I've just asked him to have a look just in case he knows:

https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/990202-linux-4-15-is-a-huge-update-for-both-amd-cpu-radeon-gpu-owners?p=990258#post990258

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

I already tried to contact Bridgman through a Phoronix PM but his inbox was full so I figured that he's probably pestered about random AMD issues enough already.

I then sent a message to AMD Tech Support. I haven't heard anything back yet.

Revision history for this message
In , lorenz.bona (lorenz.bona-linux-kernel-bugs) wrote :

Hi guys.
I've been running a Ryzen build since late May/early June from Linus git and I never faced this lockup.
I've just checked and in my configuration that option is disabled.

In attachment you can find my config file.

May I help you testing something?

R5-1600 with a Gigabyte Gaming 3.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

@Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py script mentioned earlier in this thread.

Revision history for this message
In , lorenz.bona (lorenz.bona-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #100)
> @Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py
> script mentioned earlier in this thread.

Yes, C6 enabled.

./zenstates.py -l
P0 - Enabled - FID = 94 - DID = 8 - VID = 32 - Ratio = 37.00 - vCore = 1.23750
P1 - Enabled - FID = B9 - DID = A - VID = 32 - Ratio = 37.00 - vCore = 1.23750
P2 - Enabled - FID = 7C - DID = 10 - VID = 68 - Ratio = 15.50 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Revision history for this message
In , lorenz.bona (lorenz.bona-linux-kernel-bugs) wrote :

Created attachment 260745
ryzen kernel config

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

I've been facing those system hangs, too, until I switched on "daily computing" optimization in bios (ASUS PRIME X370-PRO BIOS 0902 09/08/2017) running 4.13.x.

Switching to "daily computing" increases CPU speed to max MHz 3600 (instead of 3400) using Ryzen 7 1700X. Maybe it changes some more things I don't know off. But C states are definitely enabled:

# zenstates.py -l
P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P1 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Revision history for this message
In , johannes.hirte (johannes.hirte-linux-kernel-bugs) wrote :

I don't think Bridgman can help here. You should ask Borislav Petkov <email address hidden> with cc-ing <email address hidden>. He doesn't work for AMD anymore, but he is still involved with kernel development. So if he can't say what's going on, he can point to the right developer.

Revision history for this message
In , sfearghail (sfearghail-linux-kernel-bugs) wrote :

Just to add more as I've been following along ever since this 1st happened to me about 6 days ago. To preface I am running Debian 9 with 4.13 kernel from backports. I went several days after upgrade from 4.12 with a very idle Ryzen 1700 and didn't have any issues. I did have 2 qemu-kvm VM's running, and they are also mostly idle. I encountered the soft lock CPU bug out of the blue on the Nov 15th while not doing anything interactive with the system.

I'm now running 6 VM's on this system as it was a replacement for an old Intel system. I simply cannot have this thing crashing on me unattended, especially while I am remote.

Here are the measures I took, and frankly I don't really know if it's going to prevent this as the system simply hasn't been up long enough.

Board: Gigabyte GA-AB350-GAMING 3
Bios version: F7

Disabled Global C-State Control in bios. The manual also says there should be a C6 Mode in bios, however it simply doesn't exist.

Disabled AMD Cool&Quiet in bios.

Further issued a disable of C6 via the zenstate.py script upon boot up in systemd as I wasn't certain what the difference is between Package and Core.

# zenstates.py -l
P0 - Enabled - FID = 78 - DID = 8 - VID = 3A - Ratio = 30.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Disabled

Question for clarification. Does CONFIG_RCU_NOCB_CPU need to be set to yes in kernel to be able to use the boot parameter? I get the impression that the boot parameter doesn't actually do anything unless kernel option is configured.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Scott Farrell from comment #105)
> Disabled Global C-State Control in bios. The manual also says there should
> be a C6 Mode in bios, however it simply doesn't exist.
These are the same thing.

> Disabled AMD Cool&Quiet in bios.
You don't need to disable this. I would leave it on.

> Question for clarification. Does CONFIG_RCU_NOCB_CPU need to be set to yes
> in kernel to be able to use the boot parameter? I get the impression that
> the boot parameter doesn't actually do anything unless kernel option is
> configured.
Yes, the boot parameter alone won't do anything. You only need to do this or disable C6, not both. At this point, the overall consensus seems to be that the RCU workaround is better as it still allows some power saving and doesn't prevent boost.

Revision history for this message
In , tom (tom-linux-kernel-bugs) wrote :

Well our machine has now gone about nine days without any soft lockups with C6 disabled - previously (even with rcu_nocbs=0-31) it was lucky to go more than a day or two.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

I'm at 30 days with C states ENABLED, Cool & Quiet ENABLED, boot param set, and kernel param set. I consider my server stable now.

Now if only the out of the box linux kernel would be this stable.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

I have had a response from AMD Tech Support. They told me to disable C6 even though I'd already told them that this has been found to work around the issue. Not very helpful but they also added that the issue "should be addressed in an up and coming BIOS release." This would imply that they already know about the issue.

I didn't say so but I am slightly dubious about the suggestion of a BIOS fix as they said the same thing about the segfault issue and that never materialised.

However I did implore them to be more transparent with their customers. Bad press is bad but how they handle the situation is what really counts. While they never fully admitted that the segfault issue was a large scale hardware fault in the face of overwhelming evidence, they did at least honour all RMA requests with little resistance. I get the impression that most customers were grateful for this. I certainly was.

They then told me my feedback had been passed onto the Ryzen team.

Revision history for this message
In , tom (tom-linux-kernel-bugs) wrote :

Don't forget that a BIOS update can include new CPU microcode.

Revision history for this message
In , master.homer (master.homer-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #109)
> I have had a response from AMD Tech Support. They told me to disable C6 even
> though I'd already told them that this has been found to work around the
> issue. Not very helpful but they also added that the issue "should be
> addressed in an up and coming BIOS release." This would imply that they
> already know about the issue.
>
> I didn't say so but I am slightly dubious about the suggestion of a BIOS fix
> as they said the same thing about the segfault issue and that never
> materialised.
>
> However I did implore them to be more transparent with their customers. Bad
> press is bad but how they handle the situation is what really counts. While
> they never fully admitted that the segfault issue was a large scale hardware
> fault in the face of overwhelming evidence, they did at least honour all RMA
> requests with little resistance. I get the impression that most customers
> were grateful for this. I certainly was.
>
> They then told me my feedback had been passed onto the Ryzen team.

Good to know that they are working on it!

I've had this problem myself, with multiple random lockups per day when it was mostly idle. But like others a custom compiled kernel (latest 4.15) with the CONFIG_RCU_NOCB_CPU option set and the rcu kernel command line seems to have fixed it.

Revision history for this message
In , b-o-s-s (b-o-s-s-linux-kernel-bugs) wrote :

As a test, I've been running like below with an uptime of 12 days without any freeze or strange syslog entries. Note that this is with the patch Seth mentioned far above _reverted_, so at least for me this seems to have fixed the issue. I've read that for some it hadn't, strange. I have to countercheck another few weeks.

# uname -a
Linux donald 4.13.12-x64 #14 SMP Sun Nov 12 17:23:57 CET 2017 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux
# zcat /proc/config.gz |grep RCU_NO
  CONFIG_RCU_NOCB_CPU is not set
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.13.12 root=/dev/nvme0n1p3 ro
# ./zenstates.py -l
P0 - Enabled - FID = 8C - DID = 8 - VID = 3A - Ratio = 35.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Some relevant dmidecode output:

Base Board Information
        Manufacturer: ASRock
        Product Name: AB350 Gaming-ITX/ac

BIOS Information
        Version: P3.10
        Release Date: 08/28/2017

Memory Device (x2)
        Data Width: 64 bits
        Size: 8192 MB
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 3200 MT/s
        Part Number: G.Skill F4-3200C14-8GFX
        Rank: 1
        Configured Clock Speed: 1600 MT/s

I'm running RAM on its specified XMP profile 3200-14-14-14-34 1.35V, which is quite good actually. UEFI settings are mostly default, only P-state overclocking is used and two memory related options turned off: GearDown mode and Bank Swapping. Voltages are default except CPU offset -100 mV. All turbo and power saving stuff is working as intented using ondemand governor.

So... are you guys sure your freezes are no UEFI setting and/or kernel config and/or hardware, most probably RAM, issue? Especially RAM is really critical on this platform and maybe Memtest cannot detect it...?
(But as noted, I have to countercheck at least 2 weeks.)

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Tyler from comment #112)
> As a test, I've been running like below with an uptime of 12 days without
> any freeze or strange syslog entries. Note that this is with the patch Seth
> mentioned far above _reverted_, so at least for me this seems to have fixed
> the issue.

I'll have to try it myself then.

> So... are you guys sure your freezes are no UEFI setting and/or kernel
> config and/or hardware, most probably RAM, issue? Especially RAM is really
> critical on this platform and maybe Memtest cannot detect it...?

I am booting in legacy mode so I doubt it's UEFI related. As someone who actually does have faulty RAM but has worked around it (for now), I can say that while it might have caused the odd freeze earlier, it was more apparent in other ways like random disk corruption (particularly when using git) and the browser constantly crashing until I rebooted.

Revision history for this message
In , master.homer (master.homer-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #113)
> (In reply to Tyler from comment #112)
> > As a test, I've been running like below with an uptime of 12 days without
> > any freeze or strange syslog entries. Note that this is with the patch Seth
> > mentioned far above _reverted_, so at least for me this seems to have fixed
> > the issue.
>
> I'll have to try it myself then.
I tried this myself over the weekend - I compiled a kernel with CONFIG_RCU_NOCB_CPU set to n and that commit reverted, unfortunately, after a few hours of being left idle it had locked up. After that I also did a test with that commit applied again, and CONFIG_RCU_NOCB_CPU set to y (with the kernel command line also being set) and it ran well for 12+ hours. So it definitely feels like that config option fixes it.

Revision history for this message
In , tomm (tomm-linux-kernel-bugs) wrote :

We have been in "indirect" contact with an engineer at AMD who offers this explanation:

> We have been investigating the issue where systems are reportedly locking up
> when idling or running small workloads.
>
> This issue is related to the power supply. Most PC power supplies (PSUs) are
> designed to handle a wide range of power consumption from your PC components,
> but not all PSUs are created equal.
>
> Because of this, there are some rare conditions where the power draw of an
> efficient PC does not meet the minimum power consumption requirements for one
> or more circuits inside some PSUs.
>
> This scenario (called "minimal loading supply") can cause such PSUs to output
> poor quality power, or shut off entirely.
>
> To prevent this issue from happening, it is important to ensure that the
> power supply supports 0A minimum load on the +12V circuit. These PSUs became
> commonplace starting in 2013 for the Intel "Haswell" platform.
>
> This specification can be found printed on the sticker affixed to most PSUs,
> or it may be available on the manufacturer’s website.
>
> However, AMD understands that not everyone is in a position to replace their
> PSU with a contemporary 0A-rated unit. To help with that, AMD is also
> developing a firmware workaround for these power supplies, and will make it
> available through motherboard partners as a BIOS update in the future.

You guys buy this?

Revision history for this message
In , spartacus06 (spartacus06-linux-kernel-bugs) wrote :

Nope, my PSU is this:
https://www.newegg.com/Product/Product.aspx?Item=N82E16817139146

Advertised "Haswell Ready" (0A min load). I _do_ believe this is a problem with power delivery, just not from the PSU. I think it is the on-chip power management.

Sounds like they might try to spin it. I don't really care though as long as the firmware update fixes the issue :)

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

It sounds plausible but my Seasonic G-550, bought last year, was also declared to be Haswell ready.

https://www.pcper.com/news/General-Tech/Seasonic-Releases-Information-Its-Haswell-Ready-Power-Supplies

Maybe Ryzens use even less power in this state!?

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

I don't "buy" this explanation. My PSU (Coolermaster G650M) specifically states: Haswell C6/C7 support & zero load operation

@Klaus Mueller: What this option changes is that it disables XFR and turbo (single core is 3600 same as all core) and it sets P1 state to 3600, 1.35V as can be seen in zenstates.py output. This configuration makes the CPU to spend longer time in 1.35V and less time to 0.9V and less. It does seem to make the problem less frequent.

@Seth Jennings: I agree with you, I believe there is an issue with the on-chip power management. Let's hope it can be fixed in firmware.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

>> This issue is related to the power supply.
> You guys buy this?

No. This is hogwash.

EVGA SuperNOVA 750 G3 is a rather new PSU. Corsair RMi Series RM650i is also fairly new. Both of these were released after Haswell was in 2013. These are the ones I (ab)use for my Ryzen 1600X's. AMD's "engineer" can forget about selling me some story about how both of these PSUs are somehow flawed in a way that makes these CPUs hang shortly after boot if I leave the system idle (when not using the right kernel options/parameter).

If everyone with this problem was re-using a ancient PSU from an earlier build then there would have been a chance that this would hold water. That's just not the case and that makes it obvious that AMD story is hogwash.

It is still the case that kernel configuration CONFIG_RCU_NOCB_CPU=y and the kernel option rcu_nocbs=0-11 fixes this issue - completely on both my Ryzen systems. This tells me that a kernel patch can fix this (or permanently mask it) because it's plainly clear that software does fix it.

Revision history for this message
In , master.homer (master.homer-linux-kernel-bugs) wrote :

Yeah, I don't buy it either. My PSU is a Fractal Design R3 1000W and is rated for a minimum current of 0A on the 12V output.
Plus, Fractal Design's support page states that:
"All power supplies
that use the DC-DC method are able to output their full 3.3V/5V ratings even with no load on the 12V
rail, so Tesla R2 and Newton R3 power supplies will easily support the new sleep states introduced with
Intel’s Haswell platform."

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #118)
> @Klaus Mueller: What this option changes is that it disables XFR and turbo
> (single core is 3600 same as all core) and it sets P1 state to 3600, 1.35V
> as can be seen in zenstates.py output. This configuration makes the CPU to
> spend longer time in 1.35V and less time to 0.9V and less. It does seem to
> make the problem less frequent.

Well, happily I couldn't measure any increased power consumption. It's unchanged compared to standard configuration.

I'm running this config now since about 3 weeks and I didn't face any problem so far - there would have been a lot of chances to hang.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #121)
>
> Well, happily I couldn't measure any increased power consumption. It's
> unchanged compared to standard configuration.
>
> I'm running this config now since about 3 weeks and I didn't face any
> problem so far - there would have been a lot of chances to hang.

I have tried it in my system and although it didn't freeze in the first 30-60 minutes of idle like it usually does if I run it without rcu threads, it did freeze when idle all night - found it frozen in the morning. But even if it worked, I wouldn't like to use it since such setting limits single core potential performance. 1700X can go up to 3900 MHz for single core performance.

Revision history for this message
In , mitsarionas (mitsarionas-linux-kernel-bugs) wrote :

Just to chime in with my case... My Ryzen 1600 (w/ Gigabyte AB350M Gaming 3) also used to lock up on stock Ubuntu 17.10 (4.13), installed on a hdd, C6 enabled in bios. The PSU is a 750W one from 2011-2012, I doubt it's OA min load.
As in everyone's (?) case here, this was fixed with the CONFIG_RCU_NOCB_CPU workaround.

A few days ago though, I did a clean install on an ssd , and thought I'd check again. It didn't lock up for 3-4 days that I've left it on. Could one say that it has some relation to the ssd not using 12V? (It seems opposite to what one would expect? Perhaps it has something to do with the 12V power draw alternating between zero and non-zero? Are the sata 12V rails even related to this?)

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #122)
> I have tried it in my system and although it didn't freeze in the first
> 30-60 minutes of idle like it usually does if I run it without rcu threads,
> it did freeze when idle all night - found it frozen in the morning.

Thanks for testing. It obviously isn't a workaround suitable for everybody.

> But even
> if it worked, I wouldn't like to use it since such setting limits single
> core potential performance. 1700X can go up to 3900 MHz for single core
> performance.

It depends on the characteristics of the workload.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

Another me too.

Originally I had a Ryzen 1700 which was suffering from both the segfault and the idle lockup issues. I sent it back via RMA, got a replacement which does not segfault anymore but which still exhibits the idle lockup.

For me, messing around with RCU kernel settings/boot params does not fix the idle lockup behaviour. Disabling the C6/Coolnquiet at the BIOS level does not fix it either (ASRock X370 Gaming K4 motherboard). What DOES fix the problem is disabling C6/P states via the zenstates.py script: with it, the system has been rock solid for weeks.

At this point I am waiting for a BIOS update before possibly going through RMA again.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Francesco Biscani from comment #125)
> At this point I am waiting for a BIOS update before possibly going through
> RMA again.
I wouldn't go through RMA again as there's no indicator that this has been fixed in hardware.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #126)
> (In reply to Francesco Biscani from comment #125)
> > At this point I am waiting for a BIOS update before possibly going through
> > RMA again.
> I wouldn't go through RMA again as there's no indicator that this has been
> fixed in hardware.

Yeah you are probably right. Let's hope they fix this in the next AGESA update.

Running with C6 disabled is not that much of a problem: thermally I did not notice any difference, and while the power savings would be nice, it's not a deal breaker for me personally.

Still, according to some quick tests I ran, it seems like there's a ~10% performance decrease in single-threaded workloads due to the fact that, without C6, turbo mode is disabled. Would be nice to get that back.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

I've been interested in getting a new AMD CPU, but being new to Linux this bug has scared me off sofar. Too complex for me to compile my own kernel yet. So I created an account to follow this bug here, and see if it gets fixed, maybe with the new Ryzen's coming next February, so rumours have it.

However, I would like to ask if anyone here knows if this bug has been happening on Threadrippers as well. AMD says the TR's are the sorted top 5 Zeppelin dies. Therefor if this bug is a general design issue, it should happen on those high end workstation CPU's as well, but I could not find any complaints. If it does not happen on these TR's, then it's a manufacturing error?

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to Jonathan from comment #128)
> I've been interested in getting a new AMD CPU, but being new to Linux this
> bug has scared me off sofar. Too complex for me to compile my own kernel
> yet. So I created an account to follow this bug here, and see if it gets
> fixed, maybe with the new Ryzen's coming next February, so rumours have it.
>
> However, I would like to ask if anyone here knows if this bug has been
> happening on Threadrippers as well. AMD says the TR's are the sorted top 5
> Zeppelin dies. Therefor if this bug is a general design issue, it should
> happen on those high end workstation CPU's as well, but I could not find any
> complaints. If it does not happen on these TR's, then it's a manufacturing
> error?

It's difficult to say, as AMD has been horribly tight-lipped. Most information on these issues comes from various online resources (forums, bugzilla, etc.), and it's often anecdotal and sometimes contradictory.

My understanding is that the segfault problem (the only one publicly admitted by AMD so far) was a manufacturing issue, which was solved somewhere around the introduction of the TR. See also here:

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

The other issue, the hang/reboot when idle, seems to be persisting also on newer processors, and, going by memory (again, anecdotal) I do remember at one point someone reporting this very problem on a TR.

The current speculation/wishful thinking is that the idle bug may be solved by a future AGESA/microcode update, which should be imminent:

https://www.vortez.net/news_story/its_looking_like_november_for_amds_agesa_1_7.html

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to Francesco Biscani from comment #129)
> My understanding is that the segfault problem (the only one publicly
> admitted by AMD so far) was a manufacturing issue, which was solved
> somewhere around the introduction of the TR. See also here:
>
Ryzen cpus manufactured before week 20 or so have the segfault problem with high stress compilation that average users are not doing. You get a new cpu Amd after submitting RMA.

>
> The other issue, the hang/reboot when idle, seems to be persisting also on
> newer processors, and, going by memory (again, anecdotal) I do remember at
> one point someone reporting this very problem on a TR.
>
> The current speculation/wishful thinking is that the idle bug may be solved
> by a future AGESA/microcode update, which should be imminent:

This bug is made by the intel developers who does not test their code with other hardware and older intel hardware is affected this problem too. Easy workaround is found in this bug report, put for those who can not follow discussions, here it is again:
To prevent random kernel lock ups, enable RCU_NOCB_CPU and boot the kernel with the rcu_nocbs=0-X command line parameter. X is the cpu thread count -1.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #130)
> Ryzen cpus manufactured before week 20 or so have the segfault problem with
> high stress compilation that average users are not doing. You get a new cpu
> Amd after submitting RMA.
>

This is not true, processors from week 28 have also segfault problem.

> This bug is made by the intel developers who does not test their code with
> other hardware and older intel hardware is affected this problem too. Easy
> workaround is found in this bug report, put for those who can not follow
> discussions, here it is again:

This is not true either. Present what change Intel submitted in the kernel that made AMD cpus to freeze on idle. The previously mentioned commit from Seth (comment 72) has been reverted and doesn't fix the issue. And there are people that only C6 disable fixes the issue for them, rcuo threads don't fix it. For me, rcuo fixes the issue - although I would prefer to say "it hides" the issue. To me, as a Computer Engineer, this is a hardware issue, much like the segfault issue. It doesn't seem to occur under Windows, and it hides itself with rcuo threads enabled.

Also, there is at least one user of Threadripper that has this problem and rcuo threads didn't help him. Setting C6 to disabled helped him (comments 82, 91 and 107)

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

I made it 38 days uptime until system crash. I had RCU fixes applied, C6 enabled, cool and quiet enabled. RCU changes appear to help but not fix my issues.

I just turned off C6.

Frustrating. I want to rely on my server.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Why doesn't this issue manifest in Windows?

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #133)
> Why doesn't this issue manifest in Windows?

I had previously said half jokingly that Windows never gets that idle. I have since heard that AMD have said the same thing themselves.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #134)
> (In reply to eric.c.morgan from comment #133)
> > Why doesn't this issue manifest in Windows?
>
> I had previously said half jokingly that Windows never gets that idle. I
> have since heard that AMD have said the same thing themselves.

Thats funny ;-)

While it sucks overall for us, creating a whole new CPU must be maddening to validate. You'll never hit all corner cases the first try.

I'll probably pay for a Zen 2 or whatever is next.

Revision history for this message
Mike R (mi9e) wrote :

@franckc did you have to rebuild your kernel? I tried the kernel options without that and still had failures. I've just rebuilt with CONFIG_RCU_NOCB_CPU to see if now the rcu_nocbs=0-15 resolves my lockups

Revision history for this message
Franck Charras (franckc) wrote :

YES the boot option alone has no effect, you must also compile the kernel with CONFIG_RCU_NOCB_CPU. You should be fine now (I have this confirmed on several different configs). Here's a tutorial: http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #131)
> (In reply to fin4478 from comment #130)
> > Ryzen cpus manufactured before week 20 or so have the segfault problem with
> > high stress compilation that average users are not doing. You get a new cpu
> > Amd after submitting RMA.
> >
>
> This is not true, processors from week 28 have also segfault problem.
>

I did not check what phoronix wrote. You can have a new Ryzen cpu anyway if you are compiling Mesa and having the segfault problem.

To me, as a Computer Engineer, this is a hardware issue, much
> like the segfault issue. It doesn't seem to occur under Windows, and it
> hides itself with rcuo threads enabled.

I am Msc software engineer and when you say win virus hoover does not have random lockups, then it is a software problem made by wintel kernel developers. Intel develops Linux core and test only with their latest hardware.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #136)
> I did not check what phoronix wrote. You can have a new Ryzen cpu anyway if
> you are compiling Mesa and having the segfault problem.

When has phoronix become a respectful place for info? When they published the "performance marginality problem exclusive to to certain workloads in Linux"? phoronix wrote whatever AMD told them to get some Threadripper and Epyc systems for review. I don't trust phoronix. They first escalate the segfault issue, then they buy the "marginality" excuse from AMD.

It has been reported in the original thread in AMD forums (https://community.amd.com/thread/215773) that CPUs from week 28 bought from stores still excibit segfault issues. CPUs from week 33 and after seem to have this fixed, but maybe still not 100%.

> I am Msc software engineer and when you say win virus hoover does not have
> random lockups, then it is a software problem made by wintel kernel
> developers. Intel develops Linux core and test only with their latest
> hardware.

This "Intel develops Linux core" is a statement of no value. Unless you present some specific commits from Intel that specifically affect Ryzen (AMD's Opteron for example is not affected by this idle freeze issue), you are just talking nonsense. We should better stay with facts. Fact is that even rcuo threads enabled don't fix the issue for everyone, some CPUs require C6 disabled completely. Fact is that some CPUs don't idle freeze even with rcuo threads disabled and C6 enabled. To me this is a clear indication about a hardware and not a software issue.

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #137)

> have this fixed, but maybe still not 100%.

Less complex devices than Ryzen cpus do have hardware errors and you can have money back or a new product.
>
> > I am Msc software engineer and when you say win virus hoover does not have
> > random lockups, then it is a software problem made by wintel kernel
> > developers. Intel develops Linux core and test only with their latest
> > hardware.
>
> This "Intel develops Linux core" is a statement of no value. Unless you
> present some specific commits from Intel that specifically affect Ryzen

https://bugzilla.kernel.org/show_bug.cgi?id=197177

Revision history for this message
Mario Emmenlauer (emmenlau) wrote :

I have compiled the current Ubuntu 17.10 kernel 4.13.0-17-generic with CONFIG_RCU_NOCB_CPU
and boot option "rcu_nocbs=0-15" and just had a system freeze again. My System is a Ryzen 7 1700X
on an ASrock AB350 Pro4 mainboard with 32GB Ram @3200.
I do not use the kernel option "processor.max_cstate=1", is it suggested for this problem?

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #138)

> https://bugzilla.kernel.org/show_bug.cgi?id=197177

And this proves that? This commit doesn't affect Ryzen, it affects some specific motherboard with some specific BIOS/ACPI implementation. It is just some false ACPI interpretation - and more probably it is what you have been told - an incorrect ACPI table compilation based on older tools.

Revision history for this message
In , hi-angel (hi-angel-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #138)
> (In reply to Panagiotis Malakoudis from comment #137)
> > This "Intel develops Linux core" is a statement of no value. Unless you
> > present some specific commits from Intel that specifically affect Ryzen
>
> https://bugzilla.kernel.org/show_bug.cgi?id=197177

Besides being irrelevant, you've been told your firmware is buggy, for that matter up to the point of not even working on Windows. You've also been told that inhibiting firmware errors is a work-in-progress; and I bet many users do want to know if their firmware is buggy — I do.

You've straight ignored everything, and started cussing the dev. Please stop being an asshole.

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #139)
> (In reply to fin4478 from comment #138)
>
> > https://bugzilla.kernel.org/show_bug.cgi?id=197177
>
> And this proves that? This commit doesn't affect Ryzen, it affects some
> specific motherboard with some specific BIOS/ACPI implementation. It is just
> some false ACPI interpretation - and more probably it is what you have been
> told - an incorrect ACPI table compilation based on older tools.

Asus sure knows better than intel what tools to use when creating a motherboard. Nobody buys intel motherboards. Intel cpus have had and will have hardware errors too, so let us stop here.

https://en.wikipedia.org/wiki/Pentium_FDIV_bug

Revision history for this message
Franck Charras (franckc) wrote :

No "processor.max_cstate=1" is not necessary. I'd also suggest to double check that your motherboard bios is up to date. If it persists I'd try with a more recent kernel (I'm using 13.6) because IIRC the rcu options where a little bit different in earlier versions of 4.13.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #141)
>
> Asus sure knows better than intel what tools to use when creating a
> motherboard. Nobody buys intel motherboards. Intel cpus have had and will
> have hardware errors too, so let us stop here.
>
> https://en.wikipedia.org/wiki/Pentium_FDIV_bug

I don't understand what you are trying to say, sure Intel has bugs and more recent ones from the FDIV bug, there was hyperthreading bug in Skylake and Kabylake CPUs - which were fixed by microcode update.

The problem discussed here is probably hardware related and we hope it could be fixed with a future microcode (AGESA) update. Workaround it in software (the rcuo threads fix) would be nice but since it doesn't fix it for 100% of the cases, it should be investigated further. Intel Skylake hyperthreading issue took around 18 months to be fixed in microcode, so I can wait for a fix from AMD. But we need a confirmation that a. they have reproduced the issue and are looking at it and b. that it can be fixed with microcode update.

Revision history for this message
In , lucio (lucio-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #142)
> I don't understand what you are trying to say,

fin4478 (a.k.a. debianxfce@phoronix) is trolling as usual, please don't feed it.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

Is there a reliable way to test and be 100% sure a Ryzen/TR is OK or has the problem? Like there was a script to test for the segfault issue IIRC?

Revision history for this message
Mike R (mi9e) wrote :

Cool Thanks for the confirmation Franck I've rebuilt 4.13.0-17 with instructions from here https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel (Probably similarly to Mario). I'm going to test it with this first (since don't really want to reboot again). But might get stuff ready with a 4.14.3 (which is current latest).

Revision history for this message
Franck Charras (franckc) wrote :

If you rely on the boot option "rcu_nocbs=0-15" you also need to reboot to enable it.

Revision history for this message
Mike R (mi9e) wrote :

Yep I did :) thanks

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to Jonathan from comment #144)
> Is there a reliable way to test and be 100% sure a Ryzen/TR is OK or has the
> problem? Like there was a script to test for the segfault issue IIRC?

Leaving your computer idle over night for a couple of nights will either trigger the issue and you will find your computer freezed or it will not. Idle freeze usually happens from 15-30 minutes up to some hours.

It should be noted here that there are two kinds of idle freeze. In the first one (discussed here) compurer locks but if you press reset button it reboots fine. The second one is related to memory/cpu overclock and computer freezes completely (in overclock.net it is called "black screen freeze on idle") and pressing reset button does nothing - you have to completely power off. This second one also happens under Windows. So, in order to test reliably don't overclock CPU and RAM - meaning that RAM should run at max 2400 MHz.

Revision history for this message
In , tstraus13 (tstraus13-linux-kernel-bugs) wrote :

I just wanted to chime in as well. I have an AMD Ryzen 1600, ASRock X370 Taichi acting as a NFS File Server. It is running Ubuntu 17.10 server with I believe kernel 4.13 or whatever the latest kernel is after updates. Recently I have been experiencing the same exact issue as you all. Within 24 of the system being idle or low load the system just freezes. I am unable to access it. It actually happened today as I am typing this, I am unable to remote into it but I have it hooked up to a keyboard, mouse, and monitor. When I return home I want to see if there are errors of anykind on screen. I initally thought it may have been due to some hardware changes but it just didnt make sense and then I found this thread.

I am going to try disabling the C6 State and see if it helps at all and will report back. Hopefully this gets fixed with a BIOS update or something. This is a major inconvenience for server use with frequent idle loads. Thanks.

Revision history for this message
Tom Reynolds (tomreyn) wrote :

The workaround discussed at http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen fixes the recurring (roughly once or twice in 6 hours of operation, during low or idle CPU load periods) system freezes for this Ryzen 7 1800X (early weeks), AMD microcode 0x8001129, ASRock X370 Taichi (Firmware P3.20, BIOS 5.12, Release Date: 09/08/2017) system running XUbuntu 16.04 LTS (with latest updates)

All the mainline kernels up to linux-image-4.13.10-041310-generic, and official Ubuntu kernels (default 16.04 kernel, hwe) all kept freezing - although I had only tested them without "rcu_nocbs=0-15". I did test hwe-edge (4.10.0.20.13), both with and without "rcu_nocbs=0-15". Both variants gave me freezes.

tags: removed: kernel-da-key
Revision history for this message
Mike R (mi9e) wrote :

Well I've hit 48 hours uptime since booting to my custom build kernel (Linux ganymede 4.13.0-17-generic #20+ryzen SMP Sun Dec 3 20:18:51 MST 2017 x86_64 x86_64 x86_64 GNU/Linux) with rcu_nocbs=0-15 in my boot command and disabled ASLR. That's about double my previous uptime record, I'll report back again if I hit 1 week

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I have two identical Ryzen systems, running Debian 9 "stretch" as servers.

Specs:
Ryzen 5 1600
ASRock AB350 Pro4
Crucial memory with ECC

These servers have been running fine since early August. I had one random Apache crash (segfault) on kernel 4.9 (too old for Ryzen really), but both had been running kernel 4.12 (from stretch-backports) just fine for about 3.5 months now. Suddenly, in the last week, I have seen 3 crashes across both machines. Similar symptoms as described above. Total system freeze in all cases, sometimes with some jibberish on screen:
https://i.imgur.com/J8M1zUS.jpg
https://i.imgur.com/Qb7Z8YW.jpg
In all cases, crashes happened between 03.00 and 06.00, when no users are on and backup jobs have finished.

I am since yesterday running stress-ng on both servers which keeps 3 cores at 100% 24/7, to see if that stabilises the systems for about a week. This could add to the evidence that prolonged full idle is the trigger for this issue.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

What I forgot to mention is that I find it rather peculiar that both machines are rock solid for 3.5 months and suddenly start crashing a number of times in a week.. Perhaps that is worth investigating too.

Revision history for this message
Mario Emmenlauer (emmenlau) wrote :

Its great that the frequency of freezes is reduced. But I guess many people here expect a stable
system and can work with nothing less. Ryzen-Processors are attractive for CI-systems, and I can
not have regular freezes in our CI. I think this is an all-or-nothing-issue, and I still had a
freeze with latest Ubuntu Kernel and fixes applied (after 8hrs idle time).

Revision history for this message
Franck Charras (franckc) wrote :

It could be something else then ? After applying the fix on several machines, I have no freezes to report, more than 1 month of uptime, it looks rock stable. Did you also disable ASLR ? Also I compiled a kernel from git.kernel.org as suggested in the tutorial, not Ubuntu Kernel.

Revision history for this message
Mario Emmenlauer (emmenlau) wrote :

I think its very very unlikely to have a different underlying cause, even though I there is no guarantee. I assume that I am affected by the same problem because:
(1) The problem manifests identical to other reports here
(2) The freeze happens particularly during idle times, not under heavy load
(3) This is the only freeze we encountered in years, on many different desktop
    and server machines with partially identical or overlapping software setup,
    and partially overlapping hardware (graphics boards, hard disks, peripherals).
(4) The frequency of freezes was drastically reduced after I performed the
    suggested steps: compiling a custom kernel with CONFIG_RCU_NOCB_CPU,
    setting rcu_nocbs=0-15, and disabling ASLR.
It would be a very very rare coincidence if my issue is unrelated, albeit its never impossible. I have also updated to latest BIOS but with no improvements. I will report if there are further freezes.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to kernel from comment #147)
> in the last week, I have seen 3 crashes across
> both machines. Similar symptoms as described above. Total system freeze in
> all cases, sometimes with some jibberish on screen:
> https://i.imgur.com/J8M1zUS.jpg
> https://i.imgur.com/Qb7Z8YW.jpg
> In all cases, crashes happened between 03.00 and 06.00, when no users are on
> and backup jobs have finished.

Your first screen shows sda disk error. Your second screen indeed shows the problem discussed here.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #149)
> Your first screen shows sda disk error. Your second screen indeed shows the
> problem discussed here.

Agreed, but the machine completely locked up in the process. That disk is part of a SSD raid1 array for root, and one disk going down or not responding should not crash the entire machine; I find it hard to believe it is unrelated. Both crashes are from the same machine, by the way.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to kernel from comment #147)
> I have two identical Ryzen systems, running Debian 9 "stretch" as servers.
>
> Specs:
> Ryzen 5 1600
> ASRock AB350 Pro4
> Crucial memory with ECC
>
> These servers have been running fine since early August. I had one random
> Apache crash (segfault) on kernel 4.9 (too old for Ryzen really), but both
> had been running kernel 4.12 (from stretch-backports) just fine for about
> 3.5 months now. Suddenly, in the last week, I have seen 3 crashes across
> both machines. Similar symptoms as described above. Total system freeze in
> all cases, sometimes with some jibberish on screen:
> https://i.imgur.com/J8M1zUS.jpg
> https://i.imgur.com/Qb7Z8YW.jpg
> In all cases, crashes happened between 03.00 and 06.00, when no users are on
> and backup jobs have finished.
>
> I am since yesterday running stress-ng on both servers which keeps 3 cores
> at 100% 24/7, to see if that stabilises the systems for about a week. This
> could add to the evidence that prolonged full idle is the trigger for this
> issue.

FWIW before my RCU and C6 changes I would run a video on loop and that improved things for me. I'm not sure I would keep 50% of my CPU on full tilt like you!

Revision history for this message
Tom Reynolds (tomreyn) wrote :

~emmenlau: Try setting up a serial console to or netconsole from this system to get a better idea of why / how it's freezing.

information type: Public → Public Security
information type: Public Security → Public
Revision history for this message
Mike R (mi9e) wrote :

After 5 days uptime with my custom ubuntu kernel build I got a different crash with a reboot vs the watchdog error. I'm now running a mainline kernel 4.14.3 with ASLR disabled and rcu_nocbs=0-15 and have restarted my no crash counter.

Revision history for this message
Huygens (huygens-25) wrote :

We are experiencing the same behaviour and error messages with a Ryzen5 1600 CPU and an ASUS PRIME A320M-K motherboard.

What we can add is:

1. When no VMs are running, even if there are long idle periods, the machines can be stable for several days on Ubuntu 16.04.3 using the HWE Kernel 4.10.
2. When we run a single VM on the machine, but both the VM and the host are idle, then we get a crash. The machine is unresponsive, we see CPU soft lockups in the screen, no logs, etc. We tested it with Ubuntu 16.04.3 using the stable and edge HWE kernel (so 4.10 and 4.13), we also even tested with Ubuntu 17.10 and Kernel 4.13. And latest test was with Ubuntu 16.04.3 with custom built Kernel based on HWE edge (so 4.13) with the config CONFIG_RCU_NOCB_CPU, we also got a crash within the following night on one of the 2 machines (note that we did not have ASLR deactivated).

For our latest test we did:
- Ubuntu 16.04.3
- HWE Edge custom built kernel with CONFIG_RCU_NOCB_CPU
- ASLR deactivated
- boot parameters: rcu_nocbs=0-11 processor.max_cstate=1

We set this up last Friday, on Monday morning both machine were down, so it did not help.

PS: our motherboard had the latest "BIOS/UEFI" update which includes AMD AGESA 1.0.0.6B, we even upgraded to a newer firmware (which ASUS has since removed) with stated that it contains AMD AGESA 1.0.0.7A. This morning we saw again another firmware update from ASUS, we are going to apply it on top of our other changes, but if it does not work, we will try to return the machine (and that would be a pity, but stability is the number 1 feature more than the number of core or throughput).

Revision history for this message
Mike R (mi9e) wrote :

Well another watchdog failure with 4.14.5

I also updated to AGESA 1.0.0.6B when I rebooted to 4.14.5.

Just building a 4.14.6 to try again.

Revision history for this message
In , sfearghail (sfearghail-linux-kernel-bugs) wrote :

Specs:

AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
Ryzen 1700 week 33
G.SKILL DDR4-2400 32GB 15-15-15-39 1.2v
4.13.13-custom #1 SMP Mon Nov 27 20:17:00 CST 2017 x86_64 GNU/Linux

Debian 9 and kernel is compiled with CONFIG_RCU_NOCB_CPU and booting with parameter set. ASLR disabled. C-State enabled bios.

It runs samba and 6 active VMs. The VMs are fairly idle most of the time. The crash occurred at 6:40AM after 10 days of uptime.

"watchdog: BUG: soft lookup - CPU#x stuck for 23s! [worker:1659]"

https://photos.app.goo.gl/3ZIsi9aWXltZcixn1
https://photos.app.goo.gl/wfJFlVChftMHyqYQ2

So is the general consensus now that we simply must disable C-State completely in bios if kernel work-around is not working? I just need this system to not crash on me at this point.

Revision history for this message
In , tstraus13 (tstraus13-linux-kernel-bugs) wrote :

Hello all,

Since I have posted my last comment, about ten days ago, I had disabled C6 state in the BIOS and have not had a single lockup. It has been stable. Just wanted to report back in with my results. Thanks.

(In reply to Tom from comment #146)
> I just wanted to chime in as well. I have an AMD Ryzen 1600, ASRock X370
> Taichi acting as a NFS File Server. It is running Ubuntu 17.10 server with I
> believe kernel 4.13 or whatever the latest kernel is after updates. Recently
> I have been experiencing the same exact issue as you all. Within 24 of the
> system being idle or low load the system just freezes. I am unable to access
> it. It actually happened today as I am typing this, I am unable to remote
> into it but I have it hooked up to a keyboard, mouse, and monitor. When I
> return home I want to see if there are errors of anykind on screen. I
> initally thought it may have been due to some hardware changes but it just
> didnt make sense and then I found this thread.
>
> I am going to try disabling the C6 State and see if it helps at all and will
> report back. Hopefully this gets fixed with a BIOS update or something. This
> is a major inconvenience for server use with frequent idle loads. Thanks.

Revision history for this message
In , sfearghail (sfearghail-linux-kernel-bugs) wrote :

I forgot to mention the PSU is Corsair CX 550M.

(In reply to Scott Farrell from comment #152)
> Specs:
>
> AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
> Ryzen 1700 week 33
> G.SKILL DDR4-2400 32GB 15-15-15-39 1.2v
> 4.13.13-custom #1 SMP Mon Nov 27 20:17:00 CST 2017 x86_64 GNU/Linux
>
> Debian 9 and kernel is compiled with CONFIG_RCU_NOCB_CPU and booting with
> parameter set. ASLR disabled. C-State enabled bios.
>
> It runs samba and 6 active VMs. The VMs are fairly idle most of the time.
> The crash occurred at 6:40AM after 10 days of uptime.
>
> "watchdog: BUG: soft lookup - CPU#x stuck for 23s! [worker:1659]"
>
> https://photos.app.goo.gl/3ZIsi9aWXltZcixn1
> https://photos.app.goo.gl/wfJFlVChftMHyqYQ2
>
> So is the general consensus now that we simply must disable C-State
> completely in bios if kernel work-around is not working? I just need this
> system to not crash on me at this point.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

I built a new Ryzen system with MSI B350 motherboard, Ryzen 1700 CPU 1733SUS (segfault free direct from retail). Running same Debian 9 system with kernel 4.13 from backports (also tried 4.14.5 self built). This system does not freeze on idle. Believing it is a CPU issue, I installed this CPU to my first system (Asus Prime X370, 1700X 1725SUS from RMA).

Guess what? My first system still freezes on idle with the new CPU.

Differences: First system has GTX1060, second has GT1030. They also have different PSU and different disk. RAM is different but have already exchanged them and didn't make any difference.

Will continue trying hardware combinations to find the root cause, at least for my case.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #155)
> Believing it is a CPU issue, I installed this CPU to my
> first system (Asus Prime X370, 1700X 1725SUS from RMA).
>
> Guess what? My first system still freezes on idle with the new CPU.

Could you please tell, which Bios version you are using w/ Asus Prime X370?

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

BIOS is 3401, but this was happening with previous BIOS versions as well.

Revision history for this message
Mike R (mi9e) wrote :

well another lock up. I've now disabled C6 in the bios and added max_cstates=1 to the boot parameters

Revision history for this message
Nagy Ákos (e-contact-pecska-ro) wrote :
Download full text (4.9 KiB)

I experienced the same issue with Ryzen 5 1600 and Ryzen 7 1700, with AsRock AB350M PRO4 and Asus Prime B350M-A.
I try with ubuntu 16.04 normal and hwe kernel, and ubuntu 17.10.

On the system with Asus motherboard, I found a detailed log:
Dec 19 06:25:04 am02 kernel: [227622.995187] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:115]
Dec 19 06:25:04 am02 kernel: [227622.996201] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables snd_hda_codec_hdmi nls_iso8859_1 eeepc_wmi asus_wmi sparse_keymap edac_mce_amd edac_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer i2c_piix4 snd ccp soundcore mac_hid 8250_dw i2c_designware_platform i2c_designware_core shpchp kvm_amd kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Dec 19 06:25:04 am02 kernel: [227622.996242] raid0 multipath linear nouveau crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc mxm_wmi video i2c_algo_bit ttm aesni_intel drm_kms_helper aes_x86_64 crypto_simd syscopyarea glue_helper sysfillrect cryptd sysimgblt fb_sys_fops drm r8169 ahci mii libahci wmi fjes gpio_amdpt gpio_generic
Dec 19 06:25:04 am02 kernel: [227622.996263] CPU: 0 PID: 115 Comm: kworker/0:1 Tainted: G L 4.10.0-28-generic #32~16.04.2-Ubuntu
Dec 19 06:25:04 am02 kernel: [227622.996264] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 0611 05/05/2017
Dec 19 06:25:04 am02 kernel: [227622.996271] Workqueue: events netstamp_clear
Dec 19 06:25:04 am02 kernel: [227622.996274] task: ffff9d12fa288000 task.stack: ffffba804361c000
Dec 19 06:25:04 am02 kernel: [227622.996278] RIP: 0010:smp_call_function_many+0x1f1/0x250
Dec 19 06:25:04 am02 kernel: [227622.996280] RSP: 0018:ffffba804361fce8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Dec 19 06:25:04 am02 kernel: [227622.996283] RAX: 0000000000000003 RBX: 0000000000000010 RCX: 000000000000000c
Dec 19 06:25:04 am02 kernel: [227622.996285] RDX: ffff9d12fe91d0f8 RSI: 0000000000000010 RDI: ffff9d12fe021d30
Dec 19 06:25:04 am02 kernel: [227622.996286] RBP: ffffba804361fd20 R08: 0000000000000000 R09: 000000000000fffe
Dec 19 06:25:04 am02 kernel: [227622.996287] R10: ffffeceadfd69100 R11: ffff9d12fe021d30 R12: ffffffffbb634c60
Dec 19 06:25:04 am02 kernel: [227622.996288] R13: 0000000000000000 R14: ffff9d12fe61a340 R15: 000000000001a300
Dec 19 06:25:04 am02 kernel: [227622.996291] FS: 0000000000000000(0000) GS:ffff9d12fe600000(0000) knlGS:0000000000000000
Dec 19 06:25:04 am02 kernel: [227622.996292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 19 06:25:04 am02 kernel: [227622.996294] CR2: 00007f5780200a58 CR3: 00000007f67ca000 CR4: 00000000003406f0
Dec 19 06:25:04 am02 kernel: [227622.996296] Call Trace:
Dec 19 06:25:04 am02 kernel: [227622.996301] ? netif_receiv...

Read more...

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

Did you do further tests with your hardware combinations? Any news? It's frustrating not knowing what is the cause and AMD not interested in recognising let alone fixing the issue.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

C6 disabled and lockup after about a week. This is getting infuriating.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #159)
> C6 disabled and lockup after about a week. This is getting infuriating.

How did you disable C6? From BIOS or using the python script?

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

(In reply to Jonathan from comment #158)
> Did you do further tests with your hardware combinations? Any news? It's
> frustrating not knowing what is the cause and AMD not interested in
> recognising let alone fixing the issue.

Yes I did.
I changed the graphics card, system still froze on idle.
Now I changed the disk. Original system was running from a USB 3.0 adapter with an mSATA SSD in it. Now running with a SATA SSD (the one from the second system that didn't freeze).
If it freezes again, last thing to change is the PSU. But I can't accept it is the PSU. Most probably it is the power deliver system on the 2nd motherboard that made the 2nd system not freeze - at least not as quick as my 1st system.

@eric.c.morgan: Freezing on idle with C6 disabled should be something completely different. Are you running with memory overclocked? I keep testing my different setups with memory always running at 2400 or 2133.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

C6 disabled via bios, I'll verify with that python script I posted earlier.

Mem is stock, memtested for 24 hours too.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Crap, the .py said C6 was still enabled! I disabled via said script. Fingers crossed.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #163)
> Crap, the .py said C6 was still enabled! I disabled via said script. Fingers
> crossed.

I had the same experience: disabling C6 via BIOS did not actually disable C6 (at least according to the script), and it would not fix the idle bug. Since I started disabling C6 via the script, I haven't experienced the bug once so far (in 2 months or so of usage - in normal circumstances the bug happens once every 1/2 days for me).

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

As expected, system froze on idle overnight again.
My last test is with the PSU from the 2nd system.

To summarize:
1st system Asus Prime X370, Ryzen 1700X, Flare X 2*8GB DDR4, mSATA SSD on USB 3.0, nvme disk (Windows install), GTX 1060 6GB, Coolermaster G650M PSU freezes on idle. It doesn't freeze with rcuo threads or C6 disabled.
2nd system MSI B350M Bazooka, Ryzen 1700, Corsair Vengeance 2*8GB DDR4, SATA SSD, GT 1030, Antec Earthwatts EA-380D PSU did not freeze on idle for 3 days, running same kernel with C6 enabled without rcuo threads. I didn't test much longer than 3 days because 1st system usually freezes overnight.

Installed on 1st system CPU, RAM, Graphics Card and disk from 2nd system and still freezes on idle. Now testing with the PSU. The only thing after that left is the motherboard. There is also a small difference on peripheral devices (different USB keyboard, a USB webcam) and the other difference is the water cooling system on 1st system.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

System froze with the PSU from 2nd system. As it is I am out of ideas.

Revision history for this message
In , klausman (klausman-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #166)
> System froze with the PSU from 2nd system. As it is I am out of ideas.

FWIW, at GHz speeds all circuits are analog. Propensity to crash on low load may well require (or be hastened by) a particular combination of PSU, CPU and mainboard, possible more components. Plus, in the light of Win10 being unaffected, the system being exercised in a particular way.

In this case, there are at least three variables (PSU/MB/CPU). Combined with the elusive nature of an easy way to reproduce, it may well be a fool's errand to try and figure out which component is at fault.

This may also be the reason why AMD is so mum about this whole matter: there are no clear steps to make a crashing system better that aren't papering over the problem (C6 disabling, RCUO tweaking). If I had to make a decision at AMD, I'd be *very* careful to not claim a source of the error (and possibly solution) too quickly. I would, however, be more transparent about the whole ordeal.

Relatedly, I'm on an Asus Prime B350 Plus with a 1700X and 2x16GB Kingston RAM. I used to have an AMD GPU (ancient, 6450, IIRC) but now have a GTX1050, but crashiness didn't really vary between the two GPUs.

My first CPU had both the lockup and the segv problem, so I got it RMA'd. The new CPU has no segv problem, but is still crashy, usually within a day of light computing, or at the very least over night.

I haven't tried the RCUO approach yet, but with C6 disabled via script, I've gotten to two days of uptime, with very long stretches of complete idle (sitting at login prompt, not doing anything in the background).

As others have said, while this is far from perfect (I don't particularly care about the power saving or turbo, but I understand those who do), it at least it reduces the risk of data loss.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #166)
> System froze with the PSU from 2nd system. As it is I am out of ideas.

For the hell of it and completion's sake, could you try switching the USB devices as well (or remove them from system 1)?

If that still makes no difference, then perhaps we could start listing which mainboards have issues with what bios versions and see if we find a pattern there.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

@Jonathan: Friend of mine has Asus Prime X370 as I do and he doesn't have the idle freeze issue. And I don't think a USB webcam is to blame.

I think the problem is a complex PSU/Motherboard/CPU + specific workload puzzle that I am not willing to solve any more. I already lost many days trying combinations. After all I didn't pay AMD to debug their problems, I paid to use my CPU.

Since my issue is resolved with rcuo threads, I will use it like that and test every now and then if a BIOS update solves this. And will certainly replace my Motherboard/CPU combination as soon as X470 and new Ryzen CPUs are out.

Revision history for this message
In , arup.chowdhury (arup.chowdhury-linux-kernel-bugs) wrote :

Have the same issue with my ASIS 350M and Ryzen 1700 latest BIOS and GSkill RAM, all the freeze happen during idle and never under load. Sometimes it would go days without a freeze, other days its random. I am on Arch with GNOME and using Nvidia drivers for my 1050Ti

Revision history for this message
In , lyyb46 (lyyb46-linux-kernel-bugs) wrote :

Just would like to report that I have almost exactly the same problem (log shows "soft lockup - CPU#0 stuck for 22s!"), with TP-Link TL-WDN4800 WiFi adapter. Apparently someone has exactly the same issue with me at https://forum.level1techs.com/t/solved-linux-is-unstable-ever-since-i-upgraded-to-ryzen/117541/152 .

Not sure if I have other same components with the OP in the level1techs forum. But for me the soft lockup definitely comes from the WiFi driver process (it mentions about ath9k_hw_wait, which is the WiFi driver in kernel I think).

It seems to me that TP-Link TL-WDN4800, with a recent kernel (I tried 4.10, 4.11, 4.13) can reliably re-produce the soft lockup issue within an hour in Ubuntu 16.04. (Of course you will need to turn on WiFi)

I have given my Ryzen CPU to a friend who is a windows user. But I kept a copy of the log - let me know if you guys need me to upload it.

Revision history for this message
In , lyyb46 (lyyb46-linux-kernel-bugs) wrote :

(In reply to Yibin Lin from comment #171)
> Just would like to report that I have almost exactly the same problem (log
> shows "soft lockup - CPU#0 stuck for 22s!"), with TP-Link TL-WDN4800 WiFi
> adapter. Apparently someone has exactly the same issue with me at
> https://forum.level1techs.com/t/solved-linux-is-unstable-ever-since-i-
> upgraded-to-ryzen/117541/152 .
>
> Not sure if I have other same components with the OP in the level1techs
> forum. But for me the soft lockup definitely comes from the WiFi driver
> process (it mentions about ath9k_hw_wait, which is the WiFi driver in kernel
> I think).
>
> It seems to me that TP-Link TL-WDN4800, with a recent kernel (I tried 4.10,
> 4.11, 4.13) can reliably re-produce the soft lockup issue within an hour in
> Ubuntu 16.04. (Of course you will need to turn on WiFi)
>
> I have given my Ryzen CPU to a friend who is a windows user. But I kept a
> copy of the log - let me know if you guys need me to upload it.

Some additional information: I tried disabling C6 and with the WiFi card it still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA 550 G3 PSU, Asus ROG Strix B350-f motherboard.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to Yibin Lin from comment #172)
> Some additional information: I tried disabling C6 and with the WiFi card it
> still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA
> 550 G3 PSU, Asus ROG Strix B350-f motherboard.

How did you disable C6? Via BIOS or the Python script posted earlier?

Revision history for this message
In , lyyb46 (lyyb46-linux-kernel-bugs) wrote :

(In reply to Francesco Biscani from comment #173)
> (In reply to Yibin Lin from comment #172)
> > Some additional information: I tried disabling C6 and with the WiFi card it
> > still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA
> > 550 G3 PSU, Asus ROG Strix B350-f motherboard.
>
> How did you disable C6? Via BIOS or the Python script posted earlier?

I think I did both..(?) After I disabled C6 in the BIOS, I ran the Python script and it shows two things about C6 states, one disabled (I guess that's the effect of BIOS setup) and the other enabled. I then wrote a simple systemd script to disable the other C6 state at the Ubuntu startup time. But normally within half an hour my computer would freeze.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to Yibin Lin from comment #174)
> I think I did both..(?) After I disabled C6 in the BIOS, I ran the Python
> script and it shows two things about C6 states, one disabled (I guess that's
> the effect of BIOS setup) and the other enabled. I then wrote a simple
> systemd script to disable the other C6 state at the Ubuntu startup time. But
> normally within half an hour my computer would freeze.

Thanks for the info.

That's a bummer, I was becoming convinced that C6 was at the core of the issue.

For reference, since 2 months I am executing these commands at every startup:

python zenstates.py --c6-disable
python zenstates.py --disable -p0
python zenstates.py --disable -p1
python zenstates.py --disable -p2

And I haven't had a single issue so far. I will start experimenting leaving C6 on and disabling only the P states, just to see what happens. Still waiting for the next BIOS update for my motherboard, and then it's almost time for the release of Ryzen 2...

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I've seen something new on one of my two affected machines. Note I have C6 *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The machines have been rock-solid since I started doing that, about 3-4 weeks uptime now.

Anyway, this message came up on one of them:
[1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
[1905495.738523] Do you have a strange power saving mode enabled?
[1905495.738524] Dazed and confused, but trying to continue

I have no idea what to make of this, perhaps it can help.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to kernel from comment #176)
> I've seen something new on one of my two affected machines. Note I have C6
> *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The
> machines have been rock-solid since I started doing that, about 3-4 weeks
> uptime now.
>
> Anyway, this message came up on one of them:
> [1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
> [1905495.738523] Do you have a strange power saving mode enabled?
> [1905495.738524] Dazed and confused, but trying to continue
>
> I have no idea what to make of this, perhaps it can help.

I can't even with those errors. Wow.

Regarding the loaded CPU. I did the same with about %15 cpu load 24/7 and had similar results. I'm now testing with just the python script C6 turned off.

This whole mess is getting really annoying though. I want to support AMD, but man, I might have to go back to intel for a totally reliable machine. Segfault first, RMA pains, and now random crashes with low load? FFS...

Revision history for this message
Mike R (mi9e) wrote :

Well I've now hit 3 weeks of uptime with C6 disabled in the BIOS and
GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-15 processor.max_cstate=1"

and 4.14.6 built from the mainline kernel repo. At this point I'm satisfied that for me at least my Ryzen is now stable.

Revision history for this message
Mike R (mi9e) wrote :

Just to be clear my mainline kernel is configured with CONFIG_RCU_NOCB_CPU to yes.

Revision history for this message
Mario Emmenlauer (emmenlau) wrote :

Ok, after updating to the latest BIOS the issue seems resolved for me. I'm now also at an uptime of over 3 weeks, and before I could never get past 1 week. I've updated the BIOS of my ASRock AB350 Pro4 AMD B350 to 3.20 with AGESA 1.0.0.6b.

I am currently running kernel 4.14.5 with CONFIG_RCU_NOCB_CPU and boot option "rcu_nocbs=0-15". My System is a Ryzen 7 1700X on ASrock AB350 Pro4 with 32GB Ram @3200.

Revision history for this message
In , sergio (sergio-linux-kernel-bugs) wrote :

(In reply to kernel from comment #176)
> I've seen something new on one of my two affected machines. Note I have C6
> *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The
> machines have been rock-solid since I started doing that, about 3-4 weeks
> uptime now.
>
> Anyway, this message came up on one of them:
> [1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
> [1905495.738523] Do you have a strange power saving mode enabled?
> [1905495.738524] Dazed and confused, but trying to continue
>
> I have no idea what to make of this, perhaps it can help.

As per this thread, this message could be due to your GPU, for instance: https://bbs.archlinux.org/viewtopic.php?id=121291

Revision history for this message
Tom Reynolds (tomreyn) wrote :

So, to sum up, we have several reports here that Ubuntu's default kernels freeze on low load situations with Ryzen CPUs (or a subset of Ryzen CPUs?) up to and including the latest mainline releases. This bug differs from bugzilla.kernel.org #196481 (user space segfaults only on high CPU load).

For low load freezes we have a reliable workaround for Xenial (probably also for newer releases) across different hardware components (differences in exact Ryzen model, mainboards, memory models and configurations) and kernel versions:

* Update to latest UEFI
* Kernel >= 4.10.0 (4.14.5 mainline tested successfully) built with CONFIG_RCU_NOCB_CPU
* The result of `echo rcu_nocbs=0-$(($(nproc)-1))` appended to GRUB_CMDLINE_LINUX

In my experience, with the above changes applied, the following BIOS changes are NOT necessary:

* Disabling SMT
* Disabling ALSR

It is unclear whether it is strictly necessary to disable C6 in BIOS, comments welcome.

I would think CONFIG_RCU_NOCB_CPU is an acceptable compromise (for this CPU type) if it provides the stability AMD has, for exactly one year now, been unable to provide to Linux users on their current platform.

Revision history for this message
Tom Reynolds (tomreyn) wrote :

Link to correct upstream bug report

Revision history for this message
In , bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote :

Several Ubuntu users seem to be able to work around this bug as discussed in
https://bugs.launchpad.net/linux/+bug/1690085 , more specifically in https://bugs.launchpad.net/linux/+bug/1690085/comments/69

Revision history for this message
In , darkbasic (darkbasic-linux-kernel-bugs) wrote :

Didn't know about it, that's really bad! I'm subscribing to make sure to not buy a Ryzen until this gets fixed.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Moritz Naumann from comment #179)
> Several Ubuntu users seem to be able to work around this bug as discussed in
> https://bugs.launchpad.net/linux/+bug/1690085 , more specifically in
> https://bugs.launchpad.net/linux/+bug/1690085/comments/69

That's just the same workaround we've been discussing the whole time. It works but it's not 100% effective. I've maybe had one or two freezes with it over the course of a few months.

Revision history for this message
In , bugspam (bugspam-linux-kernel-bugs) wrote :

Created attachment 273573
Workaround init script (derived from zenstates.py)

I'm using the attached init script to work around the problem. I can be fairly sure of it hanging (or even rebooting) within a day or two without this; no issues with it in place.

CONFIG_RCU_NOCB_CPU is not set here.

Revision history for this message
Mike R (mi9e) wrote :

@tomreyn

My corrent config to get stability

C6 disabled in BIOS
ALSR disabled
kernel >= 4.14.4 with CONFIG_RCU_NOCB_CPU
rcu_nocbs=0-15 processor.max_cstate=1 in GRUB_CMDLINE_LINUX

I got lockups still with ASLR disabled and rcu_nocbs with the custom built kernel, once I added C6 disabled in bios and processor.max_cstate=1 I've gotten 6 weeks of uptime

Revision history for this message
Kevin Wong (stonecracker) wrote :

CPU: Ryzen Threadripper 1920X
Ubuntu 16.04 LTS

Identical problems to the above. Agonizing, irritating, random, low-load idle-state freezes. No issues while under heavy load.

My fixes:

Disabled c-states in the BIOS
Disabled ASLR in the BIOS
Disabled AMD Cool 'n Quiet in the BIOS.

Compiled a new kernel from source, (4.13.16) but with "Offload RCU_callback processing" aka CONFIG_RCU_NOCB_CPU, chosen in the kernel config.

I added "rcu_nocbs=0-23" to the "GRUB_CMDLINE_LINUX" parameter in /etc/default/grub (and then ran "sudo update-grub"). "0-23", to cover all 24 threads on the 12 core Ryzen,

Rebooted.

Only with all of those fixes (yep, all of them, tested every one of them separately -- that took an eon -- have I finally gotten a stable system.

Zero issues now after 8 weeks. Let's hope that holds.

Haven't tried any kernel updates, though.

In addition to the above, I also consulted these references:

http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

http://blog.programster.org/how-to-disable-aslr

Revision history for this message
In , 05497894 (05497894-linux-kernel-bugs) wrote :

Apologies if this has been mentioned or is completely obvious, but if you're using zenstates or ryzen-stabilizator or some other software means to disable C6, make sure you run it after resuming from S3. At least on my system, C6 is re-enabled after I resume from sleep.

Revision history for this message
In , cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote :

I'm also experiencing what appears to be this same issue, on a ASUS
Prime X370-PRO motherboard and a Ryzen 1800X running Fedora 27 (with no
overclocking and Kingston server RAM with ECC). I've been able to capture
kernel logs through netconsole and, as with other people, they're reported
to run through smp_call_function_many and/or smp_call_function_single
(usually called from TLB flushing code).

Booting my system with 'rcu_nocbs=0-15 processor.max_cstate=5' appears
to have made it stable (again, as with other people).

Revision history for this message
Peridot (peridot) wrote :

This is still an issue even with the bionic dailies. The easiest way (read: without recompiling kernels) I have found to get it to work is to disable IOMMU in your BIOS and add "iommu=soft" to the kernel booting options in grub.

linux can then detect everything properly (all cores) and I've had zero crashes. The only issue is that it's using software IOMMU.

Without these options you will either get crashes or hangs relating to the ACPI table (booting with acpi=off will only show a single core). Lots of AMD-Vi Logged events. and irq crashes.

I believe this is related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360

I'm on a Ryzen 1800X and Biostar B350GT5.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote : Re: [Bug 1690085] Re: Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks
Download full text (5.8 KiB)

> On 5 Feb 2018, at 7:21 PM, Peridot <email address hidden> wrote:
>
> This is still an issue even with the bionic dailies. The easiest way
> (read: without recompiling kernels) I have found to get it to work is to
> disable IOMMU in your BIOS and add "iommu=soft" to the kernel booting
> options in grub.
>
> linux can then detect everything properly (all cores) and I've had zero
> crashes. The only issue is that it's using software IOMMU.
>
> Without these options you will either get crashes or hangs relating to
> the ACPI table (booting with acpi=off will only show a single core).
> Lots of AMD-Vi Logged events. and irq crashes.
>
> I believe this is related to
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360>

This particular bug is about interrupt storm from AMD GPIO driver.
Can you file a separate bug instead?

>
> I'm on a Ryzen 1800X and Biostar B350GT5.
>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1690085
>
> Title:
> Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks
>
> Status in Linux:
> Unknown
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Hi,
>
>
> We aregetting various kernel crash on a pretty new config.
> We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)
>
> We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
> Tested kernel version:
>
> native 17.04 kernel
> 4.10.15
>
> Issues are the same, we're getting random freeze on the machine.
>
> Here is kern.log entry when happening :
>
> May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
> May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
> May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
> May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
> May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
> May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
> May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
> May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
> May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
> May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
> May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
> May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
> May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
> May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
> May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
> May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
> May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_ker...

Read more...

Revision history for this message
Peridot (peridot) wrote :
Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I have Ryzen 7 1800X, ASUS Prime X370-PRO, with CPU replaced by AMD.

The damn thing has not worked properly since I bought it.

I have: kernel 4.14.18-300.fc27
        CONFIG_RCU_NOCB_CPU=y
        rcu-nocbs=0-15

I have seen streams of "watchdog: BUG: soft lockup", but mostly the system just stops and I can find no logging that tells me why.

Currently I find the machine has stopped over night (when it is idle), every two or three days.

I have just installed kernel 4.15.3-300.fc27. But I can find nothing to suggest that this fault is understood, let alone fixed.

I have just added processor.max_cstate=5 -- though I have failed to find any way to check what difference that has made.

So: is the "work-around" still the only available solution ?

[FWIW, while the AMD CPU went through the RMA process, I bought an Intel Coffee Lake i7-8700K machine. Will I buy another AMD ? I doubt it.]

Revision history for this message
In , johnpaul7 (johnpaul7-linux-kernel-bugs) wrote :

There really has to be something else at play here. With my system everything has been stable since finding this thread and setting rcu-nocbs. But my co-worker who built an almost identical system (only differences being 1800X and non QVL memory) is experiencing the problem and close to giving up. My system specs:

Fedora 26
kernel 4.14.18
CONFIG_RCU_NOCB_CPU=y
rcu-nocbs=0-15

Ryzen 7 1700x (week 37)
ASrock Fatal1ty AB350 Gaming-ITX/ac (4.40 bios)
32gb QVL memory running at rated 2933 via profile
Other stuff: AMD GPU, Corsair PSU, Samsung SSD

Bios settings are pretty much stock apart from memory profile, enabling virtualization stuff, and tweaking fan profiles. Nothing tweaked related to C6 (bios or zenstates script).

To help aggregate all our experiences, and try to find some patterns what is everyone's thoughts on starting a Google form spreadsheet with columns for:
- CPU
- CPU week: yes, not related really but good to know in general
- Mobo: helpful for finding others in "your" situation
- Mobo Chipset: curious if 350 vs 370 shows anything significant. Mobo name would probably contain this, but having a dedicated column to sort is helpful
- Mobo BIOS version
- (boolean) QVL memory: we know Ryzen is sensitive to memory, so would be good data point to collect
- (boolean) CONFIG_RCU_NOCB_CPU
- (boolean) rcu-nocbs: bool since value is based on processor and not something I think we misalign with number of threads
- C6 tweaks: "false" or what you did (e.g. bios, zenstates.py, etc)
- Distro
- Kernel
- (boolean) Passes stress test: to help eliminate hardware instability from OC'ing or faulty memory sticks, we can define what qualifies as "true" here
- (linear scale) Stable: from my life sucks and I can't use system, to my life on Ryzen is awesome and I forgot last time I rebooted (but still conscious bugs and crashes exist outside of Ryzen related issues).

I want to try and keep it as simple/straightforward as possible so it compliments this thread, and helps us see the data in a more structured format. Hopefully we can find a pattern! I also can't help but wonder how more people are not reporting issues. Would be curious if any people on the levelonetechs community can get Wendell & Co. to draw more attention to this. AMD is making great strides for open source and linux friendly hardware, but this problem really hurts that effort.

**EDIT** v1 of form is live: https://goo.gl/forms/oGCTPnNK0vJtNntj2

If you want to be a collaborator on the form, message me directly. Also share feedback on how to make the form better suited for our goal!

Revision history for this message
John Paul Herold (dailyherold) wrote :

Been following this thread along with https://bugzilla.kernel.org/show_bug.cgi?id=196683. To help us aggregate data and look for patterns, I created a google form: https://goo.gl/forms/oGCTPnNK0vJtNntj2

Message me if you want to be a collaborator on it, and also please share feedback. Goal is to keep it simple and straightforward so that it compliments this thread (and kernel bugzilla thread). I could have had questions for every workaround listed across both threads, but tried to distill it to the ones most commonly stated/used/discussed.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #185)
> I have Ryzen 7 1800X, ASUS Prime X370-PRO, with CPU replaced by AMD.
>
> The damn thing has not worked properly since I bought it.
>
> I have: kernel 4.14.18-300.fc27
> CONFIG_RCU_NOCB_CPU=y
> rcu-nocbs=0-15
>
> I have seen streams of "watchdog: BUG: soft lockup", but mostly the system
> just stops and I can find no logging that tells me why.
>
> Currently I find the machine has stopped over night (when it is idle), every
> two or three days.
>
> I have just installed kernel 4.15.3-300.fc27. But I can find nothing to
> suggest that this fault is understood, let alone fixed.
>
> I have just added processor.max_cstate=5 -- though I have failed to find any
> way to check what difference that has made.
>
> So: is the "work-around" still the only available solution ?
>
> [FWIW, while the AMD CPU went through the RMA process, I bought an Intel
> Coffee Lake i7-8700K machine. Will I buy another AMD ? I doubt it.]

Run the python script mentioned earlier.

For me new kernel/params with RCU setting enabled & python script to turn off C states has had my machine running for 60 days now.

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #187)
> Run the python script mentioned earlier.

+1, disabling C6 has resulted in a rock-solid experience over here, no idle reboots in months of constant use of the machine.

Just wanted to mention a couple of points for those still following this thread.

I recently updated my motherboard's BIOS to the latest version, which contains a CPU microcode update from AMD (I am now on version "0x08001136", whereas I was stuck on version "0x08001129" for - I believe - 6 months or so). I had high hopes a microcode update would fix the idle reboots, but, alas, they still persist with the new microcode and C6 enabled.

The other interesting tidbit is this article from 1 year ago I found:

https://www.gamersnexus.net/news-pc/2870-ryzen-power-plan-update-min-frequency-90-pct

Apparently, soon after the ryzen launch last year, AMD issued an updated version of the Windows "Balanced" preset power plan which disables "core parking". Doing a bit of research, it seems like core parking is synonymous (or, at least strongly related) to the C6 power state. Might this explain why people on Windows are not experiencing random reboots? I am not sure 100% how these Windows preset power plans work, but the page linked above seems to indicate that:

1) after the update, the "Balanced" power plan disables core parking,
2) core parking was always disabled anyway in the "Performance" power plan.

See also this thread on the AMD community forum:

https://community.amd.com/thread/220563

Maybe AMD knew all along there are issues with the C6 power state...

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

Nice work John-Paul. I've just put in my entry. I'm curious to see whether *anyone* has experienced these freezes when C6 is disabled. If not then I think we can be fairly sure the culprit has been found.

My own stress-ng approach is also still holding strong. Both servers have been rock solid since early December, with C6 *on* and no kernel options. This adds to the evidence that prolonged full idle is the trigger for this issue.

This was brought up before, but can we get onto an AMD engineer with regards to this? It'd be particularly interesting to ask about Francesco's hypothesis.

Revision history for this message
In , johnpaul7 (johnpaul7-linux-kernel-bugs) wrote :

I enabled the setting so you can see charts and summary of data after submission, but for those who missed that or just want a direct link to the spreadsheet, here it is: https://docs.google.com/spreadsheets/d/1PuIkhxbxdE2H7fjbmfWYKBJI_iw8A6W1h-SZnQUXBSI/edit?usp=sharing

Pretty interesting stuff, only 7 responses total so far so sample size way too small to draw conclusions. However, initial notes I find interesting:

- 2 of 7 still have stability issues, one has a >1725 chip (1744). Again, this issue not related to segfault stuff, but good data point that newer steppings still show the behavior.
- The only two responses that aren't using QVL listed memory, are the two having issues still. Both happen to be ECC memory as well. May or not be relevant because these two do not have any RCU tweaks in place.
- Of the 5 who have stable systems, the CPU models vary, the chipsets vary, BIOS and AGESA vary from new to old.
- The only constant from 5 "stable" systems is "CONFIG_RCU_NOCB_CPU=y". Whether C-States was tweaked, or rcu-nocbs, or ASLR, there are mixed responses.
- Good on everyone for testing system/hardware stability!

Thank you to those who shared! Let's get more data!

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

John-Paul, I'd like to note that I entered that I still have issues because with stress-ng not running to create artificial load, the problem persists. As in my previous comment, with a single thread at 100% 24/7 I do *not* have issues.

I did this originally as to provide more data; letting the system not enter full idle will also solve the problem, without BIOS tweaks or custom kernel options. n=2 for this result.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

OK...

  []# cat /proc/cmdline
  BOOT_IMAGE=/vmlinuz-4.15.3-300.fc27.x86_64 root=... ro rhgb quiet rcu_nocbs=0-15 processor.max_cstate=5

  []# python zenstates.py -l
  P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
  P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
  P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
  P3 - Disabled
  P4 - Disabled
  P5 - Disabled
  P6 - Disabled
  P7 - Disabled
  C6 State - Package - Enabled
  C6 State - Core - Enabled

So... kernel command line option "processor.max_cstate=5" does not appear to disable C6. [And removing the option did not seem to change anything, either.]

But...

  []# python zenstates.py --c6-disable
  Disabling C6 state

  []# python zenstates.py -l
  P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
  P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
  P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
  P3 - Disabled
  P4 - Disabled
  P5 - Disabled
  P6 - Disabled
  P7 - Disabled
  C6 State - Package - Disabled
  C6 State - Core - Disabled

So... having added "python zenstates.py --c6-disable" to rc.local (eventually), I now appear to have C6 disabled when the system is rebooted.

Do I have to do anything else to make sure that C6 *stays* disabled ?

Chris

Revision history for this message
In , 05497894 (05497894-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #192)
> OK...
>
> []# cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-4.15.3-300.fc27.x86_64 root=... ro rhgb quiet
> rcu_nocbs=0-15 processor.max_cstate=5
>
> []# python zenstates.py -l
> P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
>
> So... kernel command line option "processor.max_cstate=5" does not appear to
> disable C6. [And removing the option did not seem to change anything,
> either.]
>
> But...
>
> []# python zenstates.py --c6-disable
> Disabling C6 state
>
> []# python zenstates.py -l
> P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Disabled
> C6 State - Core - Disabled
>
> So... having added "python zenstates.py --c6-disable" to rc.local
> (eventually), I now appear to have C6 disabled when the system is rebooted.
>
> Do I have to do anything else to make sure that C6 *stays* disabled ?
>
> Chris

If you suspend your system, make sure you run zenstates again upon resume. Otherwise you should be ok.

Revision history for this message
In , it (it-linux-kernel-bugs) wrote :

I want to share my findings. Today I googled for "Ryzen C6 state" and found this reddit thread: https://www.reddit.com/r/Amd/comments/7ita4h/why_its_suggested_to_disable_c6_and_cstate_global/

The last comment regarding DRAM "Power Down Enable" and instabilities combined with C6 raised my interest. So I searched the UEFI of my ASrock AB350 Pro4 and found that option, but it was already disabled - but perhaps it's worth to look into if you experience idle freezes.

But I found another interesting option called "Power Supply Idle Control". Didn't find much documentation about it except: https://lime-technology.com/forums/topic/61767-amd-ryzen-update/?tab=comments#comment-607258

As was mentioned before AMD claims that's the idle freezes are related to the PSU. I don't think my PSU isn't able to support that low power mode but here http://forum.asrock.com/forum_posts.asp?TID=6832&title=ramocvoltagecantlowerthanxmpvoltageb350pro4-420
someone states, that the "Typical current Idle" setting changes the idle voltage of the cores. Perhaps that might make a difference?
Anyone tried that?

I've set it to "Typical current Idle" for now and we'll see. 3h of idle uptime so far - not bad :-)

Oh, and I was a little bit astonished to see my AGESA version shown in the UEFI to be 1.0.0.1a. If anyone wonders like me, it seems, that AMD has reset the versioning for whatever reason.

Revision history for this message
Jamie Davis (sichemist) wrote :

I had the same issues this bug describes: Freeze on low load; Runs fine on heavy load. I even ran stress -c 16 when I had to keep the system up for long periods of low load.

I tried pretty much everything in this list including compiling a mainline kernel with CONFIG_RCU_NOCB_CPU enabled and the respective GRUB changes without success.

I noticed a problem in /var/log/syslog involving the ath9k driver.
I disabled the driver (rmmod ath9k) and the problems went away.

I had a PCIe wireless card I swapped in from an earlier Intel Core i5 build that used that driver. After disabling ath9k, all MY problems went away. I replaced it with a different PCIe wireless adapter that has an Intel chipset and my system has been stable ever since. I changed back all the settings that I had altered (except C6) and everything works.

The old Ath9k PCIe card is stable in the Intel system I moved it back to.

I'm only posting here because my bug had the exact symptoms as the ones in this thread but a completely different cause/solution.

Revision history for this message
In , c.buddeweg (c.buddeweg-linux-kernel-bugs) wrote :

(In reply to John-Paul Herold from comment #186)

Linux version 4.15.5-gentoo (gcc version 7.3.0 (Gentoo 7.3.0 p1.0)) #1 SMP Sat Feb 24 08:35:43 CET 2018
Command line: BOOT_IMAGE=/kernel-4.15.5-prod rcu_nocbs=0-15 root=/dev/sdb5 ro

Boot Mode is legacy - no tweaks, C6 is on

Hardware R7 1700 (RMA'ed because of segfault problem)
GA-AX370-Gaming K7 Bios F22b
32 GB RAM no QVL

The lockup is reproduceable after longtime idle/sleep (more than 30 min). I didn't found something inside the logs.

With CONFIG_RCU_NOCB_CPU=n I had the lockups. Now I use CONFIG_RCU_NOCB_CPU=y and the commandline above and it works since 3 days stable.

I use the CPU "on demand gouvernor" and the range of the frequency is from 1400 MHz up to 3100 MHz. It's wider than the "AMD balanced power plan for Ryzen" that you can see for Win10.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

Question, I'm curious what the impact is: how much power would say a Ryzen 1800x use more in idle or low use (like desktop or browsing internet) with these 'fixes' enabled?

Revision history for this message
In , it (it-linux-kernel-bugs) wrote :

FYI: I just passed 1 week uptime without any freeze. I never before managed to have such a long uptime. So the "Power Supply Idle Control" option in BIOS seems to really make a difference.

Revision history for this message
In , kernel.org (kernel.org-linux-kernel-bugs) wrote :

Having an uptime of 5 days now. Seems to be stable.
Was down to ~4-6h between lockups.

Running successfully with:
 rcu_nocbs=0-11
 zenstates.py --c6-disable
On
 4.13.0-36-generic (HWE kernel)
 Ubuntu 16.04.4 LTS
 AMD Ryzen 5 1600X
 Asus PRIME X370-PRO (Bios 3803)

I'm not sure if I would have needed the RCU. C6 disable was a must.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to rtux from comment #198)
> Having an uptime of 5 days now. Seems to be stable.
> Was down to ~4-6h between lockups.
>
> Running successfully with:
> rcu_nocbs=0-11
> zenstates.py --c6-disable
> On
> 4.13.0-36-generic (HWE kernel)
> Ubuntu 16.04.4 LTS
> AMD Ryzen 5 1600X
> Asus PRIME X370-PRO (Bios 3803)
>
> I'm not sure if I would have needed the RCU. C6 disable was a must.

RCU did not fix my issue, C6 disabling script did. I'm at 75 days uptime.

Revision history for this message
In , excieve (excieve-linux-kernel-bugs) wrote :

I think it might be solved for me by the latest ASRock X370 Gaming K4 firmware.

So, I have an RMA'd Ryzen 1700X, which had this freezing problem since receiving it. The workaround I applied was disabling the C6 state in BIOS, which stopped this. After installing BIOS 4.50 (currently latest) I noticed there's no such option any more so I started using the zenstates script on boot to do this. However, a week ago I was reading more on this issue and the "Power Supply Idle Control" option in the BIOS. So I decided to stop disabling C6 while keeping the option at default (where it should be adaptive) and see how it goes. During this week I tried leaving idle over night, whole day, using as usual, causing sustained high load, etc. and there's been no freezing so far.

Might be just luck of course, but before, with C6 enabled, it would freeze quite reliably even with normal daily usage. Or perhaps it's just a lot less common now.

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

Good evening,

I was facing the same issues most people are facing here: the so-called AMD Ryzen soft-lock bug, happening when the CPU was iddling. Digging on the Internet, I found some interesting information avoiding having to disable C-state, which is very good for CPU life-expectancy and energy saving.
It is all about tricking the component voltage and frequency. For interested people, this is depicted here: http://www.silence.host/node/1

To Artem Hluvchynskyi, please let us know more about this. As guessed by some, this definately looks to be a power issue, and not a kernel one.

Revision history for this message
In , excieve (excieve-linux-kernel-bugs) wrote :

Unfortunately just got a freeze again after 1.5 weeks. So I ended up switching "Power Supply Idle Control" to "Typical" and apparently what it does is actually disabling the C6 state on the CPU package. At least that's what MSR shows (through zenstates script) and I can't see Vcore going lower than 0.8V so it does indeed look like C6 is not being reached. This should make it stable but it really sucks. Will later check if the most fresh firmware (released these days) changes anything and how RCU offload workaround affects stability.

Revision history for this message
In , kallisti5 (kallisti5-linux-kernel-bugs) wrote :

Same issue here for a while on Fedora 27. System seems to hard-lock when idle almost daily. (I've never had a lockup while "in-use")

Ryzen X1800
PRIME X370-PRO + Latest BIOS
32GB Ram
amd_iommu=on
I noticed the issue completely stopped ~4.15.3, but definitely has returned on 4.15.6

I'm trying the "rcu_nocbs=0-15" fix since it has been driving me nuts.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

Is anyone here planning on getting a Ryzen 2 a month or so from now? With or without a new x470/b450 mainboard? I'm curious if AMD, even if they haven't really acknowledged the issue, has actually worked on the problem and fixed it in the new CPU or mainboards.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

> I'm trying the "rcu_nocbs=0-15" fix since it has been driving me nuts.
This may not solve it for you. I use rcu_nocbs=0-15 and I made systemd services that run on boot and return from suspend which runs the script attached to this bug https://bugzilla.kernel.org/attachment.cgi?id=273573 that disables C6. A combination of those two solves it on both my boxes, the one with a ASUS board hung regularly with just rcu_nocbs=0-15 but works fine with both.

Unrelated, it's interesting that AMD is very silent about this total scandal bug in their system. I won't be buying a new Ryzen 2 when it's released (not that I would anyway, current one is fast enough). Intel does not have this problem. It is further a total outrage that AMD has been trying to blame this on PSUs. If two brand new high-end PSUs from two different brands result in the exact same problem then it's clearly AMDs fault.

Revision history for this message
In , kallisti5 (kallisti5-linux-kernel-bugs) wrote :

Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in this thread.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Alexander von Gluck from comment #206)
> Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in
> this thread.

I don't think the rest of us without ASUS boards are imagining this issue.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

(In reply to Alexander von Gluck from comment #206)
> Are we sure this isn't an Asus BIOS bug?
Yes. I have two 1600X's, one's on a Asus board and the other a Gigabyte board. Switching PSU/RAM/GPU doesn't change anything, both have the idle bug in every combination. This total scandal is all AMDs fault.

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

Asus is the main partner from AMD concerning Ryzen chipsets. AMD works
first with them, and when they decide a fix is stable, propagate the
BIOS updates to other vendors, ie MSI or Gigabyte.

The issue _is_ a power issue (not a PSU issue). Giving a bit more power
to the chipset, CPU and memory generally solves the bug. The bug also
existed in a different form on Windows, and exists with the exact same
form on BSD.

Please, consider this (and this is something different from disabling
C-state). Google, or consider reading this: http://www.silence.host/node/1

Le 12/03/2018 à 12:43, <email address hidden> a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #207 from James Le Cuirot (<email address hidden>) ---
> (In reply to Alexander von Gluck from comment #206)
>> Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in
>> this thread.
> I don't think the rest of us without ASUS boards are imagining this issue.
>

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Is anyone from AMD aware of this at all?

I don't have a Ryzen system but I support someone who does and this appears to be going on for them too.

Anyone here, or anyone who someone knows here, got contacts within AMD to make them aware of the issue?

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

They know, they're just not admitting it publicly.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :
Download full text (4.7 KiB)

(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

I opened a ticket with them on 25-Feb-2018.

I received a response on 28-Feb-2018, to which I replied as follows:

[start_______________________________________________________________________
On 28/02/18 10:51, <email address hidden> wrote:
> Dear Chris,
>
> Your service request : SR #{ticketno:[8200794428]} has been reviewed and
> updated.
>
> Response and Service Request History:
>
> Thank you for the response.
>
> I understand you are experiencing an issue on your PC with Ryzen processor
> when C6 state is enabled on the BIOS.

Yes, my PC freezes at random when it is idle. Typically it will freeze when left overnight, roughly every 2 or 3 days.

Having disabled C6 -- using the 'zenstates.py' script, from https://github.com/r4m0n/ZenStates-Linux -- my machine has not frozen for 7 days.

> This issue has been fixed with the latest BIOS updates, but the option to fix
> it may not be available in all BIOS.

What is the root cause of the issue ?

In what way has it been fixed ?

> I request you to update to the latest BIOS and see if you have the Power
> Supply Control option in the MB BIOS. Try toggling this option between
> the different settings to see if it fixes it. If the specific option is
> not available I would suggest you keep C6 off for now.

I have the latest available "PRIME X370-PRO BIOS 3803" from ASUS. That apparently includes:

  2.Update to AGESA 1000a for new upcoming processors

I understand that means AGESA 1.0.0.0a (?) -- I have no idea what that means, since AMD seem to keep the release notes for AGESA as a deep, dark secret. A previous BIOSes had (according to ASUS) "AGESA 1071", and before that were "AGESA 1.0.0.6B" and "AGESA 1.0.0.6a"... so I admit to being baffled.

What does this new BIOS "Power Supply Option" do ?

Are you telling me that this is problem with my power supply ?

If so, does this mean I need a better power supply ?

Disabling C6 is not really a long term solution... since that disables both (a) the maximum single core performance, and (b) the minimum power consumption state. While these are arguably marginal, I have wasted a lot of time and energy trying to get my machine to work reliably.

I am seriously disappointed that the only information available is buried in kernel bug report(s) and in various support forums.

Having (eventually) found Linux Kernel Bug 196683, I have been hoping that AMD would leap into action to: (a) inject some clarity into the discussion, and (b) provide a proper solution.

<sigh>

For completeness, let me repeat my questions:

  1) What is the root cause of the issue ?

  2) In what way has it been fixed ?

  3) What does this new BIOS "Power Supply Option" do ?

  4) Are you telling me that this is problem with my power supply ?

  5) If so, does this mean I need a better power supply ?
__________________________________________________________________________end]

I prompted them on 12-Mar-2018, and today (14-Mar-2018) have received:

[start_______________________________________________________________________

Your service request : SR #{ticketno:[8200794428]} has bee...

Read more...

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #212)
> I admit that the PSU I was using was (a) old and (b) cheap. I have now
> replaced that by a brand-new, more efficient PSU which supports 0A loads
> across all voltages. I am running that with C6 enabled, and await
> developments.

Thanks for putting the pressure on them. I also contacted support but didn't press them any further. Your new PSU probably won't help though, we discussed the various PSUs that people have further up this page and most, like mine, are good quality and not that old.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #212)
> I admit that the PSU I was using was (a) old and (b) cheap. I have now
> replaced that by a brand-new, more efficient PSU which supports 0A loads
> across all voltages. I am running that with C6 enabled, and await
> developments.

Could you please tell which PSU you're now testing?

Revision history for this message
In , cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote :

My Ryzen 1800X / Asus Prime X370-Pro based system that experiences this
has an EVGA SuperNova G3 550 watt PSU. Measured power usage is well
under its maximum rating, and while I can't find an explicit statement
in its documentation about 0A loads, I would be stunned if it doesn't
fully support that, since it's a quite modern and well regarded model
(and other G3 models seem to be reported as Haswell-ready in this way).

(More broadly, if AMD Ryzen CPUs and motherboards don't work reliably
with modern high quality PSUs, there is a problem in practice for people
who want to have reliable Ryzen-based Linux systems.)

Revision history for this message
In , kallisti5 (kallisti5-linux-kernel-bugs) wrote :

There's a lot of amd bashing in this thread, I'm still inclined to believe that it's just bugs to work out with a new major CPU changeup. The same kind of bugs bubbled up when athlon 64 + opteron came out.

Beyond the idle bug (which does indeed seem fixed so far adding the rcu_nocbs boot flag), performance seems fine under Fedora 27.

Let me ping my amd graphics contacts and see if they're aware of this ticket.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #214)
> (In reply to Chris Hall from comment #212)
> > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > across all voltages. I am running that with C6 enabled, and await
> > developments.

> Could you please tell which PSU you're now testing?

It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #217)
> (In reply to Klaus Mueller from comment #214)
> > (In reply to Chris Hall from comment #212)
> > > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > > across all voltages. I am running that with C6 enabled, and await
> > > developments.
>
> > Could you please tell which PSU you're now testing?
>
> It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011. I'm facing problems, which can be worked around here by using the optimized mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm curious about your test result. According be quiet!, your PSU should be 0A - stable (compliant to intel C6/C7).

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #218)
>
> Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011.
> I'm facing problems, which can be worked around here by using the optimized
> mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm
> curious about your test result. According be quiet!, your PSU should be 0A -
> stable (compliant to intel C6/C7).

FWIW, I have looked at a number of reputable PSU suppliers, and found absolutely nothing else which explicitly covers the minimum current.

However, I find that ATX12V v2.4 is included in Revision 1.31 of Intel's 'Design Guide for Desktop Platform Form Factors', published in April 2013. [https://www.intel.com/content/dam/www/public/us/en/documents/guides/power-supply-design-guide.pdf] This appears to be the latest revision, and I note:

  * revision history says Revision 1.31 includes:

     * Changed 3.2.10 12 V2DC Minimum Loading to REQUIRED

     * Updated ... ATX12V Specific Guidelines to version 2.4

  * section 3.2.10 seems to:

     * REQUIRE +12 V2DC to operate down to 0.05A

     * and RECOMMEND that it operates down to 0A

So... I guess that a PSU which claims ATX12V v2.4 compliance should be OK ?

Revision history for this message
In , 2rb0alex (2rb0alex-linux-kernel-bugs) wrote :

Hi,
ASRock AB350 Pro4 + Be quiet! Pure Power 10 CM owner here.
Still having issues (got hit by this bug during "pacman -Syu" update last week).
The only thing that works for me is disabling C6.
Could AMD provide a reference PC spec or PSU QVL or something?
I've found no PSU recommendations from AMD except this:
https://support.amd.com/en-us/recommended/power-supplies

Revision history for this message
In , kernel.org (kernel.org-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #218)
> Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011.
> I'm facing problems, which can be worked around here by using the optimized
> mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm
> curious about your test result. According be quiet!, your PSU should be 0A -
> stable (compliant to intel C6/C7).

I'm running a new beQuiet! System Power 8 with 400W (Jan 2018) which supports latest processor generations from Intel and AMD.

I had to disable C6 to get it stable - see comment #198 for details.
Current uptime is 17 days now.

Still waiting for a Asus BIOS fix, so that I can re-enable C6 again.

Revision history for this message
In , johnpaul7 (johnpaul7-linux-kernel-bugs) wrote :

I'm running a [Corsair SF450](https://www.corsair.com/us/en/Power/Plug-Type/sf-series-psu-config/p/CP-9020104-NA), which under "Tech Specs" lists the ATX12V version as v2.4. I had the idle freeze issue until I added `rcu_nocbs=0-15`. Haven't had to do anything else specific in BIOS or other tweaks.

I have modified the google form to include a question for validating the ATX12V version, just to see if anything trends from that.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Blaming the PSU from AMD's part is wrong, since idle freezes have been documented with various PSUs, new and old. If you read my older posts, I did a complete replacement of parts between two systems: one excibiting the freeze on idle issue and one that didn't. The system not excibiting the freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the other system didn't avoid idle freeze.

The problem is related to power and actually I have "hide" the problem by overclocking my CPU. Hitting certain voltages/frequencies avoids completely the issue (at least on my case with an ASUS X370 Prime). But even these voltages/frequencies are different from one case to another. Another user here reported he "hides" the problem by using the automatic overclocking of motherboard ("optimized mode for daily computing" option) which actually overclocks to 3600 if I remember well. For me this didn't work, still had idle freezes. But idle freezes disappeared when I overclocked at 3900.

rcu option doesn't fix the problem for all. It changes the way cores get idle so some see the problem as fixed, some don't. It is more than obvious that this is not a kernel problem. So I would close this report as invalid. It is not a kernel bug. It is a hardware issue that AMD needs to fix and it seems it has been fixed in with this new option "Power Supply Option". Unfortunately my motherboard doesn't have this option yet.

Revision history for this message
In , kallisti5 (kallisti5-linux-kernel-bugs) wrote :

I opened a support case to AMD, and buried in several basic steps (what GPU, etc) was "disable C6 in BIOS".

So, sounds like AMD is definitely aware of the problem since disable C6 was in their first support communication.

I have a 1000 Watt Corsair HX1000i and see the idle hang-ups, so definitely *NOT* a power supply issue.

I wonder if windows users see this issue?

(In reply to Panagiotis Malakoudis from comment #223)
> rcu option doesn't fix the problem for all.

For the RCU fix to work, your distro also has to have kernel support for it (CONFIG_RCU_NOCB_CPU). See this article:

http://textandhubris.com/2017/12/09/ryzen-issues-on-fedora-27/

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :
Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Also does anyone know if the Epyc series has this issue?

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #223)
> Blaming the PSU from AMD's part is wrong, since idle freezes have been
> documented with various PSUs, new and old. If you read my older posts, I did
> a complete replacement of parts between two systems: one excibiting the
> freeze on idle issue and one that didn't. The system not excibiting the
> freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the
> other system didn't avoid idle freeze.

Two have one more documented check by Chris Hall can't be bad. But I fear I know the end of this test - even if I wish him it would work.

> The problem is related to power and actually I have "hide" the problem by
> overclocking my CPU.

This is a new interesting information.

> Hitting certain voltages/frequencies avoids completely
> the issue (at least on my case with an ASUS X370 Prime). But even these
> voltages/frequencies are different from one case to another. Another user

Me!

> here reported he "hides" the problem by using the automatic overclocking of
> motherboard ("optimized mode for daily computing" option) which actually
> overclocks to 3600 if I remember well.

You do! 3600 probably is enough for me, as I'm always running 4 VMs - which are mostly idle, too, but they add some additional minimal load, which could prevent the system from often entering very low current levels. Moderate overclocking additionally helps to circumvent the problem.
Next point is, that I'm seldom running this machine more than 15 hours - therefore, I can't say, if the problem disappeared completely. But before the overclocking, I could see freezes even after an hour or earlier.

> For me this didn't work, still had
> idle freezes. But idle freezes disappeared when I overclocked at 3900.

This is good to know! There are a lot of posts elsewhere about overclocking Ryzen (and RAM - that seems to be much more difficult) - mostly by Windows users. Overclocking doesn't cause difficulties but even could resolve problems! That's a cool new idea!
Could it be related to the relation between RAM and CPU clock?

[...]

> It is a hardware issue that AMD needs to fix and it
> seems it has been fixed in with this new option "Power Supply Option".

As I could read elsewhere [1], this option would disable C6 states among others. so, using zenstates.py to switch off c6 states would be the same "solution".

[1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

>
> As I could read elsewhere [1], this option would disable C6 states among
> others. so, using zenstates.py to switch off c6 states would be the same
> "solution".
>
> [1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714

This does not seem to be true for all motherboards. I have a B350 Tomahawk from MSI and I updated the BIOS today. In its last BIOS it has this new Power Supply Option that can be set to Auto, Low current idle, Typical current idle. The default is auto.

I have changed it to both Low and Typical current and checked the possible states using zenstates.py -l and in both cases I get:

P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

So the C6 states looks enabled. I am now running it with Typical current idle to see if it improves stability. Next monday, when I get back to work I will see if the machine locked over the weekend, as usual, or not.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to Paulo J. S. Silva from comment #228)
> >
> > As I could read elsewhere [1], this option would disable C6 states among
> > others. so, using zenstates.py to switch off c6 states would be the same
> > "solution".
> >
> > [1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714
>
> This does not seem to be true for all motherboards. I have a B350 Tomahawk
> from MSI and I updated the BIOS today. In its last BIOS it has this new
> Power Supply Option that can be set to Auto, Low current idle, Typical
> current idle. The default is auto.
>
> I have changed it to both Low and Typical current and checked the possible
> states using zenstates.py -l and in both cases I get:
>
>
> P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
>
> So the C6 states looks enabled. I am now running it with Typical current
> idle to see if it improves stability. Next monday, when I get back to work I
> will see if the machine locked over the weekend, as usual, or not.

I look forward to your results! I've done new kernel, params, and .py to get a stable system. I'd like to get back to stock distro kernels and if your setting works I'll be there!

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

Created attachment 274783
attachment-6763-0.html

> This does not seem to be true for all motherboards. I have a B350 Tomahawk
> from
> MSI and I updated the BIOS today. In its last BIOS it has this new Power
> Supply
> Option that can be set to Auto, Low current idle, Typical current idle. The
> default is auto.
>
> So the C6 states looks enabled. I am now running it with Typical current idle
> to see if it improves stability. Next monday, when I get back to work I will
> see if the machine locked over the weekend, as usual, or not.
>

Good to hear that the new BIOS start to be spread over other constructors.
For my motherboard (MSI X370 Gaming Plus), there is an explicit comment
for the new BIOS version:

1. Do not update this BIOS if you`re currently using windows7

Thus letting believe they are focusing on these Unix soft luckups even
if nothing official is written anywhere about it. But with that kind of
comment we can expect a big mess on BIOS side in some months.
On my side, I won't test it since tricking the voltages made my computer
to be very stable (20 days for now, 14 days last time until a reboot due
to new kernel installation).

Revision history for this message
In , bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote :

For me, Ryzen 7 1800X (acquired in late April 2017) on ASRock X370 Taichi with 2x 16GB (non-QVL) Kingston 9965669-019.A00G unbuffered ECC RAM (banks 2+4) on be quiet! Dark Power Pro 11 80plus 650W, is stable out of the box since BIOS 4.60 (2018/3/6) with default BIOS settings (except for SVM = enabled) on Ubuntu 16.04 and linux-image-hwe-edge 4.13.0.36.37. The previous BIOS version 4.40 (2018/2/9) also worked with a default Linux kernel image and without kernel parameters after setting the new BIOS 'current' option to the non-default 'full current' value.
So it just took a year and the follow-up generation to be released, yeay.

Revision history for this message
In , bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote :

It turns out my previous statement was premature. The system still freezes unless "Power Supply Idle Control" is set to the non-default value of "Typical Current".

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Moritz Naumann from comment #232)
> It turns out my previous statement was premature. The system still freezes
> unless "Power Supply Idle Control" is set to the non-default value of
> "Typical Current".

http://download.gigabyte.cn/FileList/Manual/mb_manual_ga-ax370m-ds3h_e.pdf

Power Supply Idle Control
Enables or disables Package C6 State.

Typical Current Idle
Disables this function.

Low Current Idle
Enables this function.

Auto
The BIOS automatically configures this setting (Default)

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

>
> Power Supply Idle Control
> Enables or disables Package C6 State.

Interesting, according to zenstates.py in my motherboard (a MSI B350 Tomahawk) the Package C6 State is enabled in both cases.

Actually, I have good news. Setting this option to typical current idle made my machine stable over the week end (it almost always freezes during the weekends if let idle). It is still too soon to assert anything. But it is a good sign.

I may also bring my kill-a-watt to compare the idle consumption from the outlet with this option enabled and disabled to see if I can find any difference. But, I have to find the kill-a-watt at my place first.

Revision history for this message
In , sineways (sineways-linux-kernel-bugs) wrote :

Ryzen 3 1200 Quad-Core (Made: 17/28) with an ASRock AB350 Pro4 running BIOS 3.20 (AGESA 1.0.0.6b)

System gets hardlocks (anywhere from 1-3, 10-30 days) that cause even SysRq to not work. The chip is in a server that varies between idle and active, but the hang seems to happen regardless of CPU usage state.

Only once did I get a "BUG: soft lockup". And in that situation I was able to REISUB sysrq my way to a clean reboot.

I've been following this bug for a while, but I'm not sure of what a proper solution might be (if any) for those that have found one at this point.

Below is some possibly helpful information:
Kernel: 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1 (2018-01-14) x86_64 GNU/Linux
Grub: rcu_nocbs=0-3

-
python zenstates.py --list
P0 - Enabled - FID = 7C - DID = 8 - VID = 3A - Ratio = 31.00 - vCore = 1.18750
P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I thought that disabled C6 states, and that wasn't desirable, but perhaps I'm mistaken.

Appreciate any guidance
Doug

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to Douglas S from comment #235)
> Ryzen 3 1200 Quad-Core (Made: 17/28) with an ASRock AB350 Pro4 running BIOS
> 3.20 (AGESA 1.0.0.6b)
>
> System gets hardlocks (anywhere from 1-3, 10-30 days) that cause even SysRq
> to not work. The chip is in a server that varies between idle and active,
> but the hang seems to happen regardless of CPU usage state.
>
> Only once did I get a "BUG: soft lockup". And in that situation I was able
> to REISUB sysrq my way to a clean reboot.
>
> I've been following this bug for a while, but I'm not sure of what a proper
> solution might be (if any) for those that have found one at this point.
>
> Below is some possibly helpful information:
> Kernel: 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1 (2018-01-14)
> x86_64 GNU/Linux
> Grub: rcu_nocbs=0-3
>
> -
> python zenstates.py --list
> P0 - Enabled - FID = 7C - DID = 8 - VID = 3A - Ratio = 31.00 - vCore =
> 1.18750
> P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore =
> 1.05000
> P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore =
> 0.88750
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
>
> At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I
> thought that disabled C6 states, and that wasn't desirable, but perhaps I'm
> mistaken.
>
> Appreciate any guidance
> Doug

Run the mentioned python script and disable C6. I have 90 days uptime on my server after doing this.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Douglas S from comment #235)
> At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I
> thought that disabled C6 states, and that wasn't desirable, but perhaps I'm
> mistaken.

I thought that at one point but that's not true. It's an effective workaround for most (but evidently not all) people with fewer downsides.

On my side, I have updated to my latest BIOS and tried setting the new "Power Supply Idle Control" option to "Typical Current" while removing the rcu_nocbs parameter. So far, so good. zenstates.py reports the following, which makes sense given the option's description.

C6 State - Package - Disabled
C6 State - Core - Enabled

If you don't know what this means, I've read that "core" just powers down the CPU cores while "package" powers down most of the rest of the CPU after the cores. Sounds like an appropriate fix. I'd like to measure the power usage but haven't gotten round to it yet.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Still no response from AMD? Look forward to the brave soul who tests this on the new 2xxx series in a month or two.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

Created attachment 274853
disable c6

So maybe disable package c6 within kernel?

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #239)
> So maybe disable package c6 within kernel?

Good idea and for testing ok - but please as kernel commandline option if it should go to production because not everybody does have this problem respectively does have better solution like more or less CPU overclocking.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #240)
> Good idea and for testing ok - but please as kernel commandline option if it
> should go to production because not everybody does have this problem
> respectively does have better solution like more or less CPU overclocking.

It's not necessary because we can already write the MSR from userspace.

Is the RMA'd CPU also stepping 1?

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Created attachment 274863
zenstates.py small patch to allow disabling only C6 package

I made a small patch to zenstates.py to allow disabling only C6 package for those that don't have the new power idle control options in their BIOS. Test and report.

Revision history for this message
In , sergio (sergio-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #241)
> (In reply to Klaus Mueller from comment #240)
> > Good idea and for testing ok - but please as kernel commandline option if
> it
> > should go to production because not everybody does have this problem
> > respectively does have better solution like more or less CPU overclocking.
>
> It's not necessary because we can already write the MSR from userspace.
>
> Is the RMA'd CPU also stepping 1?

Mine is stepping 1.

For people using the "Power Supply Idle Control" setting in the BIOS/UEFI, does it also get reset after resuming from suspend/sleep?

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Sergio C. from comment #243)
> (In reply to Kai-Heng Feng from comment #241)
> > Is the RMA'd CPU also stepping 1?
>
> Mine is stepping 1.

So is mine, which has been RMA'd.

> For people using the "Power Supply Idle Control" setting in the BIOS/UEFI,
> does it also get reset after resuming from suspend/sleep?

Just tested, C6 Package stays disabled.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

If we are sure the issue is gone when package C6 is disabled, I'll send a patch which disables package on model == 1 && stepping == 1.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #245)
> If we are sure the issue is gone when package C6 is disabled, I'll send a
> patch which disables package on model == 1 && stepping == 1.

Does this does not have consequences on power consumption and/or performance? I am not sure this is the best approach now that the nw BIOS option is in place. As I said in my system (MSI B350 Motherboard) the option set to "Typical" solves the problem. My machine is running for a week now without any freeze.

Moreover, in my case the BIOS option does not seem to disable the C6 Package as fair as zenstates.py can tell:

pjssilva@quorra:~/local/administracao/ZenStates-Linux$ sudo ./zenstates.py -l
[sudo] senha para pjssilva:
P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled
pjssilva@quorra:~/local/administracao/ZenStates-Linux$

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

(In reply to Paulo J. S. Silva from comment #246)
> (In reply to Kai-Heng Feng from comment #245)
> > If we are sure the issue is gone when package C6 is disabled, I'll send a
> > patch which disables package on model == 1 && stepping == 1.
>
> Does this does not have consequences on power consumption and/or
> performance?

I don't know. Need someone who has power factor meter to do the test.

Revision history for this message
In , klausman (klausman-linux-kernel-bugs) wrote :

I for one have a stable system *without* disabling C6, but just rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it wouldn't help me, but likely make the power consumption a lot worse.

Hardware:
AMD Ryzen 7 1700X (model 1, stepping 1) that was RMA'd for segfaults
ASUS PRIME B350-Plus, BIOS 3401
2x Kingston 9965669-017.A00G

No overclocking and core boost disabled.

I regularly get uptimes well over two weeks and haven't had a lockup since I added the RCU blacklist.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

I haven't measured it yet but I wouldn't panic about power usage prematurely. I think just disabling C6 package is quite different to disabling C6 entirely.

It also seems like BIOS settings aren't always proving reliable and you need to check zenstates.py to see whether it has changed what you expect or even anything at all.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

I guess RCU_NOCBS only papers over the bug by let the CPU hard to enter Package C6.

Regarding to power consumption, we need real world test results before making any judgment.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

This option must be switchable of course - I don't need it, too! It should be handled like the bios switch: default is without any change - machines which need it, enable it with c6package.disable e.g. or even better additionally via sysctl (always opt in).

About power consumption: I can test it on base of an already overclocked system (+200 MHz). I can do it once I can switch off the system to put in the power meter. The precision isn't very high, but 1 or 2 Watt more or less should be noticeable.

Revision history for this message
In , excieve (excieve-linux-kernel-bugs) wrote :

I think both RCU workaround and disabling C6 (either through MSR or using the BIOS option) just masks the problem, which is probably on hardware level. It is likely also masked on Windows through how its power management works. I strongly believe that AMD should properly investigate this issue instead of waving it off on PSUs. This happens on all kinds of PSUs, including those that definitely support C6/C7.

One thing I noticed is that with C6 disabled through the script or using the "Typical Current" power supply option in BIOS in my case MB sensors never report Vcore less than 0.8V, which indicates that indeed it doesn't reach lower idle states even if only "package C6" is disabled. On default settings it will frequently report 0.4-0.5V range when idle.

Since Zen would only apply boosting and especially XFR when certain power and thermal conditions are met, disabling the deep idle states may affect this adversely. So IMO all that should be well tested before applying a workaround like this on the kernel level.

Revision history for this message
In , it (it-linux-kernel-bugs) wrote :

Since I've enabled "Typical Current" my system is absolutely stable. On my system that disables package C6.

I did a short test and run stress-ng -c1, pinned the process affinity to the first core and watched the output of cpufreq-aperf. Didn't see a difference between package C6 enabled or disabled.

Didn't do a very long running test nor do I know, if the cpufreq-aperf values are really meaningful in this scenario (sometimes it showed frequencies above the 3.9GHz my 1700x should be able to reach), but at least there was no difference.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Artem Hluvchynskyi from comment #252)
> I think both RCU workaround and disabling C6 (either through MSR or using
> the BIOS option) just masks the problem, which is probably on hardware
> level.

Maybe true.

[...]

> One thing I noticed is that with C6 disabled through the script or using the
> "Typical Current" power supply option in BIOS in my case MB sensors never
> report Vcore less than 0.8V, which indicates that indeed it doesn't reach
> lower idle states even if only "package C6" is disabled. On default settings
> it will frequently report 0.4-0.5V range when idle.

I'm reaching exactly this 0.4-0.5V range with overclocking (+200 MHz) and C6 completely enabled - and I didn't see the problem any more since a lot of weeks now (before not even after an hour). My PSU officially is definitely not C6/C7 capable because it's too old.

> Since Zen would only apply boosting and especially XFR when certain power
> and thermal conditions are met, disabling the deep idle states may affect
> this adversely. So IMO all that should be well tested before applying a
> workaround like this on the kernel level.

Fine with me if its switchable (opt in).

Nevertheless it would be indisputable much better to have a real fix basing of the real reason instead of a workaround - but I fear it won't come anytime soon.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

> One thing I noticed is that with C6 disabled through the script or using the
> "Typical Current" power supply option in BIOS in my case MB sensors never
> report Vcore less than 0.8V, which indicates that indeed it doesn't reach
> lower idle states even if only "package C6" is disabled. On default settings
> it will frequently report 0.4-0.5V range when idle.

The same here (Tomahawk B350 from MSI). With the "Typical current" option in BIOS the voltage has a minimum of 0.82V (that's what sensors report). If I change the setting to "Auto" or "Low current" it goes down to 0.4-0.5V.

With "Typical" my system is stable, with "Auto" it freezes, usually during the night. I haven't tried "Low" yet. Maybe next week.

Revision history for this message
In , laforge (laforge-linux-kernel-bugs) wrote :

FYI: I'm experiencing a similar o the same issue on a Ryzen 1700X

vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1700X Eight-Core Processor
stepping : 1
microcode : 0x8001129

Base Board Information
        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: PRIME B350M-A
        Version: Rev X.0x
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 3803
        Release Date: 01/22/2018

where at least one per day the machine would hang with
<pre>
Message from syslogd@host2 at Mar 23 21:35:56 ...
 kernel:[31606.963222] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [(md-udevd):2623]
</pre>

being the last messages. C6 was enabled for both the package and the core by the bios.

The machine would hang quite reliably at least once every 24 hours, typically during the idle night times. The load at that point was 0.01.

OS/Kernel is Debian 9 / 4.9.0-6-amd64, which has "CONFIG_RCU_EXPERT is not set" and hence none of the [old or new] more detailed RCU settings are enabled.

Have decided not to upgrade the kernel but simply disable C6 and wait...

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

I tested power consumption with C6 package enabled (default) or disabled. I disabled it via patch for zenstates.py.

I ran two tests over 2 hours with disabled VMs and no KDE session (just the login screen started, but active was text console 0).

Based on the measuring tolerance of the power meter, I couldn't find any difference between enabled or disabled C6 package.

The voltages I saw in sensors have been the same (most of the time 0.39 V) - no difference. The load of the machine was 0 0 0.

Maybe the patch for zenstates.py to disable C6 package doesn't work? It would have been nice to have the possibility to re enable it.

Another nice finding was: the consumption of my 4 idle(!) VMs is constantly about 5 Watt!

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Those of you that have the new power idle options, can you report the values of msr registers 0xC0010292 and 0xC0010296 after booting?
It can be done with:
modprobe msr
rdmsr -x 0xC0010292
rdmsr -x 0xC0010296

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Disabling only C6 package does not fix the issue for me. So whatever "Power Supply Idle Control" does, it is not just disabling C6 package.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #258)
> rdmsr -x 0xC0010292

52

> rdmsr -x 0xC0010296

484848

I've taken some basic power measurements. Not particularly scientific or accurate, just watching the numbers fluctuate on my power plug meter. I've not looked into this before so I really didn't know whether the difference would be tens of watts or milliwatts. I booted with just C6 package disabled and it hovered around 64.5W. With C6 entirely disabled, it went up to 70W. With C6 entirely enabled, it was around 63.5W. So C6 Package hardly makes any difference. I made these changes with zenstates.py rather than the BIOS for this test.

Revision history for this message
In , robert (robert-linux-kernel-bugs) wrote :

Hello guys, (new here, but not to unix/linux)

I have at the moment a new Epyc-based server, and also a new Ryzen PC.
No soft lokups on Epyc, only a very evil error under heavy load I have been hunting down for a while (https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID) until I decided to move away from FC27 (not certified on this server), and load Centos7. So far no error, but I have yet to test more load on it.

Specs, Epyc:
Kernel: 3.10.0-693.21.1.el7.x86_64 x86_64 bits: 64 gcc: 4.8.5
           Desktop: Openbox Distro: CentOS Linux release 7.4.1708 (Core)
Machine: Device: kvm System: Supermicro product: AS -2023US-TR4 v: 0123456789 serial: <filter>
           Mobo: Supermicro model: H11DSU-iN v: 1.02A serial: <filter>
           UEFI [Legacy]: American Megatrends v: 1.1 date: 02/07/2018
CPU(s): 2 16 core AMD EPYC 7351s (-MCP-SMP-) arch: Zen rev.2 cache: 16384 KB

Ryzen:
Kernel: 4.15.10-300.fc27.x86_64 x86_64 bits: 64 gcc: 7.3.1 Console: tty 0
           Distro: Fedora release 27 (Twenty Seven)
Machine: Device: desktop Mobo: ASUSTeK model: PRIME B350M-A v: Rev X.0x serial: <filter>
           UEFI [Legacy]: American Megatrends v: 3402 date: 12/11/2017
CPU: 6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1 cache: 3072 KB

I can say that I have seen the soft lockup on the Ryzen several times in the past 3 months, maybe once / week or so. (but then again the machine is always doing stuff in my home, i.e. downloading backups from servers)

The boot options on both are now:
Epyc: GRUB_CMDLINE_LINUX="rhgb selinux=0 rcu_nocbs=0-63"
Ryzen: GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 rhgb quiet selinux=0 nmi_watchdog=0 nohpet pci=biosirq rcu_nocbs=0-11"

I have not been home yet to upgrade the BIOS on the B350 mobo, but I will be there in 7 days, do it, and post my findings. In the meantime, I will leave the Ryzen completely idle, waiting for a crash.
I have not (ever) touched the C6 issue, alhough I do remember setting my Mobo BIOS on "Standard" (not Performance, and not Power Saving).

Revision history for this message
In , robert (robert-linux-kernel-bugs) wrote :

Oh, I forgot:

Epyc (Centos7):
fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_NONE=y
# CONFIG_RCU_NOCB_CPU_ZERO is not set
# CONFIG_RCU_NOCB_CPU_ALL is not set

Ryzen (FC27):
fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

If there's no noticeable power consumption increase when Package C6 gets disabled, and we are sure that it really workarounds the hard lockup, I'll send the patch to upstream.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

It doesn't workarounds the hard lock, as I reported a while back (comment 259)

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #263)
> If there's no noticeable power consumption increase when Package C6 gets
> disabled, and we are sure that it really workarounds the hard lockup, I'll
> send the patch to upstream.

What exactly would this mean (Linux newbie here)? Would this mean this would be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because electricity is bloody expensive here and the 5-10% power consumption difference would matter to me.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #264)
> It doesn't workarounds the hard lock, as I reported a while back (comment
> 259)

So does disable Core C6 do the trick?

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #217)
> (In reply to Klaus Mueller from comment #214)
> > (In reply to Chris Hall from comment #212)
> > > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > > across all voltages. I am running that with C6 enabled, and await
> > > developments.

> > Could you please tell which PSU you're now testing?

> It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

I have been running with the new PSU with C6 enabled (package and core), and with CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15, for 10 days.

This morning the machine was frozen. So, new PSU has not cured the problem.

With the old PSU the machine was freezing every 2-3 days. So, things have improved... which could be the PSU, but the kernel (now 4.15.10) and other software has been updated and the idle load on the machine depends on how frequently somebody tries to ssh in :-(

The BIOS for my ASUS Prime X370-PRO does not have the magic "Power Supply Idle Control". I will now try zenstates.py --c6-package-disable".

FWIW: I last wrote to AMD "TECH.SUPPORT" on 14-Mar... so far, no reply.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Disabling both C6 package and C6 core prevents hard lock, but also disables single core turbo XFR speeds. With C6 disable my 1700X can go only up to 3500MHz while a single core can go up to 3900 with C6 enabled.

Disabling both C6 package and C6 core should not be considered a fix.

Revision history for this message
In , laforge (laforge-linux-kernel-bugs) wrote :

I can report that on my Ryzen 1700X machine disabling core+package C6 seems to
be working around the problem. Where previously the machine would fail quite
reliably at least once per 24 hours, it's now running longer without any trouble
so far.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Jonathan from comment #265)
> (In reply to Kai-Heng Feng from comment #263)
> > If there's no noticeable power consumption increase when Package C6 gets
> > disabled, and we are sure that it really workarounds the hard lockup, I'll
> > send the patch to upstream.
>
> What exactly would this mean (Linux newbie here)? Would this mean this would
> be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because
> electricity is bloody expensive here and the 5-10% power consumption
> difference would matter to me.

Kai-Heng Feng seems to mulishly ignore the fact, that not *all people* need this workaround. I repeatedly wrote, that this *must be switchable*, as opt in e.g.

But before anything is done, it must be proven in general, that the fix is correct at all - means, that it really addresses C6 package and nothing else.

My tests here have been shown, that there isn't any difference between enabled / disabled C6 package on voltages using zenstates.py. Did it really address the C6 package? But I'm not sure, if it does the same as the kernel patch.

Revision history for this message
In , tomm (tomm-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #240)
> (In reply to Kai-Heng Feng from comment #239)
> > So maybe disable package c6 within kernel?
>
> Good idea and for testing ok - but please as kernel commandline option if it
> should go to production because not everybody does have this problem
> respectively does have better solution like more or less CPU overclocking.

Good luck getting Linus to accept a kernel change for a defect that AMD won't even acknowledge.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #267)
> (In reply to Chris Hall from comment #217)
> > (In reply to Klaus Mueller from comment #214)
> > > (In reply to Chris Hall from comment #212)
> > > > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > > > replaced that by a brand-new, more efficient PSU which supports 0A
> loads
> > > > across all voltages. I am running that with C6 enabled, and await
> > > > developments.
>
> > > Could you please tell which PSU you're now testing?
>
> > It's a "be quiet! Straight Power 11", Model E11-450W (BN280).
>
> I have been running with the new PSU with C6 enabled (package and core), and
> with CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15, for 10 days.
>
> This morning the machine was frozen. So, new PSU has not cured the problem.

Thanks for your reply!

> With the old PSU the machine was freezing every 2-3 days. So, things have
> improved... which could be the PSU, but the kernel (now 4.15.10) and other
> software has been updated and the idle load on the machine depends on how
> frequently somebody tries to ssh in :-(
>
> The BIOS for my ASUS Prime X370-PRO does not have the magic "Power Supply
> Idle Control". I will now try zenstates.py --c6-package-disable".

I've got the same board. My workaround is to overclock the CPU a bit: I'm using the PC scenario "Daily computing" (+200 MHz) - all other configurations are default and no RCU or C6 - workaround. Panagiotis Malakoudis had to use even higher overclocking (see comment 223) to work around the problem - but he has another CPU if I remember correctly - I have a AMD Ryzen 7 1700X and my clock for DRAM is 2400 MHz (default - it isn't changed). Overclocking doesn't affect power consumption here in idle mode (I tested it)!

Maybe you want to try it?

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #270)
> (In reply to Jonathan from comment #265)
> > (In reply to Kai-Heng Feng from comment #263)
> > > If there's no noticeable power consumption increase when Package C6 gets
> > > disabled, and we are sure that it really workarounds the hard lockup,
> I'll
> > > send the patch to upstream.
> >
> > What exactly would this mean (Linux newbie here)? Would this mean this
> would
> > be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because
> > electricity is bloody expensive here and the 5-10% power consumption
> > difference would matter to me.
>
> Kai-Heng Feng seems to mulishly ignore the fact, that not *all people* need
> this workaround. I repeatedly wrote, that this *must be switchable*, as opt
> in e.g.

But lots of systems are affected by this, no? A functional kernel should work out of the box.
So fix like this needs to be opted out, like how all other workarounds get handled in the kernel. You can still easily opt out via MSR.

But apparently people are not happy about this fix, let's just wait for AMD's fix, if there's one.

Revision history for this message
In , robert (robert-linux-kernel-bugs) wrote :

Well, judging by what I have read and what I have seen, this is obviously a fault in AMD or the Ryzen processor either:
1. by allowing the power to the CPU to get too low, so it starves and stutters.
2. by specifying the wrong CPU power needs to the motherboard manufacturers.

I have read above, there is a site where they "play with the voltages". That seems to be the most obvious real fix to me. However, it's not for "granny", and it should just not happen to a company like AMD, or a hyped product like Ryzen.

All said and done, when I leave my comp at home transcoding 8TB of videos to mobile format, the beast just rocks so good, and it never crashes then ...

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #273)
> But lots of systems are affected by this, no? A functional kernel should
> work out of the box.
> So fix like this needs to be opted out, like how all other workarounds get
> handled in the kernel. You can still easily opt out via MSR.

Opt out is fine for me, too. This is ok. It's important to be switchable. But zenstates.py can't switch it off as of now. Therefore, it must be done via kernel.

> But apparently people are not happy about this fix, let's just wait for
> AMD's fix, if there's one.

Because it wouldn't be a fix, but a workaround. That's the problem. AMD should fix it - but they seem to behave lazy and just seem to ignore it!

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

I find it hard to believe this can't be fixed with a bios / microcode update. It really is crazy.

Someone earlier here was going to contact AMD graphics division to see if they could shed some light on it, I'm going to assume that went nowhere.

There is an AMD staff member who regularly posts on reddit, perhaps if this were to be posted on the AMD subreddit, succinctly, with evidence, it might get upvotes (but some AMD fans, tend to be extremely defensive)

It's certainly frustrating, this bug has been now open quite a while.

Revision history for this message
In , sergio (sergio-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #276)
> I find it hard to believe this can't be fixed with a bios / microcode
> update. It really is crazy.
>
> Someone earlier here was going to contact AMD graphics division to see if
> they could shed some light on it, I'm going to assume that went nowhere.
>
>

They probably got the same "I apologise for the delay in response, unfortunately the information you are requesting is private and confidential and not publicly available at this time." I got when asked about the power supply idle control setting.

On a sidenote, I am currently trying to debug an issue with a wireless card (Intel 7260, mobo ASRock AB350 Pro4, Ryzen 5 1600). Today I went back to an older BIOS that does not have the power supply workaround and simply disabling C6 package did not help. I got a reboot with the infamous "mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108" message; hadn't seen that one in months :)

Revision history for this message
In , it (it-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #258)
> rdmsr -x 0xC0010292

52

> rdmsr -x 0xC0010296

484848

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

It seems my motherboard, ASUS Prime X370 will be getting the new "Power Supply Idle Control" option in new BIOS with version 3907 and AGESA 1002, as reported in https://www.hardwareluxx.de/community/f219/asus-prime-x370-pro-am4-1156996-289.html#post26231965

So I will be able to test this workaround soon.

Revision history for this message
In , simon (simon-linux-kernel-bugs) wrote :

Hi there,

i read this list here since about 3 month,
trying to get 4 hard/software identical computers
running stable:

amd ryzen7 1700
asrock ab350m
opensuse tumbleweed.
32 gb memory (max for this board)
no overcloaking, memory (except the color) correct for this board
(there are different numbers of the memory but they are related to
the color of the cover (as my computer shop told me))

1) i changed all processors twice now the high load bug is gone
   ua 1714 pgt (was first processors with this bug)
   ua 1733 pgs (where the second processors with this bug)
   now i got some other (no idear which, but they are running
   with ( a modified) "kill rizen script" and BIOS LOWER than 3.2!!!!
   -> so there ARE processors out there with week > 30 which still did
      have high load bug!!

2) still with that bios versions at low load the system crashed
   randomly about ever 5 days, systems are running about 12 hours/day,
   after this, shut down, next morning starting again.
   and in this situation every about 5 days once a crash.
   -> the crash(s) did not appear if the system runs contignously
    one week without touching the system!!!
   it crahes when starting browser, or changing windows in kde, or
   type after a while not doing anithing in libreoffice, so it seems
   there has to be a small load from nearly no load to cause the problem.

3) i updated to bios 4.4 later to 4.5.
   crahes are still present.
   but in this bios there is this feature:
   /advanced/amdcbs/zen-common-options/"power supply idel control"
   changed this from "auto" to "typical" (as i have seen here in this list)
   up to now, since about 3 weeks no lowload-crashes appear.
   BUT:

4) with bios 4.4 and 4.5 under high load i run into the problem:
   with modified "kill ryzen script", from 16 loops, i wrote 4 loops
   to the nvme0 (because for 16 loops 32gb memory is to small)
   and 12 loops to ramdrive.
   so there is a heavy load also at the bus.
   kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
   kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
   this errors occour every couple of seconds.
   with bios smaller 3.2 this was all fine, this error did not show up.

5) to solve the 4) problem, i added pcie_aspm=off" to the kernel command-line.

now it SEEMS that this systems are stable under low and high load :-)))
... cost me nearly a half of a year to get to this point.

thanks to all here who have reported ad gave me some hints.

simoN

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

FWIW, after 14 days of silence, I have heard back from "<email address hidden>".

When I last wrote I sent them the URL for this discussion.

"<email address hidden>" tells me:
__________________________________________________________________________
  Thank you for your patience, I understand that the idle hang freezing
  issue is frustrating.

  Thanks for sharing the Bugzilla link, I have glanced through the
  discussion thread and saw that users that have the Power Supply
  Control option available in the bios confirmed that changing the
  setting has resolved their problem.

  This option is the AMD solution and has been provided to motherboard
  vendor for validation and inclusion in a future BIOS release.

  According to comment 279, a poster with the same motherboard as you,
  has pointed to a discussion where it states that an updated BIOS will
  be made available for your motherboard which contains the Power Supply
  Control Option. https://www.hardwareluxx.de/community/f219/
                   asus-prime-x370-pro-am4-1156996-289.html#post26231965

  Although I cannot confirm the accuracy of this post, you may want to
  check with ASUS directly.
__________________________________________________________________________

I did ask them what the "Power Supply Control" option actually does, and why.

I have asked again.

Given that the "Power Supply Control" option is "the AMD solution" I assume that AMD are the right people to ask ?

FWIW: ASUS PRIME X370-PRO BIOS 3805 has appeared, with the release note "Improve system performance" -- but that does not have a "Power Supply Control" setting (or at least not in the "Advanced/AMD CBS" menu).

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

For ASUS X370 Prime, beta BIOS 3907 has this option available. 3805 does not have it.
I think I will test 3907 in the next days, although I would prefer not to use beta BIOS.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've seen some reports from users that they no longer have the issue with 4.15 kernel. I updated to 4.15 and already survived two nights of idle usage at stock settings (no cpu overclock etc). Of course this has happened in the past as well some times so I have to wait a few more days to make conclusions, but I wonder if anyone else tested with 4.15

Revision history for this message
In , excieve (excieve-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #283)
> Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've
> seen some reports from users that they no longer have the issue with 4.15
> kernel. I updated to 4.15 and already survived two nights of idle usage at
> stock settings (no cpu overclock etc). Of course this has happened in the
> past as well some times so I have to wait a few more days to make
> conclusions, but I wonder if anyone else tested with 4.15

For me it froze in a bit more than a week of mixed usage on all defaults on 4.15.x.

Revision history for this message
In , simon (simon-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #283)
> Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've

Hi, i using opensuse's 4.15.10-1-default kernel see my comment 280 for system configuration (stable).
how to check if rcuo treads are enabled or not?

Revision history for this message
In , simon (simon-linux-kernel-bugs) wrote :

(In reply to Simon from comment #285)
> (In reply to Panagiotis Malakoudis from comment #283)
> > Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've
>
> Hi, i using opensuse's 4.15.10-1-default kernel see my comment 280 for
> system configuration (stable).
> how to check if rcuo treads are enabled or not?

maybe this is the info you need?:

zcat /proc/config.gz | grep RCU

# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_HAVE_RCU_TABLE_FREE=y
# RCU Debugging
# CONFIG_PROVE_RCU is not set
CONFIG_RCU_PERF_TEST=m
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set

Revision history for this message
In , ptheis (ptheis-linux-kernel-bugs) wrote :

I don't get it ... my Manjaro system was stable for 3 months, lately with 4.14 LTS kernels and constantly upgraded NVIDIA drivers. Now the system is unstable again, even when decoding HD video streams from the web ... I will give kernel 4.16 a try.

Revision history for this message
In , cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote :

We've now seen this idle lockup with a Ryzen Pro 1700X based system running current Fedora 27. As with my Fedora 27 system, it is avoided by using 'rcu_nocbs=0-15 processor.max_cstate=5'. The hardware is an Asus Prime Z370-Pro motherboard, EVGA SuperNova G3 650W PSU, 2x 8GB Kingston ECC RAM, and an nVidia graphics card.

(It's disappointing but not surprising that the Ryzen Pro doesn't seem to fix this.)

Revision history for this message
In , cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote :

Whoops, I need to correct my comment slightly, because Asus has so many motherboards: the Ryzen Pro is in an Asus Prime X370-Pro motherboard.

(There is no Asus 'Prime Z370-Pro', but there is an Asus 'Prime Z370-A' for current Intel CPUs so you can easily be confused just by changing a few letters.)

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

See also https://bugzilla.redhat.com/show_bug.cgi?id=1562530

I was thinking this was an amdgpu driver bug, but not sure now.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Does the AMD Epyc embedded series have this bug?

Revision history for this message
In , robert (robert-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #291)
> Does the AMD Epyc embedded series have this bug?

I am running a dual standard Epyc (7351s - Zen Rev.2) server and I can say that it has not manifested itself in any way. The server was one week completely idle waiting for software installations, and there was no errors. (CentOS 7.4).
I cannot say anything about the Embedded version though. But I guess it may depend on the "successful" marriage with its motherboard in reality.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

Hi.
I am testing it under Windows.
First, you must know that on Windows only the economy saver mode make use of core parking. Normal mode doesn't, as the ryzen delivered with the chipset driver package. From what i see via Parkcontrol app.

I'm now using core parking for try to reproduce it under this others OS. It's running now since 4hours in economy mode with core parking enabled and still no crash.

I use several OS and i had a crash (freeze) just after i installed the chipset driver package, which touch to the scheduler settings.
Since it, never had an other freeze. I must have use the reset button for get out of the freeze.

From what i read in this thread if i want to reproduce it on Linux, best is to try Fedora 27 updated.

I'll be back with others tests.

Revision history for this message
In , brauliobo (brauliobo-linux-kernel-bugs) wrote :

(In reply to JerryD from comment #290)
> See also https://bugzilla.redhat.com/show_bug.cgi?id=1562530
>
> I was thinking this was an amdgpu driver bug, but not sure now.

See my replies there, I think with Raven Ridge this is really related to amdgpu driver.

Revision history for this message
In , devurandom (devurandom-linux-kernel-bugs) wrote :

I am using an AMD Ryzen 5 2400G on an Asus ROG Strix B350-F (firmware version 3805 and 3803 before that) and was also experiencing freezes, e.g. after about one hour into `rsync -a /home ...`, with Gentoo Linux (Linux 4.15.10), Arch Linux 2018.03.1, Fedora 27, Fedora 28 nightly, Fedora 28 beta. Setting rcu_nocbs=0-7, processor.max_cstate=5 or intel_idle.max_cstate=5 did not seem to have an effect -- the freezes still occurred. The hardware itself (mainboard, CPU, RAM) was cross-checked by the supplier, who found no fault in it. Since I disabled CPU C-states in the mainboard firmware about a day ago, I am no longer able to reproduce the freeze.

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

(In reply to Dennis Schridde from comment #295)
> I am using an AMD Ryzen 5 2400G on an Asus ROG Strix B350-F (firmware
> version 3805 and 3803 before that) and was also experiencing freezes, e.g.
> after about one hour into `rsync -a /home ...`, with Gentoo Linux (Linux
> 4.15.10), Arch Linux 2018.03.1, Fedora 27, Fedora 28 nightly, Fedora 28
> beta. Setting rcu_nocbs=0-7, processor.max_cstate=5 or
> intel_idle.max_cstate=5 did not seem to have an effect -- the freezes still
> occurred. The hardware itself (mainboard, CPU, RAM) was cross-checked by
> the supplier, who found no fault in it. Since I disabled CPU C-states in
> the mainboard firmware about a day ago, I am no longer able to reproduce the
> freeze.

Ryzen 2xxx are almost Ryzen 1xxx but with lower TDP. Meaning lower CPU frequency and thus lower power.
And since all available solutions and workarounds are about not to let the chipset give less power to the system (cpu, memory...) or to explicitly give more power to the system, this is not surprising (to my opinion) that the bug is still here.
What will be more interesting is to see if the bug will still be here for the next generatation of Ryzen...

Also, since now this is clear like rock water that this bug is not a software bug (thus not a kernel bug), but clearly a bug from AMD (most probably in the full design of Ryzen both CPU and chipsets), please let me propose you all to consider again to trick your motherboard voltage. Just giving it very few more and you'll have a system fully working. The CPU will still be able to lower its power, to scale its frequency... This is totally a win-win.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

I flashed my motherboard (ASUS Prime X370 Pro) with the new 4008 BIOS that finally offers the "Power Supply Control" option. I set it to typical idle, disabled my CPU overclock that made the problem less frequent and now waiting to see if it will freeze.

What is already obvious though is that CPU power never goes under 0.8V. Without the Power Supply Control option set to Typical, voltage goes down to 0.39V

XFR and single core turbo works fine even if voltage is not dropping to 0.39V

Will report back in a few days.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

I do the same on my side.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #297)
> I flashed my motherboard (ASUS Prime X370 Pro) with the new 4008 BIOS that
> finally offers the "Power Supply Control" option. I set it to typical idle,
> disabled my CPU overclock that made the problem less frequent and now
> waiting to see if it will freeze.
>
> What is already obvious though is that CPU power never goes under 0.8V.
> Without the Power Supply Control option set to Typical, voltage goes down to
> 0.39V
>
> XFR and single core turbo works fine even if voltage is not dropping to 0.39V
>
> Will report back in a few days.

How do you monitor XFR/single core turbo under Linux?
Thank you

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

I open a terminal and run as root:
watch -n 0.5 cpupower monitor

cpupower is in package linux-cpupower in Debian 9 that I am using.
apt install linux-cpupower

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

And if you want to test a single core benchmark pinned to a specific core to check what frequency can be reached, it is easy with following command:
taskset -c 10 openssl speed rsa2048

On my 1700X, cpupower monitor shows frequency up to 3880 for core 10 in the above benchmark.

If you want to stress two cores, you can either run two commands, or:
taskset -c 6,10 openssl speed -multi 2 rsa2048

Two cores go up to 3800 on my 1700X. Three and above go only to 3500. Two cores+multithread (taskset -c 6,7,10,11 for example) go up to 3650. This is the point where the new ryzen 2XXX are better, they can use higher turbo when using >2 cores. It seems they can do 4000 MHz up to 4 cores/8 threads with XFR2.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #301)
> And if you want to test a single core benchmark pinned to a specific core to
> check what frequency can be reached, it is easy with following command:
> taskset -c 10 openssl speed rsa2048
>
> On my 1700X, cpupower monitor shows frequency up to 3880 for core 10 in the
> above benchmark.
>
> If you want to stress two cores, you can either run two commands, or:
> taskset -c 6,10 openssl speed -multi 2 rsa2048
>
> Two cores go up to 3800 on my 1700X. Three and above go only to 3500. Two
> cores+multithread (taskset -c 6,7,10,11 for example) go up to 3650. This is
> the point where the new ryzen 2XXX are better, they can use higher turbo
> when using >2 cores. It seems they can do 4000 MHz up to 4 cores/8 threads
> with XFR2.
No problem for it, i use -mmt1 2 3 4 etc for it with 7za b
And as you said, xfr and turbo are working.
Now i see that my performance under windows are better, i think it's because of timing for change frequencies who is smaller on windows. I'm going to tune it a bit for see, on windows better timing improve performance by ~10% iirc.

Yes they can.
between 4.05 et 4.075 GHz over eight cores. "

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

So far,
[philectro@linux ~]$ uptime
 21:32:34 up 7:14

The typical current seems to fix the problem.
The script zenstates says my C6 is enabled, so i don't know what this parameter does.

Revision history for this message
In , phillips (phillips-linux-kernel-bugs) wrote :

For my Ryzen 1700 system, setting bios "Power Supply Idle Control" to "Typical Current" increases power consumption at idle by approximately 4 watts, from 39 watts to 43. Still impressively low, but I hope that AMD will give me my 4 watts back eventually. Uptime is currently 2 days, but has been as high as 6 weeks in the past before locking up, so I cannot yet report that my stability issue is resolved.

As others have reported, the "Typical Current" option disables package C6 state, but not core C6 state. It is high time for AMD to make an official statement on this issue.

Motherboard: GA-AB360-GAMING, stock voltages and clocks
Processor: Ryzen 7 1700, week 38
Power supply: EVGA 650 GQ
Video: Radeon 6450, radeon driver
Kernel: various, including 4.15.0

Revision history for this message
In , phillips (phillips-linux-kernel-bugs) wrote :

Correction: GA-AB350-GAMING

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

package C6 state is not disabled with "Power Supply Idle Control" to "Typical Current" on the Asus Prime X370 with 4008 chipset. Also, disabling package C6 state with zenstates.py didn't have any effect on previous BIOSes.

I don't care about the 4 watts, but I would definitely like a technical explanation of the issue.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

Same for me.
I confirm than on my x370-pro "typical current" doesn't disable C6 at all.

For the script, it disabled both C6 on my motherboard on 3805 uefi version.
What do you mean by didn't have any effect?
My system was stable when i disabled C6.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

If I was only disabling package C6, I still had idle freezes. Completely disabling C6 was not an option, since you loose XFR and turbo speeds above 3500 (on my 1700X)

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

I contacted AMD support for my Ryzen 1600 and asked an explanation. I got an answer but so far no explanation.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

Today, I tested bios 4008 (Prime X370-PRO / previous was 3404). The update wasn't any problem. I measured power consumption with enabled "typical current" and couldn't find any difference to the standard adjustment (and to my current overclocking solution w/ bios 3404).

Sensors showed with enabled "typical current", that minimal voltage is about 0.8 V - whereas standard uses between 0.4 V and 0.8 V. C6 (package) states are enabled.

Nevertheless I went back to 3404 (and overclocking) via dos/afudos because of the completely broken fan regulation (CPU fan min duty cycle is 60% - this is completely unusable) - I didn't want to use the modded bios, because I'm not sure how to install it.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

Now that my freezes are solved i have get this kind of flood on 4.15.17 today(and maybe yesterday as my smb share were odly slow)

pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000b(Transmitter ID)
I have reboot several times for try different kernel version.
4.16.4 and 4.15.16
The same error appeared.

I have shutdown the computer. Cut the PSU completely with the switch and wait.
Pushed the radeon 7970 a bit in the slot in case it moved.
Put it on, the error doesn't appear anymore.
Tried to reproduce it with 4.15.17 but i can't.

I have see several users on level1 with threadripper and some with ryzen that had this error.

Revision history for this message
In , malakudi (malakudi-linux-kernel-bugs) wrote :

Unfortunately I have to report that the idle freeze issue hasn't been fixed for me with new BIOS and the "Typical Power idle" option. My system froze 2 times since 19/4 when I installed the new BIOS.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #312)
> Unfortunately I have to report that the idle freeze issue hasn't been fixed
> for me with new BIOS and the "Typical Power idle" option.

What's your kernel version?

I'm testing at the moment w/ Bios 4008 / default configuration - means w/o using "Typical Power idle" option and Linux 4.16.6. I didn't see any problem so far - but this isn't a final result until now - it may change anytime.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

Forgot to mention, that I applied this patch(1) to 4.16.6, which is now part of 4.16.7. See comments(2).

BTW, I couldn't measure any power consumption change with this patch.

(1) https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86-urgent-for-linus&id=da6fa7ef67f07108a1b0cb9fd9e7fcaabd39c051&utm_source=anz

(2) https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

Has anyone tried a Ryzen 2000 series / x470 mainboard yet?
no reason to wait for a TR 2000 series apparently, since it won't be using Zen+ cores.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

(In reply to Jonathan from comment #315)
> Has anyone tried a Ryzen 2000 series / x470 mainboard yet?
> no reason to wait for a TR 2000 series apparently, since it won't be using
> Zen+ cores.

I literally just came here tot ask the same question, curious if they've actually fixed this properly.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Panagiotis Malakoudis from comment #312)
> Unfortunately I have to report that the idle freeze issue hasn't been fixed
> for me with new BIOS and the "Typical Power idle" option. My system froze 2
> times since 19/4 when I installed the new BIOS.

Hmm time for talk with AMD and rma because it does for me.
But i still consider it as a dirty workaround for hide the fact cpus are defective.

Revision history for this message
In , nethershaw (nethershaw-linux-kernel-bugs) wrote :

To answer the question re: the new Ryzen CPUs:

I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board, and I can positively confirm that the random soft-lockups are very much still a thing. Nothing appears in dmesg, and nothing appears in the system journal.

My kernel is 4.16.6 (gentoo-sources patchset) and so far, the only troubleshooting I've done before landing on this bug report is to set 'processor.max_cstate=5' in my kernel boot options. The machine soft-locked about two hours later whilst I was using it (a few minutes ago).

I'll go back through the comments and suggestions to attempt the deeper workarounds, including kernel 4.16.7, but I thought this information might be valuable to someone first.

If anyone has anything they'd like me to test in particular, this is a Gentoo system and I am more than happy to subject it to trials in service of finding a solution to this bug.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

You don't have test the "typical current" option in the uefi?
It's the official workaround.

If you have still freezes with the B2 stepping ouch.
I tho they have at least fixed it there...

You means it freeze completely and you need to press reset?

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

(In reply to Matthew Vaughn from comment #318)
> To answer the question re: the new Ryzen CPUs:
>
> I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board,
> and I can positively confirm that the random soft-lockups are very much
> still a thing.

This is truly appalling. Thanks for the information.

Revision history for this message
In , nethershaw (nethershaw-linux-kernel-bugs) wrote :

(In reply to Michaël Colignon from comment #319)
> You don't have test the "typical current" option in the uefi?
> It's the official workaround.

I only found this bug report in the last hour, so I hadn't yet switched on the 'Typical Current Idle' option in UEFI. I just switched that on during my upgrade to kernel 4.16.7 a moment ago. I'll return here and report my results.

> You means it freeze completely and you need to press reset?

That's correct; up until now, it would freeze completely and I needed to press reset to reboot the machine. This had happened several times in the last 24 hours before I started looking around for reports of the issue and landed here. It's a brand-new installation.

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01. This is over ONE YEAR since the first release of Ryzen CPUs.

Can ya'all verify for me that this actually works - long term - and that I can remove both the rcu_nocbs=0-7 kernel boot option and my custom disable-c6-on-boot.service & disable-c6-on-suspend.service (which disables both C6 package and core)?

Or do I need to keep either rcu_nocbs or zenstates --c6-disable?

I would very much like to know if Typical Idle would be enough. I realize I could simply test this but I don't want to experiment and have random hangs. I'm on kernel 4.17rc3 btw.

If Typical Idle (I also see a "auto" option") actually fixes this then it's still a TOTAL SCANDAL since it's been a year now. But hey, better late than never.

Revision history for this message
In , nethershaw (nethershaw-linux-kernel-bugs) wrote :

I'm testing with just the 'Typical Current Idle' option and no other countermeasures enabled right now. I'll give it 24 hours before I'm convinced, given the frequency of previously observed lockups. So far, there have been none since my last comment, which is longer than any previous uptime interval I've had on this hardware since installing it.

For what it's worth, the 'auto' option for the current idle setting is no good. That's the default, and was the setting in effect when I was observing lockups.

Revision history for this message
In , it (it-linux-kernel-bugs) wrote :

(In reply to oyvinds from comment #322)
> Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option
> in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01.
> This is over ONE YEAR since the first release of Ryzen CPUs.
>
> Can ya'all verify for me that this actually works - long term - and that I
> can remove both the rcu_nocbs=0-7 kernel boot option and my custom
> disable-c6-on-boot.service & disable-c6-on-suspend.service (which disables
> both C6 package and core)?
>
> Or do I need to keep either rcu_nocbs or zenstates --c6-disable?
>
> I would very much like to know if Typical Idle would be enough. I realize I
> could simply test this but I don't want to experiment and have random hangs.
> I'm on kernel 4.17rc3 btw.

I didn't experience any lockups since more than 2 months. Only Typical Idle enabled, nothing else changed.
Running Ubuntu 17.10 stock kernel 4.13.0-39-generic

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to oyvinds from comment #322)
> Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option
> in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01.
> This is over ONE YEAR since the first release of Ryzen CPUs.

I have this board and it was available in F22 over a month ago. Still not great but I honestly don't think AMD were even aware of this issue before last August or so. The other segfault issue was getting far more attention and many wrongly assumed the freezes were related.

> I would very much like to know if Typical Idle would be enough.

It is enough. I have disabled all other workarounds and I've maybe had one freeze since though I can't remember for sure and that may well have been down to something else.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

(In reply to Matthew Vaughn from comment #318)
> I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board,
> and I can positively confirm that the random soft-lockups are very much
> still a thing. Nothing appears in dmesg, and nothing appears in the system
> journal.

Very disappointing. AMD really isn't taking this serious. Since power consumption is really important for me, and my PC is often idling (so either live with freezes or accept even higher power consumption), this really means no AMD. Especially since their Threadripper next gen is supposed to use an older Epyc core, not even Zen+. And since they're not taking this seriously, my guess is Zen 2 next year will still be buggy. Way to go AMD.

And with Intel just getting hit with another Spectre round, I guess their fixes will slow their CPU's down even slower than my old non patched Sandy Bridge.

*grumble*

Oh well, this is one way to get upgraditis cured I guess.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I lost interest in my Ryzen 7 1800X machine... but this morning I got round to upgrading the BIOS on the ASUS Prime X370-Pro (to 4011, hot off the press: "Update AGESA 1.0.0.2a + SMU 43.18" whatever that means).

I was running with rcu_nocbs and zenstates --c6-package-disable. That ran for 11 days and then froze.

I last heard from "<email address hidden>":

  Thank you for the update and confirming that your BeQuite Straight Power 11
  supports 0A minimum load.

  Because your system still freezes, it could be due to cross loading problems
  which can result in the power supply turning off when a load changes or
  result in voltages becoming out of specification causing system crashes and
  hangs.

  What the Power Supply Idle Control option does is disables the lowest power
  state for the CPU. This is where ALL cores are not in operation and the
  entire CCX or Core complex is taken down.

  There are many levels of power states that a core can be in from C1 to C6,
  CC6 and finally PC6. The Power Supply Idle Control option is designed to
  keep enough current on the rail so that power supply does not go out of
  regulation.

  The Power Supply Idle Control option is part of an AGESA update from AMD
  provided to the motherboard vendors for validation and implementation in
  their BIOS updates. However, it is motherboard vendors decision as this
  which BIOS version will contain the Power Supply Idle Control option.

So now I have options: "low current idle", "typical current idle" and "auto". Neither AMD nor ASUS seem to think it necessary to document what those mean.

I have set "typical current idle". I note that zenstates shows that both C6 States Package and Core are Enabled.

I guess I am back to waiting and seeing.

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #327)
--- snip ---
> So now I have options: "low current idle", "typical current idle" and
> "auto". Neither AMD nor ASUS seem to think it necessary to document what
> those mean.
>
> I have set "typical current idle". I note that zenstates shows that both C6
> States Package and Core are Enabled.
>
> I guess I am back to waiting and seeing.

So have you tried the "auto" mode? or "low current idle". Cant hurt to try them and just let the machine run doing something.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

Auto must make use of low current idle, the setting who gives freeze.
I can say than auto gives freeze as for Matthew.

Revision history for this message
In , nethershaw (nethershaw-linux-kernel-bugs) wrote :

24 hours of uptime without a lockup after only setting "typical current idle." Considering my rig had been locking up after very short uptimes without this setting, I'd say that's a significant difference.

Revision history for this message
In , tcl_de (tclde-linux-kernel-bugs) wrote :

(In reply to Chris Hall from comment #327)

--snip--
> I last heard from "<email address hidden>":
>
> Thank you for the update and confirming that your BeQuite Straight Power 11
> supports 0A minimum load.
>
> Because your system still freezes, it could be due to cross loading
> problems
> which can result in the power supply turning off when a load changes or
> result in voltages becoming out of specification causing system crashes and
> hangs.
>
> What the Power Supply Idle Control option does is disables the lowest power
> state for the CPU. This is where ALL cores are not in operation and the
> entire CCX or Core complex is taken down.
>
> There are many levels of power states that a core can be in from C1 to C6,
> CC6 and finally PC6. The Power Supply Idle Control option is designed to
> keep enough current on the rail so that power supply does not go out of
> regulation.
>
> The Power Supply Idle Control option is part of an AGESA update from AMD
> provided to the motherboard vendors for validation and implementation in
> their BIOS updates. However, it is motherboard vendors decision as this
> which BIOS version will contain the Power Supply Idle Control option.
>
--snip--
Actually there should not be any cross loading problems by design here as this PSU uses DC/DC converters to regulate +5V and +3.3V independently of +12V.

I suspect that the problem might rather be related to the voltage regulation on the mainboard.
It's good to know however that the Power Supply Idle Control option is part of an AGESA update.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

I suggest we fill all a spreadsheet on the web with our hardware, PSU included, for have the final word of this story.
I let you decide the right tool for it. I can host it if needed.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #313)

> I'm testing at the moment w/ Bios 4008 / default configuration - means w/o
> using "Typical Power idle" option and Linux 4.16.6. I didn't see any problem
> so far - but this isn't a final result until now - it may change anytime.

=> Got a hang yesterday.
watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [worker:11105]

Went back again to BIOS 3404 / Linux 4.14.x and overclocking which is known to be stable for me.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Someone needs to make Tomshardware, HardOCP, Anandtech aware of this thread.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I think that is a great idea. I have time to waste at airports soon, so can draft an email message.

Receipients:
Tomshardware, HardOCP, Anandtech, Ars Technica, Tweakers.net, Phoronix, ... anyone else?

Contents:
- Total system freezes on full idle
- Seemingly/definitely(?) a hardware bug
- AMD has been contacted multiple times but refuses to acknowledge the issue
- AMD first blamed PSUs
- AMD then pushed mobo manufacturers to add an undocumented BIOS option, which in its default setting does not solve the problem and with other settings doesn't always solve the problem
- Cause is still unknown
- Issue is also present in Ryzen 2xxx CPUs

Other suggestions welcome before I start writing it up!

Revision history for this message
In , simon (simon-linux-kernel-bugs) wrote :

I would suggest to change one of your lines:

> - AMD then pushed mobo manufacturers to add an undocumented BIOS option,
> which in its default setting does not solve the problem and with other
> settings MOST TIMES (NOT ALWAYS) solve the problem

and add one sentence:
here the settings: /advanced/amdcbs/zen-common-options/"power supply idel control"
changed this from "auto" to "typical"

simoN

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

(In reply to kernel from comment #335)
> I think that is a great idea. I have time to waste at airports soon, so can
> draft an email message.
>

Just to tell part of my story: I bought the well-known Asus laptop with Ryzen CPU. It was exposing the segfault bug. But before I was most sure about that, it burnt just two weeks after having bought it... I sent it to Asus, hoping for a new lapotp. Asus offically told me that they replaced the motherboard and the keyboard (they did not tell anything about if the CPU was replaced, and I was not able to check this).
The new but repaired laptop did not expose the segfault bug anymore (same kernel version). But it exposed the soft-lock bug. I didn't know this was this bug at that time. So I returned the laptop again to Asus claiming to be reimbursed. Asus repaired it again by changing the motherboard again. The bug was still present. (and for the interested people I could finally got reimbursed).

Then I decided to do as usual: build my own desktop. And I was facing the soft-lock bug until I tricked the voltages...

To go back to the story, and if we can trust Asus, a motherboard change made the segfault bug to disappear (AMD officially 'replaced' such 'faulty' CPUs) but made the soft-lock bug to appear. A second change of the motherboard did not improve anything.

What I guess (since this cannot be really ensured) is that AMD does not know where all these bugs come from. And AMD does not know how to resolve them. AMD does not even know if this can be solved.

What we can know is that the guy responsible for the Ryzen architecture was hired by AMD back again (ater having moved to Intel or Apple) but he left AMD 3 years ago (so before Ryzen went out). He moved to Tesla and is now working to Intel again.
Link is here: https://en.wikipedia.org/wiki/Jim_Keller_(engineer)

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

And before my last paragraph could be misinterpreted, I was just meaning that this guy is the one to contact if we want some concrete information.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I've sent the aforementioned outlets an email. For completeness, here's the entire thing:

Receipients:
Ars Technica: https://arstechnica.wufoo.com/forms/z7p8x7/
Tomshardware: http://www.purch.com/about/#contact-general
HardOCP: <email address hidden>
Anandtech: http://www.purch.com/about/#contact-general
Tweakers.net: <email address hidden>
Phoronix: https://www.phoronix-media.com/?

Subject: Publication of an AMD Ryzen hardware issue

Dear editor,

Since their introduction the AMD Ryzen processors have been plagued by several issues, most notably the segfault issue that occurred under high (compilation) loads. To that particular issue AMD has responded by replacing affected chips.

However, there is another significant issue that affects both Ryzen 1xxx and Threadripper CPUs, as well as the newer Ryzen 2xxx processors. It appears Epyc is not affected (although sample size is one in this case).

The most complete storyline on this issue can be found in the link below [1], however for an overview the rest of my email attempts to summarise the issue.

This issue results in a complete system freeze, occurring under full idle conditions, and requires a hard reset. Evidence suggests this is a hardware problem, since several workarounds have been found that mitigate/solve the issue, such as disabling C6 entirely (inefficient), overclocking and tweaking voltages (not for everyone), or running processes that keep the CPU active at all times (again, inefficient and pointless).

AMD has been contacted multiple times but refuses to acknowledge the issue. At some point in one reply AMD blamed users' PSUs; this is obviously nonsensical as the issue occurs on a wide variety of PSUs including brand new models, the only constant is the Ryzen platform.

In response, AMD has pushed motherboard manufacturers to add an otherwise undocumented BIOS option, "Power Supply Idle Control", as part of their AGESA update. However, in its default value this setting does *not* solve the problem, and with other settings doesn't *always* solve the problem. To be precise, this setting needs to be changed from "auto" to "typical":
/advanced/amdcbs/zen-common-options/"power supply idle control"
However it is unknown what this option actually does.

The issue is also present in the laptop platforms. In example of one user, he sent his laptop back to Asus repeatedly for motherboard replacements, but the issue remains.

The root cause of this issue is still unknown, and moreover, the issue is still present in the latest Ryzen 2xxx CPUs. AMDs refusal to acknowledge let alone resolve the issue has led to this email, in the hope that media attention to this problem will inspire AMD to take action. Hence, I propose that your outlet publishes an article about the issue.

Kind regards,
<email address hidden>

[1] https://bugzilla.kernel.org/show_bug.cgi?id=196683

Revision history for this message
Mario Emmenlauer (emmenlau) wrote :

Is it known if 18.04 has the patches active to support ryzen out of the box? I have upgraded to 18.04 recently and the system has just hung for the first time in many months. I wonder if this is by chance or if other people experienced the same?

Revision history for this message
Antoine Pitrou (pitrou) wrote :

According at least to http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen, 18.04 is supposed to be ok. I haven't upgraded yet myself.

Revision history for this message
Wes Deviers (wdeviers) wrote :

These symptoms definitely still happen to me on 18.04 w/ 4.15.0-22-generic

Revision history for this message
Tom Reynolds (tomreyn) wrote :

What this very bug report is about is a little blurry since the focus changed into different directions over time.

It is my understanding that https://bugzilla.kernel.org/show_bug.cgi?id=196683 is a hardware (design) bug, which is not going to be fixed (and could not be fixed or properly mitigated) in Linux. The only way to mitigate it in Linux would probably be to constantly burn CPU cycles (as Windows does, which appears to be why it does not get to trigger there). And kernel developers would rightfully reject this approach to work around hardware bugs.

Revision history for this message
Wes Deviers (wdeviers) wrote :

I compiled (from a make oldconfig) 4.13.16 w/ CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15 as a boot parameter. I've been almost 60 hours, which is a first since buying this machine.

Revision history for this message
Sean McG (gseanmcg) wrote :

I haven't noticed any of the issues identified in this ticket since I upgraded to 18.04 LTS.

The stock 4.15.x kernel doesn't have RCU_NOCBS enabled from what I can tell.

Revision history for this message
Tom Reynolds (tomreyn) wrote :

The microcode updates AMD had announced https://www.amd.com/en/corporate/security-updates are going to be availble to AMD customers "by downloading BIOS updates provided by PC and server manufacturers and motherboard providers" are now actually available as separate microcode packages which can be loaded by the OS during early boot.

For now, the microcode has not undergone sufficient testing on Ubuntu's end so it is only made available on the ubuntu-security-proposed PPA (so far, this will change if it looks fine).

A changelog for the upstream (Debian) package is available at https://launchpad.net/debian/+source/amd64-microcode/+changelog - according to a member of the Ubuntu security team it primarily (not sure if exclusively?) targets family 17h (Zen, so Ryzen, Epyc, Threadripper) processors.

Note that Ubuntu 14.04 systems running Linux 3.13 require a kernel patch for this microcode to work properly.

I do not know whether or not this microcode update provides mitigations for the bug discussed here (it might have helped for AMD to state so), but it is probably worth a try if your mainboard vendor has not provided a recent enough BIOS upgrade (and your system remains unstable) or the latest BIOS update you have just doesn't mitigate Spectre v2 in hardware, yet.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

Sadly I haven't received back apart from the following from HardOCP:

"We touched on this issue in one of our recent reviews and found the issue to be with the Power settings implemented by Microsoft. The bug has been reported to MS. If you cycle through one of two of the other power settings, the issue will go away from our experience."

To which I pointed out that this issue occurs under Linux in particular (due to full idle). Haven't heard back yet.

Does anyone here have direct contacts with any of the major outlets?

Revision history for this message
In , devurandom (devurandom-linux-kernel-bugs) wrote :

(In reply to kernel from comment #340)
> Sadly I haven't received back apart from the following from HardOCP:

I got a reply from c't / Heise. They say that during their experiments with Linux on Ryzen they could not find any evidence of this bug so far.

Revision history for this message
In , KuroNeko (kuroneko-linux-kernel-bugs) wrote :

So basically, the press is not interested. Big surprise. Anyway, C't reported today on the new 4.17 kernel, with following about low power state for AMD Ryzen. Anyone care to see if this has an effect on this bug?

"Auch Systeme mit AMDs aktuellen Prozessoren dürften mit 4.17 etwas sparsamer laufen. Das ist einer kleinen Änderung am Code zu verdanken, der den Prozessor im Leerlauf schlafen schickt. Er nutzte auf AMD-CPUs bislang den MWAIT-Aufruf, durch den AMDs aktuelle Prozessoren aber lediglich in den Schlafmodus C1 wechseln; jetzt verwendet der Kernel CPUIDLE oder HALT, durch die Ryzen, Epyc & Co. auch in tiefere und daher effizientere Schlafzustände wechseln."

Google translate:
Even systems with AMD's current processors should run a bit more economical with 4.17. This is due to a small change to the code that makes the processor sleep idle. He used on AMD CPUs so far the MWAIT call, through which AMD's current processors but only switch to sleep mode C1; Now the kernel uses CPUIDLE or HALT, through which Ryzen, Epyc & Co. also switch to deeper and therefore more efficient sleep states.

Revision history for this message
In , dev (dev-linux-kernel-bugs) wrote :

(In reply to Jonathan from comment #342)
> Anyone care to see if this has an effect on this bug?

Better save your time, on my Ryzen 2500U; Kernel 4.15 (Ubuntu's), nor 4.16 (Fedora's) nor 4.17rc6+7 (Mainline) had an effect on this.

Ubuntu & mainline kernels needed "pci=noacpi" or "acpi=noirq" to even boot.
Fedora's just needed "nomodeset", if you wanted to boot with XFCE, and nothing at all with Gnome.

Doing a backup with fsarchiver
```fsarchiver savefs -o -j8 -A -a -v -Z3 $BackupfileDestination $Partition```
__always lead to a freeze within 1 - 5 runs__). Fsarchiver causes a lot of up- and downclocking (and boosting above baseclock), while it produces only partial cpu-load).
Full idle never caused a freeze neither did full load (handbrake, kill-ryzen.sh).

On kernel 4.13 the runs of fsarchiver never caused a freeze (max cpu-clock 2ghz).
Also, the setting of scaling_governor to powersave on all other tested kernels prevented that freezes (cpu clock was locked to 1,6ghz in that case).

Things I (unsuccessful) tried to get rid of the freezes:
*blacklisting amdgpu
*kernel @ CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-7
*rmmod ath9k (rmmod ath10k_pci ath10k_core)
*30.5.2018 Ubuntu's microcode update for amd64
*disabled ASLR
*new mesa (18.0rc5 > 18.1.1. > 18.2 dev)
*c6-disabled via service
via grub:
*acpi=strict
*libata.noacpi
*amdgpu.dpm=0
*.dc=0
*.audio=0
*.bapm=0
*noibrs noibpb nopti
*libata.force=noncq
*pcie_aspm=off & the others
*amd_iommu=off & the other options (including iommu + options)
*pci=nocrs
*nomodeset
*all acpi= parameters
*processor.max_cstate=5

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Freihut from comment #343)
>
> Better save your time, on my Ryzen 2500U; Kernel 4.15 (Ubuntu's), nor 4.16
> (Fedora's) nor 4.17rc6+7 (Mainline) had an effect on this.

My earlier posts above relate to my desktop Ryzen 5 1600X but I've also been seeing daily freezes with a 2700U-powered laptop. I don't believe this is the same issue though as ZenStates.py did not help. I think it's more likely to be graphics-related, especially given what another user with the same hardware reported in bug #199653. Maybe your situation is different though.

Revision history for this message
In , zao (zao-linux-kernel-bugs) wrote :

(In reply to Freihut from comment #343)
> Doing a backup with fsarchiver
> ```fsarchiver savefs -o -j8 -A -a -v -Z3 $BackupfileDestination
> $Partition```
> __always lead to a freeze within 1 - 5 runs__). Fsarchiver causes a lot of
> up- and downclocking (and boosting above baseclock), while it produces only
> partial cpu-load).
> Full idle never caused a freeze neither did full load (handbrake,
> kill-ryzen.sh).

This is starting to sound very similar to my total system freeze with ZFS on FreeBSD on all three of my Ryzen 1700 chips on two different ASUS motherboards (B350M-A and X370-PRIME).

The underlying storage I had there was a pile of four HDDs, or a SSD, which I made a snapshot of and then:

  while true; do zfs send -R stuff@foo | pipemeter | cat >/dev/null; done

This would lock the system up in about 2-7 terabytes of HDD data if memory serves me right. It would not happen on USB HDDs, but happened regardless of if it was onboard or on a disk controller. I've also managed to reproduce it by receiving a stream over the network and writing it out to said disks, but this one is easier to run as one doesn't need another machine to feed it.

Doing the same under Linux at the time could not produce any similar hangs, but it may be due to a different implementation of ZFS and not using pipemeter.

I discussed it with AMD support over months (Aug-Dec 2017) and they gave me all sorts of power twiddling suggestions for firmware, but the only thing that ever really worked there was to disable hyperthreading (SMT), bringing the machine down to 8C/8T.

Have you tried disabling SMT in your firmware and seen if it changes anything?
Please note that if you do it and upgrade your firmware, you may need to do a full wipe of settings to even get the option back. Support recommended me (rightfully) to do full clears of firmware settings between upgrades, seems like the migration path isn't quite solid.

Revision history for this message
In , dev (dev-linux-kernel-bugs) wrote :

(In reply to Lars Viklund from comment #345)
> Have you tried disabling SMT in your firmware and seen if it changes
> anything?

I can't do that, because it's a notebook with only 2 BIOS-options (something virtualization something). Tried disabling both, but that had no effect at all.
As far as I can remember, I tried disabling SMT via Grub, but only to test, if it's booting (wasn't).

But I wouldn't say it's SMT-related, because the CPU runs utterly fine, while it's below it's base clock or at full load.
The freezes also occured on singlethreaded action: While the play-back of random youtube-stuff (autoplay, no user interactions applied) with Palemoon (Firefox fork). That usual took 30 - 120 mins to freeze and was also tested with disabled amdgpu, so I'm pretty sure it's CPU-related.

Have to send it back to the vendor now, because I ran out of time (for the right to return).

Revision history for this message
In , devurandom (devurandom-linux-kernel-bugs) wrote :

(In reply to Dennis Schridde from comment #295)

As an update from my earlier post, here is what I wrote to AMD support recently:

I can confirm the effectiveness of the "Power Supply Idle Control = Typical Current Idle" non-default firmware-option introduced with AGESA 1.0.0.2a only partially. After setting this option it often still takes several attempts until the machine boots -- during the failing attempts I either get 3 long beeps from the mainboard followed by an automatic reboot, or I see varying Linux kernel stack-traces during the early boot, seemingly related to firmware / EFI and CPU / idle. Sometimes the system can be restarted from this state using soft-reboot (ctrl+alt+del). But sometimes the situation requires a hard-reset, e.g. because the system completely froze (similar to the original problem reported here), or because the init process dies. I have the feeling that in such situations the system, even if it does not freeze completely, works with corrupt data and then writes this to the hard disk [3,4]. I destroyed my installation several times in the last month due to these problems, e.g. because the system had destroyed all (?!) superblocks of the file system, or lately the LVM cache. Surprisingly, once the system has booted properly, it will run stably for days without any further issues.

I could not reproduce the problem synthentically using either PassMark memtest86 [5], TU Dresden Firestarter [6] or Google Stressful Application Test [7], the latter including CPU, RAM, hard drive and file-system tests. My hardware supplier also tested all components (CPU, RAM, mainboard) again using Windows with Prime95 and Furmark, as well as with Memtest, and assured me that they could not detect an issue either. During regular use I can reliably reproduce it on Gentoo (with Linux 4.16), Fedora 27 (Linux 4.14), Fedora 28 Beta (Linux 4.15), Fedora 28 (Linux 4.16) and Arch 2018.03 (Linux 4.15). Before the introduction of the "Power Supply Idle Control = Typical Current Idle" firmware-option, I first noticed the issue while compiling large amounts of software on Gentoo, but was later able to reliably reproduce the freeze with simple `rsync` operations on all other operating systems, too. This even happened when no X-server is running and the GPU is not utilised in any other way (as far as I know), so I do not see a connection to possibly incomplete Vega-GPU-support in Linux. After setting "Power Supply Idle Control = Typical Current Idle" these freezes stopped happening and I only see the problem during early boot.

[3]: https://www.redhat.com/archives/linux-lvm/2018-May/msg00006.html
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1585670
[5]: https://www.memtest86.com/
[6]: https://tu-dresden.de/zih/forschung/projekte/firestarter
[7]: https://github.com/stressapptest/stressapptest

Revision history for this message
In , mail (mail-linux-kernel-bugs) wrote :

Hi Dennis, from my point of view, your case is not the same as the reported idle freezes in this bug report. As a owner of a Ryzen 7 1700 that has been replaced as of the other segfault bug during compiling, the bug mentioned here is known to me, but my feeling is that it's quite seldom now. Now to compare here my Ryzen 3 2200G, would be unfair as it is a completely different beast. By saying this, I can assure you I have trouble with this system as well. But it's known and Phoronix posted about it, but they say the 2400G might be stable now, but not the 2200G. If you have issues with this APU, which is quite expected reading your post, then from my own experience I can tell you that you need the latest kernel, latest UEFi, latest firmware, latest Mesa... This should give you a working system, at least my 2200G still often fails to boot, but if it's booted it works ok. But as this is a APU this way more complex, you may see GPU issues and that is why I think your post doesn't belong to the original bug report. From my personal experience, I got once a corrupted btrfs file system on my Ryzen 7 1700, but in the end one of my 16GB Ram was defect (detected by Memtest86+) and it corrupted the file system during a crash. Again, I have the feeling we shouldn't mix to many different issue in this bug report and I think my issues with the 2200G do not belong here.

Revision history for this message
In , phillips (phillips-linux-kernel-bugs) wrote :

I am happy to report 45 days of continuous uptime with "typical current idle" selected in bios, and otherwise vanilla everything including 4.15.0 kernel. My system configuration is as posted above.

I find myself willing to believe the AMD "indirectly" statement: this is related to power supply. My power supply is an EVGA 650GQ, with eco switch set on to reduce fan noise. I suspect that if I swapped it for some other power supply, perhaps one not quite so efficient, or flipped to eco switch off, that I would find myself among those lucky users able to run with default bios settings and 4 watts less power consumption at idle. Unless anybody can refute this theory, I would consider it advisable to post power supply details for anybody still hitting problems.

I suspect that the truth is, some power supplies just do not expect such low full system power draw, and freak out about it. Without "typical current", mine is 38 watts at the wall including 32GB of memory, but unfortunately, only lives a few days at that setting.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

(In reply to Daniel Phillips from comment #349)
> I am happy to report 45 days of continuous uptime with "typical current
> idle" selected in bios, and otherwise vanilla everything including 4.15.0
> kernel. My system configuration is as posted above.

With my Ryzen 7 1800X and Asus X370 Pro: I can report that with the magic "typical" BIOS setting I have today 30 days uptime. This is more than twice the previous record.

FWIW: I am so fed up with this machine that I haven't used it since I updated the BIOS and applied the setting. It is running 4.16.5 (Fedora 27), with CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15. I don't know if the rcu_nocbs=0-15 is still required.

Also FWIW: zenstates.py -l tells me that C6 Package is Disabled, but C6 Core is Enabled. Before the BIOS update I used zenstates.py to set C6 the same way, but the machine froze after some 12 days. After the BIOS update I no longer use zenstates.py to set anything. So I guess the BIOS "typical" option disables C6 Package, but also does some other magic.

> I suspect that the truth is, some power supplies just do not expect such low
> full system power draw, and freak out about it. Without "typical current",
> mine is 38 watts at the wall including 32GB of memory, but unfortunately,
> only lives a few days at that setting.

Mr BeQuiet! are adamant that the Straight Power 11 I have is perfectly happy to supply 0A at all voltages.

Of course, there's a lot of stuff between the PSU and the CPU... so it could be a motherboard issue. Who can tell ?

Possibly, some day, I will go back to using by AMD machine, but I doubt I shall come to be fond of it :-( Certainly I am livid with AMD's abject failure to address the issue promptly, and their continuing inability to discuss or document the issue. Bugs happen. It's how they are dealt with that separates the sheep from the goats. <sigh>

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

Since the problem is not (and it seems will not) be solved, may I ask people here to send me full specs and BIOS settings of working configuration so that I can publish them on a web page (with the hope to be useful for other people) ?
So basically I am looking for the original spec list (HW, BIOS version and settings, boot options) and what you did to make your system stable (let say stability starts with at least 14 days of uptime).

Just email me if you are interested, and if I have enough lines to put in that DB, I will post the link of the web page here later on.

Since the power supply (and harddrives) seem still to be an important piece of the puzzle, send me that information too.

Revision history for this message
In , kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote :

Is there anyone not affected by this when Package C6 is enabled?
Otherwise I intent to send the patch, which has the same effect as "typical current".

Revision history for this message
In , brauliobo (brauliobo-linux-kernel-bugs) wrote :

(In reply to Kai-Heng Feng from comment #352)
> Is there anyone not affected by this when Package C6 is enabled?
> Otherwise I intent to send the patch, which has the same effect as "typical
> current".

That would be perfect! Can you send us that patch for testing?

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , daniel (daniel-linux-kernel-bugs) wrote :

On 2018-06-09 07:33 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #352 from Kai-Heng Feng (<email address hidden>) ---
> Is there anyone not affected by this when Package C6 is enabled?
> Otherwise I intent to send the patch, which has the same effect as "typical
> current".
>
There are multiple reports above that disabling package C6 is not the
same as setting typical current in Bios. ISTR there is even a report
that setting typical current did not disable C6 for that particular
bios. The theory that this power bug is related to motherboard is
gaining prominence, because different motherboards seem to have
different tweaks for "typical current". If only AMD would just tell us
what is going on.

Meanwhile, I am more than content with my Ryzen setup now that it is
apparently stable, but discontented with AMDs lack of disclosure. Not to
the point of swearing off AMD, but it got close.

Revision history for this message
In , hoper (hoper-linux-kernel-bugs) wrote :

How did I made my system ultra stable after months of despair ?
Easy, I never let it idle anymore :p
Mining crypto money on at least some of the core and I never
had trouble again :)

Le 09/06/2018 à 16:28, <email address hidden> a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> AMD Linux User (<email address hidden>) changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #351 from AMD Linux User (<email address hidden>) ---
> Since the problem is not (and it seems will not) be solved, may I ask people
> here to send me full specs and BIOS settings of working configuration so that
> I
> can publish them on a web page (with the hope to be useful for other people)
> ?
> So basically I am looking for the original spec list (HW, BIOS version and
> settings, boot options) and what you did to make your system stable (let say
> stability starts with at least 14 days of uptime).
>
> Just email me if you are interested, and if I have enough lines to put in
> that
> DB, I will post the link of the web page here later on.
>
> Since the power supply (and harddrives) seem still to be an important piece
> of
> the puzzle, send me that information too.
>

Revision history for this message
In , oyvinds (oyvinds-linux-kernel-bugs) wrote :

> I find myself willing to believe the AMD "indirectly" statement: this is
> related to power supply. My power supply is an EVGA 650GQ, with eco switch
> set on to reduce fan noise. I suspect that if I swapped it for some other
Some want to believe and that's fine but AMD story is and always was bullshit and here's why: My EVGA SuperNOVA 750 G3 & Corsair RM650i are modern and C6/"haswell" compliant PSUs. There's plenty of other people with a variety of new and old and cheap and expensive PSUs who are affected by this TOTAL SCANDAL bug.

The new Power Supply option in BIOS set to Typical does fix it all combinations of Ryzen motherboards/CPUs/PSUs I've tried.

The root cause could be a combination of motherboard/CPU/PSU but my guess is that it's got more to do with motherboard voltage regulators and less to do with PSUs. Buying a new PSU isn't a viable option anyway, what are you going do to? buy every single PSU on the market and hope you stumble upon one that works with the "low current" power supply option in BIOS after trying 20 that don't? If you sell a computer monitor that does not work properly on 19 out of 20 graphics cards then it's not the graphics cards that are to blame, you're selling a defective product.

The Typical Idle setting is a fine solution if you know about it, so there's that. Works for me. I do feel sorry for poor noobs who don't know and pull their hairs for hours before discovering the shocking truth, though.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

I would like just to pass some evidence that this affects Threadrippers. I have two identical system with the 1950X on a Asus X399-A Prime motherboard. This motherboard does not have the Power Supply option in the BIOS. Both freeze from time to time.

But more interesting, both systems freeze when running a code developed by one of our students that uses Matlab and computes the SVD of a large matrix followed by some time almost idle writing to a NFS mount many times in a loop. The code takes around an hour to run. I made a small script loops between running the code followed by calling sleep for one extra hour to put the machine in a idle state. It always make the machines freeze in less than one day. In some cases the freeze happens even while the code is running, not only when idle. This seems odd. It is very reproducible.

The solution is to disable C6 fully using zenstates.py. Only disabling c6-packaging is not enough. Another possibility, that I am inclined is to recover some of the performance I lost by loosing turbo boost and XFR, is a light overclock. I am testing one of the machines now @3.75GHz.

Have anyone seen the "power supply" option in a X399 motherboard?

Revision history for this message
In , ryan (ryan-linux-kernel-bugs) wrote :

@Paulo: Are you running the latest bios? There is a newer bios from ~April or so, but I do not have the motherboard to try it out.

To add some more information to this thread, I have the following machine:

 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to On)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0601 04/19/2018, Typical Current set in the BIOS)

This setup is currently working better with the Typical Current setting, though I have experienced a soft lockup with these settings after about ~6 hours of uptime. I am going to run the same setup with the eco switch to off on the PSU to see if anything changes.

Revision history for this message
In , ryan (ryan-linux-kernel-bugs) wrote :

Eco mode supposedly only controls the PSU fan. I'm guessing that will not change anything on the CPU soft lock side.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

@Ryan, yes. Acoording to dmidecode I am using 0601 BIOS that is the last one available in Asus website.

Interesting that you are having soft lockups even with Typical Current. It completely solved the problems in my old 1700X system and it seems to have worked well for others. Are you overclocking? If yes, try first with default settings. Good luck!

Revision history for this message
In , eduardoreyesdpm (eduardoreyesdpm-linux-kernel-bugs) wrote :

@Paulo:
I downloaded the manual for your motherboard and found this

http://dlcdnet.asus.com/pub/ASUS/mb/socketTR4/PRIME_X399-A/E13557_PRIME_X399-A_BIOS_EM_WEB_20171030.pdf

EPU Power Saving Mode
The ASUS EPU (Energy Processing Unit) sets the CPU in its minimum power consumption
settings. Enabling this item will apply lower CPU Core/Cache Voltage and help save energy
consumption. Set this item to disabled if you are over clocking the system. Configuration
options: [Disabled] [Enabled]

Have you tried Disabling this?

Revision history for this message
In , daniel (daniel-linux-kernel-bugs) wrote :

On 2018-06-14 09:10 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #359 from Ryan Phillips (<email address hidden>) ---
> Eco mode supposedly only controls the PSU fan. I'm guessing that will not
> change anything on the CPU soft lock side.
>
How about this theory: With Eco mode on, the power supply prevents the
current from increasing until the fan speeds up, causing the voltage to
drop ever so slightly, and pademonium ensues. Note that EVGA's site does
not explicitly say that that Eco mode only controls the fan. I am not
pointing the finger at EVGA, I am just exploring a theory, please refute
if possible.

I am willing to believe that this system is stable now that I am at 51
days uptime,  so I will try some experiments. The first one will be, run
with both Typical Power and Eco Mode off, and all stock bios and kernel
settings.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

@Eduardo. Yes I've found that option before. It is disabled by default (so the default is not to try the power saving). I even tried to enable hoping that the power saving system would be smart and avoid the lock up but it didn't work.

What I am trying right now is to set the option "Overclocking Enhancement" that is supposed to do "It enables SenseMi Skew (artificially reading lower temperatures in order to trick XFR to boost longer and higher), Performance Bias and tweaks VRM settings to improve overclockability". I am specially interested on the tweaked VRM settings. (https://rog.asus.com/forum/showthread.php?97680-Overclocking-Enhancement)

Interesting enough, if I only set this to Enable (no overclock, just default setting + this option enabled) it seems to make things better. As I said, I have a code that consistently triggers a lockup if I run in a loop followed by one hour of inactivity. I have been running this test in one of my system with the "Overclocking Enhancement" set to Enable and it did not trigger the bug for the last 24hs. I have just changed the other machine to test this as well. I'll keep the test running today and the whole weekend on both machines and report back.

If anyone else is having the problem with a Asus X399 motherboard it may be worth a try and, please, report back too.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

Oh, and I forgot to mention. Both machines are using the "Overclocking Enhancement" options and I have explicitly set c6 state to enabled using zenstates.py.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

Hi, a follow up. My two Threadripper machines are working flawlessly since Friday. I tried both, running a test that alternates between high load and idle that would normally trigger the but, and leave the machine idle for many hours.

It seems like a possible workaround for someone with a ASUS X399-A Prime motherboard is to set the BIOS option "Overclocking Enhancement". Beware that this option has a weird side-effect of skewing the temperature to report a lower value. Therefore you should have a good cooling solution in place.

Revision history for this message
In , excieve (excieve-linux-kernel-bugs) wrote :

So AMD has finally published the revision guide for 17h family, which includes errata:
https://developer.amd.com/wp-content/resources/55449_1.12.pdf

Would be cool if someone with more knowledge about that kind of stuff could have a quick look through and see if this issue could be related to anything in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

Revision history for this message
In , brauliobo (brauliobo-linux-kernel-bugs) wrote :

(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

Wow, a lot of errors to be workarounded...

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

(In reply to Artem Hluvchynskyi from comment #366)
> So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

I remember seeing the kernel fix/workaround for that issue when I was trying to diagnose our issue. I don't think they are related.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to James Le Cuirot from comment #368)
> I remember seeing the kernel fix/workaround for that issue when I was trying
> to diagnose our issue. I don't think they are related.

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix

But don't know if it's the same MWAIT problem as mentioned in the errata pdf above.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

I was thinking of this but with fresh eyes, maybe that is yet another issue. I'll shut up now and leave this to the experts. :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88d879d29f9cc0de2d930b584285638cdada6625

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

@James Le Cuirot:
Linux 4.14 already came with the mentioned patch from beginning and doesn't prevent freezing on idle at all - same as the MWAIT-patch mentioned at Phoronix (see link above).

Revision history for this message
In , dagecko (dagecko-linux-kernel-bugs) wrote :

(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

"1033 A Lock Operation May Cause the System to Hang" seems also related since some logs were refering about locking issues.

Revision history for this message
In , xiyou.wangcong (xiyou.wangcong-linux-kernel-bugs) wrote :

Hey, guys

We have been seeing this smp function call soft lockup since 3.x kernel. Unfortunately we don't know how to reproduce it either.

I suspect there is something on other CPU (not shown without sysrq-l) blocking the smp call function execution, we have to figure out what it is. So next time you see a living case of this bug, please collect the stack traces on all the CPU's at that time, use sysrq-l.

Thanks!

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Thread still ongoing with no response from AMD.

Really shouldn't need to fiddle with custom bios options.

What is big business / big iron or whatever it's called doing with Epyc and linux servers? Are those totally fine?

Revision history for this message
Emilio (emilio-moretti) wrote :

I have a Ryzen 1800X + Gigabyte GA-AX370-Gaming K7.

I was suffering from these hangs until I changed a few settings in my BIOS. The change that made the difference was the Power Supply Iddle Control that was added in the latest BIOS update. All other settings are left by default:
M.I.T->
  Advanced Frequency Settings ->
    Core Settings ->
      *Power Supply Iddle Control = Typical
      *SVM Mode = On
Peripherals ->
  *Above 4G Decoding = On
Chipset ->
  IOMMU = Auto (Disabled)

BIOS version: F23f AGESA 1.0.0.2a + SMU FW 43.18

I'm running Ubuntu x64 on stock kernel 4.15.0-23-generic with the following options
GRUB_CMDLINE_LINUX_DEFAULT="radeon.cik_support=1 amdgpu.cik_support=0 iommu=soft"

As an extra measure I completely blacklisted AMDGPU from my system because it was also a source of many of the kernel freezes (AMD 390x GPU), and I haven't seen a system freeze in almost two months. (they were _very_ frequent before, almost daily)

Good luck.

Revision history for this message
In , robert (robert-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #375)
> Thread still ongoing with no response from AMD.
>
> Really shouldn't need to fiddle with custom bios options.
>
> What is big business / big iron or whatever it's called doing with Epyc and
> linux servers? Are those totally fine?

I am running an Epyc-based server, and as posted in this thread here:
https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID
(worth a full read)
everything in there was solved by setting the correct Virtualization flags in BIOS, as specified by a Supermicro Tech as follows:
========
It looks like you want to enable the virtualization feature.
Please go into BIOS,
Advanced -> NB Configuration -> IOMMU (change to Enabled).
Advanced -> PCIe/PCI/PnP Configuration -> SR-IOV Support (change to Enabled).
========

After this, the server worked without a hitch, and I have now upgraded it from Centos7, to Fedora 28, and it runs like a charm.

Of course the server is not IDLE at all, and maybe that saves the server from the low power issue discussed in this thread. However, I have had it at least 7 days just sitting there not doing anything, and it never had any problem.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

I'm praying intensely that the new AMD Epyc 3000 Embedded series (low power e3000) don't have the same issues. Hoping to buy one in the next few months when they finally put the damn things on some ITX boards.

Revision history for this message
In , ryan (ryan-linux-kernel-bugs) wrote :

I was able to capture a soft-lock with sysrq-l [1]. Networking and mouse were still functional, but focusing windows and window activity in Plasma were locked.

https://gist.github.com/rphillips/fae9540a8e7a3a83731c9a9809a98df4

Notes:
 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to Off)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0601 04/19/2018, Typical Current set in the BIOS)

Revision history for this message
In , alexandre9099 (alexandre9099-linux-kernel-bugs) wrote :

Is there any workaround/solution for this bug? With this problem i can't reliably have the PC running as a server :(

The bug is almost 1 year old

Revision history for this message
In , cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote :

There are two solutions or workarounds that have been reliable for me, one of which I now trust more than the other.

First, if your BIOS has the option for 'Power Supply Idle Control' hiding somewhere in its settings (generally in a sub-menu off an 'advanced' menu), setting it to 'Typical current idle' seems to work. I trust this workaround more than the next one, but it requires a BIOS that has been updated to include AGESA 1.0.0.2a (AGESA is apparently a magic blob that AMD supplies to vendors).

Before I had the BIOS option available, I also had a stable system by using the kernel command line parameters 'rcu_nocbs=1-15 processor.max_cstate=5' (some people use 1 as the maximum cstate). This requires a kernel that supports rcu_nocbs, which not all kernels are built to do, and is more magical than the BIOS setting; it's clear that these settings are stabilizing the system through some side effects, not their direct operation.

(I experimentally determined that on my hardware and setup, merely using 'processor.max_cstate=5' wasn't enough; my machine still locked up.)

My machine runs Fedora 27, using Fedora 4.16.x and 4.17.x kernels on a Ryzen 1800X on an ASUS Prime X370-PRO motherboard with ECC RAM, currently using BIOS 4011.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

Created attachment 277137
attachment-17307-0.html

Yep, typical current parameter in your UEFI
If not there, update it

________________________________
De : <email address hidden> <email address hidden>
Envoyé : Monday, July 2, 2018 11:12:26 AM
À : <email address hidden>
Objet : [Bug 196683] Random Soft Lockup on new Ryzen build

https://bugzilla.kernel.org/show_bug.cgi?id=196683

Alexandre Badalo (<email address hidden>) changed:

           What |Removed |Added
----------------------------------------------------------------------------
                 CC| |<email address hidden>

--- Comment #379 from Alexandre Badalo (<email address hidden>) ---
Is there any workaround/solution for this bug? With this problem i can't
reliably have the PC running as a server :(

The bug is almost 1 year old

--
You are receiving this mail because:
You are on the CC list for the bug.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

(In reply to Alexandre Badalo from comment #379)
> Is there any workaround/solution for this bug? With this problem i can't
> reliably have the PC running as a server :(
>
> The bug is almost 1 year old

People need to contact AMD, it's ridiculous this is still ongoing.

Revision history for this message
In , ryan (ryan-linux-kernel-bugs) wrote :

I have finally gotten to stable (no soft lockup). Asus released the 0702 bios which includes Agesa update 1.0.0.2c. Typical Current is set to enabled within the bios. No other special kernel command-line options are enabled.

Machine:

 * Kernel 4.7.13
 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to Off)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0702 06/22/2018, Typical Current set in the BIOS)

Revision history for this message
In , toni.spets (toni.spets-linux-kernel-bugs) wrote :

Adding to the confusion here:

I'm running a Ryzen 1700 on Asus PRIME B350-PLUS motherboard and on BIOS version 4011 I have the "Power Supply Idle Control" option and I have changed it from "Auto" to "Low Current Idle" a week or so ago and after that I have had zero lockups.

According to this bug report disabling C6 states *or* setting that option to "Typical Current Idle" should work and "Auto" should mean "Low Current Idle" hence it should not have made any difference. zenstates.py shows C6 enabled for both package and core.

Unless the Ubuntu 18.04 kernel 4.15.0-*-generic got an off-tree patch applied somewhere in-between my reboots I'm confused and relieved at the same time.

Revision history for this message
In , dev (dev-linux-kernel-bugs) wrote :

Did the ppl with the new BIOSes double test the 'Power Supply Idle Control' option by disabling it temporary? Because, it could also be fixed by that Agesa update some of you mentioned (also mentioned in [1])

tl;dr:
Had soft lockup at full load, BIOS-update seems to help, but it doesn't have the 'Power Supply Idle Control'-option.

More details:
My HP 17-ca0202ng (Ryzen 2500U) was stable by starting up with "idle=nomwait" Kernel-parameter for 2~3 weeks until *something*[2] happened and from one day to the next it was hard-locking again and on another testrun it threw the soft-lock-error (NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!) while it was transcoding x264-stuff to x265 (kind of full load; 95% @ all 8 threads). It was bootet up by "nomodeset" at the time the errors occur, so I'm quite sure it can't be amdgpu-related.
The time the soft-lock occurred, the system was kind of responsive, just freezing after a amount of seconds with unfreezing after another amount of seconds. I was able to run dmesg and just saw that NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! for all threads, before more freezings in shorter amount of seconds appeared and the system completely froze. Strange!

But: HP provided a BIOS-Update (F4 > F10) some days ago, I was able to update it yesterday and I've done 8 hours of full-load testing, 8 hours of idle testing and 7,5 hours of partial-load testing. No lockup so far and no Kernel-parameters needed (running Ubuntu's mainline build kernel 4.17.2, the stuff from Padoka stable ppa and ravenridge firmware from [3] to prevent amdgpu-freezes). As any lockup occurred within 3 hours, the laptop seems to be stable (or at least "more stable" than before).
Can't say anything about the changes HP made in that update, because they don't provide a changelog. But it's a laptop, so it has 3 or 4 BIOS-options (something virtualisation something, something secure boot something and some boot order options) and for sure no 'Power Supply Idle Control'. Allow me to point out, that Laptops usually have power adapters special designed for the device. So it can't be power supply related as AMD said (IMO).

[1] https://community.amd.com/thread/225795 , https://community.amd.com/thread/224000
[2] May be microcode-update related provided by Ubuntu, but I can't remember for sure.
[3] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu

Revision history for this message
In , robert (robert-linux-kernel-bugs) wrote :

Interestingly,

I had a nicely running system, uptime about 30 days, and I just decided to upgrade the Kernel to the latest (Fedora 28), and "what the heck" I will just upgrade the motherboard BIOS too.

Mobo: ASUSTeK model: PRIME B350M-A
6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1

So, now I have BIOS version 4104 (previous was 3801 I think)

And after like 10 minutes of doing stuff (i.e. not idle), LOCKED UP !!!

So I went into the BIOS and changed the Global C-State to Disabled, and right under that is the power profile, changed it to Typical.

We shall see how long it lasts !

I guess "don't fix what ain't broken" applied to my activity today lol.

Revision history for this message
In , alexandre9099 (alexandre9099-linux-kernel-bugs) wrote :

(In reply to Robert Hoffmann from comment #386)
> Interestingly,
>
> I had a nicely running system, uptime about 30 days, and I just decided to
> upgrade the Kernel to the latest (Fedora 28), and "what the heck" I will
> just upgrade the motherboard BIOS too.
>
> Mobo: ASUSTeK model: PRIME B350M-A
> 6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1
>
> So, now I have BIOS version 4104 (previous was 3801 I think)
>
> And after like 10 minutes of doing stuff (i.e. not idle), LOCKED UP !!!
>
> So I went into the BIOS and changed the Global C-State to Disabled, and
> right under that is the power profile, changed it to Typical.
>
> We shall see how long it lasts !
>
> I guess "don't fix what ain't broken" applied to my activity today lol.

My case is similar, but i don't recall updating the BIOS/UEFI firmware, it was after a kernel upgrade that the system started locking up :/

Revision history for this message
In , arup.chowdhury (arup.chowdhury-linux-kernel-bugs) wrote :

I have the ASUS B350M board and I updated to latest BIOS 4014 but my lock ups started with the latest kernel 4.17 in Arch where the lock up bug reappeared. I have set the idle current setting to typical and so far no lock up in two days, lets see what happens in few days. Keeping my fingers crossed.

Revision history for this message
In , brauliobo (brauliobo-linux-kernel-bugs) wrote :

idle=nomwait fixed all hangs (from https://community.amd.com/thread/224000

Revision history for this message
In , c.buddeweg (c.buddeweg-linux-kernel-bugs) wrote :

I'm watching this thread since February. My first stable workaround with legacy BIOS-Mode was the kernel parameter rcu_nocbs=0-15 because the BIOS didn't have the "Typical Current Idle" Option.

Systeminfo
R7 1700, GA AX370-Gaming K7 with BIOS F23f in legacy mode, PSU Corsair HX750, OS Gentoo Linux Kernel 4.17.4, desktop kde-plasma-5.13.2

no overclocking or tweaking, only using the "Typical Current Idle" Option (it's possible now with F23f) instead "Auto".

So I didn't need rcu_nocbs=0-15 furthermore and all works fine - no lockups

Revision history for this message
In , iam (iam-linux-kernel-bugs) wrote :

These lockups are probably not related to this bug. I've updated by Intel Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have random CPU lockups, too.
4.17.3 worked fine.

Revision history for this message
In , willemdreyer (willemdreyer-linux-kernel-bugs) wrote :

(In reply to ValdikSS from comment #391)
> These lockups are probably not related to this bug. I've updated by Intel
> Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have
> random CPU lockups, too.
> 4.17.3 worked fine.

Please take a look at this report, it matches your description https://bugzilla.redhat.com/show_bug.cgi?id=1598989 The user that reported it also has an Intel CPU (Acer Aspire V3-771 from his attached screenshot). I have experienced this issue once on one of my Ryzen systems running 4.17.6, could it be an unrelated problem? Another one https://bugzilla.redhat.com/show_bug.cgi?id=1598462

With regards to the original Ryzen Random Soft Lockup issue: I am witnessing it on every kernel that I have tested so far, ranging from kernel 4.10 through to 4.17.6. The frequency of the Ryzen soft lockup has increased since kernel >= 4.15 in my experience. It could just be a random effect as I am unaware of the cause of this problem.

I am running my machines at stock clocks and have the latest stable BIOS updates install as of today on a Ryzen 1700 w/ MSI PRIME X370-PRO, Ryzen 1800X w/ ASRock Fatal1ty X370 Professional Gaming, and finally my personal desktop Ryzen 1800X w/ X470 Taichi Ultimate w/ Seasonic 1000W Platinum PSU (Haswell ready). All of my CPUs are running microcode patch level 0x8001137

I can't find any correlation in workload, the issue occurs on web servers, VFIO gaming, even live USB sessions. I used the following (try) to provide insight.
journalctl -t kernel --no-pager | grep "soft lockup" | awk -F"!" '{print $2}' | sort -u
 [Compositor:4167]
 [kworker/10:1:18249]
 [kworker/1:3:418]
 [libvirtd:1226]
 [systemd:1]
 [Web Content:3505]

I have attempted CPU pinning, disabling ASLR completely. I have also tried isolating workloads in virtual machines that do not cross CCX units with hugepages (THP off). I am not an expert in knowledge of CCX, NUMA, Infinity Fabric, etc... I could have made a mistake in my test. That said, I am currently testing the "idle=nomwait" parameter in hope of getting better results, who knows perhaps even a stable system.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

There still does not appear to be a single consistent solution to this problem and it exists in Ryzen 1xxx and 2xxx series.

Apparently not in the Epyc series, is that correct?

This thread is utterly embarrassing for AMD, truly embarrassing.

Revision history for this message
In , kallisti5 (kallisti5-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #393)
> There still does not appear to be a single consistent solution to this
> problem and it exists in Ryzen 1xxx and 2xxx series.

Actually, there are several workarounds known to work listed in this thread.

I've been running a stable x1800 using the rcu_nocbs=0-15 fixes noted above for almost a year now.

> This thread is utterly embarrassing for AMD, truly embarrassing.

It's a nuanced hard to troubleshoot bug in a processor + support chipsets with a known workaround which only seems to happen on Linux... not really sure how that is "embarrassing". It sucks, it's new silicon. Welcome to being an early adopter.

Look at Intel's heart-bleed and all the other major security catastrophes they've had in recent history. Stuff happens. I do wish AMD would address it, but they're likely reluctant to bring it up until they have a reliable fix (if it even is a fix on their end)

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

> It's a nuanced hard to troubleshoot bug in a processor + support chipsets
> with a known workaround which only seems to happen on Linux... not really
> sure how that is "embarrassing". It sucks, it's new silicon. Welcome to
> being an early adopter.

Thread Reported: 2017-08-16 19:05 UTC

https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700
16 month old silicon.

Nah this is too long.
Firstly, nothing should need to be done in linux in any capacity, there should be a bios flag that fixes it.
Second, the bios flag should be on at all times
Third, based on my second point, as long as someone has the latest firmware, it should 100% solve the problem.

Users should not need to mess around for stability, that is not acceptable.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

I have Ryzen 7 1800X on Asus Prime X370-Pro. I upgraded the BIOS to v4011(Update AGESA 1.0.0.2a + SMU 43.18) and:

  1) turned on the "Typical Current Idle" option.

  2) stopped using zenstates.py -- which I had been using to enable "C6 Core"
     but disable "C6 Package" (to no avail).

  3) did *not* change Linux -- which was 4.16.5 -- Fedora 27.

  4) continued to use CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15

After 67 days uptime (leaving the system completely idle and changing nothing), I became convinced that the "Typical Current Idle" option has dealt with the "freezing when idle" problem.

When I say "freezing when idle", what I mean is: if the machine is left idle (typically over night) it simply stops responding. Nothing at all is logged -- no application, driver or kernel errors or warnings are logged -- the machine is still powered up, but frozen solid. The only way to restart the machine is to power down and up again.

Reviewing this thread, it seems to be mostly concerned with the "freezing while idle" issue.

The symptoms of the original "Random Soft Lockup" include log messages of the form:

     NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s!

is that related to "freezing while idle", or is it a separate issue ?

I get the impression that CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15 may be related to the "Random Soft Lockup"... but not to "freezing while idle" ???

It seems that other crashes/lockups are trying to attach themselves to this thread.

I note that this bug is asigned to <email address hidden>. This bug is very nearly 1 year old. Is this a good moment for the assignee to address this thread and say:

  * what, if any, Kernel issues have been identified

  * what, if any, Kernel fixes have been applied

related to this thread.

If the root cause of (some or all of) the issues in this thread is fixed or worked around by the "Typical Current Idle" BIOS option, does the assignee think that this "bug" can now be closed, or are there actual Kernel issues that remain, waiting to be fixed ?

Is it significant that W*nd*rs does not seem to suffer ?

Revision history for this message
In , phillips (phillips-linux-kernel-bugs) wrote :

This bug should be closed as "FIXED" in my humble opinion. "NEW" is grossly inaccurate.

My workstation uptime is now 109 days, continuous operation without suspend, mainly idle, mixed with episodes of heavy load on all cores and everything in between. The technical term for this is "spectacularly stable". Credit belongs to all involved, including AMD, kernel devs, motherboard maker, Debian maintainers, and many other groups who made this possible. Again in my humble opinion, the only issue remaining is AMD's failure to explain what went wrong and exactly how they worked around it. That is a separate bug, this one is done.

I suspect that Windows users did get this bug, but they didn't notice it because, well use your imagination. Then it was quietly fixed in a Windows update supplied by AMD that went out around the same time as the Typical Power bios updates.

Revision history for this message
Antoine Pitrou (pitrou) wrote :

This bug still occurs for me with 4.15.0-32-generic (Ubuntu 18.04.1) if I don't disable C6 states.

Revision history for this message
In , pitrou (pitrou-linux-kernel-bugs) wrote :

@Daniel what makes you think this bug is fixed? It certainly isn't for me. I have a ASRock Fatal1ty X370 Gaming X motherboard, a Ryzen 7 1700 CPU, have updated the motherboard BIOS to its latest version (4.80 with PinnaclePI-AM4_1.0.0.2 Patch A) and am still getting idle lockups with kernel 4.15.0-32-generic (Ubuntu 18.04.1 kernel). The workaround for me is to disable C6 states using the Zenstates.py script (which has the unfortunate side effect of lowering the max CPU frequency in single-threaded mode).

(I also don't see any "Typical Current Idle" option in my Setup options, though I might have overlooked it)

Revision history for this message
In , nelsoneci (nelsoneci-linux-kernel-bugs) wrote :

Hello there. I read most of the thread a week ago.

My machine was freezing when idle and when not idle. I read most of this thread and thanks to it got a stable setup.

@Daniel: When you say that you think the problem is fixed, do you know what change/Linux version fixes the issue for most people?

-------------

Now my report.

I have the following setup:

Motherboard: AB350M-DS3H (F23d BIOS).
CPU: Ryzen 2700X.
No overclock.

I still wonder whether my CPU is doing too much extra work (I don't know what cpufreq would report if a C6 state was reached).

cpufreq stats:

CPU 0: 3.70 GHz:6.64%, 3.20 GHz:1.15%, 2.20 GHz:92.21%
CPU15: 3.70 GHz:4.86%, 3.20 GHz:0.80%, 2.20 GHz:94.34%

Anyway, to make the things work I had to use three tweaks. I tried individual tweaks to no avail.

- Select in BIOS: "Typical Current Idle"
- Start Linux with: idle=nomwait
- Disable C6 states (both core and package) with Zenstates.py

Before doing the last step Zenstates.py reports Core enabled and Package disabled.

I'm using Ubuntu 18.04.01 LTS. I didn't compile Linux with CONFIG_RCU_NOCB_CPU / CONFIG_RCU_NOCB_CPU_ALL.

So, things are working for me. But if you think I should test a new Linux version that is supposed to fix the issue please let me know.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Antoine Pitrou from comment #398)
> @Daniel what makes you think this bug is fixed? It certainly isn't for me.
> I have a ASRock Fatal1ty X370 Gaming X motherboard, a Ryzen 7 1700 CPU, have
> updated the motherboard BIOS to its latest version (4.80 with
> PinnaclePI-AM4_1.0.0.2 Patch A) and am still getting idle lockups with
> kernel 4.15.0-32-generic (Ubuntu 18.04.1 kernel). The workaround for me is
> to disable C6 states using the Zenstates.py script (which has the
> unfortunate side effect of lowering the max CPU frequency in single-threaded
> mode).
>
> (I also don't see any "Typical Current Idle" option in my Setup options,
> though I might have overlooked it)

Hi Antoine, i checked the manual of your Asrock motherboard quickly.
For your machine the "typical idle current" is named "c6 mode"
Disable it and you get the "typical idle current", i think.
While keeping the boost monothread.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

The "freezing while idle" problem does appear to be (at least) worked around by the "Typical Current Idle" option. If that counts as a fix, then the bug is fixed -- or rather, was never a Kernel Bug.

The "Random Soft Lockup" -- the nominal subject of the bug -- on the other hand... who can tell ?

It would be churlish not to be happy that my machine is now reliable and I am a believer in credit where credit is due.

So, as you say, all credit goes to AMD for their total silence up to and after the release of the work around (or fix) -- except for the advice that some (older) power supplies might not work. I don't really need to know why or how the new option does the trick, but I remain curious whether it has a material effect on power consumption.

But I also give credit to this bug's assignee for their total silence -- in particular:

  * it's not clear whether the "freezing while idle" and "Random Soft Lockup"
    problems are separate, related or the same.

  * hence, the CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15 voodoo may or may
    not still be recommended ?

    For all I know, with the "Typical Current Idle" option, I may be better
    off returning to the default configuration ?

  * and there is talk of kernel changes which reduce power consumption
    (significantly ?) -- which may or may not be related to these issues.

Sadly, the absence of solid information allows half-truths and ill-informed speculation to take its place and to live on in the undamped echo chamber of the interweb :-( [I acknowledge my small part in that noise :-(]

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Daniel Phillips from comment #397)
> This bug should be closed as "FIXED" in my humble opinion. "NEW" is grossly
> inaccurate.
>
> My workstation uptime is now 109 days, continuous operation without suspend,
> mainly idle, mixed with episodes of heavy load on all cores and everything
> in between. The technical term for this is "spectacularly stable". Credit
> belongs to all involved, including AMD, kernel devs, motherboard maker,
> Debian maintainers, and many other groups who made this possible. Again in
> my humble opinion, the only issue remaining is AMD's failure to explain what
> went wrong and exactly how they worked around it. That is a separate bug,
> this one is done.
>
> I suspect that Windows users did get this bug, but they didn't notice it
> because, well use your imagination. Then it was quietly fixed in a Windows
> update supplied by AMD that went out around the same time as the Typical
> Power bios updates.

It seems than the normal power mode under this OS disable core parking (C6)
I say it because i checked myself and core parking doesn't work by default under w10.
So it could explain why it's not a windows thing.

Revision history for this message
In , chewi (chewi-linux-kernel-bugs) wrote :

To anyone applying the BIOS/CONFIG_RCU_NOCB_CPU/ZenStates solution together with the idle=nomwait solution, I believe this is unnecessary as these are different problems relating to different CPUs. The former worked for my Ryzen 5 1600X, while only the latter worked for my much newer Ryzen 7 2700U (Raven Ridge). I've heard the idle=nomwait issue applies to all Raven Ridge CPUs. I don't know if that's true but I'm fairly certain it doesn't apply to any earlier hardware.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

The title of the bug should be changed to include Ryzen 7 1700X, and the reporter should follow up, if the problem still occurs with Linux 4.18.

For the other issues, separate bug reports should be submitted. Though the mailing lists might be the better forum for this, as the Linux kernel developers for the subsystem do not seem to use the Kernel.org Bugzilla.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Antoine Pitrou from comment #398)
> @Daniel what makes you think this bug is fixed? It certainly isn't for me.
> I have a ASRock Fatal1ty X370 Gaming X motherboard, a Ryzen 7 1700 CPU, have
> updated the motherboard BIOS to its latest version (4.80 with
> PinnaclePI-AM4_1.0.0.2 Patch A) and am still getting idle lockups with
> kernel 4.15.0-32-generic (Ubuntu 18.04.1 kernel). The workaround for me is
> to disable C6 states using the Zenstates.py script (which has the
> unfortunate side effect of lowering the max CPU frequency in single-threaded
> mode).

Please remember, this is the upstream Linux kernel bug tracker. So, reports should only be for the latest stable releases, which currently is 4.18.3 [1]. Some distributions make it easy to install those. Packages are available for Ubuntu [2].

Please report issues with the heavily patched Ubuntu default Linux kernel to the Ubuntu bug tracker [3].

[1]: https://www.kernel.org/
[2]: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.18.3/
[3]: https://bugs.launchpad.net/ubuntu/+source/linux/+bugs

Revision history for this message
In , pitrou (pitrou-linux-kernel-bugs) wrote :

@Paul I understand this. I'm also monitoring the corresponding Ubuntu issue. Still, I was asking whether there was a specific reason to believe this bug was fixed upstream (as opposed to the fact that a single person doesn't experience the issue anymore).

Revision history for this message
In , pitrou (pitrou-linux-kernel-bugs) wrote :

(In reply to Michaël Colignon from comment #400)
>
> Hi Antoine, i checked the manual of your Asrock motherboard quickly.
> For your machine the "typical idle current" is named "c6 mode"
> Disable it and you get the "typical idle current", i think.
> While keeping the boost monothread.

Hi Michaël, that option was removed in the latest BIOS updates. Probably because it didn't work at all.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Antoine Pitrou from comment #407)
> (In reply to Michaël Colignon from comment #400)
> >
> > Hi Antoine, i checked the manual of your Asrock motherboard quickly.
> > For your machine the "typical idle current" is named "c6 mode"
> > Disable it and you get the "typical idle current", i think.
> > While keeping the boost monothread.
>
> Hi Michaël, that option was removed in the latest BIOS updates. Probably
> because it didn't work at all.

Oh yes it works. On others boards it fix the crashes in idle.
Contact asrock support about the "typical idle current" needed option.

Revision history for this message
In , pitrou (pitrou-linux-kernel-bugs) wrote :

I don't think there is any reason to think that the "c6 mode" option had anything to do with "typical idle current". Its name implied that it disabled C6 states, which it actually didn't (according to the Zenstates script).

Revision history for this message
In , zankuro (zankuro-linux-kernel-bugs) wrote :

I have an Asus ROG Strix GL702ZC notebook with Ryzen 7 1700 + Radeon RX580, and I have soft lockups too. The computer completly freeze when idle. My system is an Ubuntu 18.04.1. Kernel parameters like "idle=nomwait" and "rcu_nocbs=0-15" don't fix my problem. Now, I'm using my computer "without lockups" with cpu governor setted in "performance". My BIOS, updated to 305, does not have the option "Typical Idle Current" or similar.

Revision history for this message
In , pitrou (pitrou-linux-kernel-bugs) wrote :

Unfortunately, setting the governor to performance doesn't fix the issue here (Ryzen 7 1700, ASRock Fatal1ty X370 Gaming X).

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

It appears that selecting the "Typical Current Idle" BIOS option (where available) eliminates the "freeze when idle" problem (somehow).

That said, it also appears that this is *not* the default.

The need for the option appears to be Linux specific, or at least it is not needed for W*nd*rs.

That said, I am damned if I can find any documentation for the option, or any authoritative recommendation for its use with Linux.

So, anyone bitten by this problem may struggle to find their way to selecting "Typical Current Idle". Particularly as the symptoms have all the appearance of a hardware issue -- possibly an overclocking step too far.

Would it be possible for the Kernel to set whatever the "Typical Current Idle" BIOS option sets ? Or invoke some BIOS function to select the option ? That way, at least an up to date version of Linux would avoid this problem "out of the box".

I have been clapping my hands vigorously for some time, but I am coming to the conclusion that the <email address hidden> fairy does not, in fact, exist (or perhaps not in this universe).

While I am practising believing the regulation six impossible things before breakfast, I note that the <email address hidden> fairy would have to be able to communicate with the AMD documentation fairy... perhaps using a 56k ouija board ?

Finally and FWIW, I tried turning off "rcu_nocbs=0-15" and "Typical Current Idle", and my Ryzen 7 1800X/ASUS X370-Pro/Linux 4.18/Fedora froze within hours. Before the "Typical Current Idle" option became available, I was running "rcu_nocbs=0-15" (following the advice in this thread), and my machine froze every week or so. I have now been running for some days with "Typical Current Idle" turned back on, but without the "rcu_nocbs=0-15" voodoo -- so far, so good (as the man said as he fell past the 7th floor).

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

So still no concrete 100% sure fire solution?

Incredible.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

Who knows ?

When I say that "Typical Current Idle" appears to eliminate the "freeze when idle" problem, I mean that (a) that is my experience (to date), and (b) I have seen others reporting similar experience.

I'm not aware of any public, definitive information from AMD, any motherboard vendor or the relevant kernel folk -- certainly nothing which says what the "Typical Current Idle" BIOS option actually does, and whether or how it solves the problem, in part or completely.

Sadly, this long thread has almost nothing to do with identifying any actual Kernel bug or bugs. It's more a sort of support group. [Hello, I'm Chris and I'm an AMD Ryzen user... since I found the "Typical Current Idle" BIOS option, I haven't had a "freeze while idle" for two weeks -- praise be.]

If you are looking for useful information, look away now: angry, bewildered person about to howl at the moon...

--------------

...do I find the situation "Incredible" ? hmmm...

[Final warning: reduced signal/noise ration ahead.]

I bought my Ryzen machine the moment I could order it.

When I came to build it, I found there was some confusion about the "standard" fitting for the CPU cooler -- the plate glued to the motherboard and the cooler fitting were incompatible. I found myself in a three-way stand-off between AMD, the motherboard vendor and the cooler vendor. If there was a specification for the standard AM4 socket fitting for the cooler, it was Top Secret. In the end I voided all warranties, ripped the plate from the motherboard and replaced it by the one supplied with the cooler.

When I set up my new machine, I found the standard configuration for the memory was not as fast as I expected, noting <https://www.amd.com/en/products/cpu/amd-ryzen-7-1800x>. But, I stumbled across this <https://community.amd.com/community/gaming/blog/2017/03/14/tips-for-building-a-better-amd-ryzen-system> where I discovered that the 2667 MT/s I was expecting was for 2 DIMMs; with my 4 DIMMs, 2133 MT/s is what I should have expected. Silly Me ! I have seen BIOSes since then which are supposed to improve memory support... but I think that's more options for overclocking... who knows ? Call me old fashioned, but I am disappointed (but no longer surprised) that I cannot find an AMD data sheet that specifies memory support (or much of anything else, for that matter).

Can I find any documentation for AGESA ? No. It's various versions (and changes in version numbering) ? No.

Does my motherboard vendor provide release notes for each BIOS version ? No. How did I discover which BIOS version had the "Typical Current Idle" option ? By experiment.

If the <email address hidden> fairy has died because I stopped clapping, then I'm sorry.

... so, Incredible ? Nah. Infuriating ? You bet.

Revision history for this message
In , arup.chowdhury (arup.chowdhury-linux-kernel-bugs) wrote :

Haven't had a single lockup after update to Kernel 4.15 and setting the BIOS to typical idle current setting. Have run the PC for few days and no issues nor any lock up under heavy CPU use.This is a ASUS B350M board with latest BIOS on Ryzen 1700.

Revision history for this message
In , daniel (daniel-linux-kernel-bugs) wrote :

I erred earlier in suggesting that this bug should be closed with
"fixed". Rather, this bug should should be closed with "notabug" because
there appears to be no evidence to suggest that it is actually a kernel
bug, compared to considerable evidence suggesting it is not.

Meanwhile, I personally appreciate the slow trickle of comments along
the lines of confirming that the typical current bios fix is effective,
compared to the absence of comments that the bug persists even with this
option enabled. My current uptime with typical current enabled is 136
days, compared to only a few days without it.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

The situation is incredible, in that there doesn't (seem) to be a concrete sure fire solution to this.

If there's a BIOS option, which can fix it, it should damn well be put MANDATORY in all BIOS - not an option, locked in. System stability should not be this unreliable. Especially now they're on 2xxx series processors.

This huge job log is neatly a year old now, it's crazy. It should be fixed on the processor itself or the BIOS.

Revision history for this message
In , Michael (michael-linux-kernel-bugs) wrote :

Sorry guys, typical Current Idle doesn't fix it for me.

I can't believe this is not yet nailed down. I kinda lost believe in the AMD comeback.

I do have this issue.

My setup is a 2700X on a Gigabyte B450 AORUS PRO.

I did hunt this freeze issue down to the behaviour decribed here: System is suddenly stuck during normal office use. Everything stays where is is. No restart, whoops, panic, log entries of any type.

Only reset-button helps.

I did try to diable Global C-State Control in UEFI - didn't help.
After finding this thread today I set Power Supply Idle Control to "Typical Current Idle" and set global C-State back to Auto (since I thought that doesn't help). Sadly, my system froze after 3h.

That is with the first and stock UEFI version F1 (AGESA 1.0.0.4).
I've just seen that there is an updated UEFI version F2. I've just updated and set global c-states to disabled AND Power Supply Idle Control to "Typical Current Idle".

"sudo ./zenstates.py -l" is now:
C6 State - Package - Disabled
C6 State - Core - Disabled

I'll come back with the results.

This makes the whole AMD 1xxx and 2xxx seem unstable.

Questions:
- Why does AMD not care? Is this maybe not NOT a general issue and only affecting a small percentage of users?

- If Windows doesn't suffer from that issue - can we find out why and do a "if amd then powersave like windows"?
Or is it just the lack of information/support by vendor so that we don't know what exactly the issue is, what is causing it and how to work around it?

Revision history for this message
In , nelsoneci (nelsoneci-linux-kernel-bugs) wrote :

Michael, with all that enabled also try the " idle=nomwait " kernel option If the upgrade to F2 doesn't work.

I have a similar BIOS and the same CPU. I had to enable all what you did but this wouldn't work without idle=nomwait. See my post above.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

Hi, another Michael here.
The typical idle current fixed the trouble, if you have still freeze under low load after it's:
a bug in the uefi that doesn't really apply the "typical idle current"(disabling partially core parking).
another problem than the one on this thread.

I want to add another thing, about fixing it directly from the kernel.
As typical idle current seems (i am testing it again for see) not needed under the other OS, that's because the core parking under this other OS is partially disabled by DEFAULT on this OS as i said earlier in this thread.

So it seems possible to fix it via the kernel, without make use of an uefi parameters not enabled by default.
It could increase the user experience under Linux for people who doesn't know this trouble.

So, why don't make simply the same on Linux? IIRC from my reading over hardware.fr article the same apply to Intel cpu. That's a tweak for increasing performance afaik. Ryzen gains ~10%(on games, where latencies are more important) while not having "deep sleep" of cores enabled.

My Wattmeter is not a good one but i see mostly no difference between this core parking partially disabled or not. It must save really few power.

You all can easily test it by core parking control software on this other OS.

Revision history for this message
In , nancy154 (nancy154-linux-kernel-bugs) wrote :

(In reply to Michaël Colignon from comment #420)
> ...
> I want to add another thing, about fixing it directly from the kernel.
> As typical idle current seems (i am testing it again for see) not needed
> under the other OS, that's because the core parking under this other OS is
> partially disabled by DEFAULT on this OS as i said earlier in this thread.
>
> So it seems possible to fix it via the kernel, without make use of an uefi
> parameters not enabled by default...

I can perhaps confirm that. But I have to say first, I'm not a developer and I do not understand anything about Linux.

I use an AMD Ryzen 5 1600 Six-Core @ 12x 3.2GHz with Gentoo as the operating system. So it's very easy for me to build a kernel just for my machine. (Without the whole Intel, ARM64, Android and Big-Data stuff.)

I try the kernels with standard bios setup, without kernel parameters like "idle = nomwait", without OC. My bios only has Global C-states Auto (which is default) and enable/disable options. There is no "Idle Control to "Typical Current Idle"" in the BIOS. Only CONFIG_RCU_NOCB_CPU=y is set in the .config.

BOOT_IMAGE=/boot/kernel-x86_64-4.19.0-rc3 root=UUID=465f9c93-09d0-441b-9440-8ec7799b557c ro real_root=/dev/sdd4 resume=/dev/sdd3 init=/usr/lib/systemd/systemd
rootfstype=ext4

./zenstates.py -l
...
C6 State - Package - Enabled
C6 State - Core - Enabled
...
On this basis,
It is to be confirmed by me that the kernel 4.17.14 works really well. But the series 4.18.x is cruel. Also the currently as "stable" declarated 4.18.7. Bad. Freeze or reboot every 2 - 3 hours.

To my surprise, the 4.19.0-rcX work well. So far no soft lockups.

I do not know why. But these are the subjective results of my playing with the kernels.

I can also understand that the few kernel developers are working on priority issues. Please think of Specter and Meltdown.

But a few clarifying information would be very welcome..

Revision history for this message
In , Michael (michael-linux-kernel-bugs) wrote :

Hi,

update from me, as promised. First my system for reference:
2700X
B450 Chipset - Gigabyte B450 AORUS PRO with F2 BIOS
PSU from 2018 which is "Haswell compatible".

Typical Current Idle AND disabling global c-states didn't help.
Running Ubuntu 18.04 LTS with its stock kernel 4.15.0-34 right now.

I am having some success (4 days now) with all of that AND idle=nomwait kernel parameter - although I only read that to be related to resets and not freezes.

Anyways. will keep my system on.

I would like to trade in the LEDs on GPU and mainboard for vendors effort to get some stability in their systems.

Any update from someone who really knows what is going on would be really appreciated. We're playing around with settings which are not explained anywhere.

Revision history for this message
In , Michael (michael-linux-kernel-bugs) wrote :

So, I'm running Ubuntu 18.04 stock kernel with
- Typical Current Idle
- disabling global c-states
- idle=nomwait kernel parameter

and have no freezes for 8 Days now - new record. So it looks like I can confirm what Nelson Castillo mentioned in comment #419

This seems ok now. Now I need to verify if all changes are needed or if only one or two are necessary.

Will come back with an update when I have the results.

Michael

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

So what are the implications of using idle=nowait and why isn't it a standard behaviour for all Ryzen CPU going forward?

Revision history for this message
In , livubuntu20 (livubuntu20-linux-kernel-bugs) wrote :

I had this same bug in June 2018 and in the beginning of July 2018. It seems to be solved for me by " Look in BIOS settings for "PSU Idle Control" and set it to "Typical Current Idle" "

The bug was appearing about a few days (2-7) in low usage mode.

I was using openSUSE Leap 15.0 before this and no bug. Then I switched to Leap 15.1 and the bug appeared for the first time (it had a newer kernel). Then I had about 3 weeks before finding this fix.

Now I have openSUSE Tumbleweed with kernel 4.18.8. It was OK since kernel 4.14 and now it keeps on updating the kernel as usual and everything is OK.

Thank you for providing this fix!

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.

The zenstates.py script fails when I try to disable C6. Oh well.

I have kernel parameters: idle=nomwait processor.max_cstate=5

I still get lockup. No BIOS settings available on this mavhine.

Feel pretty hopeless at the moment.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to JerryD from comment #426)
> I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.
>
> The zenstates.py script fails when I try to disable C6. Oh well.

Did you load msr kernel module before (modprobe msr)?

Do you have the possibility to slightly overclock? Isn't there a Bios switch like "Typical Current Idle"? Is there a possibility to switch of C-states completely in the Bios (just for testing to be sure your hangs are the same reason as here)?

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #427)
> (In reply to JerryD from comment #426)
> > I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.
> >
> > The zenstates.py script fails when I try to disable C6. Oh well.
>
> Did you load msr kernel module before (modprobe msr)?
>
> Do you have the possibility to slightly overclock? Isn't there a Bios switch
> like "Typical Current Idle"? Is there a possibility to switch of C-states
> completely in the Bios (just for testing to be sure your hangs are the same
> reason as here)?

As follows:

[root@amdr jerry]# modprobe msr
[root@amdr jerry]# ./bin/zenstates.py --list
P0 - Enabled - FID = 64 - DID = A - VID = 35 - Ratio = 20.00 - vCore = 1.21875
P1 - Enabled - FID = 66 - DID = C - VID = 60 - Ratio = 17.00 - vCore = 0.95000
P2 - Enabled - FID = 60 - DID = C - VID = 66 - Ratio = 16.00 - vCore = 0.91250
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

[root@amdr jerry]# ./bin/zenstates.py --c6-disable
Traceback (most recent call last):
  File "./bin/zenstates.py", line 112, in <module>
    writemsr(0xC0010292, readmsr(0xC0010292) & ~(1 << 32))
  File "./bin/zenstates.py", line 23, in writemsr
    raise OSError("msr module not loaded (run modprobe msr)")
OSError: msr module not loaded (run modprobe msr)

As you can see msr is loaded and listing the states works fine but the disable option fails.

The bios on this laptop has no power related options that I can see.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

The root cause of the "freeze when idle" appears to be a fault in the mechanism which wakes the CPU up once it has gone into the deepest of deep sleeps.

Mr AMD has pointed the finger at PSUs which fail to maintain the correct voltage when the current draw approaches 0A. Between the PSU and the CPU there is circuitry on the motherboard, which I guess could also be involved ?

Disabling C-States, in particular C6, obviously addresses the apparent root cause.

However, for my machine, setting C6 Package Disabled (using zenstates.py) was not enough. Nor was it enough to set both C6 Package and C6 Core Disabled.

After setting "Typical Current Idle", I have had ~5 months without a "freeze when idle" -- with various kernels up to and including 4.18.9, and (latterly) with no other settings.

So I assume that "Typical Current Idle" does something more than disabling C6... something I have yet to discover.

------------

What the "Typical Current Idle" option does is secret :-(

On my machine (Ryzen 7 1800X, Asus X370-Pro, BIOS 4012), zenstates.py tells me:

   Low Current Idle : C6 Package Enabled : C6 Core Enabled
   Typical Current Idle : C6 Package Disabled : C6 Core Enabled
   Auto : C6 Package Enabled : C6 Core Enabled

and all three have the same three P-States:

   P0: FID=90 DID=8 VID=20 Ratio=36.00 vCore=1.35000
   P1: FID=80 DID=8 VID=2C Ratio=32.00 vCore=1.27500
   P2: FID=84 DID=C VID=68 Ratio=22.00 vCore=0.90000

[I thought that "Typical Current Idle" might be fiddling with P2, but that does not seem to be the case.]

I imagine there are many more parameters I could look at, if only I knew more. For completeness, I leave all other BIOS options in their default state.

While I was checking the effect of the "Current Idle" options, I noticed something peculiar about setting "Typical Current Idle". I started with:

 0) "Typical Current Idle" was: C6 Package Disabled

and then:

 1) reboot into BIOS, set "Low Current Idle", gave: C6 Package Enabled
 2) reboot into BIOS, set "Typical Current Idle", gave: C6 Package Enabled !
 3) shutdown and restart, gave: C6 Package Disabled !

To get the (full) effect of "Typical Current Idle" I have to do a cold boot, apparently !! [I tried this three times, just to be sure.]

------------

I'm curious about "idle=nomwait". I find this will "disable mwait for CPU C-states". AFAIKS, the MWAIT instruction halts the current thread and sets a given C-State to drop into. So not using MWAIT looks like another way of disabling C6 for the core ? Or is there something else going on here ?

Given that I found that disabling C6 is not enough to eliminate "freeze when idle", I would not expect "idle=nomwait" to be enough. Unless "idle=nomwait" does something more... avoiding a Kernel bug, for example ?

Revision history for this message
In , Michael (michael-linux-kernel-bugs) wrote :

(In reply to Michael from comment #423)
> So, I'm running Ubuntu 18.04 stock kernel with
> - Typical Current Idle
> - disabling global c-states
> - idle=nomwait kernel parameter
>
> and have no freezes for 8 Days now - new record. So it looks like I can
> confirm what Nelson Castillo mentioned in comment #419
>
> This seems ok now. Now I need to verify if all changes are needed or if only
> one or two are necessary.
>
> Will come back with an update when I have the results.
>
> Michael

Hi!

First, I've changed my mainboard (see below). AGESA-wise this was a downgrade from 1.0.0.4 to 1.0.0.2.
This does change the chipset from B450 to X470.

Second, I did enable the global c-states in UEFI and am still running without freezes for over 14 days now. I consider that long enough to call it stable.

That means only the following changes are needed:

- typical current idle UEFI setting
- the idle=nomwait kernel parameter

For reference:
Asrock X470 Master SLI with BIOS 1.10
2700x
no overclocking, but ram on XMP profile.

# ./zenstates.py -l
P0 - Enabled - FID = 94 - DID = 8 - VID = 36 - Ratio = 37.00 - vCore = 1.21250
P1 - Enabled - FID = 80 - DID = 8 - VID = 59 - Ratio = 32.00 - vCore = 0.99375
P2 - Enabled - FID = 84 - DID = C - VID = 76 - Ratio = 22.00 - vCore = 0.81250
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Enabled

I'll stop playing with parameters since I think I reached the best I can. I'll only update if something changes like freezes keep returning etc.

Regards,
Michael

Revision history for this message
In , hexawolfie (hexawolfie-linux-kernel-bugs) wrote :

(In reply to JerryD from comment #426)
> I am on Ryzen 2500U Laptop, HP. I am using kernel 4.18.9-200.fc28.x86_64.
>
> The zenstates.py script fails when I try to disable C6. Oh well.
>
> I have kernel parameters: idle=nomwait processor.max_cstate=5
>
> I still get lockup. No BIOS settings available on this mavhine.
>
> Feel pretty hopeless at the moment.

I'm on Ryzen 2500U Laptop as well, HP 15-db0229ur with latest BIOS (F.11 Rev.1 in my case) and latest kernel (something that used to be called 4.19-rc7 on kernel.org). Also have no option of entering BIOS settings available on this machine. Tried putting both kernel parameters AND *successfully* disabling ZenStates.py. Already had massive data loss at least twice. I really, really-really hope this issue won't turn into one of those 10-years-old issues.

AMD did publish an errata report, pointing that this may be fixed at software level (page 63).
https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf

Revision history for this message
In , simonmcquire (simonmcquire-linux-kernel-bugs) wrote :

Want to share this in case it helps anyone else. I stumbled across this issue after putting together a new Ryzen 2200G HTPC with AsRock AB350 M-ITX mainboard a few weeks ago. Would encounter random freezes every couple of hours.

After updating UEFI to latest version, updating kernel to latest 4.19-rc8, and Mesa to 18.2.2, I was still having stability problems (freezing with error messages in dmesg output). Made the changes in BIOS to typical current IDLE etc. and still was having issues.

But then realised I had not updated the amdgpu firmware, so grabbed the latest version from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git and everything seems to have been stable since then *fingers crossed*.

Revision history for this message
In , hexawolfie (hexawolfie-linux-kernel-bugs) wrote :

(In reply to simonmcquire from comment #432)
> But then realised I had not updated the amdgpu firmware...

Does that means that the issue was already resolved and it's just distros being outdated? I haven't tried this solution yet but you're not the only one who is saying that the problem can be solved by grabbing latest AMDGPU drivers.

Revision history for this message
In , simonmcquire (simonmcquire-linux-kernel-bugs) wrote :

(In reply to Vladyslav Yamkovyi from comment #433)
> (In reply to simonmcquire from comment #432)
> > But then realised I had not updated the amdgpu firmware...
>
> Does that means that the issue was already resolved and it's just distros
> being outdated? I haven't tried this solution yet but you're not the only
> one who is saying that the problem can be solved by grabbing latest AMDGPU
> drivers.

No idea, but I would say if you have outdated firmware, it is definitely worth trying updating to latest version.

Revision history for this message
In , eduardoreyesdpm (eduardoreyesdpm-linux-kernel-bugs) wrote :

@simonmcquire

Can you please clarify what this amdgpu firmware you are using is and what for?
Are you flashing the video card?
Are you flashing the motherboard?

Thanks in Advance

Revision history for this message
In , simonmcquire (simonmcquire-linux-kernel-bugs) wrote :

(In reply to Eduardo Reyes from comment #435)
> @simonmcquire
>
> Can you please clarify what this amdgpu firmware you are using is and what
> for?
> Are you flashing the video card?
> Are you flashing the motherboard?
>
> Thanks in Advance

No flashing of video card or motherboard here. I'm not using a separate GPU - I'm using the GPU built into the Ryzen 2200G APU. This is just firmware loaded by the amdgpu kernel driver.

The firmware is loaded and used by the amdgpu driver as far as I understand. I have just replaced all of the files in /lib/firmware/amdgpu with the latest version.

Revision history for this message
In , eduardoreyesdpm (eduardoreyesdpm-linux-kernel-bugs) wrote :

(In reply to simonmcquire from comment #436)
> (In reply to Eduardo Reyes from comment #435)
> > @simonmcquire
> >
> > Can you please clarify what this amdgpu firmware you are using is and what
> > for?
> > Are you flashing the video card?
> > Are you flashing the motherboard?
> >
> > Thanks in Advance
>
> No flashing of video card or motherboard here. I'm not using a separate GPU
> - I'm using the GPU built into the Ryzen 2200G APU. This is just firmware
> loaded by the amdgpu kernel driver.
>
> The firmware is loaded and used by the amdgpu driver as far as I understand.
> I have just replaced all of the files in /lib/firmware/amdgpu with the
> latest version.

I'm using the rx580 with mesa stable and kubuntu 18.04.1.... I dont have this issue anymore after motherboard firmware update on another asrock board but just curious would this firmware help with performance or bug fixes on all AMD gpu?

Revision history for this message
In , simonmcquire (simonmcquire-linux-kernel-bugs) wrote :

(In reply to Eduardo Reyes from comment #437)
> (In reply to simonmcquire from comment #436)
> > (In reply to Eduardo Reyes from comment #435)
> > > @simonmcquire
> > >
> > > Can you please clarify what this amdgpu firmware you are using is and
> what
> > > for?
> > > Are you flashing the video card?
> > > Are you flashing the motherboard?
> > >
> > > Thanks in Advance
> >
> > No flashing of video card or motherboard here. I'm not using a separate
> GPU
> > - I'm using the GPU built into the Ryzen 2200G APU. This is just firmware
> > loaded by the amdgpu kernel driver.
> >
> > The firmware is loaded and used by the amdgpu driver as far as I
> understand.
> > I have just replaced all of the files in /lib/firmware/amdgpu with the
> > latest version.
>
> I'm using the rx580 with mesa stable and kubuntu 18.04.1.... I dont have
> this issue anymore after motherboard firmware update on another asrock board
> but just curious would this firmware help with performance or bug fixes on
> all AMD gpu?

I don't know, but maybe if it's stable now you should stick with your current configuration.

Since the 2200G is the APU chip, which has built in graphics I think I may have experienced amdgpu and cpu-related issues (which this bug seems to be more related to), so I've needed a combination (new UEFI and amdgpu firmware) to get it stable.

Revision history for this message
In , owenswerk (owenswerk-linux-kernel-bugs) wrote :

I'm one of the users James mentioned early on; I don't know how to capture kernel output when I'm greeted with a completely-locked-up or already-rebooted machine. If it's hardware, I don't see what the kernel could do anyway.

I've tried all the tricks mentioned here, when applicable. (My BIOS doesn't expose anything like "typical current idle", etc.)

The only way I've been able to keep this system stable is to keep it busy. I run BOINC and donate some of this shark-style swim-or-die CPU to medical research. There are some nice settings for how the cores get used, and according to htop, tasks get shuffled around to all cores, if that matters. I went with using at most 50% of 25% of the cores (figures I pulled out of ...thin air). That keeps my load average around 2 when I'm doing nothing, but whatever - it's for a good cause.

My CPU is an AMD Ryzen 7 1800X with 8 or 16 cores depending on how one counts, and keeping them from getting bored has, so far, let me get back to rebooting on my terms (upgrades) rather than at random.

This comes up as a Linux problem, but it sure smells like a hardware defect to me. In my mind, all the self-monitoring and user-spying and background-indexing and auto-downloading that Windows does happens to mask this problem for most people. (Plus who would think twice about a randomly-hanging or -rebooting Windows box?) Has anyone run *BSD or FreeDOS or something else which would allow a Ryzen to get bored for hours/days?

Revision history for this message
In , hi-angel (hi-angel-linux-kernel-bugs) wrote :

> Has anyone run *BSD or FreeDOS or something else which would allow a Ryzen to
> get bored for hours/days?

Yes, FreeBSD is affected too https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-BSD-Lock-Ups-2018

Revision history for this message
In , hexawolfie (hexawolfie-linux-kernel-bugs) wrote :

(In reply to Owen Swerkstrom from comment #439)
> This comes up as a Linux problem, but it sure smells like a hardware defect
> to me.
Any system that does not provides a software workaround must be affected according to published errata, not even a question. I've published one of their revision guides in my previous comments. They have no plans on fixing this and suggest using software workarounds. We're left on our own.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

(In reply to Vladyslav Yamkovyi from comment #441)
> (In reply to Owen Swerkstrom from comment #439)
> > This comes up as a Linux problem, but it sure smells like a hardware defect
> > to me.
> Any system that does not provides a software workaround must be affected
> according to published errata, not even a question. I've published one of
> their revision guides in my previous comments. They have no plans on fixing
> this and suggest using software workarounds. We're left on our own.

What is the "this" that AMD have no plans to fix ?

I had a look at <https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf> which you referenced earlier. I found 3 MWAIT issues:

  1057 MWAIT or MWAITX Instructions May Fail to Correctly Exit From
       the Monitor Event Pending State

  1059 In Real Mode or Virtual-8086 Mode MWAIT or MWAITX Instructions May
       Fail to Correctly Exit From the Monitor Event Pending State

  1109 MWAIT Instruction May Hang a Thread

but I could not find anything else that might be related to the "freeze when idle" problem.

FWIW, here's the full text for the Erratum 1109:

  1109 MWAIT Instruction May Hang a Thread

       Description: Under a highly specific and detailed set of internal timing
                    conditions, the MWAIT instruction may cause a thread to
                    hang in SMT (Simultaneous Multithreading) Mode.

       Potential Effect on System: The system may hang or reset.

       Suggested Workaround: System software may contain the workaround for
                             this erratum.

       Fix Planned: No fix planned

so there ! I guess "idle=nomwait" is "the workaround" ?

"Typical Current Idle" appears to work for some (including me) but not for everyone. If one or more of these MWAIT errata is the root cause of the "freeze when idle" problem, I wonder why AMD introduced "Typical Current Idle" and how that relates to these MWAIT issues ??

Revision history for this message
In , owenswerk (owenswerk-linux-kernel-bugs) wrote :

...
> so there ! I guess "idle=nomwait" is "the workaround" ?
>
> "Typical Current Idle" appears to work for some (including me) but not for
> everyone. If one or more of these MWAIT errata is the root cause of the
> "freeze when idle" problem, I wonder why AMD introduced "Typical Current
> Idle" and how that relates to these MWAIT issues ??

Good questions, and beyond my ken. "idle=nomwait" did not prevent my system from rebooting while I was away. (I don't know if reboots and freezes are separate problems?) My BIOS doesn't expose anything like "typical current idle", and I have no way to upgrade it since the upgrade tools require Windows.

Is there a guide out there for dummies like me to be able to collect kernel output from these reboots? (Does the kernel even get a chance to do/report anything? Would a report even be helpful to kernel hackers in figuring out more workarounds?) If I can contribute anything plausibly-constructive, I'd love to.

Otherwise, I use and need this machine, so I'll probably just keep treating the symptoms by keeping it busy. (Which I suppose is another software workaround. ;^)

Revision history for this message
In , zankuro (zankuro-linux-kernel-bugs) wrote :

I have an Asus ROG Strix GL702ZC notebook with Ryzen 7 1700 + Radeon RX580 on Ubuntu 18.04. Kernel parameter "idle=nomwait" made no diference to me. Idle freezes persists. The only thing that solved my problem was to disable CPU C6 state through zenstates.py.

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to Owen Swerkstrom from comment #443)
 I have no way to upgrade it since the upgrade tools require Windows.

Win10 is free and runs freely forever. Download the iso file from ms site. Mount the iso file virtually and copy files to a gparted fat32 formatted and bootable usb memory stick. Make space for win10 about 30GB with gparted in your main drive. Boot from the usb memory and install win10, skip registration key dialog etc.

Revision history for this message
In , owenswerk (owenswerk-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #445)
> (In reply to Owen Swerkstrom from comment #443)
> I have no way to upgrade it since the upgrade tools require Windows.
>
> Win10 is free and runs freely forever. Download the iso file from ms site.
> Mount the iso file virtually and copy files to a gparted fat32 formatted and
> bootable usb memory stick. Make space for win10 about 30GB with gparted in
> your main drive. Boot from the usb memory and install win10, skip
> registration key dialog etc.

Thanks, I didn't know that. But, since figuring out zenstates.py, I appear to be set. Fingers crossed! (Anyway, I'd rather keep the cpu busy than install windows. ;^)

<soapbox>
BIOS updates should have their own boot images; how on Earth they can get away with requiring a specific OS is all kinds of backwards.
</soapbox>

Revision history for this message
In , fin4478 (fin4478-linux-kernel-bugs) wrote :

(In reply to Owen Swerkstrom from comment #446)
 <soapbox>
> BIOS updates should have their own boot images; how on Earth they can get
> away with requiring a specific OS is all kinds of backwards.
> </soapbox>

Asus motherboards are the best, for Linux users especially. You can update the bios via the Ethernet connection or via a small fat32 partition with the Ez flash tool in the bios.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to fin4478 from comment #447)
> (In reply to Owen Swerkstrom from comment #446)
> <soapbox>
> > BIOS updates should have their own boot images; how on Earth they can get
> > away with requiring a specific OS is all kinds of backwards.
> > </soapbox>
>
> Asus motherboards are the best, for Linux users especially. You can update
> the bios via the Ethernet connection or via a small fat32 partition with the
> Ez flash tool in the bios.

My Asrock has all those features as well I believe.

Revision history for this message
In , spleefer90 (spleefer90-linux-kernel-bugs) wrote :

*** Bug 201719 has been marked as a duplicate of this bug. ***

Revision history for this message
In , spleefer90 (spleefer90-linux-kernel-bugs) wrote :

From the dupe bug report -

Gigabyte GA-AX370-Gaming K5 (latest UEFI - 2018/08/08)
AMD Ryzen 5 2400G

Running Arch Linux with 2018-10-26 linux-firmware and I still get soft lockups and one other issue capable of freezing the system.

I'll try the 'Power Supply Idle Control' setting and see if that does anyhing (if my mobo even has that...)

Revision history for this message
In , hvtaifwkbgefbaei (hvtaifwkbgefbaei-linux-kernel-bugs) wrote :

With Ryzen 1600X + Radeon RX 550 + ASRock Taichi X370 I didn't have this bug until 4.18.20. 4.18.18 had 12 days uptime and 4.18.20
(and 4.19.6) maybe 6 hours. 4.18.19 now has 56 hours uptime.

X just freezes (keyboard+mouse dead) and I have to press reset button. Likewise, if I am in console, freeze happens the same way; cursor stops blinking and I don't get any messages.

I am booting with nosmt=force rcu_nocbs=0-5 mem_encrypt=off
 (also CONFIG_RCU_NOCB_CPU=Y ).

Now, I don't feel like doing git-bisect (commits v4.18.19..v4.18.20),...
Does someone have ideas as to what to try next? Anything suspicious in v4.18.20 commits?

Some differences in dmesg 4.18.19..4.18.20:

-smpboot: Allowing 16 CPUs, 10 hotplug CPUs
+smpboot: Allowing 16 CPUs, 4 hotplug CPUs
-.... node #0, CPUs: #1 #2 #3 #4 #5
+.... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11
 smp: Brought up 1 node, 6 CPUs
 smpboot: Max logical packages: 3

-ACPI: (supports S0 S5)
+ACPI: (supports S0 S3 S5)

Revision history for this message
In , hvtaifwkbgefbaei (hvtaifwkbgefbaei-linux-kernel-bugs) wrote :

Well, 4.18.19 froze after four days.

4.19.7 froze in 3½ hours.

rcu_nocb_poll mem_encrypt=off nosmt=force
CONFIG_RCU_NOCB_CPU=y
CONFIG_PREEMPT_RCU=y

[12572.931476] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [(journald):18688]
[12572.931509] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [amdgpu_cs:0:6535]
[12600.931702] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [(journald):18688]
[12600.931736] watchdog: BUG: soft lockup - CPU#4 stuck for 22s! [amdgpu_cs:0:6535]

Revision history for this message
In , hexawolfie (hexawolfie-linux-kernel-bugs) wrote :

Let's face it: I really, really don't think measuring time before a freeze makes any sense. It just occurs under certain circumstances or even randomly due to some hardware bug - 4.19 will work 4 days for the first time, 3 hours for the second time and 23 hours for the third. That's for sure - it's not related to a specific version, it just persists across all kernel versions.

Revision history for this message
In , edigitalb (edigitalb-linux-kernel-bugs) wrote :

(In reply to Vladyslav Yamkovyi from comment #453)
> Let's face it: I really, really don't think measuring time before a freeze
> makes any sense. It just occurs under certain circumstances or even randomly
> due to some hardware bug - 4.19 will work 4 days for the first time, 3 hours
> for the second time and 23 hours for the third. That's for sure - it's not
> related to a specific version, it just persists across all kernel versions.

I confirm that, tested on kernel 3.x, 4.12, 4.15, 4.19

Revision history for this message
In , edigitalb (edigitalb-linux-kernel-bugs) wrote :

That's not an AMD specific problem, I'm getting the same issue on a Intel I7 8700

Revision history for this message
In , glueckspilz85 (glueckspilz85-linux-kernel-bugs) wrote :

You can test your ryzen with the program(It is currently no longer being further developed by the developer): https://github.com/suaefar/ryzen-test
My 2200G shows no segfault, but "TIME TO FAIL: 885 s".

Under Windows 10 one Ryzen 2200G is freezing often under load for 20-30 seconds.
Another Ryzen 2200G often freezes under Debian Sid and crashes KDE plasma all the time, while Gnome Wayland runs more stable.

C-states, C6, Cool'n'Quiet, Low current idle, Typical current idle, nothing changed the fact that the CPU freezes.

Now I don't have the two systems anymore. It was too crazy with the ryzen 2200G for me and I have enough of it.

Revision history for this message
In , pitrou (pitrou-linux-kernel-bugs) wrote :

@Weasel that's a different problem, that doesn't necessarily occur with the same CPUs. This specific bug entry is about lockups that (to quote the original message) "typically occur when the load is low".

For example, I've had several low-load lockups on my Ryzen 1700, I haven't had a single high-load lockup. Please let's not conflate the two.

Revision history for this message
In , cevelnet (cevelnet-linux-kernel-bugs) wrote :

Hi,

I have an ASUS A320M-K with latest BIOS 4027 (AGESA 1.0.0.6), Ryzen 7 2700, Patriot Viper RGB 16GB RAM 3200CL16, Samsung 830 SSD 128GB, Corsair AX860 860W PSU (80+ Platinum) and ASUS ROG Strix RX480 8GB. Running Arch Linux w/ Kernel 4.19.12 and GNOME 3.30 on Wayland. BIOS configured to Defaults + XMP Profile activated + CSM disabled (-> UEFI-only mode).

I have experienced 2 hangs with this system, both within 1 hour uptime. However, I couldn't spot anything in the logs. There were no watchdog entries, unlike previously posted in this bug report.

The "system lockup" expresses itself by blanking my screen to dark and showing a blinking cursor in the top left corner. No other symbols / information. I can't interact in any way with my system - it ignores all inputs; can't switch console, doesn't react to single power button press, nothing. Only way to shutdown is holding power button for 4s.

I entered the BIOS after the lockups and noticed *VERY* slow performance within BIOS. Keypresses took very long to register (like 5+ seconds) until BIOS eventually hung. After another powercycle, all was good again. I wonder, if rebooting using the reset button could give us more clues, if the system continues to run "buggy" after the reset.

After setting "Typical Current Idle", I didn't had a single lockup ever since, even after 48h uptime, mostly idling. System runs perfectly good. This fix is fine for desktops, but laptops suffer from higher power draw, if they're able to set this.

Revision history for this message
In , Michael (michael-linux-kernel-bugs) wrote :

I think this bug is mostly about the official AMD Errata 1109 "MWAIT Instruction May Hang a Thread".

This is described here ( https://developer.amd.com/wp-content/resources/55449_1.12.pdf ) as:

"Under a highly specific and detailed set of internal timing conditions, the MWAIT instruction may cause a
thread to hang in SMT (Simultaneous Multithreading) Mode."

Since this is a nofix for AMD, this must be fixed by the OS. The AMD document states:

"Suggested Workaround
System software may contain the workaround for this erratum."

So, unfortunately AMD will never solve that hardware error. I suggest they change the first two letters of their company name ;)

This in fact leaves every linux system with a Ryzen CPU instable, so this should be considered serious.

I didn't read a comment from any kernel developer in the history of this bug report. Did I overlook something?

Question: Will the Linux kernel ever implement a workaround (as suggested by AMD??

Revision history for this message
In , brauliobo (brauliobo-linux-kernel-bugs) wrote :

(In reply to Michael from comment #459)
> I think this bug is mostly about the official AMD Errata 1109 "MWAIT
> Instruction May Hang a Thread".
>
> This is described here (
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf ) as:
>
> "Under a highly specific and detailed set of internal timing conditions, the
> MWAIT instruction may cause a
> thread to hang in SMT (Simultaneous Multithreading) Mode."
>
> Since this is a nofix for AMD, this must be fixed by the OS. The AMD
> document states:
>
> "Suggested Workaround
> System software may contain the workaround for this erratum."
>
> So, unfortunately AMD will never solve that hardware error. I suggest they
> change the first two letters of their company name ;)
>
> This in fact leaves every linux system with a Ryzen CPU instable, so this
> should be considered serious.
>
> I didn't read a comment from any kernel developer in the history of this bug
> report. Did I overlook something?
>
> Question: Will the Linux kernel ever implement a workaround (as suggested by
> AMD??

idle=nomwait kernel option fixed all hangs for me

Revision history for this message
In , Michael (michael-linux-kernel-bugs) wrote :

(In reply to Bráulio Bhavamitra from comment #460)
> (In reply to Michael from comment #459)
> > I think this bug is mostly about the official AMD Errata 1109 "MWAIT
> > (...)
> >
> > Question: Will the Linux kernel ever implement a workaround (as suggested
> by
> > AMD??
>
> idle=nomwait kernel option fixed all hangs for me

True, same for me.
But, is every future ZEN user doomed to go through a valley of pain and possibly find this workaround themselves or is it going to be default at any time?

Revision history for this message
In , bluescarni (bluescarni-linux-kernel-bugs) wrote :

My understanding is that idle=nomwait disables the use of the MWAIT instruction in the kernel implementation of the CPU idle state.

I also *think* that, under the hood, disabling C6 and the "Typical Power idle" option also have a similar (or perhaps the same?) effect of reducing the use of MWAIT. E.g., since I started enabling the "Typical Power idle" option in my BIOS, at every boot I have the following message showing up in the kernel logs:

[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

However, because the MWAIT instruction is part of SSE3 [1], it might end up being executed by any software which was compiled with support for SSE3. So I believe that disabling MWAIT for idle states greatly reduces the risk of getting these MWAIT hangs (simply because the instruction will end up being executed much less), but it will not completely eliminate them. Thus, it seems like the only bulletproof solution is either to upgrade to Ryzen+/Ryzen2, or to disable hypterthreading.

[1] https://en.wikipedia.org/wiki/SSE3

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

I have random freezes with Ryzen 2200u laptop. Display image just freezes and system stops responding. Only way to get out from this state is to holding power button (even magic SysRq key does not work). No any clues in logs.

With Ubuntu 18.04 it was happen nearly once per week (not intense use). After installing 18.10 things become more stable - I've got first lockup after 6 weeks and another after 3. Hangs are completely random and on different cpu load - once I've got this while gaming (and ambient noise continue playing...)

I've tried idle=nomwait but this did not help on 18.04 - and line "monitor-mwait will be used for idle loop" was still present in dmesg with this parameter!

idle=halt parameter completely disables mwait instruction in kernel and also forces use of only C1 state. In my laptop it does not cause higher battery drain or any suspend-resume issues. But I've not tested it completely an used it for few days... So this may be even not the solution for my problem.

Revision history for this message
In , brauliobo (brauliobo-linux-kernel-bugs) wrote :

(In reply to Another User from comment #463)
> I have random freezes with Ryzen 2200u laptop. Display image just freezes
> and system stops responding. Only way to get out from this state is to
> holding power button (even magic SysRq key does not work). No any clues in
> logs.
>
> With Ubuntu 18.04 it was happen nearly once per week (not intense use).
> After installing 18.10 things become more stable - I've got first lockup
> after 6 weeks and another after 3. Hangs are completely random and on
> different cpu load - once I've got this while gaming (and ambient noise
> continue playing...)
>
> I've tried idle=nomwait but this did not help on 18.04 - and line
> "monitor-mwait will be used for idle loop" was still present in dmesg with
> this parameter!
>
> idle=halt parameter completely disables mwait instruction in kernel and also
> forces use of only C1 state. In my laptop it does not cause higher battery
> drain or any suspend-resume issues. But I've not tested it completely an
> used it for few days... So this may be even not the solution for my problem.

You freeze seems GPU related, try a new kernel/firmware

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

(In reply to Bráulio Bhavamitra from comment #464)
> You freeze seems GPU related, try a new kernel/firmware

I also have got amdgpu crash with same screen freeze, but magic key had worked that time and error left in kernel log. And I was able reproduce that with certain game in dolphin emulator.

My random freezes happened without any special program running, sometimes in clear gnome-shell desktop. But this never happens just after power on - only after some working time.

Revision history for this message
In , ricki (ricki-linux-kernel-bugs) wrote :

Have these lockups with Fedora 29 default kernel and AMD Ryzen 7 1700X Eight-Core Processor.

Kernel parameter "idle=nomwait" does not help, but the disable C6 script does...

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

So it would appear that also for me the issue is not completely resolved by setting "Typical Power idle". One of my two (identical) servers reported this morning that a bunch of processes had blocked for more than 120s:

(I cut out a lot of other info like call traces, also these are just 4 examples but there's way more):
> [1094614.681444] INFO: task md1_raid1:301 blocked for more than 120 seconds.
> [1094614.681902] INFO: task fail2ban-server:1095 blocked for more than 120
> seconds.
> [1094614.682226] INFO: task mysqld:1319 blocked for more than 120 seconds.
> [1094614.682447] INFO: task ntpd:1653 blocked for more than 120 seconds.
All the output occurred in the same 3 ms.

Interestingly, the server had not crashed because of this. I did reboot it though, just in case. Meanwhile, still no word from AMD..

Revision history for this message
In , dl9px (dl9px-linux-kernel-bugs) wrote :

Just asked the helpdesk what we can do to get help and more attention. Disabling c6 still is a workaround for me, but that's not how I intended to run my system.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to dl9px from comment #468)
> Just asked the helpdesk what we can do to get help and more attention.
> Disabling c6 still is a workaround for me, but that's not how I intended to
> run my system.

Thanks.

I'm in the same boat with C6. I know I'm a small fish but my company won't be using amd hardware until this is worked out.

Revision history for this message
In , nucrap (nucrap-linux-kernel-bugs) wrote :

Hey guys so finally I found this bugreport which I guess explains those soft lockups on my Ryzen 7 1700 based system. I had temporarily switched to Win10 since I hoped this bug would be resolved after some time, but now 1,5 years later it is still there on the latest kernel.. At least I can finally use my system again knowing that I need to disable the C6 state.

Can we finally bring some attention to this bug; at least set it to confirmed?

Revision history for this message
In , russianneuromancer (russianneuromancer-linux-kernel-bugs) wrote :

Maybe we need to contact AMD support and ask them to implement workaround they suggested (document in Comment 459) in Linux?

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Sorry but this bugzilla entry is humongous with 110 people on CC and all reporting some different aspects of what they're seeing and what they're trying and and...

In order to debug this properly, I'd like for someone to test the latest upstream kernel 4.20 and try to reproduce the issue there. Then to explain how exactly one has reproduced it. Once that happens we'll take it from there with further questions.

Thx.

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

I just experienced this issue with a Threadripper 2950X on kernel 4.20.3

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Aaron Muir Hamilton from comment #474)
> I just experienced this issue with a Threadripper 2950X on kernel 4.20.3

Anything in dmesg?

Please upload full dmesg, /proc/cpuinfo and kernel .config.

Anything particular you did to reproduce it? Any correlation between what the box does and occurrence of the lockup?

Thx.

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #475)
> (In reply to Aaron Muir Hamilton from comment #474)
> > I just experienced this issue with a Threadripper 2950X on kernel 4.20.3
>
> Anything in dmesg?
>
> Please upload full dmesg, /proc/cpuinfo and kernel .config.
There is no header soldered on the board's UART, I do not have a normal serial port, and the logs would not write to disk (so I can't copy the dmesg from the incident). It amounted to:

BUG: soft lockup - CPU#13 stuck for 22s!

The config is here https://git.archlinux.org/svntogit/packages.git/tree/trunk/config?h=packages/linux

>
> Anything particular you did to reproduce it? Any correlation between what
> the box does and occurrence of the lockup?
Sadly no. The core happened to be running pulseaudio and upower, so my audio buffer started looping and eventually the watchdog timer timed out. I don't think any of that is particularly special though.

>
> Thx.

No, thank you! I don't know what I'm doing, and maybe you do. :- )

For the time being, I've now added idle=nomwait to my kernel params, as it seems at least one of these issues is related to MWAIT (according to AMD's errata).

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

Created attachment 280601
My /proc/cpuinfo

This is my ThreadRipper 2950X /proc/cpuinfo, as requested.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Aaron Muir Hamilton from comment #476)
> For the time being, I've now added idle=nomwait to my kernel params, as it
> seems at least one of these issues is related to MWAIT (according to AMD's
> errata).

Ok, that's good. Please give your box hell to check whether that really
is fixing the issue.

Thx.

Revision history for this message
Maskim Bakulin (ruspartisan) wrote :

We decided to get 3 Ryzen 2700x machines for testing at work, and we are bitterly disappointed that they randomly freeze in Linux.

The only message I get in journalctl is "rcu_sched detected stalls blablabla", and sometimes I don't get even that. I test the system by compiling QEMU with -j16 in a loop (so it has some time of load and some time of idle). Freezes usually occur after several hours. Sometimes you can use pc for a while (like, open a terminal, or browser, move a mouse), but then it freezes completely.

Configuration:
2700x
4*16gb Ram. It is rated at 3000MHz, but I set them to 2133, just to be sure
ASRock B450 Pro4 at one pc and ASUS X470 Pro in two others.
Samsung 860evo 250Gb
Aerocool KCAS 650 80+ Gold PSU
GT1030

I've tried using one machine on B450 Pro4 in Windows, and it's already at 48 hours of uptime, while Linux will freeze after 1-5 hours.

dmesg, cpuinfo and .config are in an attachment.

Things I've tried:
- Typical Current Idle
- idle=nomwait kernel parameter
- recompile kernel and add rcu_nocbs=0-15, kernel versions 4.15.18 and 4.20
- zenstates --disable-c6
- processor.max_cstate=3 (not even 5, just to be sure)
- set SoC at 1.1v instead of default 0.8-0.9
- another PSU (800w 80+ platinum)
- update bios to latest version (December 2018)

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #478)
> (In reply to Aaron Muir Hamilton from comment #476)
> > For the time being, I've now added idle=nomwait to my kernel params, as it
> > seems at least one of these issues is related to MWAIT (according to AMD's
> > errata).
>
> Ok, that's good. Please give your box hell to check whether that really
> is fixing the issue.
>
> Thx.

So my box was mostly idle for the last day or so, and had locked up while the monitor was asleep, so I had to reset it.

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

(In reply to Aaron Muir Hamilton from comment #480)
> So my box was mostly idle for the last day or so, and had locked up while
> the monitor was asleep, so I had to reset it.

Can you test your system with idle=halt parameter instead of idle=nomwait?

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Aaron Muir Hamilton from comment #480)
> So my box was mostly idle for the last day or so, and had locked up while
> the monitor was asleep, so I had to reset it.

Let me make sure I understand it correctly: you had "idle=nomwait" on
the kernel command line and it became unresponsive?

If so, it would be really helpful if you could connect that box over
serial to another one and collect logs in realtime with the hope of
maybe catching something relevant.

The other thing I could suggest is to disable SMT: I know of another
threadripper machine where disabling SMT helped with the lockups. (Not
saying this is the final solution but it would be a good data point to
know).

Thx.

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #482)
> (In reply to Aaron Muir Hamilton from comment #480)
> > So my box was mostly idle for the last day or so, and had locked up while
> > the monitor was asleep, so I had to reset it.
>
> Let me make sure I understand it correctly: you had "idle=nomwait" on
> the kernel command line and it became unresponsive?

Yes

> If so, it would be really helpful if you could connect that box over
> serial to another one and collect logs in realtime with the hope of
> maybe catching something relevant.
Yeah, I'm looking at that, I don't want to fry the (~500 CAD) board with the wrong serial voltage though (RS-232 v. other serial), so I'm being a bit careful.

> The other thing I could suggest is to disable SMT: I know of another
> threadripper machine where disabling SMT helped with the lockups. (Not
> saying this is the final solution but it would be a good data point to
> know).

That is maybe an option, for now I disabled C6 from my board firmware (in ASRock's case, by selecting "Typical current idle" [sometimes called "Common current idle" I think, on some Ryzen boards] in Advanced > AMD CBS > Zen Common Options > Power Supply Idle Control); and I'll see if that does the trick.

I've heard that this resolved some soft lockup issues on Ryzen.

>
> Thx.

Thank you.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Aaron Muir Hamilton from comment #483)
> That is maybe an option, for now I disabled C6 from my board firmware (in
> ASRock's case, by selecting "Typical current idle" [sometimes called "Common
> current idle" I think, on some Ryzen boards] in Advanced > AMD CBS > Zen
> Common Options > Power Supply Idle Control); and I'll see if that does the
> trick.

Yes, that's also worth a try.

Thx.

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

Hi, just setup brand new Ryzen 5 2600 on Asrock AB350 PRO4 with ECC RAM with Fedora 29 and the box froze after only 3 hours uptime, right after we went home for the night, so it was fairly idle (< 0.2% CPU 99% of the time).

Luckily I had read this thread before buying, so I wasn't too shocked.

I wanted to say "me too" because some were wondering if the 2nd gen Ryzens were affected: they are. Also, we were able to get a stack trace / panic output that was on the frozen screen in a phone capture jpg. If anyone wants that, I can attach it. It looks similar to other ones I've seen for this bug elsewhere. No NMI errors logged on our box though.

We did the "idle: typical" bios tweak and the idle=nomwait tweak and the system has been 100% stable for the 2 days since.

Not sure if it's relevant, but the box also has a NVMe M.2 SSD, a rust drive, a high-quality Japanese-cap 620W PS (but states min 0.6A on +12V, but the drives and mobo are drawing at least that), and a very cheap PCIe video card for server use.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Trevor Cordes from comment #485)
> Also, we were able to get a stack trace / panic output that was on
> the frozen screen in a phone capture jpg. If anyone wants that, I can
> attach it.

Please do.

> We did the "idle: typical" bios tweak

What is that?

> and the idle=nomwait tweak and the system has been 100% stable for the
> 2 days since.

Ok, it would be good to test those settings separately:

1. once with the "idle: typical" BIOS thing
2. idle=nomwait

and say which fixes the issue for ya. Or both?

Thx.

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

Created attachment 280661
picture of panic on screen before reboot

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

Sorry the pic is so horrible in attachment 280661, but at least you can see most of the stack traces. I have a wider pic of it I can use to transcribe the missing right-hand-side bits if needed.

By "idle: typical" bios tweak I meant the <<selecting "Typical current idle">> tweak everyone else has done.

As for which fix (1 or 2) really fixed it, I'll get back to you in about a month because this box is now in production and we can't afford downtime until our next scheduled period. I have a feeling just the bios tweak is required (from all I've read in dozens of forums), but I did both for good measure.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Trevor Cordes from comment #488)
> Sorry the pic is so horrible in attachment 280661 [details], but at least
> you can see most of the stack traces. I have a wider pic of it I can use to
> transcribe the missing right-hand-side bits if needed.

Thanks, that's good enough.

Btw, is there any particular reason why you're running a 32-bit kernel?
If not, I'd consider switching to a 64-bit kernel which is a lot more
and widely tested.

> By "idle: typical" bios tweak I meant the <<selecting "Typical current
> idle">> tweak everyone else has done.

I need to figure out what's behind that setting. What kind of BIOS do
you have?

dmesg | grep DMI:

should have it.

> As for which fix (1 or 2) really fixed it, I'll get back to you in about a
> month because this box is now in production and we can't afford downtime
> until our next scheduled period. I have a feeling just the bios tweak is
> required (from all I've read in dozens of forums), but I did both for good
> measure.

Ok, whenever you can.

Thx.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

From my experience (a Ryzen 1700X on a MSI B350 motherboard) the freezes on IDLE, like the ones described from Trevor Cordes, went way once I changed the "Idle: typical" setting on the BIOS.

I had the bug before the BIOS setting appeared, and tried many things with more or less success. But the best solution by far is just to change the BIOS setting. I have many months of uptime without any lock up since then. *It should be the first thing to test if your computer freezes by itself on idle*.

On the other hand I have a Threadripper that freezes when doing hard computations (that alternate between high CPU usage and low, because it needs to write temporary results to disk). This can not be solved using the "Idle: typical" BIOS option. This other freeze happens specially when using Matlab. I am starting to think that the problem is that Matlab (or the MKL Blas libraryit uses) might use heavily SSE3 instructions that have the MWAIT instruction as cited by Francesco Buscani above. I should try the kernel option "idle=mnowait" to see if I can get rid of these freezes. Unfortunately I am travelling until the end of the month and away from the Threadrippers to test it.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #486)
> (In reply to Trevor Cordes from comment #485)
> > Also, we were able to get a stack trace / panic output that was on
> > the frozen screen in a phone capture jpg. If anyone wants that, I can
> > attach it.
>
> Please do.
>
> > We did the "idle: typical" bios tweak
>
> What is that?

You may take a look here to get some more information about this problem and what AMD says: https://community.amd.com/thread/225795 - start reading at 7th post.

Short: AMD says, that this would be a problem of the PSU - but that seems not to be true, because it happens even w/ PSUs supporting 0A at 12V.

The "Typical current idle" switch seems to disable C6 State - Package - which is normally enabled (see post 14 at the above given URL). Used tool is zenstates.py https://github.com/r4m0n/ZenStates-Linux/blob/master/zenstates.py

Revision history for this message
In , bakulinm (bakulinm-linux-kernel-bugs) wrote :

Created attachment 280669
dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied

some older info here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566

I have three machines with new 2700x CPUs, and all three of them experience freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with make -j16 in a loop to test for stability.

I'm not sure it is the same bug, because I observe different behaviour: one machine, that was compiling QEMU, froze during the night, and the one left idle worked for 1 day. Another thing that SEEM to help is disabling SMT: machine with disabled SMT is currently at 7 hours of uptime, which is much more, than usual 1-3 hours before freeze. Windows 10 seem to work fine: I used WSL with ubuntu 18.04 for the same test compilations for 3 days, and no freezes with the same bios settings.

Things I've tried that didn't help:
idle=nomwait
rcu_nocbs=0-15 with new kernel (4.20.3)
disabled cool n quiet and c6 states
Typical Current Idle in uefi
Set SoC voltage to 1.1v
Set DRAM voltage to 1.3v
Update to latest BIOS
High-end PSU

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #489)
> Btw, is there any particular reason why you're running a 32-bit kernel?
> If not, I'd consider switching to a 64-bit kernel which is a lot more
> and widely tested.

Yes, we know. This box has been running forever, always upgraded to the latest Fedora, and we recently upgraded the hardware to the Ryzen from a non-64-capable P4. That's why it's 32b. We will upgrade it to 64-bit during next scheduled downtime in a month.

Maxim: I think your problem is different from this bug. From everything I've read everywhere, the mwait and/or idle:typical tweaks always solve this bug. You've done even the esoteric stuff like nocbs and voltage override and your problem remains. Might (must?) be something else.

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

(In reply to Maxim Bakulin from comment #492)
> Created attachment 280669 [details]
> dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
>
> some older info here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566
>
> I have three machines with new 2700x CPUs, and all three of them experience
> freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with
> make -j16 in a loop to test for stability.
>
> I'm not sure it is the same bug, because I observe different behaviour: one
> machine, that was compiling QEMU, froze during the night, and the one left
> idle worked for 1 day. Another thing that SEEM to help is disabling SMT:
> machine with disabled SMT is currently at 7 hours of uptime, which is much
> more, than usual 1-3 hours before freeze. Windows 10 seem to work fine: I
> used WSL with ubuntu 18.04 for the same test compilations for 3 days, and no
> freezes with the same bios settings.
>
> Things I've tried that didn't help:
> idle=nomwait
> rcu_nocbs=0-15 with new kernel (4.20.3)
> disabled cool n quiet and c6 states
> Typical Current Idle in uefi
> Set SoC voltage to 1.1v
> Set DRAM voltage to 1.3v
> Update to latest BIOS
> High-end PSU

With my Threadripper 2950X I can confirm that none of these were sufficient. I have an excellent PSU which I have confirmed is operating well better than spec, the voltage is definitely sufficient, the power states people are saying cause this are disabled, mwait is disabled, AGESA is current, etc.

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

(In reply to Another User from comment #481)
> (In reply to Aaron Muir Hamilton from comment #480)
> > So my box was mostly idle for the last day or so, and had locked up while
> > the monitor was asleep, so I had to reset it.
>
> Can you test your system with idle=halt parameter instead of idle=nomwait?

Now trying idle=halt, as you suggested. We'll see.

Revision history for this message
In , dolohow (dolohow-linux-kernel-bugs) wrote :

I did power consumption tests with idle=nomwait option on my Huawei Matebook D 14 Ryzen 2500u with power meter.

With idle=nomwait idle power consumption is about 4W less!! I did few reboots to be sure about it.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Folks, just a quick thing: please check whether you have the latest BIOS and if not, do upgrade it and check if it makes any difference.

Thx.

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

OS: Antergos Linux 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux.
CPU: AMD Ryzen 7 1700 Eight-Core Processor.
CPU flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
MB: ASRock AB350 Gaming-ITX/ac; Version: P4.60; Release Date: 04/19/2018
RAM: F4-2400C15D-16GFXR, 32 GB DDR4.

Attempt #1:
a) Set BIOS / Advanced / AMD CBS / Zen Common Options / Power Supply Idle Control to "Typical Current Idle".

Result: Hangs

Attempt #2:
a) Set /etc/default/grub to: GRUB_CMDLINE_LINUX_DEFAULT="quiet idle=nomwait resume=UUID=9d2d2002-406a-4eb5-bd21-fdb889831991"
b) Ran update-grub
c) Rebooted
d) BIOS remains at "Typical Current Idle"

Result: Hangs

Journal Log:

Jan 24 17:21:00 kernel: BUG: unable to handle kernel paging request at ffffffffffffffff
Jan 24 17:21:00 kernel: PGD 3a0a0e067 P4D 3a0a0e067 PUD 3a0a10067 PMD 0
Jan 24 17:21:00 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 24 17:21:00 kernel: CPU: 1 PID: 994 Comm: xfwm4 Not tainted 4.20.3-arch1-1-ARCH #1
Jan 24 17:21:00 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Gaming-ITX/ac, BIOS P4.60 04/19/2018
Jan 24 17:55:01 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Jan 24 17:55:01 kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
Jan 24 17:55:27 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks:
Jan 24 17:59:39 kernel: ACPI BIOS Warning (bug): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20181003/tbfadt-624)
Jan 24 17:59:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

$ sudo cat /boot/grub/grub.cfg | grep -i idle
 linux /vmlinuz-linux root=UUID=16a8e892-ec00-4316-b62d-5b60da84a7fb rw quiet idle=nomwait resume=UUID=9d2d2002-406a-4eb5-bd21-fdb889831991

Have not tried:

1) idle=halt
2) Disable C6 state with zenstates.py
3) BIOS upgrade

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to T X from comment #498)
> Jan 24 17:59:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported
> by HW (0x0)

So this looks like a broken BIOS.

> Jan 24 17:21:00 kernel: Hardware name: To Be Filled By O.E.M. To Be
> Filled By O.E.M./AB350 Gaming-ITX/ac, BIOS P4.60 04/19/2018

Looking at Asrock's website, there's newer BIOS for your board, AFAICT,
so can you update it and rerun the same test? Or what you were doing to
cause the splat.

Also, please upload full dmesg after you've updated the BIOS.

Thx.

Revision history for this message
In , bakulinm (bakulinm-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #497)
> Folks, just a quick thing: please check whether you have the latest BIOS and
> if not, do upgrade it and check if it makes any difference.
>
> Thx.

Already running latest BIOS (2018/12/19 for ASRock B450 Pro4 and 2018/12/17 for ASUS X470 Pro).

I found out that openSUSE Tumbleweed with stock 4.20.0 kernel doesn't appear to freeze. I tried copying suse's kernel and /lib/modules to Xubuntu 18.04LTS, but it didn't seem to help, xubuntu still freezes. Maybe, xubuntu's firmware packages are old? But I could've done something wrong with the transplant procedure :) I'm going to test Xubuntu 18.10 during the weekends, maybe it has newer packages and does not freeze.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Maxim Bakulin from comment #500)
> I found out that openSUSE Tumbleweed with stock 4.20.0 kernel doesn't appear
> to freeze. I tried copying suse's kernel and /lib/modules to Xubuntu
> 18.04LTS, but it didn't seem to help, xubuntu still freezes. Maybe,
> xubuntu's firmware packages are old? But I could've done something wrong
> with the transplant procedure :)

Yeah, I wouldn't do that.

What's stopping you from building a kernel on your machine and
installing it?

The web is full of tutorials like this one:

https://www.linux.com/learn/how-build-latest-linux-kernel-debian-linus-git-repo

for example, which is for debian-based systems.

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

Upgraded BIOS as follows:

1. https://www.asrock.com/mb/AMD/Fatal1ty%20AB350%20Gaming-ITXac/index.asp
2. Click Support > BIOS
3. Download Version 5.30 through Global site (ignore "AMD all in 1 VGA driver")
4. Formatted USB as FAT32
5. Copied and unzipped AB350 Gaming-ITXac(5.30)ROM.zip to USB
6. Flashed BIOS as per https://www.asrock.com/support/BIOSIG.asp?cat=BIOS8

$ dmidecode -t bios -q
  Version: P5.30
  Release Date: 12/18/2018

BIOS installed correctly.

$ journalctl -b

Jan 25 09:38:33 kernel: Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=16a8e892-ec00-4316-b62d-5b60da84a7fb rw quiet idle=nomwait resume=...
Jan 25 09:38:33 kernel: ACPI BIOS Warning (bug): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1>
Jan 25 09:38:33 kernel: mtrr: your CPUs had inconsistent variable MTRR settings
Jan 25 09:38:33 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
Jan 25 09:38:33 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Jan 25 09:38:33 kernel: sp5100-tco sp5100-tco: Watchdog hardware is disabled
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: ccp enabled
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: psp initialization failed
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: enabled
Jan 25 09:38:38 kernel: kauditd_printk_skb: 23 callbacks suppressed

Since the BIOS was flashed, I assume that "Typical Current Idle" was not set.

Splat was caused by idling for ~20 minutes. Will try again with "Typical Current Idle" and, if it hangs again, will provide the dmesg log.

Additional hardware:

* Chassis: https://www.hdplex.com/hdplex-h5-fanless-computer-case.html
* PSU: https://www.hdplex.com/hdplex-internal-400w-ac-dc-adapter-with-active-pfc-and-19vdc-output.html and https://www.hdplex.com/hdplex-400w-hi-fi-dc-atx-power-supply-16v-24v-wide-range-voltage-input.html

Revision history for this message
In , bakulinm (bakulinm-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #501)
> What's stopping you from building a kernel on your machine and
> installing it?

I tried and it didn't help. After freezes started happening at Xubuntu 18.04LTS with stock 4.15, I compiled latest 4.20.3 kernel and added CONFIG_RCU_NOCB_CPU=y, as it was suggested here, but freezes didn't stop. There's my dmesg at comment 492 with latest kernel.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Maxim Bakulin from comment #503)
> I tried and it didn't help. After freezes started happening at Xubuntu
> 18.04LTS with stock 4.15, I compiled latest 4.20.3 kernel and added
> CONFIG_RCU_NOCB_CPU=y, as it was suggested here, but freezes didn't stop.
> There's my dmesg at comment 492 with latest kernel.

Lemme get this straight:

opensuse 4.20.0 kernel doesn't freeze the box but 4.20.3 stable kernel does?!

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to T X from comment #502)
> Jan 25 09:38:33 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> Jan 25 09:38:33 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported
> by HW (0x0)

Yah, even with the new BIOS that's still there. Doesn't look like it has been
fixed.

If I send you a debugging patch, would you be able to apply it, build a
kernel, boot into it and get me dmesg from the box?

Thx.

Revision history for this message
In , aaron (aaron-linux-kernel-bugs) wrote :

I'm running with idle=halt, and I have not experienced a freeze for a couple days now, FWIW.

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

Using:

* BIOS Version: P5.30, Release Date: 12/18/2018
* idle=nomwait
* Power Supply Idle Control set to "Typical Current Idle"

The ~20-minute idle lock-up issue appears fixed, despite:

Jan 25 10:21:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not support...

If it locks up overnight, I'll report back with the dmesg.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Can people confirm that idle=halt fixes the issue, like Aaron says in comment #506?

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

> If I send you a debugging patch, would you be able to apply it, build a
kernel, boot into it and get me dmesg from the box?

I haven't built a kernel in about decade, but if you provide step-by-step instructions for how to do so, I'll attempt to build the patched kernel and then post the dmesg.

$ uname -a
Linux 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux

$ cat /etc/issue
Antergos Linux

Revision history for this message
In , hvtaifwkbgefbaei (hvtaifwkbgefbaei-linux-kernel-bugs) wrote :

(In reply to Maxim Bakulin from comment #492)
> Created attachment 280669 [details]
> dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
>
> some older info here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566
>
> I have three machines with new 2700x CPUs, and all three of them experience
> freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with
> make -j16 in a loop to test for stability.
>
> I'm not sure it is the same bug, because I observe different behaviour: one
> machine, that was compiling QEMU, froze during the night, and the one left
> idle worked for 1 day. Another thing that SEEM to help is disabling SMT:

The Processor errata lists two bugs (SMT-related) for Ryzen 1 and Ryzen 2 (1095 and 1109) with status "no fix planned". If you disable MWAIT but enable SMT, you are left with bug 1095: "Potential Violation of Read Ordering In Lock Operation In SMT (Simultaneous Multithreading) Mode". This can cause crashes. Not necessarily the cause of your crashes :-D

# lsmsr -r 0xc0011020
warning: unknown MSR c0011020
unknown = 0x0006800000000010

On my Ryzen 1600X bit 57 (no idea what it does) is 0. (But I have nosmt=force.) Linux kernel doesn't seem to touch that bit.

Also, if you get "ACPI MWAIT C-state 0x0 not supported by HW (0x0)", mwait is not used by kernel.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to T X from comment #509)
> I haven't built a kernel in about decade, but if you provide step-by-step
> instructions for how to do so, I'll attempt to build the patched kernel and
> then post the dmesg.

Ok, here we go:

1. Clone stable kernel from here if you haven't done so yet:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

2. checkout 4.20 branch

$ git checkout -b 4.20-stable origin/linux-4.20.y

3. apply the attached patch

$ patch -p1 -i acpi-dump-cstates.diff

4. copy your 4.20 config

$ cp /boot/config-4.20.... .config

5. do

$ make oldconfig

6. build

$ make -j16

7. install kernel, needs root

# make modules_install install

# reboot

Then, select this new kernel in grub, boot it and upload full dmesg.

Thx.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Created attachment 280825
acpi-dump-cstates.diff

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #511)
> Ok, here we go:
> 1. Clone stable kernel from here if you haven't done so yet:
> $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
> 2. checkout 4.20 branch
> $ git checkout -b 4.20-stable origin/linux-4.20.y
> 3. apply the attached patch
> $ patch -p1 -i acpi-dump-cstates.diff
> 4. copy your 4.20 config
> $ cp /boot/config-4.20.... .config

    $ find /boot | grep -i config
    /boot/grub/i386-pc/configfile.mod

I don't think that that's the right file.

    $ sudo updatedb
    $ locate -i config-4
    $ uname -a
    Linux 4.20.4-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 23 00:12:22 UTC 2019 x86_64 GNU/Linux

Where is the config file?

Revision history for this message
In , celti (celti-linux-kernel-bugs) wrote :

(In reply to T X from comment #513)
> $ sudo updatedb
> $ locate -i config-4
> $ uname -a
> Linux 4.20.4-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 23 00:12:22 UTC 2019
> x86_64 GNU/Linux
>
> Where is the config file?

Arch Linux doesn't package the kernel's config with the kernel. For the current 4.20-arch1 config, see: https://git.archlinux.org/svntogit/packages.git/plain/trunk/config?h=packages/linux — or just use /proc/config.gz.

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

Created attachment 280867
dmesg log after rebooting with acpi-dump-cstates patch

Revision history for this message
In , thangalin (thangalin-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #511)
> 7. install kernel, needs root
> # make modules_install install

    $ sudo make modules_install install
      ...
      INSTALL virt/lib/irqbypass.ko
      DEPMOD 4.20.5-dirty
    sh ./arch/x86/boot/install.sh 4.20.5-dirty arch/x86/boot/bzImage \
        System.map "/boot"
    Cannot find LILO.

That's a misleading error message as LILO isn't installed on the system, but the new kernel was copied:

    $ ll /boot
    -rw-r--r-- 1 root root 5851008 Jan 29 22:40 vmlinuz

GRUB seems to have loaded it fine.

See attached dmesg-acpi-dump-cstates.txt for details.

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen notebook more stable than anything else I've tried. I have not experienced freezes for some days now.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Philip Rosvall from comment #518)
> I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen
> notebook more stable than anything else I've tried. I have not experienced
> freezes for some days now.

Can you upload dmesg from the working and non-working kernels pls?

Thx.

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

Created attachment 280915
dmesg w/o boot parameter idle=halt

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

Created attachment 280917
dmesg with boot parameter idle=halt

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #519)
> (In reply to Philip Rosvall from comment #518)
> > I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen
> > notebook more stable than anything else I've tried. I have not experienced
> > freezes for some days now.
>
> Can you upload dmesg from the working and non-working kernels pls?
>
> Thx.

I have only used the workaround since kernel 4.20.4, so I just booted the kernel now without and with the parameter. As can be seen in line 704 in the log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)". This error does not occur when "idle=halt" is set.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Philip Rosvall from comment #522)
> I have only used the workaround since kernel 4.20.4, so I just booted the
> kernel now without and with the parameter. As can be seen in line 704 in the
> log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT
> C-state 0x0 not supported by HW (0x0)". This error does not occur when
> "idle=halt" is set.

Yeah, this is starting to look more and more like a BIOS issue on
certain mobos.

Since your machine is issuing that ACPI MWAIT error, can you apply the
debugging patch from comment #512, build a kernel, boot into it
without "idle=halt" and upload dmesg again?

Thx.

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

Created attachment 280921
dmesg w/o boot parameter idle=halt, patched acpi-dump-cstates

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #523)
> (In reply to Philip Rosvall from comment #522)
> > I have only used the workaround since kernel 4.20.4, so I just booted the
> > kernel now without and with the parameter. As can be seen in line 704 in
> the
> > log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT
> > C-state 0x0 not supported by HW (0x0)". This error does not occur when
> > "idle=halt" is set.
>
> Yeah, this is starting to look more and more like a BIOS issue on
> certain mobos.
>
> Since your machine is issuing that ACPI MWAIT error, can you apply the
> debugging patch from comment #512, build a kernel, boot into it
> without "idle=halt" and upload dmesg again?
>
> Thx.

Here you go!

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Created attachment 280961
Don't do mwait on B1 and earlier

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Ok, here's a test patch ontop of 4.20-stable.

It should practically make idle=halt the default on revisions before B2. You check which revision you have by doing

$ grep stepping /proc/cpuinfo

The number must be < 2.

For the folks with B2 machines we need to keep debugging.

Thx.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #526)
> Created attachment 280961 [details]
> Don't do mwait on B1 and earlier

What's the downside of generally disabling mwait? I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/ 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI firmware bug messages disappear. That's all.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #528)
> What's the downside of generally disabling mwait?

So in your case, you can't do MWAIT to enter C1 anyway because your
revision doesn't support it. This is why you're seeing those firmware
messages.

> I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/
> 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI
> firmware bug messages disappear. That's all.

That's an "interesting" way to fix it but if it works ... :)

The intention of the fix is to make idle=halt the default for obvious reasons.

HTH.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #529)
> (In reply to Klaus Mueller from comment #528)
> > What's the downside of generally disabling mwait?
>
> So in your case, you can't do MWAIT to enter C1 anyway because your
> revision doesn't support it. This is why you're seeing those firmware
> messages.

Hmm, if rev. 1 doesn't support MWAIT - why can it be a problem anyway at the same time which must be fixed by disabling the usage of MWAIT? I seem to miss something?

>
> > I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/
> > 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI
> > firmware bug messages disappear. That's all.
>
> That's an "interesting" way to fix it but if it works ... :)
>
> The intention of the fix is to make idle=halt the default for obvious
> reasons.

Isn't this already done automatically if MWAIT isn't supported?

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Klaus Mueller from comment #530)
> Hmm, if rev. 1 doesn't support MWAIT - why can it be a problem anyway
> at the same time which must be fixed by disabling the usage of MWAIT?
> I seem to miss something?

That's a good question but, frankly, I don't have a very exact answer to
it right now.

In order to understand what's *really* going on in the
cstate detection code one would need to instrument
at least acpi_processor_get_power_info_cst() and
acpi_processor_ffh_cstate_probe() to figure out what exactly does the
kernel parse from those CST objects and what it uses to try to enter
idle.

And do all that instrumentation on an affected system.

My current suspicion is is that it tries to enter idle with
misconfigured states and under certain conditions, it misses the wakeup,
leading to the stall.

This is all a conjecture anyway.

Now my patch simply falls back to the good old idle entry on AMD where
we simply do HLT and we won't even attempt to enter idle the ACPI way.

Makes sense?

Revision history for this message
In , bakulinm (bakulinm-linux-kernel-bugs) wrote :

Created attachment 281445
Testcase in docker

I have created a docker file and run script, that allows me to reproduce the issue. It still takes quite a lot of time usually. This docker freezes the system in 30 minutes - 5-6 hours. I've been able to reproduce it with docker in Windows 10, xubuntu 18.04, xubuntu 18.10, opensuse tumbleweed and centos 7.

I know, that docker can have bugs too, and reproducing in docker is not exactly reproducing natively, but since these freezes also happen on 4 of my machines in native 18.04 and in VirtualBoxed 18.04 too, it's possible that the bug is not in the docker or VirtualBox.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

Input interpretation:
Wednesday, August 16, 2017
Open code
Enlarge Data Customize A Plaintext Interactive
Date formats:More formats/calendars
16/08/2017 (day/month/year)
Time difference from today (Saturday, March 2, 2019):
1 year 6 months 17 days ago
18 months 17 days ago
80 weeks 3 days ago
563 days ago
402 weekdays ago
1.54 years ago

Embarassing AMD, just a disgrace.

Revision history for this message
In , henriquesarmentogoulartaguiar (henriquesarmentogoulartaguiar-linux-kernel-bugs) wrote :

Does anyone know whether the kernel version 5.0 solves the issue? From the changelog, it seems to me like it might.

Revision history for this message
In , ison (ison-linux-kernel-bugs) wrote :

I have tested it on 5.0 and still experienced the soft lockups just as frequently as with 4.20.
However, I am currently testing kernel 5.0 with PAGE_TABLE_ISOLATION disabled and it seems that it may be a fix.

Although maybe it's a bit premature to say that, since my experience with this issue (on 2700X) has been extremely inconsistent. I get the soft lockups, but in addition to the lockups (and usually preceding them) it seems like my whole system becomes unstable. Applications start segfaulting, or turning into zombies that can't be killed, then the lockup happens if I just let it go.
What's strange is that once this sort of "instability" occurs it seems to stay unstable even across reboots, and I can only have my machine running 20-30min before things start segfaulting or locking up again. Usually it takes me about 5 or 6 restarts before it just seems to hit some sweet spot and become "stable". Once it's stable it stays stable for days, even if I restart the machine.

So based on the above description I can see that narrowing down the problem could be very difficult since something might seem to be working for a while, until it doesn't.

At any rate, the solution I'm currently testing was proposed in this Gentoo thread:
https://forums.gentoo.org/viewtopic-t-1074860-start-0.html
which I stumbled upon after receiving a kernel panic and noticing this error
>Unexpected reschedule of offline CPU#0!
From what I can tell "offline CPU" sounds very relevant to our issue.

Their solution was to disable PAGE_TABLE_ISOLATION in the kernel. This also intrigues me as I remember seeing other errors in dmesg relating to "page" writing.
Apparently that kernel option isn't even necessary for AMD CPU's anyway. It was meant to fix an insecurity in intel CPU's, so it can be safely disabled.

NOTE: I am testing this with some of the BIOS modifications recommended by others here (such as disabling c6 state and "typical current idle", etc..)
If this solution works for another week or so I'll try restoring the BIOS settings to their defaults and testing again.

Revision history for this message
In , henriquesarmentogoulartaguiar (henriquesarmentogoulartaguiar-linux-kernel-bugs) wrote :

Well, it's a pity to know 5.0 might not fix it. I was really hoping it would. On the other hand, it's good to know so many other people have this issue, which means I'm not alone in hoping it'll be at some point fixed.

I'll just throw my two cents in though, to hopefully help.

Firstly, technical info. My computer is (I'm pretty sure) a HP 15-DB0125, with Ryzen 5 2500U and Vega 8 graphics. I've applied a bunch of fixed I've found online (mainly the Zen script and grub options), and while I still get freezes, they happen much more rarely, particularly if I shut down my computer when I'm not using it and after heavy activities. I haven't tried idle=halt yet, though

Anyhow. To the part I don't think has been mentioned.

I've noticed games running on Wine/Proton make my computer very much more likely to freeze. Neverwinter (which should be the lightest, I assume), for example, will crash after a few minutes. Warframe will crash after maybe half an hour. Counterstrike: Global Offensive? That game, which runs natively, can be played for a comfortable amount of time (maybe 1-2+ hours?) without risking any freezes. Which is a shame, really, because I'm being locked out of playing Wine-enabled games until Linux and Ryzen fix their dysfunctional relationship.

Revision history for this message
In , henriquesarmentogoulartaguiar (henriquesarmentogoulartaguiar-linux-kernel-bugs) wrote :

fixes*

To be clear, I'm not sure whether CS:GO does freeze my computer after a given amount of time. I don't think I've tested it thoroughly enough to be sure. I do know I can play it comfortably without worrying.

Revision history for this message
In , simon (simon-linux-kernel-bugs) wrote :

Hi,
this comment (536 and 535) for me sounds more or less
the instability is not from idle. because:
"... Applications start segfaulting,..."
"running wine/... (with games)"
for segfaulting you have to change the processor. search the internet
for the "kill ryzon script" run it and look what happens.
if it segfaults, normally amd will change your processor to one
who will not segfault.

so for me this was the first thing i have done to get all my ryzon
systems rockstable. all processors where exchanged by amdsupport (some twice)
(my (small) hardware computer store has done this for me)
after you get a new processor check again!!!!
and maybe (it happend for me after updates of bios to get the "typical idel control settings") , check if you under hard load (i used the
kill ryzon script again)
get:
kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
(ore something similar)
for this use the "pcie_aspm=off" option for grub/kernel commandline

... what i will say, before you are not sure that the "kill ryzon script"
will work fine, do not search for other things, these processors have
a bugs. FIRST change with amd the processor, do not thrust the production
date of the processor, i have had processors who according to all infos
i found inside the internet should be ok, but the where NOT ok.

as i understood this randomsoft lookups buglist here is for softlookups
when idle, or wakeing up from idle.

simoN

Revision history for this message
In , onox (onox-linux-kernel-bugs) wrote :

Hi ! I haven’t read all the comments here because there are too many by now, but I’m puzzled that many consider this to be a Linux-only issue, because I can tell you that it definitely happens on Windows (10) too, and I found other people with the same freeze/reboot problem on Windows. For me the crashes are really the same on both Linux and Windows. I have a Ryzen 7 1700X and an Asrock X370 Taichi mainboard, and the "Power Supply Idle Control" > "Typical Current Idle" setting fixes the problem. However, lately I’ve noticed a couple of times, but still rarely, that the computer would crash and reboot when waking up from sleep, then it seems to enter a boot loop for a few times, and then starts normally but it resets the BIOS settings… I haven’t overclocked anything. My CPU also seems to have the "Ryzen segfault bug", which seems to be a different issue from what I’ve understood.

I should probably RMA the CPU (and the mainboard ?), but I haven’t because disassembling everything would be difficult for me. In my case those crashes (without "Typical Current Idle") happen fairly rarely, like every two-three weeks, or even more rarely than that.

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

Try idle=halt! It is the workaround that fixes the random freezes! You don't need anything else (at least not with Ryzen 1*** and 2***U processors, as it works wonders on my desktop with a 1600 and my notebook with a 2700U).

IDLE=HALT IS THE WORKAROUND THAT FIXES THIS BULLSHIT!
I REPEAT:
IDLE=HALT!

It will probably work for most of you too!

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

Here (Asus CrossHair VI, 1600, gentoo with any kernel around), no suggested workaround works. The system freezes when in very light use and I was never able to obtain a single line of debug via serial (I put a small linux mini itx box near this computer just to receive the serial log).

It seems like after a workaround is applied (BIOS settings, iddle=halt, rcu_nocbs=0-11 and so on), the BAI (Bug AI) finds a way to freeze the system again and the dances restart.

Ryzen is not a CPU to run Linux onto.

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

(In reply to alfie from comment #542)
> Here (Asus CrossHair VI, 1600, gentoo with any kernel around), no suggested
> workaround works. The system freezes when in very light use and I was never
> able to obtain a single line of debug via serial (I put a small linux mini
> itx box near this computer just to receive the serial log).
>
> It seems like after a workaround is applied (BIOS settings, iddle=halt,
> rcu_nocbs=0-11 and so on), the BAI (Bug AI) finds a way to freeze the system
> again and the dances restart.
>
> Ryzen is not a CPU to run Linux onto.

Strange, since idle=halt makes both my machines rock stable. Do you have AMD-V enabled? If so, iommu=pt fixes some problems that can occur with that.

Revision history for this message
In , ison (ison-linux-kernel-bugs) wrote :

(In reply to Philip Rosvall from comment #541)
> Try idle=halt! It is the workaround that fixes the random freezes! You don't
> need anything else (at least not with Ryzen 1*** and 2***U processors, as it
> works wonders on my desktop with a 1600 and my notebook with a 2700U).
>
> IDLE=HALT IS THE WORKAROUND THAT FIXES THIS BULLSHIT!
> I REPEAT:
> IDLE=HALT!
>
> It will probably work for most of you too!

In light of the last few comments here I can say that I may have been confusing 2 separate issues: The random segfaults and the lockups.
I tested idle=halt a while back and I may have inadvertently concluded that it didn't work due to my machine still receiving segfaults, which I was assuming would eventually lead to a lockup.

Likewise, the solution I mentioned earlier (disabling PAGE_TABLE_ISOLATION) seems to have fixed my crashes, but my machine just locked up.

I've set idle=halt again and this time I'll try to keep the two issues separate and only post back here if I still receive lockups.
Thanks for clarifying the issue.

Revision history for this message
In , henriquesarmentogoulartaguiar (henriquesarmentogoulartaguiar-linux-kernel-bugs) wrote :

Hey out of curiosity, since I can't seem to find a clear answer, what exactly does idle=halt do?

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

(In reply to alfie from comment #542)
>
> Ryzen is not a CPU to run Linux onto.

YMMV but our Ryzen 2600 has been 100% rock solid stable (on 24/7) since my last comment #493 Jan 23, 2019. All we did was the mwait and idle:typical bios tweaks. That has worked for many/most people hitting this bug, as proven by the fact many people comment, say thanks for the tweaks, and never come back.

I strongly suspect that anyone having problems after doing mwait and idle:typical (and maybe rcu_nocbs too) are having a separate problem. Certainly the segfault people are hitting a different bug which should not get convoluted into this bug thread. Our 2600 never exhibited the segfault problem (thankfully!): seems to be more a 1xxx issue.

Revision history for this message
In , devurandom (devurandom-linux-kernel-bugs) wrote :

I was severely affected a year ago. My system would not run through one rebuild of Gentoo, i.e. it would lock up / hard freeze after only a few hours. I suspect the heavy disk corruption I experienced every time it happened was a direct result of this, and not caused by unrelated hardware defects. The situation was dire and at the time I had to set "typical power idle" in the firmware in order to make the problem go away and the system usable.

Things seem to have improved a lot since then. I now have default firmware settings (i.e. I have NOT set "typical power idle"), I have default Linux idle settings (i.e. I have NOT set idle=halt, idle=mwait or similar), and am I not using any other counter measure anymore. The system is as stable as it gets: No lockups and no unexplainable segfaults. This is since before Linux 5.0 and I strongly suspect the recent firmware update to version 4207 to have fixed the issue.

CPU: AMD Ryzen 5 2400G
Motherboard: Asus ROG STRIX B350-F Gaming
Firmware version: 4207
Linux: 5.0.1 (with Gentoo patches)

Revision history for this message
In , cevelnet (cevelnet-linux-kernel-bugs) wrote :

AGESA 0070 has been released for some mainboards (e.g. ASUS A320M-K, ASUS X470 Pro and a couple of MSI boards). For anyone experiencing crashes, I think 0070 is worth a try. Official changelog says something along "added support for new processors", but more changes under the hood aren't unlikely.

Regarding the segfault issue: it has already been confirmed by AMD to be a hardware issue on early batches of Ryzen 1xxx CPUs and you can RMA affected CPUs under warranty. I don't have the landing page at hand right now, but I'd advise anyone with the segfault bug to replace their CPUs; especially those experiencing crashes under Windows.

Like Trevor, I also think we have incostencies, because we're most likely dealing with multiple bugs. From what I've seen so far, Vega users have a different form of crash caused by Vega (including Ryzen Mobile, Vega 56/64 dGPU, 2200G/2400G), users experiencing crashes under Windows and Linux have a different form of instability issue, crashes caused by segfault is another category and issues fixed by idle=halt / "Typical Load Idle" is another category.

From my own tests, I fall into the category fixed by "Typical Load Idle". Later today, I will update to AGESA 0070, test against latest kernel with various settings and report back.

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

(In reply to CodingEagle02 from comment #545)
> Hey out of curiosity, since I can't seem to find a clear answer, what
> exactly does idle=halt do?

idle=halt disables ACPI MWAIT completely, and issues HALT instead of different C-states for idle.
Without idle=halt, both my machines are issuing errors upon boot ([Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)), which disappears when idle=halt is set.
Borislav Petkov has a suspicion about what happens (comment 531):
"In order to understand what's *really* going on in the
cstate detection code one would need to instrument
at least acpi_processor_get_power_info_cst() and
acpi_processor_ffh_cstate_probe() to figure out what exactly does the
kernel parse from those CST objects and what it uses to try to enter
idle.

And do all that instrumentation on an affected system.

My current suspicion is is that it tries to enter idle with
misconfigured states and under certain conditions, it misses the wakeup,
leading to the stall."

I have successfully used idle=halt for over a month now, without a single lockup!

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

(In reply to ison from comment #544)
> (In reply to Philip Rosvall from comment #541)
> > Try idle=halt! It is the workaround that fixes the random freezes! You
> don't
> > need anything else (at least not with Ryzen 1*** and 2***U processors, as
> it
> > works wonders on my desktop with a 1600 and my notebook with a 2700U).
> >
> > IDLE=HALT IS THE WORKAROUND THAT FIXES THIS BULLSHIT!
> > I REPEAT:
> > IDLE=HALT!
> >
> > It will probably work for most of you too!
>
> In light of the last few comments here I can say that I may have been
> confusing 2 separate issues: The random segfaults and the lockups.
> I tested idle=halt a while back and I may have inadvertently concluded that
> it didn't work due to my machine still receiving segfaults, which I was
> assuming would eventually lead to a lockup.
>
> Likewise, the solution I mentioned earlier (disabling PAGE_TABLE_ISOLATION)
> seems to have fixed my crashes, but my machine just locked up.
>
> I've set idle=halt again and this time I'll try to keep the two issues
> separate and only post back here if I still receive lockups.
> Thanks for clarifying the issue.

It seems like seperate issues. The lockups seems, to me, to be related to MWAIT and misconfigured C-states. Have you taken a look at your dmesg after a boot without idle=halt, if you get the error "[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)"?

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

kernel 4.19.27 with rcu_nocbs=0-11 idle=halt nopti, yesterday at around 11:30 p.m. I had an "usual" total lookup of the system during light usage (I just had an ssh session opened). With idle=halt, I get no "ACPM WAIT C-state bug" line in dmesg.

The problem with this freeze problem is that is difficult to replicate. The segfault bug was very easy: 3-4 mesa compilations triggered it for sure. This bug, instead, can hide for days and days and it reappears all of a sudden.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to alfie from comment #551)
> kernel 4.19.27 with rcu_nocbs=0-11 idle=halt nopti, yesterday at around
> 11:30 p.m. I had an "usual" total lookup of the system during light usage (I
> just had an ssh session opened). With idle=halt, I get no "ACPM WAIT C-state
> bug" line in dmesg.
>
> The problem with this freeze problem is that is difficult to replicate. The
> segfault bug was very easy: 3-4 mesa compilations triggered it for sure.
> This bug, instead, can hide for days and days and it reappears all of a
> sudden.

The segfault compilation issue is well known. I had the same with my first 1700. AMD will RMA. My replacement chip has been fine.

As for the other idle lockup issues, a combination of custom rcu kernel and python script to disable C6 has had my machine up for months.

Revision history for this message
In , bakulinm (bakulinm-linux-kernel-bugs) wrote :

As I have said previously, I'm not sure that my problem was the same that others have (I didn't have freezes on idle, but rather freezes when compiling), but what helped me is updating BIOS to latest version. ASUS has released a new BIOS with AGESA 0070 for X470 Pro, and I can't trigger the bug anymore. As for ASRock, they will probably release update in a week or so, all vendors usually release new AGESA within a moth, as far as I remember.

Revision history for this message
In , tommyvct (tommyvct-linux-kernel-bugs) wrote :

Same here, I'm using opensuse Leap 15 and Ubuntu 18.10, both 4.12 and 4.18 kernel suffer the same issue.
I'm using AMD Ryzen 5 1600 with Asrock Fatal1ty AB350 Gaming-ITX/ac board.
I'll try updating the latest bios to see if things got fixed.

Revision history for this message
In , ison (ison-linux-kernel-bugs) wrote :

I am still receiving lockups even with the latest BIOS, all the BIOS setting recommendations here, and with the idle=halt kernel parameter.

Although I am now fairly certain that I have a faulty motherboard (x470) which seems to be getting worse, and sometimes doesn't even get to the BIOS at all (I should have suspected something was up when I first bought it and the PS/2 port didn't work).
I'm going to try replacing it and hopefully that'll help.

Revision history for this message
In , zankuro (zankuro-linux-kernel-bugs) wrote :

Hi guys. I have an Asus ROG Strix notebook with Ryzen 7 1700. My CPU was manufactured on week 46 of 2017 (UA 1746PGS), so it does not have the segfault problem. My notebook completely freeze when it is idle, not always, but it is more common when newly powered (cool) or after heavy activities. The freezes never occurred during more intense use, only when idle. Like all notebook BIOS, my BIOS options is quite limited and does not have a "Typical current idle" or similar option. I tried to recompile the kernel with the option "RCU_NOCB_CPU" and also tried all possible combinations with the parameters "rcu_nocbs", "iommu=soft/pt", "idle=nomwait/halt", "pti=off" / "nopti", etc, etc, etc. The only alternative that actually solved the freeze problem was to use the program "zenstates.py" disabling the "c6state" core and package. But this attitude makes my CPU run about 5°C warmer. I used Windows on this same machine for weeks without any crashes. This makes me conclude the following: does not crash on windows (ok, it's a software problem, "linux"); CPU intel does not hang in linux (oh no, ok, it's a hardware problem, "Ryzen"). I do not know what AMD thinks about this, but I was forced to change my notebook. Now I got one with an Intel processor. Thank you AMD.

Revision history for this message
In , tommyvct (tommyvct-linux-kernel-bugs) wrote :

(In reply to Tommy Vercetti from comment #554)
> Same here, I'm using opensuse Leap 15 and Ubuntu 18.10, both 4.12 and 4.18
> kernel suffer the same issue.
> I'm using AMD Ryzen 5 1600 with Asrock Fatal1ty AB350 Gaming-ITX/ac board.
> I'll try updating the latest bios to see if things got fixed.

I updated the bios to P5.30, it only locked up for once so far.
How do I turn the log on for this?

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

I assume there is several (at least two) similar problems that cause spontaneous system hangs.

One of them is mwait bug listed in AMD errata. Looks like idle=halt is partial workaround for this. But, as said in AMD community forum, guest OS in virtual machine may execute mait instruction and provoke this bug. Luckily, mwait is not common for user-space applications (still not sure about that).

Another is power supply problem. This may be caused by unsupported PSU (no 0A 12V) or unsuitable power subsystem on motherboard. This may be partially (again!) solved by BIOS "typical current idle" option or disabling C6 states by .py script.
IMO this bios option may be implemented not properly in some mobos firmware. Looks like it only "says to OS" do not use C-states, but does not prevent deep sleep on hw level. Here we got [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) in dmesg and problem persists.

Have read some forums with assures of completely stable Ryzen systems work under Linux without any tweks. So not really too many systems affected by mwait and PSU lockups. I think some combination of factors may provoke this behavior. Mobo+PSU, memory latency (why not?) or some vendor-provided bios config.

Also in some cases these two problems may appear together.
Obviously, segfaults or amdgpu crashes is not related to this bug.

PS:Sorry for my bad English.

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

The PSU problem sounds like a weak excuse from AMD or may they are talking about very very bad 10 dollars units.

The BIOS options "typical/low current" doesn't seems to change the c-states set up by the kernel routines here. With any selection, I always end up with
c0 POLL
c1 ACPI HLT
c2 ACPI IOPORT 0x414

as you can see with cpufreq cpupower idle-info or /usr/src/linux/tools/power/x86/turbostat/turbostat -n1

That happens because the firmware already informed the kernel mwait/monitor cannot be used for c-states.

If idle=halt is used, there is no c-state management and the idl instruction is used to put a core in idle mode.

I think it is the hyper treading handling that is somehow bugged, somewhere.

Also note that many ryzen users overclock their machine and if you overclock, you have to c-state problems at all.

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

So if you got message with "typical current idle" option:
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
it means probe of this C-state failed and it must be avoided by kernel.
Is there any difference between C-state list in OS with and without this option? I have only laptop with very limited BIOS and can't check.

And there only C0-1-2. What about C6? What says zenstates.py about that? Frankly, I have not tried this script - do not want disable power-saving features for laptop.

Also some overclocking may solve this problem (see comment 103). Turn on "performance bias" or similar option in BIOS for testing.

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

This is pretty strange, could someone explain it to me? Wasn't mwait bugged in ryzen?

Tired of the random freezes, I tried this:

--- ./arch/x86/kernel/acpi/cstate.c.orig 2018-10-22 08:37:37.000000000 +0200
+++ ./arch/x86/kernel/acpi/cstate.c 2019-03-20 19:26:45.261101857 +0100
@@ -86,6 +86,7 @@ static long acpi_processor_ffh_cstate_pr
        num_cstate_subtype = edx_part & MWAIT_SUBSTATE_MASK;

        retval = 0;
+#if 0
        /* If the HW does not support any sub-states in this C-state */
        if (num_cstate_subtype == 0) {
                pr_warn(FW_BUG "ACPI MWAIT C-state 0x%x not supported by HW (0x%x)\n",
@@ -93,6 +94,7 @@ static long acpi_processor_ffh_cstate_pr
                retval = -1;
                goto out;
        }
+#endif

        /* mwait ecx extensions INTERRUPT_BREAK should be supported for C2/C3 */
        if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||

And I don't have any freeze since 3 days...

Revision history for this message
In , cevelnet (cevelnet-linux-kernel-bugs) wrote :

I have revisited the errata. Errata 1033 "A Lock Operation May Cause the System to Hang" and 1109 "MWAIT Instruction May Hang a Thread" are the top contenders. According to page 12 and 13 of that document, Pinnacle Ridge processors are not affected. However, page 16 and 17 suggest 2nd Gen Ryzen are affected. This makes me conclude: either there is an error in the document, or Raven Ridge aka improved Zen 1 (Desktop and Mobile) are affected, while Pinnacle Ridge is not.

During my latest tests on kernel 5.0.1, on AGESA 1006 and after upgrading to AGESA 0070, I wasn't able to reproduce any freezes / hangs on a Ryzen 7 2700, ASUS A320M-K (BIOS defaults, no OC, no Typical Current Idle setting), 16GB Patriot Viper RGB 3200CL16, ASUS RX480 8GB and Corsair AX860 (Haswell C6/C7 support, 80 PLUS Platinum).

I installed a fresh, bare minimum, default Arch Linux install w/o any special configuration, let the system sit in idle for half a day and check log for hangs. Then, I enabled all power management features using powertop and let the system sit over night, still no hangs. I repeated the same for AGESA 0070, still no hangs.

I think it's too early to conclude anything, as the hangs / freezes used to be very random. I will repeat my tests (on AGESA 0070) with a slightly more bloated system by installing GNOME.

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

I am running Fedora on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx with Gnome.

I use kernel parameters idle=nomwait iommu=pt processor.max_cstate=1 set via grubby.

I am not getting any hangs unless I suspend with lid close or suspend on power button. It would probably be better to not execute an MWAIT per the errata. The method in comment 561 would work for those who have the gumption to build their own kernel.

The iommu=pt I read about somewhere as useful to do, but I don't know if it helps. Some suggest set idle=halt which also avoids the MWAIT instruction.

I am also told that the DRI driver has an issue with loading at boot which will hang the kernel. https://bugs.freedesktop.org/show_bug.cgi?id=109206

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

(In reply to alfie from comment #561)
> This is pretty strange, could someone explain it to me? Wasn't mwait bugged
> in ryzen?

Looks like this patch makes kernel to ignore BIOS messages about unsupported C-state.
this function used in "acpi_processor_ffh_cstate_probe" which further used in "processor_idle.c" file. It cause "cx.entry_method" value change from ACPI_CSTATE_SYSTEMIO to ACPI_CSTATE_FFH.

It is really strange that this patch works but idle=halt is not. Maybe this is somehow silently disables C6 state.
Can your system reach turbo frequencies with this patch?
Maybe I understand something wrongly...

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

There's a patch in comment #526 for people to test before we include it so that there's at least *some* fix in the kernel, going forward...

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

(In reply to Another User from comment #564)
> Can your system reach turbo frequencies with this patch?
> Maybe I understand something wrongly...

Yes, I have a ryzen 1600 and I can see 1/2 cores running at ~3.6 GHz under some peculiar stress case.

(In reply to Borislav Petkov from comment #565)
> There's a patch in comment #526 for people to test before we include it so
> that there's at least *some* fix in the kernel, going forward...

But I really want C6/P6 and ignoring the BIOS seems to get me there...

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to alfie from comment #566)
> But I really want C6/P6 and ignoring the BIOS seems to get me there...

Why do you think the patch I pointed to won't give you C6?

Revision history for this message
In , cevelnet (cevelnet-linux-kernel-bugs) wrote :

@Borislav has the fix for erratum 1033 "A Lock Operation May Cause the System to Hang" been applied so far? The suggested workaround was "Program MSRC001_1020[4] to 1b", but I couldn't find anything about it in master branch. According to the document, 1033 only affects B1.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Tolga Cakir from comment #568)
> @Borislav has the fix for erratum 1033 "A Lock Operation May Cause the
> System to Hang" been applied so far? The suggested workaround was "Program
> MSRC001_1020[4] to 1b", but I couldn't find anything about it in master
> branch. According to the document, 1033 only affects B1.

Such a "fix" does not exist. Possibly because it is unlikely this is causing it and BIOS might be applying the fix already. People with B1s (model 1, stepping 1) could test though by doing as root:

# modprobe msr
# rdmsr -a 0xc0011020

and looking at bit 4 in the result.

Revision history for this message
In , zao (zao-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #569)
> (In reply to Tolga Cakir from comment #568)
> Such a "fix" does not exist. Possibly because it is unlikely this is causing
> it and BIOS might be applying the fix already. People with B1s (model 1,
> stepping 1) could test though by doing as root:
>
> # modprobe msr
> # rdmsr -a 0xc0011020
>
> and looking at bit 4 in the result.

I've got an ASUS PRIME X370-PRO with the B1 CPU mentioned on firmware 4024
(from 2018/09/28).

rdmsr yields 6800000000010, which has bit 4 set.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Lars Viklund from comment #570)
> rdmsr yields 6800000000010, which has bit 4 set.

Looks like your BIOS applies the fix. Now, does the patch in comment #526 fix your freezes?

Revision history for this message
In , zao (zao-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
>
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

I'm sorry, I don't currently suffer from any significant hangs on my machines when running stock Ubuntu 18.04.02 LTS (4.15.0-43-generic).

I mostly meant #570 to demonstrate that some firmware does indeed implement the errata.

While I've had core hangs and hard freezes in the past, the machine has enough workarounds applied to make it sufficiently stable. Current uptime is something like 81 days since the last incident and I can honestly not remember what it did then, but it's been used heavily as a build machine.

From my memory: sufficiently RMA:d CPUs, slight overclocking, change the current idle setting in firmware, boot idle=nomwait.

Back when I had significant problems, the best way to mitigate them was to disable SMT.

It's possible that my increased stability has come from firmware implementing this errata, or it might be coincidental, I don't have enough data to tell.

(You people should be happy you're not running FreeBSD, there I can reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )

Revision history for this message
In , devurandom (devurandom-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #569)
> (In reply to Tolga Cakir from comment #568)
> > @Borislav has the fix for erratum 1033 "A Lock Operation May Cause the
> > System to Hang" been applied so far? The suggested workaround was "Program
> > MSRC001_1020[4] to 1b", but I couldn't find anything about it in master
> > branch. According to the document, 1033 only affects B1.
>
> Such a "fix" does not exist. Possibly because it is unlikely this is causing
> it and BIOS might be applying the fix already. People with B1s (model 1,
> stepping 1) could test though by doing as root:
>
> # modprobe msr
> # rdmsr -a 0xc0011020
>
> and looking at bit 4 in the result.

206800000000100

CPU: AMD Ryzen 5 2400G
Motherboard: Asus ROG STRIX B350-F Gaming
Firmware version: 4207
Linux: 4.18.16-300.fc29 (Fedora 29 live system)

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Dennis Schridde from comment #573)
> 206800000000100

If this is correct:

http://www.cpu-world.com/CPUs/Zen/AMD-Ryzen%205%202400G.html

then you should have in /proc/cpuninfo something like this:

cpu family: 23
model: 17
stepping: 0

and if so, not affected.

Revision history for this message
In , devurandom (devurandom-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #574)
> (In reply to Dennis Schridde from comment #573)
> > 206800000000100
>
> If this is correct:
>
> http://www.cpu-world.com/CPUs/Zen/AMD-Ryzen%205%202400G.html
>
> then you should have in /proc/cpuninfo something like this:
>
> cpu family: 23
> model: 17
> stepping: 0
>
> and if so, not affected.

A year ago I was affected by something (comment #295), but the issue appears to have vanished within the last months (comment #547 and comment #573).

Revision history for this message
In , cevelnet (cevelnet-linux-kernel-bugs) wrote :

(In reply to Lars Viklund from comment #572)
> (You people should be happy you're not running FreeBSD, there I can
> reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )

Looks like they've addressed Ryzen errata issues around August 2018:

https://github.com/freebsd/freebsd/blob/75ee4f08d3acd4bf70f24b3203fa440255873973/sys/amd64/amd64/initcpu.c#L133

Also, they've made machdep.idle=hlt and machdep.idle_mwait=0 default for Ryzen processors.

I've additionally checked the registers mentioned in FresBSD's source. Latest BIOS 4406 (w/ AGESA 0070) for my ASUS A320M-K seems to apply fixes for all affected errata.

Eventhough I can't reliably reproduce the stability issues anymore, I think it's still good practice to have Boris' patch applied, since it's addresses a known erratum. I'll test the patch and check for regressions.

Revision history for this message
In , hexawolfie (hexawolfie-linux-kernel-bugs) wrote :

I'm actually desperate about this issue. I'm looking forward to see results of Boris' patch, though I really doubt it will solve this issue. Needs explicit check against self-built kernel really.

Not all manufacturers are willing to fix this issue (especially when it comes to laptops), as Windows seems to fix this inside it's kernel since 1709 I believe. I've made a post on HP forum about this, specifically for my laptop:

https://h30434.www3.hp.com/t5/Notebook-Boot-and-Lockup/Random-Soft-Lock-up-on-Ryzen/m-p/7062823

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
>
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

Hi Boris, on my laptop with its latest BIOS I get:

$ sudo rdmsr -a 0xc0011020
6800000000000

This is on:

cpu family : 23
model : 17
model name : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
stepping : 0
microcode : 0x8101007

Booting with:

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.20.16-200.fc29.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet LANG=en_US.UTF-8 idle=nomwait iommu=pt processor.max_cstate=1

Has your patch been applied to kernel yet?

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to JerryD from comment #578)
> $ sudo rdmsr -a 0xc0011020
> 6800000000000
>
> This is on:
>
> cpu family : 23
> model : 17
> model name : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
> stepping : 0
> microcode : 0x8101007

You're showing me the MSR output for erratum 1033 but your CPU is
not affected.

> Has your patch been applied to kernel yet?

No it hasn't - I'm still waiting for someone to test it and confirm that it fixes the issue on their boxes. And by issue I don't mean erratum 1033 - that is a red herring anyway - but the lockups people are reporting on some broken BIOSes.

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

Created attachment 281999
Add rifw kernel parameter to test a couple of patch to workaround ryzen freezes

Add rifw kernel parameter.

rifw=none - don't do anything
rifw=alfie - ignore BIOS
rifw=boris - use Borislav Petkov patch

I works with kernels 4.19, 4.20 and 5.0.2

Revision history for this message
In , alfie (alfie-linux-kernel-bugs) wrote :

> No it hasn't - I'm still waiting for someone to test it and confirm that it
> fixes the issue on their boxes. And by issue I don't mean erratum 1033 -
> that is a red herring anyway - but the lockups people are reporting on some
> broken BIOSes.

Yes, your patch "seems" to work in my case.

"Seems" because this bug is difficult to replicate...

Revision history for this message
In , brody.m.dyck (brody.m.dyck-linux-kernel-bugs) wrote :

I've spent the last few days reading through this thread.

I have Fedora 28, ryzen 1700x, and ASRock Steel Legend B450M. It's a server mostly used for files at the moment. Bone stock Fedora 28, samba, ZFS.

With GPU plugged in, hangs within 6-10 hours. Replicating this too often since March 27th.

With GPU unplugged, hangs happen within 10-20 minutes. Replicated 3 times.

Does not hang while SSH'd in.

Ran Memtest and tested for the segfault issue. Hardware is fine. Bought 'open box' on Amazon; this was in case they were re-selling an affected CPU without knowing.

Have not tried alfie's patch. Do have kernel 4.20 compiled with alfie's newest rifw patch as per Comment 580.

With Borislav's patch:
- Modifications to the OS are Samba, ZFS, Borislav's patch (comment #526 on kernel 5.0.1)
- booted March 30th at ~22:00
- Hung March 31st at ~09:40
- This is slightly longer than other tests
- I used the instructions he provided in Comment 511 without checking out 4.20. It compiles, but doesn't fix the issue, at least with linux 5.0.1-rc2 (apologies for not following steps and using 4.20)

'Typical Current Idle' set:
- Modifications to the OS are only Samba and ZFS
- Booted at 20:45 EST
- Still on now at time of this post 11:40 EST(I'll check before I hit send)
- This is the record up-time for this server
- Haven't tested anything else because I needed to have reliable access to my shares for my days off. This was 1 seemingly sure-fire way of making that happen.

Will update with more results once linux 4.20 with borislav and Alfie's patches tested. Could be a couple days. Less if 'Typical Current Idle' doesn't work, as I'll likely be RMAing my mobo and CPU.

Revision history for this message
In , dyson8 (dyson8-linux-kernel-bugs) wrote :

I used the rifw=alfie with a 1700x on an MSI X370 Gaming Plus BIOS 7A33v5H with no freezing for 1 week. Computer was rock solid on older BIOS (7A33v55) for a year+ until BIOS update to latest.

I'm using low power idle option with kernel 4.14.111 opensuse 15.0.

PC had kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) - After BIOS update and gone now after patch in Comment 580 with rifw=alfie boot option.

Revision history for this message
In , z.wood23 (z.wood23-linux-kernel-bugs) wrote :

I haven't had the chance to get through the whole thread yet and not sure if I'll have enough time today. However, I wanted to quickly ask if anyone has faced this issue with an Intel processor?

We are seeing kernel panics due to the soft lockup issue with the following configuration:

CPU Architecture: Intel Skylake or Higher
vCPUs: 8
RAM: 52 GB
Kernel: 4.15.0
OS: Ubuntu 16.04.1

For some additional information, this setup does leverage Nested VMX to run Docker containers with the --privileged option enabled.

The soft lockups seems to occur at a rate of varying from 1-5% of our VMs getting impacted in a 24-hour period running in this configuration.

The load on the cluster does not seem to impact the frequency of the issue.

One thing I did notice is that it seems most/all of the kernel panics have had been involved with a call to smp_call_function_single.

Since I have not made it far enough through this thread, I'm not sure if the focus on this is exclusively on AMD, or if anyone else had an Intel example?

Is there a fix in the pipe for this issue and if so, a target kernel version?

Revision history for this message
In , i (i-linux-kernel-bugs) wrote :

Created attachment 282319
Threadripper 2920x soft lockup

I've went through this thread, and tried to disable c6 state, add idle=halt with no effect. Followed this excellent reprod steps in https://www.reddit.com/r/Amd/comments/apw8im/ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent. However, switching to gcc-8 seems resolved this for this particular reprod steps. With dmesg -w, I was able to capture and verify this is indeed a soft lockup. Attached the dmesg output.

It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest BIOS on Ubuntu 18.04.2 LTS.

Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :

Hello Liu

You're hitting another bug: it happens on high load (which can be easily triggered by a lot of parallel compilation threads). The only solution for your kind of problem is to RMA you CPU. See: https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response (but the problem seems not to be related to Ryzen 7 as stated at Phoronix and it's not possible to say that CPU's after a special build time would be ok). I have been affected, too (Ryzen 7) and went through RMA process. Luckily, the first RMA'd CPU was ok - there have been people who needed more than one RMA. Take a look there:
https://community.amd.com/message/2822339
If I hit the problem those days (gcc segfault), I had to rapidly reboot my machine because it mostly hang completely some time later.

This thread here covers the problem of hanging Ryzen CPU's when the machine is completely on idle over some time (it's the opposite of the problem you're facing and I saw before CPU RMA). I "fixed" the hang on idle problem by slightly overclocking the CPU (+200 MHz - ASUS X370-PRO / BIOS 4011 04/19/2018 - Bios option: Optimization for daily computing). All other suggested workarounds don't work for me. But you can't say, overclocking would fix the problem for sure (as any other suggested workaround). But you could try it.

Revision history for this message
In , i (i-linux-kernel-bugs) wrote :

Hey, Klaus,

My understanding of that issue is it only affects Ryzen 1700 / 1700x and doesn't affect Epyc / TR line. I am on ThreadRipper 2920x, which should be out of marginality problem for over a year now. Also, it doesn't SegFault at any point, just hangs. I do believe it is the similar problem as this thread, but since it is rare, so it could be just one faulty CPU for sure. Just doesn't align well with marginality problem timeline.

Revision history for this message
In , pjssilva (pjssilva-linux-kernel-bugs) wrote :

I have a couple of Threadrippers (1950X) that also hang under very specific workloads with a Asus X-399-A motherboard. In my case when the freeze happens the only way to get the machine back is to unplug it from the wall and plug it back. Not even reset works.

After many tweaks, I found out that I can avoid that but setting either the "overclocking enhancement" on in the BIOS (this is specific to Asus boards) our manually setting the Vcore to something like 1.315V (which should work on all motherboards). My guess is than when in idle the motherboard drops the voltage too low and the CPU hangs.

Revision history for this message
In , self (self-linux-kernel-bugs) wrote :

I ran into similar problems to this with a Threadripper 1950X and none of the proposed workarounds did anything. Disabling C6 in the BIOS made it fail to boot for some reason. Setting "Typical Current Idle" in the BIOS, processor.max_cstate=1 in the kernel command line, or disabling C6 with ZenStates.py all had no effect. I also tried overclocking and underclocking.

I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just works (for 10+ hours, I don't leave my machine on overnight). Once it started working, I set everything back to the defaults except for "Typical Current Idle" (since I have a PSU from 2010 and I doubt it supports any fancy features from 2013).

Unfortunately I can't easily narrow down all of the changes after the re-install, but the ones I'm aware of are:

 - Using the Nouvea driver (previously proprietary nVidia drivers)
 - Booting in EFI mode now (previously legacy mode)
 - Using ext4 (previously btrfs)

I strongly suspect that the graphics driver was the problem since my lockups would cause the screen to become completely unresponsive, but sound continued working, and in one case I had a lockup during a video call and the other person could still see and hear me.

I plan to try the proprietary drivers again sometime and I'll make a note if that brings the problem back for me.

Revision history for this message
In , kmueller (kmueller-linux-kernel-bugs) wrote :
Download full text (3.3 KiB)

(In reply to Brendan Long from comment #589)
> I strongly suspect that the graphics driver was the problem since my lockups
> would cause the screen to become completely unresponsive, but sound
> continued working, and in one case I had a lockup during a video call and
> the other person could still see and hear me.

What you're describing here is a new "feature" introduced between kernel 4.19.16 and 17 e.g. I can see exactly the same here with radeon hardware. The system is completely working (even VMs on the host are running well) except of graphics - even tty terminals are working sometimes. When ssh'ing the machine, I can always see log entries like these:

radeon 0000:0a:00.0: ring 0 stalled for more than 14084msec
radeon 0000:0a:00.0: GPU lockup (current fence id 0x0000000000053ed7 last fence id 0x0000000000053f0f on ring 0)
...

I'm trying to near it down currently using git bisect. The suspicious changes left are at the moment:

2019-01-22 arm64: Don't trap host pointer auth use to EL2 Mark Rutland bad
2019-01-22 arm64/kvm: consistently handle host HCR_EL2 flags Mark Rutland
2019-01-22 scsi: target: iscsi: cxgbit: fix csk leak Varun Prakash
2019-01-22 scsi: target: iscsi: cxgbit: fix csk leak Varun Prakash
2019-01-22 Revert "scsi: target: iscsi: cxgbit: fix csk leak" Sasha Levin
2019-01-22 mmc: sdhci-msm: Disable CDR function on TX Loic Poulain
2019-01-22 netfilter: nf_conncount: fix argument order to find_next_bit
2019-01-22 netfilter: nf_conncount: speculative garbage collection on empty lists Pablo Neira Ayuso
2019-01-22 netfilter: nf_conncount: move all list iterations under spinlock Pablo Neira Ayuso
2019-01-22 netfilter: nf_conncount: merge lookup and add functions Florian Westphal

2019-01-22 netfilter: nf_conncount: restart search when nodes have been erased Florian Westphal ?
2019-01-22 netfilter: nf_conncount: split gc in two phases Florian Westphal
2019-01-22 netfilter: nf_conncount: don't skip eviction when age is negative Florian Westphal
2019-01-22 netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS Shawn Bohrer
2019-01-22 can: gw: ensure DLC boundaries after CAN frame modification Oliver Hartkopp
2019-01-22 tty: Don't hold ldisc lock in tty_reopen() if ldisc present Dmitry Safonov
2019-01-22 tty: Simplify tty->count math in tty_reopen() Dmitry Safonov
2019-01-22 tty: Hold tty_ldisc_lock() during tty_reopen() Dmitry Safonov
2019-01-22 tty/ldsem: Wake up readers after timed out down_write() Dmitry Safonov

As you're describing correctly, the problem seems to be network related. I'm getting this error two when watching videos from internet. I'm...

Read more...

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

(In reply to Brendan Long from comment #589)
>
> I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just
> works (for 10+ hours, I don't leave my machine on overnight). Once it
> started working, I set everything back to the defaults except for "Typical
> Current Idle" (since I have a PSU from 2010 and I doubt it supports any
> fancy features from 2013).

a) Fedora 29 has very new kernels. In the last few weeks they rebased to 5.0 and 5.0 is supposed to have a lot of Ryzen fixes in it.

b) For me "Typical Current Idle" (plus mwait) was all that was required to fix the problem. I'm pretty sure kernel 5.0 makes it so mwait override is not needed.

> - Using the Nouvea driver (previously proprietary nVidia drivers)
> - Booting in EFI mode now (previously legacy mode)
> - Using ext4 (previously btrfs)

We never used proprietary video driver and we both had the problem and fixed the problem. Same with EFI: we use just legacy mode. FS should not matter at all.

Revision history for this message
In , bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote :

The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power Supply Idle Control" option off the configuration UI; downgrading is not supported (but effectively possible at least from Windows). It is still possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g.
  /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
during boot.

Luckily, this workaround may no longer be needed. While, with default 'BIOS' (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during idle for me, it no longer does so on 5.1.2 (apparently - needs more testing - not on 5.0 either).

Revision history for this message
In , philip.rosvall (philip.rosvall-linux-kernel-bugs) wrote :

(In reply to Moritz Naumann from comment #592)
> The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power
> Supply Idle Control" option off the configuration UI; downgrading is not
> supported (but effectively possible at least from Windows). It is still
> possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g.
> /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
> during boot.
>
> Luckily, this workaround may no longer be needed. While, with default 'BIOS'
> (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during
> idle for me, it no longer does so on 5.1.2 (apparently - needs more testing
> - not on 5.0 either).

Since 5.0.10, and this commit ...:

(https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.10)
"commit 205c53cbe553c9e5a9fe93f63e398da7e59124b6
Author: Thomas Gleixner <email address hidden>
Date: Sun Apr 14 19:51:06 2019 +0200

    x86/speculation: Prevent deadlock on ssb_state::lock

    commit 2f5fb19341883bb6e37da351bc3700489d8506a7 upstream.

    Mikhail reported a lockdep splat related to the AMD specific ssb_state
    lock:

      CPU0 CPU1
      lock(&st->lock);
                                 local_irq_disable();
                                 lock(&(&sighand->siglock)->rlock);
                                 lock(&st->lock);
      <Interrupt>
         lock(&(&sighand->siglock)->rlock);

      *** DEADLOCK ***

    The connection between sighand->siglock and st->lock comes through seccomp,
    which takes st->lock while holding sighand->siglock.

    Make sure interrupts are disabled when __speculation_ctrl_update() is
    invoked via prctl() -> speculation_ctrl_update(). Add a lockdep assert to
    catch future offenders.

    Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD")"

... the problem seems to be fixed! I no longer need to use idle=halt and the systems don't freeze anymore!

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

(In reply to Philip Rosvall from comment #593)
> Since 5.0.10, and this commit ...:
>
> (https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.10)
> "commit 205c53cbe553c9e5a9fe93f63e398da7e59124b6
> Author: Thomas Gleixner <email address hidden>
> Date: Sun Apr 14 19:51:06 2019 +0200
>
> x86/speculation: Prevent deadlock on ssb_state::lock

If that commit really fixes the "issue" on your machine then you
shouldn't have been experiencing any softlockups etc, i.e., what this
bugzilla is about, but rather a software deadlock.

Which would mean that your machine is not necessarily affected by the
entering into a C-state with MWAIT and not waking up after, issue.

Revision history for this message
In , exander77 (exander77-linux-kernel-bugs) wrote :

This bug bugs me so much, that I registered to kernel bugzilla.

I tried many workarounds, nothing solved it.

Machine: ThinkPad A485 with AMD 2500U.
System:
Distributor ID: Ubuntu
Description: Ubuntu 19.04
Release: 19.04
Codename: disco
Kernel: Linux kni 5.0.0-13-generic #14-Ubuntu SMP Mon Apr 15 14:59:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

And the problem is still there.

Have idle=nomwait kernel parameter set, but tried many others. Disabling C6 states, everything.

I can reproduce this with 100% success by running my accounting software Flexibee (https://www.flexibee.eu/podpora/stazeni-flexibee/) which has a PostgreSQL backend. When it stars PostgreSQL, soft lockups start to appear, it gives me freezes and finally the system completely freezes after a 30 or so seconds. I can even reproduce it in VirtualBox, when starting the software in VirtualBox, it crashes the whole system (the host system!).

I can offer help with debugging this problem. This problem needs to go away. It is not present on Windows as far as I know.

Revision history for this message
In , i (i-linux-kernel-bugs) wrote :

(In reply to Liu Liu from comment #585)
> Created attachment 282319 [details]
> Threadripper 2920x soft lockup
>
> I've went through this thread, and tried to disable c6 state, add idle=halt
> with no effect. Followed this excellent reprod steps in
> https://www.reddit.com/r/Amd/comments/apw8im/
> ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent.
> However, switching to gcc-8 seems resolved this for this particular reprod
> steps. With dmesg -w, I was able to capture and verify this is indeed a soft
> lockup. Attached the dmesg output.
>
> It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest
> BIOS on Ubuntu 18.04.2 LTS.
>
> Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux

Some updates since I last posted. I've updated to gcc-8 and enabled idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day uses, I haven't encountered any system lockup in the past 2 months. I concluded that idle=halt should mitigate this problem for normal uses.

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to Liu Liu from comment #597)
--- snip --
> Some updates since I last posted. I've updated to gcc-8 and enabled
> idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> uses, I haven't encountered any system lockup in the past 2 months. I
> concluded that idle=halt should mitigate this problem for normal uses.

Are you saying that you compiled kernel with gcc-8 or that you use gcc-8 in your day to day work?

On my system I get a lockup if I do not use idle=nomwait. Like everyone else I too have tried verious combinations to be stable. (Ryzen 2500U Laptop) As far as I can tell this bug is not really fixed except on Microsoft Windows (and who knows what they are doing). Regardless things seem stable now but I can not use suspend (a different bug)

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

This embarrassing problem persists, will it be fixed on the new 3000 series, due in the next few months? Does anyone out there have access to this hardware?

Very close to the 2 year anniversary of this issue and there's still no 100% clear solution outlining how to fix the problem.

Revision history for this message
In , exander77 (exander77-linux-kernel-bugs) wrote :

It may be originally a hardware issue, but I am sure it does not affect Windows, so it should be fixable in kernel code. I am not even sure if the problem was present on Windows at all.

I can reproduce it with 100% success anytime I want.

Revision history for this message
In , i (i-linux-kernel-bugs) wrote :

(In reply to JerryD from comment #598)
> (In reply to Liu Liu from comment #597)
> --- snip --
> > Some updates since I last posted. I've updated to gcc-8 and enabled
> > idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> > still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> > uses, I haven't encountered any system lockup in the past 2 months. I
> > concluded that idle=halt should mitigate this problem for normal uses.
>
> Are you saying that you compiled kernel with gcc-8 or that you use gcc-8 in
> your day to day work?
>
> On my system I get a lockup if I do not use idle=nomwait. Like everyone else
> I too have tried verious combinations to be stable. (Ryzen 2500U Laptop) As
> far as I can tell this bug is not really fixed except on Microsoft Windows
> (and who knows what they are doing). Regardless things seem stable now but I
> can not use suspend (a different bug)

gcc-8 for my day-to-day work.

Revision history for this message
In , Baka.Kafka (baka.kafka-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #599)
> This embarrassing problem persists, will it be fixed on the new 3000 series,
> due in the next few months? Does anyone out there have access to this
> hardware?
>
>
> Very close to the 2 year anniversary of this issue and there's still no 100%
> clear solution outlining how to fix the problem.

No, it won't be fixed. I GUESS. Because I have mailed this issue and the link to this thread to many hardware reviews sites to get attention, and to AMD, and not a single one even bothered to reply let alone have a look at it.

If they even don't want to recognize this as a bug, why fix it?

Revision history for this message
In , itanium_de (itaniumde-linux-kernel-bugs) wrote :

(In reply to Liu Liu from comment #597)
> Some updates since I last posted. I've updated to gcc-8 and enabled
> idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> uses, I haven't encountered any system lockup in the past 2 months. I
> concluded that idle=halt should mitigate this problem for normal uses.

I have a Ryzen Threadripper 2990WX setup and experience the similar random lockups that you describe! Setting idle=halt however improved but did not fix the problem for me entirely. When do a "btrfs scrub" on my NVME SSD with 250 GB data I could get reproducible freezes even with idle=halt set. In average every second btrfs scrub over 250 GB data would cause a freeze so this was a very effective way to reproduce the error condition. (scrubbing twice through 250GB data takes only about 3 minutes on my system)

The "good" news is that since I changed the idle parameter to idle=poll (which obviously burns off electricity like crazy) I can now do many (20+) btrfs scrub runs in a row without provoking any lockups and so far the system runs stable.

Maybe someone here who also has a btrfs on a SSD drive can try if they can reproduce the freezes with it in the same way. The btrfs scrub may be the more "evil" workload compared to parallel kernel compilations with gcc-7/8.

A final thought is that this thread may report two independent issues. There is the "completely unstable system issue" that initially froze up during two out of three boot cycles that can be solved for me by seting idle=nomwait and disabling c6 states in bios.

And then there is this random freeze once-in-a-while issue that only is resolved so far by going to idle=poll.

Revision history for this message
In , arup.chowdhury (arup.chowdhury-linux-kernel-bugs) wrote :

AMD issued a new ucode few days back. Anyone using that, it updated automatically on my Arch system. I havent had the lock up issue on my ASUS B350 board with latest BIOS and option of system idle current checked.

Revision history for this message
In , itanium_de (itaniumde-linux-kernel-bugs) wrote :

(In reply to Arup from comment #604)
> AMD issued a new ucode few days back. Anyone using that, it updated
> automatically on my Arch system.
For me this update that came yesterday on Arch did not change the microcode version at all! Seems like there is no update for my processor.
Before (and still) running:
model name : AMD Ryzen Threadripper 2990WX 32-Core Processor
microcode : 0x800820b

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

So Ryzen 3000 is coming, who will be the first guinea pig?

Curious to see if this problem is mysteriously, entirely fixed on the 3000 series.

(I damn well hope so)

Revision history for this message
In , onox (onox-linux-kernel-bugs) wrote :

(In reply to Moritz Naumann from comment #592)
> The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power
> Supply Idle Control" option off the configuration UI; downgrading is not
> supported (but effectively possible at least from Windows). It is still
> possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g.
> /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
> during boot.
>
> Luckily, this workaround may no longer be needed. While, with default 'BIOS'
> (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during
> idle for me, it no longer does so on 5.1.2 (apparently - needs more testing
> - not on 5.0 either).

Hi, have you been able to do more testing? Does it still not freeze anymore for you with v.5.50? You're the second person I find saying that v5.50 removes the freezes with the default settings. I also use the ASRock X370 Taichi and the "Power Supply Idle Control" option, but I'm still on v4.70. However, since my computer also freezes on Windows when idle without changing that option, I now think that this is probably a different problem than the bug in this discussion. I actually found other people on forums who also have experienced those freezes on Windows, so I’m not an isolated case. They don’t all have ASRock boards, but there might also be something specific to ASRock. I have not really tested this yet, but sometime ago, I was told that at least on the ASRock boards, the default SOC voltage of 0.9 V is too low, and that raising it to at least 1.0 V would prevent the freezes (but don't go higher than 1.1 V). I'm probably going to raise that voltage anyway because I would like to increase the frequency of my RAM and lower the timings, if the BIOS would let me do so properly and keep the settings… The v4.x BIOSes for this board are known to have serious bugs in this respect [http://forum.asrock.com/forum_posts.asp?TID=9371&KW=ram+worse&title=psa-stay-away-from-480-bios-x370-taichi], so I will probably take the risk to upgrade to v5.50 soon, too.

Meanwhile, AMD replaced my faulty 1700X CPU, and the replacement I received (same CPU model, but produced much later) seems not to have the segfault bug, although I should test it more in order to be completely sure. I must say that AMD was very nice in the RMA process, I really recommend doing the RMA directly with them if you have the segfault bug.

Revision history for this message
In , bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to onox from comment #607)

To prevent misunderstandings, let me repeat what I previously wrote, from a different angle: ASRock X370 Taichi firmware version v5.50 is worse than previous versions in that it removes the "Power Supply Idle Control" option off the configuration screen. As a result, with Linux 4.18 (Ubuntu kernel 4.18.0-20-generic) and below the system *will* lockup unless different counter measures are taken (namely set the MSR).

This said, I have been and am running this single system 24/7 without issues despite extended low load periods on Ubuntu 18.04 LTS kernel 5.0.0-15-generic with microcode level 0x08001137 since May 15 (and from May 11 to May 15 on mainline Linux 5.1.2 with the same microcode) with AsRocks' firmware configuration defaults and without setting the MSR.

Summing up, the latest ASRock firmware for this mainboard takes a step in the wrong direction for Linux users, however newer Linux versions prevent this system from locking up.

MCExtractor is aware of a newer microcode, 0x08001138 (seen on 2019-02-04) for CPUID 0x00800f11, which ASRock decided not to include in their firmware.

Revision history for this message
In , weirdconstructor (weirdconstructor-linux-kernel-bugs) wrote :

I think I experience the same problem as described here with my recently built Ryzen computer. After installing a fresh Linux distribution I started out with Linux kernel 4.19.45. The computer ran for two days continuously, after a short pause I came back and saw it completely froze. Screen showed the last visible image, but no input worked. Neither ping/ssh, neither mouse or keyboard worked. Not even the num lock LED. There were no log entries written, neither on screen or harddisk. Last entry was some cron job, cleaning up temporary directories. The workload was just a vim, a VNC viewer and a Firefox. The next night, the same freeze happened. That finally got my attention to dig deeper.

As first measure I updated the Kernel to 5.1.4. After 2 hours running, without much load, it also froze.

I dug deeper, and as first measure I set the BIOS from "Auto" to "Typical current Idle". Uptime is now 1 day, without any workload except a chat client.
This is

This is my System:

# CPU:
    cpu family : 23
    model : 1
    model name : AMD Ryzen 5 1600 Six-Core Processor
    stepping : 1
    microcode : 0x8001137
    cpu MHz : 1646.140

# Mainboard:
    Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F5 01/25/2019

# GPU:
    NVIDIA GP107 [GeForce GTX 1050 Ti]
    Driver/OpenGL: 4.6.0 NVIDIA 418.74

# OS:
    Distro: Manjaro
    Kernel: 5.1.4-1-MANJARO x86_64

I also see this in my bootup log:
12 times: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
12 times: microcode: CPU0: patch_level=0x0800113

Like said, my computer is running now for 1 day without any load with the BIOS option "Typical current Idle" set. This is so far the same "workaround" that
brodyck wrote at 2019-04-01 15:40:26 UTC. Unfortunately, or fortunately for him, he has not wrote anything since then. That might mean, that the problems stopped for him at that point.
I will further observe this issue with my new workstation. If you don't hear back from me, the BIOS setting very likely worked.

Revision history for this message
In , kernelbugs (kernelbugs-linux-kernel-bugs) wrote :

As I mention around Comment 485 and later (starting Jan 22), for us we have nearly the same computer and after setting BIOS to current idle as you did, all of our problems disappeared. Haven't messed with it since. That's 4 months of uptime, zero hangs, zero crashes.

I'd bet everyone who was here and said they'd try the BIOS tweak and has not reported back falls into the same camp. Everyone else is dealing with more or other problems, probably.

As for Ryzen 3000 series, AMD better have fixed this !@%!^# problem! Since it looks like 3000 will do ECC nicely, I'll be building another box with it in early July.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

My two (server) machines running 1600's are also not experiencing problems any more using new BIOSes + 4.19 kernels + "Typical current bullshit". Hence I tend to agree that for most people the issue is solved.

However Trevor I'm mostly posting to let you know ECC works fine, at least with ASRock boards (of which the spec actually features ECC support, e.g. mine https://www.asrock.com/MB/AMD/AB350M%20Pro4/index.asp#Specification), even under Ryzen 1xxx CPU's:

# dmesg | grep -i edac
[ 0.057169] EDAC MC: Ver: 3.0.0
[ 9.926945] EDAC amd64: Node 0: DRAM ECC enabled.
[ 9.926946] EDAC amd64: F17h detected (node 0).
[ 9.926985] EDAC MC: UMC0 chip selects:
[ 9.926986] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 9.926987] EDAC amd64: MC: 2: 4096MB 3: 0MB
[ 9.926988] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 9.926989] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 9.926991] EDAC MC: UMC1 chip selects:
[ 9.926991] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 9.926992] EDAC amd64: MC: 2: 4096MB 3: 0MB
[ 9.926992] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 9.926993] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 9.926993] EDAC amd64: using x8 syndromes.
[ 9.926994] EDAC amd64: MCT channel count: 2
[ 9.927095] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
[ 9.927104] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 9.927105] AMD64 EDAC driver v3.5.0

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
edac-util: No errors to report.

See also:
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive.html

Revision history for this message
In , grz3chu (grz3chu-linux-kernel-bugs) wrote :
Download full text (4.4 KiB)

AMD has released a new version of AGESA Combo-AM4 1.0.0.3 for my motherboard MSI B450 Gaming Plus and others. There were also updates of the CPU microcode.

The first (#1) on the list for example is Ryzen 2600, old and new bios, comparison of the microcode version:

OLD bios
╔═════════════════════════════════════════════════════════════════╗
║ AMD ║
╟────┬──────────┬──────────┬────────────┬───────┬──────────┬──────╢
║ # │ CPUID │ Revision │ Date │ Size │ Offset │ Last ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 1 │ 00800F82 │ 0800820B │ 2018-06-20 │ 0xC80 │ 0x4DD000 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 2 │ 00800F12 │ 08001230 │ 2018-08-04 │ 0xC80 │ 0x4DDD00 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 3 │ 00800F11 │ 08001137 │ 2018-02-14 │ 0xC80 │ 0x4DEA00 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 4 │ 00800F10 │ 0800100C │ 2017-01-31 │ 0xC80 │ 0x4DF700 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 5 │ 00800F00 │ 0800002A │ 2016-10-06 │ 0xC80 │ 0x4E0400 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 6 │ 00810F10 │ 0810100B │ 2018-02-12 │ 0xC80 │ 0x65A500 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 7 │ 00820F00 │ 08200002 │ 2018-02-14 │ 0xC80 │ 0x65B200 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 8 │ 00810F00 │ 08100004 │ 2016-11-20 │ 0xC80 │ 0x65BF00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 9 │ 00810F80 │ 08108002 │ 2018-06-05 │ 0xC80 │ 0x65CC00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 10 │ 00810F81 │ 08108102 │ 2018-08-13 │ 0xC80 │ 0x65D900 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 11 │ 00810F11 │ 08101102 │ 2018-11-06 │ 0xC80 │ 0x65E600 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 12 │ 00660F00 │ 06006012 │ 2014-10-14 │ 0xA20 │ 0xD97F50 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 13 │ 00660F01 │ 0600611A │ 2018-01-26 │ 0xA20 │ 0xD98970 │ Yes ║
╚════╧══════════╧══════════╧════════════╧═══════╧══════════╧══════╝

NEW bios

╔═════════════════════════════════════════════════════════════════╗
║ AMD ║
╟────┬──────────┬──────────┬────────────┬───────┬──────────┬──────╢
║ # │ CPUID │ Revision │ Date │ Size │ Offset │ Last ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 1 │ 00800F82 │ 0800820D │ 2019-04-16 │ 0xC80 │ 0x50D000 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 2 │ 00800F12 │ 08001250 │ 2019-04-16 │ 0xC80 │ 0x50DD00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 3 │ 00800F11 │ 08001138 │ 2019-02-04 │ 0xC80 │ 0x50EA00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 4 │ 00800F10 │ 0800100C │ 2017-01-31 │ 0xC80 │ 0x50F7...

Read more...

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

So when do we get our first post, with the same problems, from X570 users, with Ryzen 3600 / 3600x / 3800 etc?

I'm very curious to know if this issue magically is gone on the next series.

Revision history for this message
In , spleefer90 (spleefer90-linux-kernel-bugs) wrote :

No workarounds fixed my 2400G lockups.
I RMAd the CPU (got a new one) and motherboard (because the firmware killed itself) and got it repaired.

Since then I didn't have a single lockup with all the workarounds disabled, so

I'd say that CPUs before some certain time period were possibly manufactured with defects or my unit was just defective.

Revision history for this message
In , kernel (kernel-linux-kernel-bugs) wrote :

FWIW: I have an ASUS Prime 370-Pro, Ryzen 7 1800X, currently running Kernel v5.0.17.

With "Typical Current Idle", and no kernel tweaks, that has run without "freezing while idle" for about a year.

Yesterday I upgraded the BIOS, which for some reason reset the "Power Supply Idle Control" to "Auto". That prompted me to run a small experiment to see if kernel or other upgrades have improved things. They have not: the machine "froze while idle" (fairly promptly). I have now restored "Typical Current Idle", in the hope of another untroubled year.

I note, however, that my cpu (0x00800f11) is running with firmware 0x0800137, not the more recent 0x0800138 -- if I am reading the table in Comment 612, above, correctly.

Revision history for this message
In , kernelorg (kernelorg-linux-kernel-bugs) wrote :

No 3000 owners yet? Is it safe to buy?

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #616)
> No 3000 owners yet? Is it safe to buy?

Google for linux benchmarks, etc. You will find people are running these things. There is one bios update push I read about regarding some windows related thing. I plan to buy a 3000 series setup soon. No fear here. I noticed HP just issued a new bios for my 2500U laptop (F.21), it has been running fine on F.20 and likewise on the upgraded, so I suspect it was a 3000 related update. They are also selling complete systems as well.

Revision history for this message
In , cameron.g.rodgers (cameron.g.rodgers-linux-kernel-bugs) wrote :

I have to kind of wonder, doesn't AMD like, want to sell these chips (in Epyc) to server farms? Ones that will run Linux? Or do Epyc/TR just not have this problem? My experience with this bug has been a bit bizarre, especially since AMD's GPUs have really great Linux support these days.

My report:
- Ryzen 1600
- CORSAIR Vengeance LPX 16GB (2 x 8GB) 288-Pin DDR4 SDRAM DDR4 3000 (PC4 24000)
- Gigabyte AB-350 Gaming 3
- RX 460 Sapphire 2GB
- TP-Link TL-WN722N USB Wifi -- This is its own bag of worms
- misc: SATA SSDs, HDDs
I've had this crash problem ever since I built this PC, HOWEVER for a very long time, the LTS kernel 4.14 didn't have the crashing/hanging/random reboot. Usually, if I installed a new kernel, that kernel would be stable for a week or so (including lots of suspend/unsuspend) and then develop the crashing problem. Meaning I mainly just stayed on 4.14. But today was a special day, because 4.14 developed the problem as well now!
IDK if this has been discussed further up already, but I'd like to highlight two of those points:
- freshly installed kernels don't have the problem at first? what could cause this?
- 4.14 didn't have it either? Again, what did they change to break it?
Moreover, the crashing is WORSE THAN EVER now. Like hourly.

So I came back to this thread and I'm trying disabling Global C-States, as my older motherboard firmware doesn't have the other options that have been mentioned ITT. I'll wait and see if that solves it, and if not, I'll try the python script, and then maybe updating my motherboard so I can access more settings.

Revision history for this message
In , angelforest (angelforest-linux-kernel-bugs) wrote :

Currently running ASRock AB350M Pro4 + Ryzen 1600
Ever since I updated BIOS to 5.50 and set "Typical Current Idle" setting (couple months ago) I haven't had a single hang.

Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
In , albertogomezmarin (albertogomezmarin-linux-kernel-bugs) wrote :

(In reply to OptionalRealName from comment #616)
> No 3000 owners yet? Is it safe to buy?

I have one, but I am going to return the laptop I think, it works as expected but in terms of battery ryzen in linux sucks

PD: windows 5 W in idle
Linux, 12 W in idle in the same laptop with kernels 5.0 5.1 5.2 5.3... BIG problem for laptops, I am not going to buy other laptop with rayzen I think, I don't know if I am the only person that have noticed this consumptions with a ryzen mobile...
Other thing, linux only recognices 3 states idle, there are:
CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 1270
Duration: 2635
C1:
Flags/Description: ACPI FFH MWAIT 0x0
Latency: 1
Usage: 60347
Duration: 17870643
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 400
Usage: 22640
Duration: 59696920
 I don't know if there is GPU integrated problems or CPU problems but I think that are cpu ones for many reasons.. one of them is lscpu and cpupower telling me that hardware limits are 1,4Ghz-2,1Ghz IN A LAPTOP YEAH, in windows it reports 400 mhz like the minimum

Revision history for this message
In , spleefer90 (spleefer90-linux-kernel-bugs) wrote :

My experience -
Bought 2400G and 3600. Both CPUs had lockups and no UEFI settings and messing with C states helped.
What helped was RMA in both cases, the RMA'd CPUs do not have lockups.

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

Asrock Fatal1ty X370 Gaming-ITX/ac
Ryzen 1700

After the most recent Asrock BIOS update I haven't had any more crashing. I do not disable C6 anymore. This was a BIOS update to support the new Ryzen 3xxx chips as I was eyeing the 3900x and wanted the motherboard to support said chip if I had a weak moment and actually bought a 3900x.

Revision history for this message
In , zheilbron (zheilbron-linux-kernel-bugs) wrote :

As @C0rn3j mentioned, I'm in a similar boat. With a Ryzen 3700x, I continue to get the following errors:

```
rcu: INFO: rcu_sched self-detected stall on CPU
...
watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:1:94]
```

I've tried just about every workaround that I've read about:
* Disable C5/C6 states (from BIOS, from zenstates.py)
* Set "Typical Current Idle" in BIOS
* Add a combination of kernel parameters (idle=nomwait, noapci, acpi=off, rcu_nocbs=0-15, processor.max_cstate=1)

Nothing works.

System specs:
Linux ... 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64
MSI B450 Gaming Pro Carbon AC (BIOS version 7B85v18)
Ryzen 7 3700X

Revision history for this message
In , zheilbron (zheilbron-linux-kernel-bugs) wrote :

(In reply to zheilbron from comment #624)
> As @C0rn3j mentioned, I'm in a similar boat. With a Ryzen 3700x, I continue
> to get the following errors:
>
> ```
> rcu: INFO: rcu_sched self-detected stall on CPU
> ...
> watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [kworker/1:1:94]
> ```
>
> I've tried just about every workaround that I've read about:
> * Disable C5/C6 states (from BIOS, from zenstates.py)
> * Set "Typical Current Idle" in BIOS
> * Add a combination of kernel parameters (idle=nomwait, noapci, acpi=off,
> rcu_nocbs=0-15, processor.max_cstate=1)
>
> Nothing works.
>
> System specs:
> Linux ... 4.19.0-5-amd64 #1 SMP Debian 4.19.37-5+deb10u2 (2019-08-08) x86_64
> MSI B450 Gaming Pro Carbon AC (BIOS version 7B85v18)
> Ryzen 7 3700X

It seems that leaving the machine idle does not produce the issue. However, connecting over Wireguard + SSH (which is how I had been accessing the machine) seems to cause the issue to manifest. After following the advice here (https://bbs.archlinux.org/viewtopic.php?id=247900), I've added the `nordrand` boot flag. This seems to be keeping the issue at bay so far. Perhaps I was chasing the wrong issue!

Time will tell... I'll report back with any new findings or if it remains stable for some time.

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

--- snip ---
> It seems that leaving the machine idle does not produce the issue. However,
> connecting over Wireguard + SSH (which is how I had been accessing the
> machine) seems to cause the issue to manifest. After following the advice
> here (https://bbs.archlinux.org/viewtopic.php?id=247900), I've added the
> `nordrand` boot flag. This seems to be keeping the issue at bay so far.
> Perhaps I was chasing the wrong issue!
>
> Time will tell... I'll report back with any new findings or if it remains
> stable for some time.

I have just last week installed Fedora 30 on a new Ryzen 3600X. The system would not boot due to a bug in systemd which has been fixed, but the FedoraLive image used to install does not have this fix. Fortunately, this is fixed in a respin of the live image that one can use. It has been running to perfection for my needs now for over a week without any kernel boot parameters needed. Clean as a whistle. So, your fix ought to be good.

Revision history for this message
In , to_load (toload-linux-kernel-bugs) wrote :

Released 3rd generation of ZEN and this strange thing still persist. So "No fix planned" was not a lie...

Kernel parameter idle=halt fixed my problem for R3 2200u (no lockups for 7 months). You may give it a try.
...just to remind.

Revision history for this message
In , pmenzel (pmenzel-linux-kernel-bugs) wrote :

Sehr geehrte Damen und Herren,

vielen Dank für Ihre Nachricht, die ich nach dem 26. August 2019 lesen werde.

Freundliche Grüße

Paul Menzel

---

Dear Madam or Sir,

Thank you for your message, I’ll read after my return on August 27th, 2019.

Kind regards,

Paul Menzel

Revision history for this message
In , zheilbron (zheilbron-linux-kernel-bugs) wrote :

Reporting back as I said I would: with `nordrand` set as a kernel boot parameter, the system is stable.

Revision history for this message
In , schweinefilet (schweinefilet-linux-kernel-bugs) wrote :

(In reply to zheilbron from comment #630)
> Reporting back as I said I would: with `nordrand` set as a kernel boot
> parameter, the system is stable.

Are you using nordrand in combination with other workarounds (BIOS settings, ZenStates)?
My hardware specs are identical and my system still freezes using nordrand.

Ryzen 3700X
MSI B450 Gaming Pro Carbon AC (BIOS version 7B85v17, not v18!)
Linux 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 13:51:02 UTC 2019 x86_64

On the other hand, MSI could have fixed these problems with the new BIOS.

Revision history for this message
In , zheilbron (zheilbron-linux-kernel-bugs) wrote :

I'm not using any other workarounds in conjunction with `nordrand`. Perhaps try upgrading to the new BIOS. Note also that my kernel version differs from yours, but it seems unlikely (though possible) that there's a regression across those versions.

Revision history for this message
In , pmenzel (pmenzel-linux-kernel-bugs) wrote :

Sehr geehrte Damen und Herren,

vielen Dank für Ihre Nachricht, die ich nach dem 26. August 2019 lesen werde.

Freundliche Grüße

Paul Menzel

---

Dear Madam or Sir,

Thank you for your message, I’ll read after my return on August 27th, 2019.

Kind regards,

Paul Menzel

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

My system reboots randomly, mostly at idle. I have tried the 'powersupply idle' UEFI option, but it didn't work. I have changed motherboard, RAM and PSU, but it still reboots/crashes, doesn't matter if it's overclocked or not. The system is not reliable.

Currently an ASUS CROSSHAIR VI HERO, BIOS 7201 07/12/2019.
The previous motherboard was a Gigabyte Aorus Ultra Gaming X470 with BIOS F40.

Now I'm waiting for the next reboot (if it happens) after disabling C6 global states in UEFI (core C6 as per zenstates.py). The next step would be to flash a new BIOS file containing AGESA 1.0.0.3ABB. Comment https://bugzilla.kernel.org/show_bug.cgi?id=196683#c623 reports no more stability problems with a BIOS providing that AGESA version.

Has anyone used the xHCI debug capability to at least know what happens just before the crash/reboot?
https://www.kernel.org/doc/html/latest/driver-api/usb/usb3-debug-port.html

AFAIK, the chipset should support DbC, as the file "dbc" is listed under the xHCI device node in sysfs and echo "enable" to it.

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

(In reply to raulvior.bcn from comment #634)

Ah, yes. I forgot to add kernel version:
Linux 5.0.0-25-generic #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

> My system reboots randomly, mostly at idle. I have tried the 'powersupply
> idle' UEFI option, but it didn't work. I have changed motherboard, RAM and
> PSU, but it still reboots/crashes, doesn't matter if it's overclocked or
> not. The system is not reliable.
>
> Currently an ASUS CROSSHAIR VI HERO, BIOS 7201 07/12/2019.
> The previous motherboard was a Gigabyte Aorus Ultra Gaming X470 with BIOS
> F40.
>
> Now I'm waiting for the next reboot (if it happens) after disabling C6
> global states in UEFI (core C6 as per zenstates.py). The next step would be
> to flash a new BIOS file containing AGESA 1.0.0.3ABB. Comment
> https://bugzilla.kernel.org/show_bug.cgi?id=196683#c623 reports no more
> stability problems with a BIOS providing that AGESA version.
>
> Has anyone used the xHCI debug capability to at least know what happens just
> before the crash/reboot?
> https://www.kernel.org/doc/html/latest/driver-api/usb/usb3-debug-port.html
>
> AFAIK, the chipset should support DbC, as the file "dbc" is listed under the
> xHCI device node in sysfs and echo "enable" to it.

Revision history for this message
In , cameron.g.rodgers (cameron.g.rodgers-linux-kernel-bugs) wrote :

(In reply to linuxannoyance from comment #618)
After I posted this comment, I have not had any more random crashes. I disabled Global C-States in my motherboard's firmware, as that was the only option relating to C-states (C-6 states not being available). Based on my experience and other comments in this issue, I believe disabling Global C-States is what has caused my system to stabilize. I recommend this as a viable option for people with older motherboard firmware who don't want to flash it.

I still think this is an embarrassment for AMD. I do not know how they intend to take the server market when they allow an issue like this to persist. Maybe Epyc gets better support. Even so, I am a desktop and laptop Linux user, and I will mention this issue to everyone I meet who is looking for personal hardware to run Linux on.

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :
Download full text (3.9 KiB)

(In reply to raulvior.bcn from comment #635)
ASUSTeK COMPUTER INC. CROSSHAIR VI HERO 7403 08/20/2019
AMD Ryzen 7 1800X Eight-Core Processor
16410MB
2560x1440 pixels
Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-25-generic, LLVM 8.0.0)
Linux 5.0.0-25-generic (x86_64) #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019
GNU C library / (Ubuntu GLIBC 2.29-0ubuntu2) 2.29
Ubuntu 19.04
BOOT_IMAGE=/vmlinuz-5.0.0-25-generic root=UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ro quiet splash vt.handoff=1

With the latest BIOS (which provides AGESA 1.0.0.3ABB), Power Supply Idle option did work. This option increases core voltage. Knowing that increasing core voltage stabilized the system for more than 48 hours, I suspected the "performance" governor could be problematic. Because even though it made the CPU to operate at higher frequencies, the voltage was not increased. Values of 0.35, 0.5 or 0.85V were still showing up despite the minimum 3 GHz frequency instead of 2 GHz.

I have disabled "power supply idle" UEFI option and started to use the "ondemand" governor, which has a lower minimum frequency of 2 GHz.

The system continues to be stable. This is the longest time I have had the system without reboots.

Linux reports the following bugs firmware bugs:
[ 0.065489] ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
[ 0.586638] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.586749] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.586830] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.586920] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.586983] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587042] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587110] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587163] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587235] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587309] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587378] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587448] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587522] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587590] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587646] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
[ 0.587699] [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

Which effectively turns off MWAIT. cpupower idle-info output:

CPUidle driver: acpi_idle
CPUidle governor: menu
analyzing CPU 0:

Number of idle states: 3
Available idle states: POLL C1 C2
POLL:
Flags/Description: CPUIDLE CORE POLL IDLE
Latency: 0
Usage: 42208
Duration: 98245
C1:
Flags/Description: ACPI HLT
Latency: 1
Usage: 2147693
Duration: 605512928
C2:
Flags/Description: ACPI IOPORT 0x414
Latency: 400
Usage: 4338245
Duration: 71750833391

For my problem, which might not be the same as for the rest of users here, not having MWAIT did not stop the system to reboot randomly. And despi...

Read more...

Revision history for this message
In , onlylys (onlylys-linux-kernel-bugs) wrote :

So since October of last year I was having problems with my Arch + KDE build freezing at idle on my 2700X build. Usually freezes occur while a YouTube or Twitch video is playing. When it freezes, it would often also cause whatever audio that is playing to loop over and over. Journalctl would have no logs of anything at all after the system freezes.

To fix this problem, I had to make sure that both Power Supply Idle was set to 'Typical' and that my VCore undervolt wasn't too aggressive. Both of those had to be satisfied, because for a time I only set Power Supply Idle to 'Typical' while leaving my undervolt at -0.75V, which even though it gave me no problems when the system is stressed (such as running Prime95 for close to 24 hours), it would cause the system to freeze at idle. All freezing stopped once I reduced my undervolt to -0.6875V, and the system has been running for close to 3 months now for 16 hours a day without a single hitch.

Just to be doubly sure that a fix has to require Power Supply Idle being set to 'Typical', I disabled it yesterday (while leaving my undervolt at -0.6875v), and it froze within a day.

TLDR: Setting Power Supply Idle to 'Typical' and making sure that I wasn't running an overly aggressive VCore undervolt was what solved my freezing issues.

My Hardware: Ryzen 2700X, ASUS X470 CROSSHAIR VII with 2304 BIOS, Seaonic Focus+80+ Gold 650W Power Supply.

Revision history for this message
In , onlylys (onlylys-linux-kernel-bugs) wrote :

Sorry, just to clarify my above comment (since there doesn't appear to be any way to edit my comment).

1. Freezes usually occur while the system is being left alone, but I noticed that it would also often occur while a YouTube video, Twitch stream or even music (via Clementine) was left playing in the background.

2. My motherboard is actually a ASUS X470 CROSSHAIR VII WI-FI edition with the 2304 BIOS.

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

Anyone here seeing this one: Bug 1738650 - Kernel 5.2.5 graphics unstable

https://bugzilla.redhat.com/show_bug.cgi?id=1738650

Everytning working great as long as I stay on kernel-5.1.20-300.fc30.x86_64

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

(In reply to raulvior.bcn from comment #637)

I got a reboot with the ondemand governor at 1d 14h of uptime.
I have reactivated the Typical Power Supply idle switch. Currently 4 days and 2 hours of uptime. The longest I've ever seen.

The minimum voltage reported by the UEFI is 0.83 V. Using the performance governor the minimum voltage reported is 1.26 V. This corresponds with P2 and P1 P-states voltages.

> (In reply to raulvior.bcn from comment #635)
> ASUSTeK COMPUTER INC. CROSSHAIR VI HERO 7403 08/20/2019
> AMD Ryzen 7 1800X Eight-Core Processor
> 16410MB
> 2560x1440 pixels
> Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-25-generic, LLVM 8.0.0)
> Linux 5.0.0-25-generic (x86_64) #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019
> GNU C library / (Ubuntu GLIBC 2.29-0ubuntu2) 2.29
> Ubuntu 19.04
> BOOT_IMAGE=/vmlinuz-5.0.0-25-generic
> root=UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ro quiet splash vt.handoff=1

Revision history for this message
In , nelsoneci (nelsoneci-linux-kernel-bugs) wrote :

After one year, I am finally using my 2700x without any issues, latest Ubuntu 18.04, Linux 5.0.0-27-generic. No patches, no kernel options. I put the system to sleep everyday and it always comes back when I press a key the next day.

I updated the BIOS of the AB350M-DS3Hto F31, the most recent update before 3rd Gen AMD support was added.

I also put a PSU that I had around, a TT-450NL2NK. The one before had less watts. I let the Typical Current Idle option, but I don't know whether it has some effect.

I also have a GeForce GTX 1050 Ti now.

I changed quite a few things, so I cannot really say what improved my setup. But I want to report that the system is finally stable.

Regards.

(In reply to Nelson Castillo from comment #399)
> Hello there. I read most of the thread a week ago.
>
> My machine was freezing when idle and when not idle. I read most of this
> thread and thanks to it got a stable setup.
>
> @Daniel: When you say that you think the problem is fixed, do you know what
> change/Linux version fixes the issue for most people?
>
> -------------
>
> Now my report.
>
> I have the following setup:
>
> Motherboard: AB350M-DS3H (F23d BIOS).
> CPU: Ryzen 2700X.
> No overclock.
>
> I still wonder whether my CPU is doing too much extra work (I don't know
> what cpufreq would report if a C6 state was reached).
>
> cpufreq stats:
>
> CPU 0: 3.70 GHz:6.64%, 3.20 GHz:1.15%, 2.20 GHz:92.21%
> CPU15: 3.70 GHz:4.86%, 3.20 GHz:0.80%, 2.20 GHz:94.34%
>
> Anyway, to make the things work I had to use three tweaks. I tried
> individual tweaks to no avail.
>
> - Select in BIOS: "Typical Current Idle"
> - Start Linux with: idle=nomwait
> - Disable C6 states (both core and package) with Zenstates.py
>
> Before doing the last step Zenstates.py reports Core enabled and Package
> disabled.
>
> I'm using Ubuntu 18.04.01 LTS. I didn't compile Linux with
> CONFIG_RCU_NOCB_CPU / CONFIG_RCU_NOCB_CPU_ALL.
>
> So, things are working for me. But if you think I should test a new Linux
> version that is supposed to fix the issue please let me know.

Revision history for this message
In , henriquesarmentogoulartaguiar (henriquesarmentogoulartaguiar-linux-kernel-bugs) wrote :

I've been lightly following this thread, but it is still a seemingly unresolved issue that impedes me from using Linux. Has there been any real progress on this? Or is the resolution still a vague 'we more or less know where the issue is but we're still waiting for someone upstream to take action to fix it'?

Revision history for this message
In , lvargas (lvargas-linux-kernel-bugs) wrote :

I have a 2700X on MSI X470 Gaming Plus. Update the BIOS to support my new 3800X also fixed this issue to me. Both CPUs are works well on Debian Buster without any extra kernel parameter and without disable C6.

Revision history for this message
In , malikovicboris85 (malikovicboris85-linux-kernel-bugs) wrote :

Hi, i bought Acer Aspire 3, A315-41G, Ryzen 5 2500U in January and since then i had CPU lock up, i manage to install arch but i got CPU lockUP on every second boot, on every other distribution i got CPU lockUP in installer...i never use C6 disable script, but i try all possibile kernel parameters and it never works...

Three days ago i found new BIOS patch Update PI code v1.1.0.8. from 2019/08/12, on acer web site, after install everything works, no lockUPs, all distribution i try works with no problems, no need for any grub or kernel parameters...

Sry for bad english,

Regards

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :
Download full text (6.9 KiB)

Sep 10 21:13:56 BelliashPC kernel: BUG: kernel NULL pointer dereference, address: 000000000000000d
Sep 10 21:13:56 BelliashPC kernel: #PF: supervisor write access in kernel mode
Sep 10 21:13:56 BelliashPC kernel: #PF: error_code(0x0002) - not-present page
Sep 10 21:13:56 BelliashPC kernel: PGD 0 P4D 0
Sep 10 21:13:56 BelliashPC kernel: Oops: 0002 [#1] SMP NOPTI
Sep 10 21:13:56 BelliashPC kernel: CPU: 11 PID: 30804 Comm: kworker/11:2 Tainted: P O T 5.2.13-gentoo #1
Sep 10 21:13:56 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep 10 21:13:56 BelliashPC kernel: Workqueue: 0x0 (events)
Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39 c5 74
Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS: 00010002
Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX: ffff8d78aecdfa80 RCX: ffff8d789fa1b400
Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI: ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08: ffff8d78aece0180 R09: 0000000000001800
Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8d78a4112b68
Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14: ffff8d78a4112b40 R15: ffff8d78a4112b40
Sep 10 21:13:56 BelliashPC kernel: FS: 0000000000000000(0000) GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
Sep 10 21:13:56 BelliashPC kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3: 000000069063a000 CR4: 0000000000340ee0
Sep 10 21:13:56 BelliashPC kernel: Call Trace:
Sep 10 21:13:56 BelliashPC kernel: kthread+0xf8/0x130
Sep 10 21:13:56 BelliashPC kernel: ? set_worker_desc+0xb0/0xb0
Sep 10 21:13:56 BelliashPC kernel: ? kthread_park+0x80/0x80
Sep 10 21:13:56 BelliashPC kernel: ret_from_fork+0x22/0x40
Sep 10 21:13:56 BelliashPC kernel: Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
Sep 10 21:13:56 BelliashPC kernel: CR2: 000000000000000d
Sep 10 21:13:56 BelliashPC kernel: ---[ end trace 23be40ec0c6dea2a ]---
Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39 c5 74
Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS: 00010002
Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX: ffff8d78aecdfa80 RCX: ffff8d789fa1b400
Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI: ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08: ffff8d78aece0180 R09: 0000000000001800
Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11: ffffffffffffffff R12...

Read more...

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :
Download full text (3.7 KiB)

And another example of calltrace:

Sep 5 21:25:32 BelliashPC kernel: NMI backtrace for cpu 1
Sep 5 21:25:32 BelliashPC kernel: CPU: 1 PID: 13564 Comm: DOM File Tainted: P D O T 5.2.11-gentoo #1
Sep 5 21:25:32 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep 5 21:25:32 BelliashPC kernel: Call Trace:
Sep 5 21:25:32 BelliashPC kernel: <IRQ>
Sep 5 21:25:32 BelliashPC kernel: dump_stack+0x46/0x60
Sep 5 21:25:32 BelliashPC kernel: nmi_cpu_backtrace.cold+0x14/0x53
Sep 5 21:25:32 BelliashPC kernel: ? lapic_can_unplug_cpu.cold+0x42/0x42
Sep 5 21:25:32 BelliashPC kernel: nmi_trigger_cpumask_backtrace+0x89/0x8b
Sep 5 21:25:32 BelliashPC kernel: rcu_dump_cpu_stacks+0x7b/0xa9
Sep 5 21:25:32 BelliashPC kernel: rcu_sched_clock_irq.cold+0x1a2/0x38d
Sep 5 21:25:32 BelliashPC kernel: ? tick_sched_do_timer+0x50/0x50
Sep 5 21:25:32 BelliashPC kernel: update_process_times+0x24/0x60
Sep 5 21:25:32 BelliashPC kernel: tick_sched_timer+0x33/0x70
Sep 5 21:25:32 BelliashPC kernel: __hrtimer_run_queues+0xe7/0x180
Sep 5 21:25:32 BelliashPC kernel: hrtimer_interrupt+0x100/0x220
Sep 5 21:25:32 BelliashPC kernel: smp_apic_timer_interrupt+0x56/0x90
Sep 5 21:25:32 BelliashPC kernel: apic_timer_interrupt+0xf/0x20
Sep 5 21:25:32 BelliashPC kernel: </IRQ>
Sep 5 21:25:32 BelliashPC kernel: RIP: 0010:queued_spin_lock_slowpath+0x3f/0x1a0
Sep 5 21:25:32 BelliashPC kernel: Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 18 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 c3 f6 c4 01 75 04 c6 47 01 00 48 c7 c0
Sep 5 21:25:32 BelliashPC kernel: RSP: 0018:ffffa299085afc48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Sep 5 21:25:32 BelliashPC kernel: RAX: 00000000003c0101 RBX: ffffa299085afd10 RCX: 000000008a2ee8ff
Sep 5 21:25:32 BelliashPC kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa299034fdb44
Sep 5 21:25:32 BelliashPC kernel: RBP: ffffa299085afcc0 R08: ffffa299085afcc0 R09: 0000000000000000
Sep 5 21:25:32 BelliashPC kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00007fa4d675aa38
Sep 5 21:25:32 BelliashPC kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffffa299034fdb40
Sep 5 21:25:32 BelliashPC kernel: futex_wait_setup+0x77/0x110
Sep 5 21:25:32 BelliashPC kernel: futex_wait+0xcf/0x230
Sep 5 21:25:32 BelliashPC kernel: ? ep_poll_callback+0x256/0x280
Sep 5 21:25:32 BelliashPC kernel: do_futex+0x15e/0xc60
Sep 5 21:25:32 BelliashPC kernel: ? pipe_write+0x382/0x410
Sep 5 21:25:32 BelliashPC kernel: ? new_sync_write+0x12c/0x1d0
Sep 5 21:25:32 BelliashPC kernel: __x64_sys_futex+0x137/0x170
Sep 5 21:25:32 BelliashPC kernel: do_syscall_64+0x4a/0x110
Sep 5 21:25:32 BelliashPC kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 5 21:25:32 BelliashPC kernel: RIP: 0033:0x7fa4e76c7067
Sep 5 21:25:32 BelliashPC kernel: Code: 5c 24 68 48 89 44 24 50 e8 96 38 00 00 e8 c1 3d 00 00 89 de 41 89 c1 40 80 f6 80 45 31 d2 31 d2 4c 89 ff b8 ca 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 44 89 cf e8 f9 3d 00 00 31 f6 4c 89 f7 e8
Sep 5 21:25:32 BelliashPC kernel: RSP: 002b...

Read more...

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

> Sep 10 21:13:56 BelliashPC kernel: Workqueue: 0x0 (events)
> Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
> Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49
> 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47
> 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39
> c5 74
> Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS:
> 00010002
> Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX:
> ffff8d78aecdfa80 RCX: ffff8d789fa1b400
> Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI:
> ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
> Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08:
> ffff8d78aece0180 R09: 0000000000001800
> Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11:
> ffffffffffffffff R12: ffff8d78a4112b68
> Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14:
> ffff8d78a4112b40 R15: ffff8d78a4112b40
> Sep 10 21:13:56 BelliashPC kernel: FS: 0000000000000000(0000)
> GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
> Sep 10 21:13:56 BelliashPC kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3:
> 000000069063a000 CR4: 0000000000340ee0
> Sep 10 21:13:56 BelliashPC kernel: Call Trace:
> Sep 10 21:13:56 BelliashPC kernel: kthread+0xf8/0x130
> Sep 10 21:13:56 BelliashPC kernel: ? set_worker_desc+0xb0/0xb0
> Sep 10 21:13:56 BelliashPC kernel: ? kthread_park+0x80/0x80
> Sep 10 21:13:56 BelliashPC kernel: ret_from_fork+0x22/0x40
> Sep 10 21:13:56 BelliashPC kernel: Modules linked in: nvidia_drm(PO)
> nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O)
> vboxdrv(O)

You have proprietary crap loaded.

Try reproducing it with the latest upstream kernel and *without* that
nvidia* and vbox* gunk.

If you can, please open a separate bug and add me to CC.

Thx.

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :

This happened also w/o vbox (I installed it few days ago) and nvidia proprietary drivers. I used to use nouveau with the same results. I will open a new ticket.

Revision history for this message
In , albertogomezmarin (albertogomezmarin-linux-kernel-bugs) wrote :

I have freezes in ryzen 7 2700u in the acer swift sf315-41 model, kernels tested, 5.2

Revision history for this message
In , schweinefilet (schweinefilet-linux-kernel-bugs) wrote :
Download full text (7.4 KiB)

I was able to update my BIOS to version 18, but my system still locks up.
I tried the following with the new BIOS:
 - use factory defaults
 - disable SMT
 - disable SMT with Typical Current Idle
 - all of the above with SVM disabled/enabled
Right now I set the power supply idle control to "Low ..." and will report back.

On a positive note, I got an error message:

Sep 14 13:33:04 user-MS-7B85 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000160
Sep 14 13:33:04 user-MS-7B85 kernel: #PF error: [normal kernel read fault]
Sep 14 13:33:04 user-MS-7B85 kernel: PGD 0 P4D 0
Sep 14 13:33:04 user-MS-7B85 kernel: Oops: 0000 [#1] SMP NOPTI
Sep 14 13:33:04 user-MS-7B85 kernel: CPU: 6 PID: 1620 Comm: QXcbEventReader Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu
Sep 14 13:33:04 user-MS-7B85 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B85/B450 GAMING PRO CARBON AC (MS-7B85), BIOS 1.80 07/22/2019
Sep 14 13:33:04 user-MS-7B85 kernel: RIP: 0010:pick_next_task_fair+0x225/0x6f0
Sep 14 13:33:04 user-MS-7B85 kernel: Code: 40 78 48 3d 80 39 e2 b9 75 6c 4c 8b 65 b0 eb 2a 4c 89 e7 45 31 ed e8 aa c6 ff ff 84 c0 75 41 4c 89 e7 4c 89 ee e8 4b 3b ff ff <4c> 8b a0 60 01 00 00 4d 85 e4 0f 84 e9 00 00 00 4d 8b 6c 24 40 4d
Sep 14 13:33:04 user-MS-7B85 kernel: RSP: 0018:ffffaa51c34ff938 EFLAGS: 00010046
Sep 14 13:33:04 user-MS-7B85 kernel: RAX: 0000000000000000 RBX: ffff8e210e7a2d80 RCX: 000000000000000a
Sep 14 13:33:04 user-MS-7B85 kernel: RDX: ffffaa51c34ff9c0 RSI: 0000000000000000 RDI: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: RBP: ffffaa51c34ff998 R08: 00000000000f48d3 R09: ffff8e2109ca8a00
Sep 14 13:33:04 user-MS-7B85 kernel: R10: ffffaa51c34ff898 R11: 0000000000000001 R12: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: R13: 0000000000000000 R14: ffff8e21089820c0 R15: ffffaa51c34ff9c0
Sep 14 13:33:04 user-MS-7B85 kernel: FS: 00007fa560ab6700(0000) GS:ffff8e210e780000(0000) knlGS:0000000000000000
Sep 14 13:33:04 user-MS-7B85 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 14 13:33:04 user-MS-7B85 kernel: CR2: 0000000000000160 CR3: 00000004081dc000 CR4: 0000000000340ee0
Sep 14 13:33:04 user-MS-7B85 kernel: Call Trace:
Sep 14 13:33:04 user-MS-7B85 kernel: __schedule+0x173/0x850
Sep 14 13:33:04 user-MS-7B85 kernel: ? ep_poll_callback+0x93/0x2a0
Sep 14 13:33:04 user-MS-7B85 kernel: schedule+0x2c/0x70
Sep 14 13:33:04 user-MS-7B85 kernel: schedule_hrtimeout_range_clock+0x181/0x190
Sep 14 13:33:04 user-MS-7B85 kernel: ? __wake_up_common+0x73/0x130
Sep 14 13:33:04 user-MS-7B85 kernel: ? add_wait_queue+0x44/0x50
Sep 14 13:33:04 user-MS-7B85 kernel: ? __pollwait+0xaf/0xe0
Sep 14 13:33:04 user-MS-7B85 kernel: schedule_hrtimeout_range+0x13/0x20
Sep 14 13:33:04 user-MS-7B85 kernel: poll_schedule_timeout.constprop.11+0x46/0x70
Sep 14 13:33:04 user-MS-7B85 kernel: do_sys_poll+0x3d6/0x590
Sep 14 13:33:04 user-MS-7B85 kernel: ? kmem_cache_free+0x1a7/0x1d0
Sep 14 13:33:04 user-MS-7B85 kernel: ? kmem_cache_free+0x1a7/0x1d0
Sep 14 13:33:04 user-MS-7B85 kernel: ? update_load_avg+0x8b/0x5a0
Sep 14 13:33:04 user-MS-7B85 kernel: ? update_load_avg+0x8b/0x5a0
Sep 14 13:33:04 user-MS-7B85 kernel: ? __...

Read more...

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :

(In reply to txrx from comment #651)
> I was able to update my BIOS to version 18, but my system still locks up.
> I tried the following with the new BIOS:
> - use factory defaults
> - disable SMT
> - disable SMT with Typical Current Idle
> - all of the above with SVM disabled/enabled
> Right now I set the power supply idle control to "Low ..." and will report
> back.
>

You got the same error as me. There is also bug 204811 for that.

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

(In reply to txrx from comment #651)

Typical Current Idle might not be working. Read the sensor output. If voltage is not higher than without enabling it, try to increase the core voltage.

My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with an ASUS Crosshair VI Hero and enabled Typical current idle.

> I was able to update my BIOS to version 18, but my system still locks up.
> I tried the following with the new BIOS:
> - use factory defaults
> - disable SMT
> - disable SMT with Typical Current Idle
> - all of the above with SVM disabled/enabled
> Right now I set the power supply idle control to "Low ..." and will report
> back.
>

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :

https://cdn.arstechnica.net/wp-content/uploads/2019/10/rdrand-test.zip - download, extract, launch ./test-rdrand and check if produces something different than 0xffffffff

Revision history for this message
In , jaap (jaap-linux-kernel-bugs) wrote :
Download full text (3.9 KiB)

Could this be related?

[620533.9804061 RBP: 0000000000000000 R08: 0000000000000001 R09: 00007f9014000080 [620533.981792] R10: 00007f9014de0270 R11: 0000000000000206 R12: 00007f90477fdlfe [620533.983039] R13: 00007f90477fdlff R14: 00007f901ffff700 R15: 00007f901fffe540 [620541.767819] watchdog: BUG: soft lockup - CPU#4 stuck for 220 Enc4:Eantoor:5552.1 [620541.769193] Modules linked in: tun ueth nf_conntrack_netlink nfnetlink xfrm_user xfrn_algo xt_addrtype iptable filter xt c, ntrack br_petfilter bridge stp 11c ouerlay xt_nat xt_tcpudp iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defraliipu6 -n id.nfjlefrag_ipu4 nouueau snd_bda_codec_realtek snd_hda_codec_generic ledtrig_audio mxn_wmi i2c_algo_bit ttm edac_mCe_amd snd a_intel drmitms_helper snd_hda_codec eeepc_wmi kum_and snd_hda_core sparse_keymap drm rfkill snd_hudep joydeu inputied s snd, cm r8169 wml_trmof kum irgbypass realtek snd_timer libphy agpgart ccp snd syscopyarea sysfillrect sysimgblt sp5100_tco sou ndcore fb_sys_fops rng_core i2cpiix4 crctlOdif_pclmul klOtemp crc32_pclmul ghash_clmulni_inte1 aesni_intel aes_x86_64 crypto_si md crypt& glue helper pcspkr wmi pinctrl_and pcc_cpufreq eudeu gpio_andpt mac_hid acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd gr ace sunrpc ip_tables x tables ext4 crc16 mbcache jbd2 raid456 libcrc32c crc32c_generic async_raid6_recou 1620541.7692211 async memcpy async_pq async_xor xor async_tx raid6_pq raidl md_mod hid_generic sd_mod usbhid hid ahci libahci 1 ibata crc32c_intel xhcipci xhci_bcd scsi_mod 1620541.7638311 CPU: 4 PID: 5552 Comm: nc4:Eantoor Tainted: G U L 5.2.11-archl-l-ARCH #1 [620541.765314] Hardware name: System manufacturer System Product Name/PRIME B450-PLUS, BIBS 1804 07/29/2019 [620541.766760] RIP: 0010:smp_call_function_many+Ox20b/Ox270 [620541.766216] Code: e6 8a ae 78 00 3b 05 f8 Od 21 01 89 c7 Of 83 7a fe ff ff 48 63 c7 48 8b Ob 48 03 Oc c5 60 ba d6 al 8b 41 1 6 a8 01 74 Oa f3 90 <ft> 51 16 83 e2 01 75 f6 eb c9 48 c7 c2 c0 14 f4 al 4c 89 fe 89 df [620541.791144] RSP: 0016:ffffaefd63e2bcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffl3 [620541.7925011 RAX: 0000000000000003 RBX: ffff8fe6be72b440 RCX: ffff8fe6be7flle0 1620541.793697] RDX: 0000000000000001 BSI: 0000000000000000 RDI: 0000000000000007 [620541.795357] RBP: ffffffffa0c7c7e0 R08: ffff8fe6be72b448 R09: ffff8fe6be72b470 1620541.796751] R10: ffff8fe6be72b448 R11: 0000000000000005 R12: ffff8febbe729d40 1620541:798198] R13: ffff8fe6be72b448 R14: 0000000000000001 R15: 0000000000000140 [620541.799600] FS: 00007f9044d5c700(0000) G3:ffff8fe6be700000(0000) knlGS:0000000000000000 [620541.800986] CS: 0010 DS: 0000 ES: 0000 CH0: 0000000080050033 [620541.802414] CR2: 00007f9062e8 f3b0 CR3: 00000003ae65a000 CR4: 00000000003406e0 [620541.803847] Call Trace: [629541.80-r 8] flush tlb mm_range+Oxe1/0x140 1620541.8065503 tlb_flush mmu+Ox72/0x140 [628541.807929] tlb_finish_mmu+Ox3d/Ox70 t62e541me92?el zap_page_range+Oxl4b/Oxia0 [620541.810589] se sys maduise+Ox613/0x890 [620541.811932] ? do syscall_64+0x5f/Ox1d0 [628541.813277] do syscall_64+0x5f/Ox1d0 [629541.8146481 ? page_fault+Ox8/0x30 [628541.815992] entry SYSCALL64afterbaframe+Ox44/0xa9 [620541i817395] RIP: 0833:0x7f9062dd7597 russdriAluff...

Read more...

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to Jaap Crezee from comment #655)
> Could this be related?
>
> [620533.9804061 RBP: 0000000000000000 R08: 0000000000000001 R09:
> 00007f9014000080 [620533.981792] R10: 00007f9014de0270 R11: 0000000000000206
> R12: 00007f90477fdlfe [620533.983039] R13: 00007f90477fdlff R14:
> 00007f901ffff700 R15: 00007f901fffe540 [620541.767819] watchdog: BUG: soft
> lockup - CPU#4 stuck for 220 Enc4:Eantoor:5552.1 [620541.769193] Modules
> linked in: tun ueth nf_conntrack_netlink nfnetlink xfrm_user xfrn_algo

See also bug report 205017. It is difficult to capture the traces when it first happens. May or may not be the same bug.

Revision history for this message
In , jaap (jaap-linux-kernel-bugs) wrote :

Since I set up remote console logging and upgraded my BIOS (again), I have not seen the hang anymore.
Although I like this, I still have no more clues about the initial trace/problem.

FYI

I am running on a

        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: PRIME B450-PLUS
        Version: 1820
        Release Date: 09/12/2019
model name : AMD Ryzen 5 1600 Six-Core Processor

Revision history for this message
In , pghjvanblokland (pghjvanblokland-linux-kernel-bugs) wrote :
Download full text (9.7 KiB)

Ever since upgrading to a Ryzen (1700X), I have experienced frequent system freezes, which may be related to the problems discussed here.

The freeze mostly happens during a certain heavily threaded task with disk io.

Symptoms:

* Screen completely freezes, including mouse pointer,
* Existing SSH connections die, no new connection can be established,
* System can no longer switch to text console,
* LEDs indicate **unceasing disk activity**,
* System still responds to pings,
* Alt-SysRq keys remain active, but cannot output to screen even if already in text console.

I've succeeded in capturing kernel logging after a freeze using netconsole:

This timeout message appears:

    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290] __schedule+0x24e/0x880
    [35042.581295] ? bit_wait+0x60/0x60
    [35042.581300] schedule+0x2c/0x80
    [35042.581304] io_schedule+0x16/0x40
    [35042.581308] bit_wait_io+0x11/0x60
    [35042.581313] __wait_on_bit+0x4c/0x90
    [35042.581317] out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323] ? bit_waitqueue+0x40/0x40
    [35042.581328] __wait_on_buffer+0x32/0x40
    [35042.581333] jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337] ? __switch_to_asm+0x41/0x70
    [35042.581343] kjournald2+0xc8/0x270
    [35042.581347] ? kjournald2+0xc8/0x270
    [35042.581351] ? wait_woken+0x80/0x80
    [35042.581355] kthread+0x121/0x140
    [35042.581359] ? commit_timeout+0x20/0x20
    [35042.581363] ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366] ret_from_fork+0x22/0x40
    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290] __schedule+0x24e/0x880
    [35042.581295] ? bit_wait+0x60/0x60
    [35042.581300] schedule+0x2c/0x80
    [35042.581304] io_schedule+0x16/0x40
    [35042.581308] bit_wait_io+0x11/0x60
    [35042.581313] __wait_on_bit+0x4c/0x90
    [35042.581317] out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323] ? bit_waitqueue+0x40/0x40
    [35042.581328] __wait_on_buffer+0x32/0x40
    [35042.581333] jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337] ? __switch_to_asm+0x41/0x70
    [35042.581343] kjournald2+0xc8/0x270
    [35042.581347] ? kjournald2+0xc8/0x270
    [35042.581351] ? wait_woken+0x80/0x80
    [35042.581355] kthread+0x121/0x140
    [35042.581359] ? commit_timeout+0x20/0x20
    [35042.581363] ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366] ret_from_fork+0x22/0x40

Also, I have thousands of lines of output for blocked tasks. Most traces look more or less like this:

    [34274.346748] sysrq: SysRq : Show Blocked State
    [34274.346766] ...

Read more...

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

I applied the latest asrock BIOS with new options "amd cbs global c-state control" to disable voltage lowering when idle.

Even with all BIOS settings, custom kernels and params, disabling C states, power supplies and so forth I'm done. 2 years of this BS.

I picked up a 65 watt Intel i9 9900.

Good luck to you all and thanks for all the ideas and help. I hope to revisit AMD Ryzen based linux servers in a few years.

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to eric.c.morgan from comment #659)
> I applied the latest asrock BIOS with new options "amd cbs global c-state
> control" to disable voltage lowering when idle.
>
> Even with all BIOS settings, custom kernels and params, disabling C states,
> power supplies and so forth I'm done. 2 years of this BS.
>
> I picked up a 65 watt Intel i9 9900.
>
> Good luck to you all and thanks for all the ideas and help. I hope to
> revisit AMD Ryzen based linux servers in a few years.

I think you had a faulty hardware. With just the typical idle current it fix the trouble.
That happens.

On the other side, Windows kernel has no problem with default parameters, so i think something in Linux is not good.

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

(In reply to Michaël Colignon from comment #660)
> (In reply to eric.c.morgan from comment #659)
> > I applied the latest asrock BIOS with new options "amd cbs global c-state
> > control" to disable voltage lowering when idle.
> >
> > Even with all BIOS settings, custom kernels and params, disabling C states,
> > power supplies and so forth I'm done. 2 years of this BS.
> >
> > I picked up a 65 watt Intel i9 9900.
> >
> > Good luck to you all and thanks for all the ideas and help. I hope to
> > revisit AMD Ryzen based linux servers in a few years.
>
> I think you had a faulty hardware. With just the typical idle current it fix
> the trouble.
> That happens.
>
> On the other side, Windows kernel has no problem with default parameters, so
> i think something in Linux is not good.

Agree, see https://bugzilla.kernel.org/show_bug.cgi?id=205017

Revision history for this message
In , eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote :

(In reply to Michaël Colignon from comment #660)
> (In reply to eric.c.morgan from comment #659)
> > I applied the latest asrock BIOS with new options "amd cbs global c-state
> > control" to disable voltage lowering when idle.
> >
> > Even with all BIOS settings, custom kernels and params, disabling C states,
> > power supplies and so forth I'm done. 2 years of this BS.
> >
> > I picked up a 65 watt Intel i9 9900.
> >
> > Good luck to you all and thanks for all the ideas and help. I hope to
> > revisit AMD Ryzen based linux servers in a few years.
>
> I think you had a faulty hardware. With just the typical idle current it fix
> the trouble.
> That happens.
>
> On the other side, Windows kernel has no problem with default parameters, so
> i think something in Linux is not good.

Likely not bad hardware. My 1700 was RMAd for the segfault issue (another pain point), and has passed all testing. Memory has been tested as well, many times over, with different profiles and settings.

Tiny i9-9900 itx ready for transplant. I feel dirty. https://i.imgur.com/fTvWrWd.png

Revision history for this message
In , jan.public (jan.public-linux-kernel-bugs) wrote :

> Likely not bad hardware. My 1700 was RMAd for the segfault issue (another
> pain point), and has passed all testing. Memory has been tested as well,
> many times over, with different profiles and settings.
>

I also had my 1600 RMAd, but still I experience hangs now and then. I have the impression that the hangs are worse lately. But I have no hard statistics.

Revision history for this message
In , bp (bp-linux-kernel-bugs) wrote :

Folks,

this bugzilla entry, with the amount of different issues reported and the amount of FUD all collected in one place, is prohibitively hard for a debugger to handle. So, I'd suggest if you still would like your issue looked at, to open a separate bug.

And please refrain from commenting on a bug with your own issue - it is much easier to open a separate one first and then when it turns out that it is a known issue, to merge the two bugzilla entries than to keep them apart in a single report.

Thx.

Revision history for this message
In , dl9px (dl9px-linux-kernel-bugs) wrote :

I as the thread starter fully agree with Borislav and set this bug report to status resolved. Thank you all very much for the interesting discussion!

Having started with massive stability problems on my new Ryzen build, I reported this issue to address a potential problem in the Linux kernel. In the hundreds of comments, which surely could be very interesting for everybody affected, I found a workaround - disabling C6 states with the python script attached to a comment above. Since then the system was perfectly stable. Test with the "kill Ryzen" script also showed me, the CPU was not affected by the other huge problem early adopters had.

A few days ago I finally found the time to update the AB350 BIOS to the latest version and set the "typical idle current" in the options. I had only 2 issues since then: The bootloader entry for my OS was broken after the update and network manager did not bring up the LAN interface anymore. But after resolving these issues: No more stability problems. Of course I deactivated my systemd service for c6 states. I am still on the stable branch of Manjaro and use the latest 5.5 series kernel.

See you!

Changed in linux:
status: Confirmed → Expired
Revision history for this message
In , stellicht (stellicht-linux-kernel-bugs) wrote :

Hi folks!

For those who are looking for a solution or already found a solution, there is a new update of AGESA rolling out. The new version 1.0.0.4 claims:

* Improved system stability when switching through ACPI power states.

It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept default optimised BIOS settings. When I used ZenState to check cpu status, the C6 state is now:

C6 State - Package - Enabled
C6 State - Core - Disabled

Previously they were all enabled, I guessed the new AGESA release finally solved this problem from source? I will report back days later to see if the system still hangs.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Charles Lim from comment #673)

> For those who are looking for a solution or already found a solution, there
> is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
>
> * Improved system stability when switching through ACPI power states.
>
> It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept
> default optimised BIOS settings. When I used ZenState to check cpu status,
> the C6 state is now:
>
> C6 State - Package - Enabled
> C6 State - Core - Disabled
>
> Previously they were all enabled, I guessed the new AGESA release finally
> solved this problem from source? I will report back days later to see if the
> system still hangs.

Are you sure about the AGESA version string? For the Dell OptiPlex 5055, firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*, and then grepping for `AGESA!` in the strings/hexdump, it says it includes AGESA version 1.0.0.7a.

    $ strings 1\ --\ 1\ OptiPlex\ System\ BIOS\ v1.1.20.bin | grep -A1 AGESA!
    %pAGESA!V9
    SummitPI-AM4 1.0.0.7a

[1]: https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=w6mw5&oscode=wt64a&productcode=optiplex-5055-ryzen-cpu
[2]: https://github.com/platomav/BIOSUtilities

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :

(In reply to Paul Menzel from comment #674)
> (In reply to Charles Lim from comment #673)
>
> > For those who are looking for a solution or already found a solution, there
> > is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> >
> > * Improved system stability when switching through ACPI power states.
> >
> > It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept
> > default optimised BIOS settings. When I used ZenState to check cpu status,
> > the C6 state is now:
> >
> > C6 State - Package - Enabled
> > C6 State - Core - Disabled
> >
> > Previously they were all enabled, I guessed the new AGESA release finally
> > solved this problem from source? I will report back days later to see if
> the
> > system still hangs.
>
> Are you sure about the AGESA version string? For the Dell OptiPlex 5055,
> firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*,
> and then grepping for `AGESA!` in the strings/hexdump, it says it includes
> AGESA version 1.0.0.7a.
>
> $ strings 1\ --\ 1\ OptiPlex\ System\ BIOS\ v1.1.20.bin | grep -A1 AGESA!
> %pAGESA!V9
> SummitPI-AM4 1.0.0.7a
>
> [1]:
> https://www.dell.com/support/home/de-de/drivers/
> driversdetails?driverid=w6mw5&oscode=wt64a&productcode=optiplex-5055-ryzen-
> cpu
> [2]: https://github.com/platomav/BIOSUtilities

1.0.0.7a is for Zen+
1.0.0.4 is for Zen2.

AMD resets versioning every new CPU generation. 1.0.0.4 is newer than 1.0.0.7

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Rafal Kupiec from comment #675)
> (In reply to Paul Menzel from comment #674)
> > (In reply to Charles Lim from comment #673)
> >
> > > For those who are looking for a solution or already found a solution,
> there
> > > is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> > >
> > > * Improved system stability when switching through ACPI power states.

[…]

> > Are you sure about the AGESA version string? For the Dell OptiPlex 5055,
> > firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*,
> > and then grepping for `AGESA!` in the strings/hexdump, it says it includes
> > AGESA version 1.0.0.7a.

[…]

> 1.0.0.7a is for Zen+
> 1.0.0.4 is for Zen2.
>
> AMD resets versioning every new CPU generation. 1.0.0.4 is newer than 1.0.0.7

Thank you for the clarification. Though I am confused now, as I thought you could use Zen2 devices in “Zen+ boards” (boards original for Zen+). So, AGESA 1.0.0.4 for Zen2 also support the predecessor generation?

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :

There is nothing like Zen+ or Zen2 board... There are AM4 motherboards based on different chipsets. All of them supports Zen, Zen+ and Zen2. If you want to install Zen2 in such board you must first ensure it has flashed BIOS that supports new CPU. The piece of software that includes CPU microcode is called AGESA. It is responsible for initializing CPU and memory at least. And it is redistributed as BIOS upgrade by motherboards manufacturers.

So if you want to install Zen2 CPU on AM4 motherboard, you need to make sure it has flashed BIOS with AGESA 1.0.0.1 that brings support for Zen2 CPUs. And that AGESA 1.0.0.1 is something newer than 1.0.0.7. It still supports Zen and Zen+, but since it is dedicated for Zen2, the versioning restarts from 1.0.0.0 what gets people confusing.

Revision history for this message
In , stellicht (stellicht-linux-kernel-bugs) wrote :

Hi folks again!

I have been testing the new AGESA 1.0.0.4 these days. Currently my uptime is reaching 3days+, which was impossible before - the system always hangs within 2 days. Hence I consider the new firmware indeed solved this Soft Lock issue!

I'm testing on Ubuntu Focal, acpi related settings are all left default. On my Asus PRIME B350M, default optimised settings are being used.

In addition, the CPU C6 State seems also normal, during idle, CPU enters its idle state with very low power consumption (measured externally). k10temp reports Vcore 850mV(approx.).

>Thank you for the clarification. Though I am confused now, as I thought you
>could use Zen2 devices in “Zen+ boards” (boards original for Zen+). So, AGESA
>1.0.0.4 for Zen2 also support the predecessor generation?

It seems that to address to this versioning mess, AMD claims that this 1.0.0.4 release reunited all architectures into one codebase.[1] I guess they also "reunited" the version number. The AGESA was also 1.0.0.7 on my motherboard before.

---REFERENCES---
[1]: https://www.reddit.com/r/Amd/comments/dtgutp/an_update_on_the_am4_platform_agesa_1004/

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I've been facing the same issue since I've bought this machine and setting my BIOS power settings to "Typical current idle" has only partially fixed things (the system still crashes every 1-2 weeks during idle: I leave my desk, screensaver fires up, then screens go to sleep, and by the time I get back to my desk, the system no longer responds to any input and the only option is a hard reset.).

At the time of writing, I can't tell if the AGESA version has made it into my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS

But I'd be willing to give this a try. I also just yesterday requested an RMA from AMD though: I spoke to their support and they asked me to try setting the power settings in the BIOS settings. Once I reported back that the power settings only reduced the frequency of these crashes, their immediate response was to open an RMA request. It seemed like they were aware of the issue and they seemed quite confident that an RMA will fix this (I sure hope so since the RMA means at least 2 weeks of downtime for me).

Revision history for this message
In , belliash (belliash-linux-kernel-bugs) wrote :

(In reply to Ashesh Ambasta from comment #680)
> I've been facing the same issue since I've bought this machine and setting
> my BIOS power settings to "Typical current idle" has only partially fixed
> things (the system still crashes every 1-2 weeks during idle: I leave my
> desk, screensaver fires up, then screens go to sleep, and by the time I get
> back to my desk, the system no longer responds to any input and the only
> option is a hard reset.).
>
> At the time of writing, I can't tell if the AGESA version has made it into
> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS
>
> But I'd be willing to give this a try. I also just yesterday requested an
> RMA from AMD though: I spoke to their support and they asked me to try
> setting the power settings in the BIOS settings. Once I reported back that
> the power settings only reduced the frequency of these crashes, their
> immediate response was to open an RMA request. It seemed like they were
> aware of the issue and they seemed quite confident that an RMA will fix this
> (I sure hope so since the RMA means at least 2 weeks of downtime for me).

I wish you luck. I think this is all I can tell you.
I had exact problem with B450. I have opened RMA request and got money back. I bought X570 based motherboard and this solved issue for me.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Created attachment 289445
attachment-1715-0.html

Does this mean the issue could be motherboard related? I actually received a reply from AMD and they’re asking me for pictures of the processor installed on the motherboard for some reason.

Best,

Ashesh Ambasta

> On 1 Jun 2020, at 12:46, <email address hidden> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #681 from Rafal Kupiec (<email address hidden>) ---
> (In reply to Ashesh Ambasta from comment #680)
>> I've been facing the same issue since I've bought this machine and setting
>> my BIOS power settings to "Typical current idle" has only partially fixed
>> things (the system still crashes every 1-2 weeks during idle: I leave my
>> desk, screensaver fires up, then screens go to sleep, and by the time I get
>> back to my desk, the system no longer responds to any input and the only
>> option is a hard reset.).
>>
>> At the time of writing, I can't tell if the AGESA version has made it into
>> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS
>>
>> But I'd be willing to give this a try. I also just yesterday requested an
>> RMA from AMD though: I spoke to their support and they asked me to try
>> setting the power settings in the BIOS settings. Once I reported back that
>> the power settings only reduced the frequency of these crashes, their
>> immediate response was to open an RMA request. It seemed like they were
>> aware of the issue and they seemed quite confident that an RMA will fix this
>> (I sure hope so since the RMA means at least 2 weeks of downtime for me).
>
> I wish you luck. I think this is all I can tell you.
> I had exact problem with B450. I have opened RMA request and got money back.
> I
> bought X570 based motherboard and this solved issue for me.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Ashesh Ambasta from comment #680)
> I've been facing the same issue since I've bought this machine and setting
> my BIOS power settings to "Typical current idle" has only partially fixed
> things (the system still crashes every 1-2 weeks during idle: I leave my
> desk, screensaver fires up, then screens go to sleep, and by the time I get
> back to my desk, the system no longer responds to any input and the only
> option is a hard reset.).
>
> At the time of writing, I can't tell if the AGESA version has made it into
> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS

[I have an MSI board, but checked the firmware update files for you.]

The description for version 3.60 includes:

> Update AMD AGESA ThreadRipperPI-SP3r2 1.1.0.2

Reading the ASRock forum thread *What happened to Fatality X399 Bios 1.60?* [1], it’s enough to run the firmware update file through a hexeditor. I searched for *GES* and was lucky for version 3.80.

    $ hexdump -C X399MT3.80 | less
    […]
    00e34ce0 00 00 00 00 00 00 00 00 00 00 00 00 be 2c 63 17 |.............,c.|
    00e34cf0 80 6c b6 49 82 07 12 b5 3d 9b 25 70 41 47 45 53 |.l.I....=.%pAGES|
    00e34d00 41 21 56 39 00 54 68 72 65 61 64 52 69 70 70 65 |A!V9.ThreadRippe|
    00e34d10 72 50 49 2d 53 50 33 72 32 2d 31 2e 31 2e 30 2e |rPI-SP3r2-1.1.0.|
    00e34d20 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |2...............|
[…]
```

So, it still has AGESA version 1.1.0.2.

[…]

[1]: http://forum.asrock.com/forum_posts.asp?TID=6165&PN=4&title=what-happened-to-fatality-x399-bios-160

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I see. But that is assuming that the 1.0.0.4 version “fixes” this issue. At this point it’s only a guess but it seems likely given the previous replies I’ve seen.
My god what a nightmare this has been. :-)

Best,

Ashesh Ambasta

> On 1 Jun 2020, at 17:59, <email address hidden> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #683 from Paul Menzel (<email address hidden>)
> ---
> (In reply to Ashesh Ambasta from comment #680)
>> I've been facing the same issue since I've bought this machine and setting
>> my BIOS power settings to "Typical current idle" has only partially fixed
>> things (the system still crashes every 1-2 weeks during idle: I leave my
>> desk, screensaver fires up, then screens go to sleep, and by the time I get
>> back to my desk, the system no longer responds to any input and the only
>> option is a hard reset.).
>>
>> At the time of writing, I can't tell if the AGESA version has made it into
>> my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS
>
> [I have an MSI board, but checked the firmware update files for you.]
>
> The description for version 3.60 includes:
>
>> Update AMD AGESA ThreadRipperPI-SP3r2 1.1.0.2
>
> Reading the ASRock forum thread *What happened to Fatality X399 Bios 1.60?*
> [1], it’s enough to run the firmware update file through a hexeditor. I
> searched for *GES* and was lucky for version 3.80.
>
> $ hexdump -C X399MT3.80 | less
> […]
> 00e34ce0 00 00 00 00 00 00 00 00 00 00 00 00 be 2c 63 17
> |.............,c.|
> 00e34cf0 80 6c b6 49 82 07 12 b5 3d 9b 25 70 41 47 45 53
> |.l.I....=.%pAGES|
> 00e34d00 41 21 56 39 00 54 68 72 65 61 64 52 69 70 70 65
> |A!V9.ThreadRippe|
> 00e34d10 72 50 49 2d 53 50 33 72 32 2d 31 2e 31 2e 30 2e
> |rPI-SP3r2-1.1.0.|
> 00e34d20 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |2...............|
> […]
> ```
>
> So, it still has AGESA version 1.1.0.2.
>
> […]
>
>
> [1]:
>
> http://forum.asrock.com/forum_posts.asp?TID=6165&PN=4&title=what-happened-to-fatality-x399-bios-160
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , c.buddeweg (c.buddeweg-linux-kernel-bugs) wrote :

Several experiments with Ryzen 1700 and Gigabyte X370 Gaming K7 Board. Best choice was "typical current idle", but the latest Bios did'nt have that option. So I bought a R5 3600 put the latest BIOS at the board and all the problems gone away. Rock solid now. I think, that the first generation of Ryzen make some problems with Linux - mine was a RMA'ed because of the compiler error. See Comment 195.
Meanwhile gcc-9.3.0, Kernel-5.7.0, Bios F50a

Revision history for this message
In , ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote :

(In reply to Paul Menzel from comment #683)
> (In reply to Ashesh Ambasta from comment #680)
> > I've been facing the same issue since I've bought this machine and setting
> > my BIOS power settings to "Typical current idle" has only partially fixed
> > things (the system still crashes every 1-2 weeks during idle: I leave my
> > desk, screensaver fires up, then screens go to sleep, and by the time I get
> > back to my desk, the system no longer responds to any input and the only
> > option is a hard reset.).
> >
> > At the time of writing, I can't tell if the AGESA version has made it into
> > my motherboard: https://www.asrock.com/mb/AMD/X399M%20Taichi/index.asp#BIOS
>
> [I have an MSI board, but checked the firmware update files for you.]
>
> The description for version 3.60 includes:
>
> > Update AMD AGESA ThreadRipperPI-SP3r2 1.1.0.2
>
> Reading the ASRock forum thread *What happened to Fatality X399 Bios 1.60?*
> [1], it’s enough to run the firmware update file through a hexeditor. I
> searched for *GES* and was lucky for version 3.80.
>
> $ hexdump -C X399MT3.80 | less
> […]
> 00e34ce0 00 00 00 00 00 00 00 00 00 00 00 00 be 2c 63 17
> |.............,c.|
> 00e34cf0 80 6c b6 49 82 07 12 b5 3d 9b 25 70 41 47 45 53
> |.l.I....=.%pAGES|
> 00e34d00 41 21 56 39 00 54 68 72 65 61 64 52 69 70 70 65
> |A!V9.ThreadRippe|
> 00e34d10 72 50 49 2d 53 50 33 72 32 2d 31 2e 31 2e 30 2e
> |rPI-SP3r2-1.1.0.|
> 00e34d20 32 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> |2...............|
> […]
> ```
>
> So, it still has AGESA version 1.1.0.2.
>
> […]
>
>
> [1]:
> http://forum.asrock.com/forum_posts.asp?TID=6165&PN=4&title=what-happened-to-
> fatality-x399-bios-160

Hi,
Agesa for TR is not the same than for AM4 in versioning.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I’ve tried that option of “Typical current idle” to no avail.

Best,

Ashesh Ambasta

> On 1 Jun 2020, at 20:51, <email address hidden> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #685 from ChrisB (<email address hidden>) ---
> Several experiments with Ryzen 1700 and Gigabyte X370 Gaming K7 Board. Best
> choice was "typical current idle", but the latest Bios did'nt have that
> option.
> So I bought a R5 3600 put the latest BIOS at the board and all the problems
> gone away. Rock solid now. I think, that the first generation of Ryzen make
> some problems with Linux - mine was a RMA'ed because of the compiler error.
> See
> Comment 195.
> Meanwhile gcc-9.3.0, Kernel-5.7.0, Bios F50a
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I've also found https://www.reddit.com/r/Amd/comments/8yzvxz/ryzen_c6_state_sleep_power_supply_common_current/e33lc6s/ which seems to be promising. It is, however, appalling to see this kind of response from AMD and I'm quite sure this is my last AMD system.

I'm currently testing my system with "Deep sleep" enabled and will report if its stable. The "Typical current idle" fix certainly didn't fix things for me, and in any case, the best I can say about that "fix" is that its an ugly hack and not really a fix from AMD.

I'm bewildered to see so many reports on this thread and no word from AMD. I'm also convinced they are aware of this issue and are doing nothing but propose cheap hacks.

AMD has also approved my RMA request and I'll send this processor back if the "deep sleep" "fix" doesn't fix things from me. If the RMA'd processor doesn't work, I'm going to sell.

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

(In reply to raulvior.bcn from comment #653)
> (In reply to txrx from comment #651)
>
> Typical Current Idle might not be working. Read the sensor output. If
> voltage is not higher than without enabling it, try to increase the core
> voltage.
>
> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with
> an ASUS Crosshair VI Hero and enabled Typical current idle.
>
>
> > I was able to update my BIOS to version 18, but my system still locks up.
> > I tried the following with the new BIOS:
> > - use factory defaults
> > - disable SMT
> > - disable SMT with Typical Current Idle
> > - all of the above with SVM disabled/enabled
> > Right now I set the power supply idle control to "Low ..." and will report
> > back.
> >

The motherboard kept hanging. I had to remove the Vitals GNOME Extension. It seems that polling voltage values hangs the motherboard... Still, there are times that the computer does not come back from suspend. There's something wrong with the BIOS/UEFI.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

As a last resort, I've tried `idle=halt` on this machine. And yet my
system just crashed after 3 weeks of uptime.

I'm done with AMD. I will RMA this processor to try things out, but
overall, if that doesn't work, this thing is headed to the junkyard and
I'm going to live with Intel.

At least in the 13 or so odd Intel systems I've tried, I've not had
exhasperating issues like these where the company is positively trying
to ignore this ongoing issue.

This is disgusting from AMD.

On 6/11/20 6:03 PM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #689 from <email address hidden> ---
> (In reply to raulvior.bcn from comment #653)
>> (In reply to txrx from comment #651)
>>
>> Typical Current Idle might not be working. Read the sensor output. If
>> voltage is not higher than without enabling it, try to increase the core
>> voltage.
>>
>> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with
>> an ASUS Crosshair VI Hero and enabled Typical current idle.
>>
>>
>>> I was able to update my BIOS to version 18, but my system still locks up.
>>> I tried the following with the new BIOS:
>>> - use factory defaults
>>> - disable SMT
>>> - disable SMT with Typical Current Idle
>>> - all of the above with SVM disabled/enabled
>>> Right now I set the power supply idle control to "Low ..." and will report
>>> back.
>>>
> The motherboard kept hanging. I had to remove the Vitals GNOME Extension. It
> seems that polling voltage values hangs the motherboard... Still, there are
> times that the computer does not come back from suspend. There's something
> wrong with the BIOS/UEFI.
>

Revision history for this message
In , arup.chowdhury (arup.chowdhury-linux-kernel-bugs) wrote :

Before you discard the CPU if your BIOS has PBO option turn it off, with latest BIOS on my B350 motherboard version 5406 PBO is on by default and apart from causing high temperature it would lead to this kind of crashes and reboots with my Ryzen7. Turning off PBO fixed it and now I am running a Ryzen9 with the same board and all works fine with PBO off.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I’ll try that, but I doubt if at this stage it will have any effect.

Best,

Ashesh Ambasta

> On 27 Jun 2020, at 22:31, <email address hidden> wrote:
>
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #693 from Arup (<email address hidden>) ---
> Before you discard the CPU if your BIOS has PBO option turn it off, with
> latest
> BIOS on my B350 motherboard version 5406 PBO is on by default and apart from
> causing high temperature it would lead to this kind of crashes and reboots
> with
> my Ryzen7. Turning off PBO fixed it and now I am running a Ryzen9 with the
> same
> board and all works fine with PBO off.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Moreover, from a bit of reading I see that PBO is related to how fast the processor clock speeds are boosted. How is this related to crashes when idle?

Best,

Ashesh Ambasta

> On 28 Jun 2020, at 07:23, Ashesh Ambasta <email address hidden> wrote:
>
> I’ll try that, but I doubt if at this stage it will have any effect.
>
> Best,
>
> Ashesh Ambasta
>
>> On 27 Jun 2020, at 22:31, <email address hidden> wrote:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>>
>> --- Comment #693 from Arup (<email address hidden>) ---
>> Before you discard the CPU if your BIOS has PBO option turn it off, with
>> latest
>> BIOS on my B350 motherboard version 5406 PBO is on by default and apart from
>> causing high temperature it would lead to this kind of crashes and reboots
>> with
>> my Ryzen7. Turning off PBO fixed it and now I am running a Ryzen9 with the
>> same
>> board and all works fine with PBO off.
>>
>> --
>> You are receiving this mail because:
>> You are on the CC list for the bug.

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

Did you ask for an RMA? Did it work?
(In reply to Ashesh Ambasta from comment #692)
> As a last resort, I've tried `idle=halt` on this machine. And yet my
> system just crashed after 3 weeks of uptime.
>
> I'm done with AMD. I will RMA this processor to try things out, but
> overall, if that doesn't work, this thing is headed to the junkyard and
> I'm going to live with Intel.
>
> At least in the 13 or so odd Intel systems I've tried, I've not had
> exhasperating issues like these where the company is positively trying
> to ignore this ongoing issue.
>
> This is disgusting from AMD.
>
> On 6/11/20 6:03 PM, <email address hidden> wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=196683
> >
> > --- Comment #689 from <email address hidden> ---
> > (In reply to raulvior.bcn from comment #653)
> >> (In reply to txrx from comment #651)
> >>
> >> Typical Current Idle might not be working. Read the sensor output. If
> >> voltage is not higher than without enabling it, try to increase the core
> >> voltage.
> >>
> >> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB
> with
> >> an ASUS Crosshair VI Hero and enabled Typical current idle.
> >>
> >>
> >>> I was able to update my BIOS to version 18, but my system still locks up.
> >>> I tried the following with the new BIOS:
> >>> - use factory defaults
> >>> - disable SMT
> >>> - disable SMT with Typical Current Idle
> >>> - all of the above with SVM disabled/enabled
> >>> Right now I set the power supply idle control to "Low ..." and will
> report
> >>> back.
> >>>
> > The motherboard kept hanging. I had to remove the Vitals GNOME Extension.
> It
> > seems that polling voltage values hangs the motherboard... Still, there are
> > times that the computer does not come back from suspend. There's something
> > wrong with the BIOS/UEFI.
> >

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :
Download full text (3.5 KiB)

Created attachment 290243
attachment-20700-0.html

I haven't; and to be honest, I've been procrastinating this issue.

As a very ugly hack/workaround; I've disabled screen power management in
xscreensaver: so the CPU keeps drawing graphics on my screen instead of
my displays going to sleep.

That way, my CPU never really enters the idle states for the crashes to
occur.

I understand that this is /far/ from a satisfactory solution; but I
didn't want to try my luck with the RMA anymore. As long as my system
doesn't crash, I can live with this CPU (albeit this continues to
frustrate me). I may lose patience in the coming months and go for  an
RMA anyway. But I'm deterred by the mixed reports for the RMA as well:
some people claim that an RMA fixes their issues; some people say it
makes no difference. I've even read reports of the RMA'd CPU actually
turning out to be worse.

I don't think I'm prepared for the gamble. I've been burnt pretty bad
with AMD at the moment. For now, I'm just making this work. The next
time I'm buying a CPU, I'll do my research more thoroughly and stay away
from AMD.

AMD did publish an errata in which they claim an issue like this exists;
but a solution is ruled out. Which is further bad news. There was some
discussion on a fix being at the kernel level, but that isn't anywhere
in sight either. I believe these CPU's are plagued by several issues,
which probably makes a fix for this at the kernel level hard. However,
Windows seems to have managed to fix it.

Anyway; rants aside, this is my current take on the CPU.

On 7/12/20 3:58 PM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #698 from <email address hidden> ---
> Did you ask for an RMA? Did it work?
> (In reply to Ashesh Ambasta from comment #692)
>> As a last resort, I've tried `idle=halt` on this machine. And yet my
>> system just crashed after 3 weeks of uptime.
>>
>> I'm done with AMD. I will RMA this processor to try things out, but
>> overall, if that doesn't work, this thing is headed to the junkyard and
>> I'm going to live with Intel.
>>
>> At least in the 13 or so odd Intel systems I've tried, I've not had
>> exhasperating issues like these where the company is positively trying
>> to ignore this ongoing issue.
>>
>> This is disgusting from AMD.
>>
>> On 6/11/20 6:03 PM, <email address hidden> wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>>>
>>> --- Comment #689 from <email address hidden> ---
>>> (In reply to raulvior.bcn from comment #653)
>>>> (In reply to txrx from comment #651)
>>>>
>>>> Typical Current Idle might not be working. Read the sensor output. If
>>>> voltage is not higher than without enabling it, try to increase the core
>>>> voltage.
>>>>
>>>> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB
>> with
>>>> an ASUS Crosshair VI Hero and enabled Typical current idle.
>>>>
>>>>
>>>>> I was able to update my BIOS to version 18, but my system still locks up.
>>>>> I tried the following with the new BIOS:
>>>>> - use factory defaults
>>>>> - disable SMT
>>>>> - disable SMT with Typical Current Idle
>>>>...

Read more...

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Ashesh Ambasta from comment #699)

> Created attachment 290243 [details]
> attachment-20700-0.html

I am getting:

> Sorry, you are not authorized to access attachment #290243.

[…]

> AMD did publish an errata in which they claim an issue like this exists;
> but a solution is ruled out. Which is further bad news. There was some
> discussion on a fix being at the kernel level, but that isn't anywhere
> in sight either. I believe these CPU's are plagued by several issues,
> which probably makes a fix for this at the kernel level hard. However,
> Windows seems to have managed to fix it.

That’s interesting. Could you please share the URL to these statements?

[…]

raulvior.bcn, Asheesh: Please always remove the cited parts, as otherwise the Web issue page gets even more convoluted. Thanks.

> Anyway; rants aside, this is my current take on the CPU.
>
> On 7/12/20 3:58 PM, <email address hidden> wrote:
> > https://bugzilla.kernel.org/show_bug.cgi?id=196683
> >
> > --- Comment #698 from <email address hidden> ---
> > Did you ask for an RMA? Did it work?
> > (In reply to Ashesh Ambasta from comment #692)
> >> As a last resort, I've tried `idle=halt` on this machine. And yet my
> >> system just crashed after 3 weeks of uptime.
> >>
> >> I'm done with AMD. I will RMA this processor to try things out, but
> >> overall, if that doesn't work, this thing is headed to the junkyard and
> >> I'm going to live with Intel.
> >>
> >> At least in the 13 or so odd Intel systems I've tried, I've not had
> >> exhasperating issues like these where the company is positively trying
> >> to ignore this ongoing issue.
> >>
> >> This is disgusting from AMD.
> >>
> >> On 6/11/20 6:03 PM, <email address hidden> wrote:
> >>> https://bugzilla.kernel.org/show_bug.cgi?id=196683
> >>>
> >>> --- Comment #689 from <email address hidden> ---
> >>> (In reply to raulvior.bcn from comment #653)
> >>>> (In reply to txrx from comment #651)
> >>>>
> >>>> Typical Current Idle might not be working. Read the sensor output. If
> >>>> voltage is not higher than without enabling it, try to increase the core
> >>>> voltage.
> >>>>
> >>>> My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB
> >> with
> >>>> an ASUS Crosshair VI Hero and enabled Typical current idle.
> >>>>
> >>>>
> >>>>> I was able to update my BIOS to version 18, but my system still locks
> up.
> >>>>> I tried the following with the new BIOS:
> >>>>> - use factory defaults
> >>>>> - disable SMT
> >>>>> - disable SMT with Typical Current Idle
> >>>>> - all of the above with SVM disabled/enabled
> >>>>> Right now I set the power supply idle control to "Low ..." and will
> >> report
> >>>>> back.
> >>>>>
> >>> The motherboard kept hanging. I had to remove the Vitals GNOME Extension.
> >> It
> >>> seems that polling voltage values hangs the motherboard... Still, there
> are
> >>> times that the computer does not come back from suspend. There's
> something
> >>> wrong with the BIOS/UEFI.
> >>>

Revision history for this message
In , victorcqueirozg (victorcqueirozg-linux-kernel-bugs) wrote :

I'm also experiencing severe and constant crashes with a Ryzen 5 3600X. I also had problems with my Ryzen 7 1700 and I was confident that the problem was never going to have problems again with my Ryzen 5 3600X, but I was wrong.

First months everything was fine, and after reinstalling the system, everything went wrong. Crashes daily and constantly. I had a critical Git repository corrupted (luckily I could recover things in time), but it completely destroyed my experience with my PC and actually become a financial gap as I use the computer to work on.

Not sure what to do. I've decided to try until the end of this year not to have more crashes, but as a Linux user, I'm unsure I will ever think about buying AMD again. I will always remember about the hassle and lost years I had with this.

Sometimes all you want to do is to have a great experience with your Linux computer and BOOM, a hard freeze that requires manually turning the computer off. It's ridiculous. To be honest, I expected much more from AMD. I hope they'll fix it ASAP, we can't stand this anymore. Intel becomes our only option with this problem.

I also opened an issue where I'm posting updates about my problems: https://bugzilla.kernel.org/show_bug.cgi?id=208615

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Have you tried the suggested fixes in this thread? And what changed
between that reinstall and earlier?

On 7/24/20 4:17 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> Victor Queiroz (<email address hidden>) changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #703 from Victor Queiroz (<email address hidden>) ---
> I'm also experiencing severe and constant crashes with a Ryzen 5 3600X. I
> also
> had problems with my Ryzen 7 1700 and I was confident that the problem was
> never going to have problems again with my Ryzen 5 3600X, but I was wrong.
>
> First months everything was fine, and after reinstalling the system,
> everything
> went wrong. Crashes daily and constantly. I had a critical Git repository
> corrupted (luckily I could recover things in time), but it completely
> destroyed
> my experience with my PC and actually become a financial gap as I use the
> computer to work on.
>
> Not sure what to do. I've decided to try until the end of this year not to
> have
> more crashes, but as a Linux user, I'm unsure I will ever think about buying
> AMD again. I will always remember about the hassle and lost years I had with
> this.
>
> Sometimes all you want to do is to have a great experience with your Linux
> computer and BOOM, a hard freeze that requires manually turning the computer
> off. It's ridiculous. To be honest, I expected much more from AMD. I hope
> they'll fix it ASAP, we can't stand this anymore. Intel becomes our only
> option
> with this problem.
>
> I also opened an issue where I'm posting updates about my problems:
> https://bugzilla.kernel.org/show_bug.cgi?id=208615
>

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Even with screensavers enabled and displays never going to sleep; I just
experienced a crash on idle.

Here's the logs (https://pastebin.com/D7SEJn9J). I did change the
display configuration just before leaving my desk (with xrandr) which
explain some of these logs. But I can see nothing from the kernel etc.

The system just dies.

On 7/24/20 4:17 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> Victor Queiroz (<email address hidden>) changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #703 from Victor Queiroz (<email address hidden>) ---
> I'm also experiencing severe and constant crashes with a Ryzen 5 3600X. I
> also
> had problems with my Ryzen 7 1700 and I was confident that the problem was
> never going to have problems again with my Ryzen 5 3600X, but I was wrong.
>
> First months everything was fine, and after reinstalling the system,
> everything
> went wrong. Crashes daily and constantly. I had a critical Git repository
> corrupted (luckily I could recover things in time), but it completely
> destroyed
> my experience with my PC and actually become a financial gap as I use the
> computer to work on.
>
> Not sure what to do. I've decided to try until the end of this year not to
> have
> more crashes, but as a Linux user, I'm unsure I will ever think about buying
> AMD again. I will always remember about the hassle and lost years I had with
> this.
>
> Sometimes all you want to do is to have a great experience with your Linux
> computer and BOOM, a hard freeze that requires manually turning the computer
> off. It's ridiculous. To be honest, I expected much more from AMD. I hope
> they'll fix it ASAP, we can't stand this anymore. Intel becomes our only
> option
> with this problem.
>
> I also opened an issue where I'm posting updates about my problems:
> https://bugzilla.kernel.org/show_bug.cgi?id=208615
>

Revision history for this message
In , ewerton.urias (ewerton.urias-linux-kernel-bugs) wrote :

Hello everyone.

I apologize for my English, I'll try to communicate.

I did a hardware upgrade in November 2019 (Intel for AMD), my current hardware is this:

------------------------------------------------
ASUS TUF B450-PRO GAMING
Ryzen 5 1600 Six-Core (BIOS always updated)
GeForce GTX 960 4 GB 128 Bits
Corsair 650w
Corsair LPX 16 GB 2666
------------------------------------------------

During the first few weeks, I noticed reboots and freezes, and after a few months of research, I found an alternative solution, which is just to add "processor.max_cstate=1" to Grub.

After I did this, my computer went 6 months without rebooting and/or freezing.

Yesterday I removed this parameter from Grub, to see what the result would be, and it happened, I had a reboot and a freeze, it means that "processor.max_cstate=1" is a solution for me.

The reason I'm here is to understand the root of the problem, to correct it correctly, I'll soon test "Power Supply Idle Control" and "Global C-State Control" in BIOS, but I have seen that for some users it didn't work.

I'm trying to read all of your comments (there are many), but skipping to the last comments, it seems that there is still no solution to this problem, and this makes me very sad.

I don't know anything about hardware, could someone explain to me, in a layman's way, the difference between using "processor.max_cstate=1", "processor.max_cstate=5", "Power Supply Idle Control" and "Global C-State Control"?

I thank you for your patience.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I can confirm that in my case, all the suggested alternatives in this
thread didn't work (the ones that were applicable to my use-case anyway).

In the end, I threw my hands up and did an RMA. And that was over a
month ago. It seems to have solved the isssue. The processor I had
previously and its replacement were different batches; the former being
from 2019 and the latter from Feb. 2020. It seems to me that AMD ironed
out some issues with processors and I got lucky with the replacement. Or
it could just be some tiny variations in fabrication. I can never be sure.

But my machine has been up since the RMA without any crashes.

On 9/17/20 7:25 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> Ewerton Urias (<email address hidden>) changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #719 from Ewerton Urias (<email address hidden>) ---
> Hello everyone.
>
> I apologize for my English, I'll try to communicate.
>
> I did a hardware upgrade in November 2019 (Intel for AMD), my current
> hardware
> is this:
>
> ------------------------------------------------
> ASUS TUF B450-PRO GAMING
> Ryzen 5 1600 Six-Core (BIOS always updated)
> GeForce GTX 960 4 GB 128 Bits
> Corsair 650w
> Corsair LPX 16 GB 2666
> ------------------------------------------------
>
> During the first few weeks, I noticed reboots and freezes, and after a few
> months of research, I found an alternative solution, which is just to add
> "processor.max_cstate=1" to Grub.
>
> After I did this, my computer went 6 months without rebooting and/or
> freezing.
>
> Yesterday I removed this parameter from Grub, to see what the result would
> be,
> and it happened, I had a reboot and a freeze, it means that
> "processor.max_cstate=1" is a solution for me.
>
> The reason I'm here is to understand the root of the problem, to correct it
> correctly, I'll soon test "Power Supply Idle Control" and "Global C-State
> Control" in BIOS, but I have seen that for some users it didn't work.
>
> I'm trying to read all of your comments (there are many), but skipping to the
> last comments, it seems that there is still no solution to this problem, and
> this makes me very sad.
>
> I don't know anything about hardware, could someone explain to me, in a
> layman's way, the difference between using "processor.max_cstate=1",
> "processor.max_cstate=5", "Power Supply Idle Control" and "Global C-State
> Control"?
>
> I thank you for your patience.
>

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Ashesh Ambasta from comment #720)
> I can confirm that in my case, all the suggested alternatives in this
> thread didn't work (the ones that were applicable to my use-case anyway).
>
> In the end, I threw my hands up and did an RMA. And that was over a
> month ago. It seems to have solved the isssue. The processor I had
> previously and its replacement were different batches; the former being
> from 2019 and the latter from Feb. 2020. It seems to me that AMD ironed
> out some issues with processors and I got lucky with the replacement. Or
> it could just be some tiny variations in fabrication. I can never be sure.
>
> But my machine has been up since the RMA without any crashes.

As you disassembled the processor, were you able to copy the serial numbers from the old and new one, and could post them here please? Could you pleas

PS: Please always remove the full quote from messages, when replying by email, as it just clutters the Web interface.

Revision history for this message
In , qdeqde (qh-u) wrote :

my R5 3600 freezing was solved by RMA, running 3 weeks without problem

old serial with freezing : Y939866S90122
current serial running OK: 9JC2491R00354

Revision history for this message
Nigel Reed (nelgin) wrote :

Another Ryzen 7 1800X user here with similar random crashes.

[165716.089703] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[165716.095949] rcu: 1-...!: (0 ticks this GP) idle=354/0/0x0 softirq=2154363/2154363 fqs=0
[165716.104512] rcu: 3-...!: (0 ticks this GP) idle=29c/0/0x0 softirq=883832/883832 fqs=0
[165716.112873] rcu: 4-...!: (8 GPs behind) idle=ad8/0/0x0 softirq=2165586/2165586 fqs=0
[165716.121179] rcu: 9-...!: (9 GPs behind) idle=acc/0/0x0 softirq=1340600/1340600 fqs=0
[165716.129467] rcu: 11-...!: (2 GPs behind) idle=a18/0/0x0 softirq=4538536/4538537 fqs=0
[165716.137828] rcu: 12-...!: (0 ticks this GP) idle=870/0/0x0 softirq=2158040/2158040 fqs=0
[165775.697763] rcu: rcu_sched kthread starved for 29898 jiffies! g36134941 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=3
[165775.709013] rcu: RCU grace-period kthread stack dump:
[165837.494623] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [(resolved):52315]
[165865.494840] watchdog: BUG: soft lockup - CPU#6 stuck for 23s! [(resolved):52315]
[165893.495058] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [(resolved):52315]
[165921.495276] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [(resolved):52315]
[165949.495494] watchdog: BUG: soft lockup - CPU#6 stuck for 22s! [(resolved):52315]

Unfortunaltey I didn't have sysrq configured so I can't provide more information other than I also have an X370 motherboard.

The system is not not has it ever been overlocked. I've owned all the componentes since new so I'm not going to be able to RMA them like some others have.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Apologies for the late reply.
> As you disassembled the processor, were you able to copy the serial numbers
I did copy and send the older processor's serial numbers to the dealer,
for the RMA. That was a processor from a batch in late 2018 IIRC.
> from the old and new one, and could post them here please? Could you pleas
The new one comes from a batch in Feb. of 2020. I'm afraid I don't have
the serial number copied somewhere (it is probably on the box but that
doesn't show the batch number). That was a rookie mistake.
> PS: Please always remove the full quote from messages, when replying by
> email,
> as it just clutters the Web interface.
Noted, sorry about that!

Revision history for this message
In , mclark (mclark-linux-kernel-bugs) wrote :

if I disable multithreading my system that had these crashes is stable for months without them. e.g. run this in rc.local:

#!/bin/bash
#
# disables hyperthreading, which stops the system from crashing
#
   for CPU in /sys/devices/system/cpu/cpu[0-9]*; do
        CPUID=$(basename $CPU)
        echo "CPU: $CPUID";
        if test -e $CPU/online; then
                echo "1" > $CPU/online;
        fi;

        COREID="$(cat $CPU/topology/core_id)";
        eval "COREENABLE=\"\${core${COREID}enable}\"";

        if ${COREENABLE:-true}; then
                echo "${CPU} core=${CORE} -> enable"
                eval "core${COREID}enable='false'";
        else
                echo "$CPU core=${CORE} -> disable";
                echo "0" > "$CPU/online";
        fi;
    done;

Revision history for this message
In , awwit (awwit-linux-kernel-bugs) wrote :

Hello everyone. I've been following this topic for a long time. I have the same problem on my Rizen 1700. For a while, I was helped by switching "Power Supply Idle Control" to "Typical Current Idle". For a while, everything worked fine for me with the value "auto". But then I changed something in the settings of my BIOS and the lockup began to happen even with the value "Typical Current Idle".

Motherboard: GA-AB350N-Gaming WIFI (rev. 1.0)
Linux Kernel Version: 5.9.12

Then I realized that the bug is due to a combination of some BIOS settings.

So far, I have managed to achieve stable operation with the following settings:

"Power Supply Idle Control" -> "Typical Current Idle"
"Power" (Tab) -> "ErP" -> "Enabled"
"Power" (Tab) -> "CEC 2019 Ready" -> "Disabled"

The settings on the "Power" tab are very important! Especially "CEC 2019 Ready (Enabled)" setting, with which lockups are repeated.

I will try to change different BIOS settings to achieve stable system operation when the "Power Supply Idle Control" is set to "auto".

Revision history for this message
In , ewerton.urias (ewerton.urias-linux-kernel-bugs) wrote :

I'm here again, after my first comment (Comment 719).

I have Ryzen 5 1600 (UA 1843PGS), and I come to report some news about my case.

I have 2x8GB 2666 Mhz, but in the BIOS Setup, by default, the frequency is set to "AUTO" (2133 Mhz), and in 2019 I had applied it to 2666 Mhz, and I didn't remember this.

My friend warned me about this, that what I did was an "overclocking of RAM", I didn't know that this is an overclocking, so I decided to revert, applying "AUTO" again (2133 Mhz).

Since then... the reboots have stopped, however... the system has continued to freeze, and the logs point only to my NVIDIA GTX 960.

In the BIOS Setup, I disabled the "Global C-State Control", and since then, it has been almost 60 days since there are no more system freezes or reboots, things that happened every day.

Resume:

1. Maintaining RAM frequency in AUTO resolves reboots;
2. Disabling "Global C-State Control" solves my problem with system freezes (log points to NVIDIA).

These two things above solved my problem, and that parameter "processor.max_cstate = 1" in GRUB is no longer needed.

Revision history for this message
In , nelsoneci (nelsoneci-linux-kernel-bugs) wrote :

Created attachment 294183
attachment-20815-0.html

Hi Ewerton.

This might be a side effect and kind of unrelated with the bug. Since YMMV
and this can wreck your HW please read this as an anecdotal experience.
When you overclock RAM you need to change other stuff in order to stabilize
the system. This video from AMD people helped me get 3000Mhz with no
issues. Before this video I was wasting -part of- the money I paid for the
sticks running at 2666. This was with an Ab350m-ds3h (BIOS F31) and a Ryzen
2700X.

  https://www.youtube.com/watch?v=vZgpHTaQ10k

BTW, Memtest86 helped me a lot here. It would detect an invalid
configuration in a second. I was happy after it ran a few times and I
didn't have any issues after that. Since then I switched to a recent
motherboard and the RAM was detected automatically.

On Wed, Dec 16, 2020 at 9:22 PM <email address hidden> wrote:

> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #726 from Ewerton Urias (<email address hidden>) ---
> I'm here again, after my first comment (Comment 719).
>
> I have Ryzen 5 1600 (UA 1843PGS), and I come to report some news about my
> case.
>
> I have 2x8GB 2666 Mhz, but in the BIOS Setup, by default, the frequency is
> set
> to "AUTO" (2133 Mhz), and in 2019 I had applied it to 2666 Mhz, and I
> didn't
> remember this.
>
> My friend warned me about this, that what I did was an "overclocking of
> RAM", I
> didn't know that this is an overclocking, so I decided to revert, applying
> "AUTO" again (2133 Mhz).
>
> Since then... the reboots have stopped, however... the system has
> continued to
> freeze, and the logs point only to my NVIDIA GTX 960.
>
> In the BIOS Setup, I disabled the "Global C-State Control", and since
> then, it
> has been almost 60 days since there are no more system freezes or reboots,
> things that happened every day.
>
> Resume:
>
> 1. Maintaining RAM frequency in AUTO resolves reboots;
> 2. Disabling "Global C-State Control" solves my problem with system freezes
> (log points to NVIDIA).
>
> These two things above solved my problem, and that parameter
> "processor.max_cstate = 1" in GRUB is no longer needed.
>
> --
> You are receiving this mail because:
> You are on the CC list for the bug.

Revision history for this message
In , kallisti5 (kallisti5-linux-kernel-bugs) wrote :

Some information. I was having this issue all the time on my Asus x370 and my Ryzen 1800x.

I upgraded my mainboard to an Asus x470, and the issue disappeared overnight. Same ram/CPU/add-on cards/OS

So, potentally due to the x370, or the combonation of early Ryzen + x370

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

So after months after my RMA, this issue seems to have surfaced again. I
experienced a lockup on system idle just a while ago.

It seems to also somehow be related to particular boots: my system was
off (not suspend, powered down) last night. It had been stable for
months before that.

I'll be selling my 2950/mobo/cooler and switching to a more stable Intel
(something old). Life is too short to live on the "edge" with these AMD
processors.

Revision history for this message
In , raulvior.bcn (raulvior.bcn-linux-kernel-bugs) wrote :

This thing can happen due to multiple factors.
I was running a 1800X. Freezes ocurred in 24-48h of uptime. Disabling Global C-State or enabling typical power idle in UEFI stopped those freezes. The latter option disables PC6 on top of using 0.85V idle voltage. To disable PC6 you need a cold boot, otherwise only the voltage change is applied and PC6 remains enabled, crashing the system as usual.

I upgraded to a 2700X (UEFI cleared) and apparently the issue disappeared. But nope, it's just less frequent. Much less. Currently I have an uptime of 23d. I also had another 2700X running for 17d before testing again a 1800X which crashed within 24h, afterwards I inserted the current 2700X which crashed within 48h. The following boot is the one with an uptime of 23d.

All in all, this might be related to either a PSU thing or to a non-existent but required kernel workaround for a bug in the processor, as detailed in the AMD Revision Guide [1]. The 1800X would crash more because it has more bugs and workarounds needed, while the 2700X has fewer, specially related to the PCIe controller. Ironically, the 2700X consumes less power at idle than the 1800X because it requires lower voltages: 12nm+ vs 14nm, and the voltages specified at every power level are also lower. I don't see much logic in saying that the PSU is the culprit of system instability.

All the people trying "idle=nomwait", "idle=halt" or "processor.max_cstate=5" should be warned those options are useless. There are only 2 Cstates available in Ryzen systems, so if you want to limit Cstate you have to set it to 1 at most -> "processor.max_cstate=1". And the use of the MWAIT instruction is disabled by the UEFI if you insert a Ryzen 1800X processor. The 2700X and the rest of 2nd gen Ryzen are not affected by any MWAIT bug. Idle option is thus useless too.

It's better to try with "pcie_aspm=off" or "pcie_aspm=force" pcie_aspm.policy=performance" and or "nvme_core.default_ps_max_latency_us=0". Maybe the PCIe root or anything related does not wakeup and the processor stalls waiting for an interrupt to be served.

[1] https://www.amd.com/system/files/TechDocs/55449_Fam_17h_M_00h-0Fh_Rev_Guide.pdf

Revision history for this message
In , ricki (ricki-linux-kernel-bugs) wrote :

I have a Ryzen 1700X and I'm heavily affected by the problem. Default kernel parameters: lockups every 1-2 days. With "processor.max_cstate=5 rcu_nocbs=0-15": reboot instead of lockups, but less frequent. Still too cumbersome to deal with.

Does anyone know whether this problem also occurs with Ryzen 5? I'd like to give it a second chance.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

I did some research into the latest Ryzens, and I suspect the issue is
still not solved. I did manage to read some alarming posts discussing
similar lockups.

I suspect this also has to do with the motherboard. At this stage, I'm
no longer willing to try (I've done one RMA and 2.5 years with an
unreliable system).

Not to rant, but I had a similar experience with AMD a decade earlier.
Maybe its time for me to move on. :-)

Revision history for this message
In , exander77 (exander77-linux-kernel-bugs) wrote :

As far as I can tell it was solved for me a quite some time ago (on my A485), most probably a by a Bios update (with previously employed kernel updates), and I have not encountered it E459 with newer Ryzen at all.

I have never encountered it on Windows.

My thinking is that it was a combination of kernel behaviour with some in early BIOSes and was fixed when both were updated.

If you use Desktop and you have latest BIOS I would suggest replacing the motherboard as somebody pointed above.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Created attachment 294817
attachment-13720-0.html

After some more digging around, I see the following in `journalctl -b-1` :

    Jan 23 14:48:41 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3: Mismatch between completed Set TR Deq Ptr command & xHCI internal state.
    Jan 23 14:48:41 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3: ep deq seg = 00000000734e9686, deq ptr = 00000000f203822a

and indeed, the "crashes" I experience also seem to be accompanied by
all my USB devices (mouse, keyboard) losing power.

Revision history for this message
In , ashesh.ambasta (ashesh.ambasta-linux-kernel-bugs) wrote :

Created attachment 294819
attachment-29532-0.html

Actually, never mind.

I also see the following when looking at all similar messages:

    |╰─$ journalctl| grep 'xhci.*Mismatch' ||
    ||Jun 13 14:43:09 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jun 27 17:58:56 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jul 16 18:10:53 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jul 19 16:53:52 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Sep 10 12:24:53 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Oct 09 13:23:53 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Oct 10 12:40:37 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Nov 01 16:44:36 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Dec 21 12:24:03 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jan 13 19:09:51 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.||
    ||Jan 23 14:48:41 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
    Mismatch between completed Set TR Deq Ptr command & xHCI internal
    state.|

There were no crashes in December/November/October & neither did I lose
any USB functionality as I did now.

In any case, I'm still looking to replace this hardware.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

(In reply to Ashesh Ambasta from comment #736)

[…]

> I also see the following when looking at all similar messages:
>
> |╰─$ journalctl| grep 'xhci.*Mismatch' ||
> ||Jun 13 14:43:09 quasar-nixos-tr kernel: xhci_hcd 0000:09:00.3:
> Mismatch between completed Set TR Deq Ptr command & xHCI internal
> state.||

[…]

Please report that to the Linux USB folks (entry for USB SUBSYSTEM in the file MAINTAINERS [1]). If you do, please also attach output of `lsusb`.

[1]: https://www.kernel.org/doc/linux/MAINTAINERS

Revision history for this message
In , jvdelisle (jvdelisle-linux-kernel-bugs) wrote :

Comment 743 is Phishing, do not click links.

Revision history for this message
In , agmondroid (agmondroid-linux-kernel-bugs) wrote :

Hello,

after while with random AMD Ryzen 5 reboots, final solution was add Kernel parameter:

acpi_osi=Linux

3 days of stability with Ubuntu 18.04 5.4.0-80-generic and counting.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

Agmon, thank you for your report. Please open a separate issue with the full output of `dmesg` attached, and the output of `acpidump`. Please reference the issues here.

Revision history for this message
In , jnonon+kernel (jnonon+kernel-linux-kernel-bugs) wrote :

Hi, I was encountering this issue in a frequent basis. I tried several methods listed all across the internet to solve the "soft lockup" specially while using rsync to copy data from one disk to another.

At some point, I ran memtest86 to check if there was a memory problem. I have 4 DIMM installed, of 16GB. The BIOS settings were mostly set to auto for most fields. Memtest86 failed 25% of the test, which is really bad.

I noticed that Memtest86 reported metadata for each memory about the voltage setting to 1.350v. I noticed the Auto Setting in the bios had set those values to 1.200v instead.

After updating that value, I ran the test suite again, none failed. Ran rsync over a large directory, and did not trigger the soft lockup.

tldr; Make sure your RAM is correctly configured with the manufacturer recommended settings instead of Auto setting from the BIOS, this goes for both frequency and voltage.

Revision history for this message
In , pmenzel+bugzilla.kernel.org (pmenzel+bugzilla.kernel.org-linux-kernel-bugs) wrote :

Jonathan, thank you for the feedback. Please share the mainboard model and firmware, and the used memory for the record.

Revision history for this message
In , nelsoneci (nelsoneci-linux-kernel-bugs) wrote :

For the hardware I had this talk was life-saving. I thought it was this issue and it was the RAM as well, used memtest86 for the first time. Mandatory stuff!

Fortunately things seem to be less manual nowadays. See:

https://www.youtube.com/watch?v=vZgpHTaQ10k

Revision history for this message
In , jnonon+kernel (jnonon+kernel-linux-kernel-bugs) wrote :

Here is the setup:

Vendor: American Megatrends Inc.
Version: P4.80
Release Date: 02/16/2022
BIOS Revision: 5.17

Base Board Information
Manufacturer: ASRock
Product Name: X570 Taichi

Processor Information
Version: AMD Ryzen 9 3900X 12-Core Processor

Memory:
Kingston RAM KF3200C16D4/16GX 16GB DIMM DDR4 3200MT
XMP Profile: DDR4-3200 CL16-18-18 @1.35V

I could not find a way to format this information nicely. So here is a pastebin
https://pastebin.com/TsB2LVD3

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.