Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
208
This bug affects 35 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tags: removed: kernel-da-key
information type: Public → Public Security
information type: Public Security → Public
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
598 comments hidden view all 678 comments

(In reply to alfie from comment #551)
> kernel 4.19.27 with rcu_nocbs=0-11 idle=halt nopti, yesterday at around
> 11:30 p.m. I had an "usual" total lookup of the system during light usage (I
> just had an ssh session opened). With idle=halt, I get no "ACPM WAIT C-state
> bug" line in dmesg.
>
> The problem with this freeze problem is that is difficult to replicate. The
> segfault bug was very easy: 3-4 mesa compilations triggered it for sure.
> This bug, instead, can hide for days and days and it reappears all of a
> sudden.

The segfault compilation issue is well known. I had the same with my first 1700. AMD will RMA. My replacement chip has been fine.

As for the other idle lockup issues, a combination of custom rcu kernel and python script to disable C6 has had my machine up for months.

As I have said previously, I'm not sure that my problem was the same that others have (I didn't have freezes on idle, but rather freezes when compiling), but what helped me is updating BIOS to latest version. ASUS has released a new BIOS with AGESA 0070 for X470 Pro, and I can't trigger the bug anymore. As for ASRock, they will probably release update in a week or so, all vendors usually release new AGESA within a moth, as far as I remember.

Same here, I'm using opensuse Leap 15 and Ubuntu 18.10, both 4.12 and 4.18 kernel suffer the same issue.
I'm using AMD Ryzen 5 1600 with Asrock Fatal1ty AB350 Gaming-ITX/ac board.
I'll try updating the latest bios to see if things got fixed.

I am still receiving lockups even with the latest BIOS, all the BIOS setting recommendations here, and with the idle=halt kernel parameter.

Although I am now fairly certain that I have a faulty motherboard (x470) which seems to be getting worse, and sometimes doesn't even get to the BIOS at all (I should have suspected something was up when I first bought it and the PS/2 port didn't work).
I'm going to try replacing it and hopefully that'll help.

Hi guys. I have an Asus ROG Strix notebook with Ryzen 7 1700. My CPU was manufactured on week 46 of 2017 (UA 1746PGS), so it does not have the segfault problem. My notebook completely freeze when it is idle, not always, but it is more common when newly powered (cool) or after heavy activities. The freezes never occurred during more intense use, only when idle. Like all notebook BIOS, my BIOS options is quite limited and does not have a "Typical current idle" or similar option. I tried to recompile the kernel with the option "RCU_NOCB_CPU" and also tried all possible combinations with the parameters "rcu_nocbs", "iommu=soft/pt", "idle=nomwait/halt", "pti=off" / "nopti", etc, etc, etc. The only alternative that actually solved the freeze problem was to use the program "zenstates.py" disabling the "c6state" core and package. But this attitude makes my CPU run about 5°C warmer. I used Windows on this same machine for weeks without any crashes. This makes me conclude the following: does not crash on windows (ok, it's a software problem, "linux"); CPU intel does not hang in linux (oh no, ok, it's a hardware problem, "Ryzen"). I do not know what AMD thinks about this, but I was forced to change my notebook. Now I got one with an Intel processor. Thank you AMD.

(In reply to Tommy Vercetti from comment #554)
> Same here, I'm using opensuse Leap 15 and Ubuntu 18.10, both 4.12 and 4.18
> kernel suffer the same issue.
> I'm using AMD Ryzen 5 1600 with Asrock Fatal1ty AB350 Gaming-ITX/ac board.
> I'll try updating the latest bios to see if things got fixed.

I updated the bios to P5.30, it only locked up for once so far.
How do I turn the log on for this?

I assume there is several (at least two) similar problems that cause spontaneous system hangs.

One of them is mwait bug listed in AMD errata. Looks like idle=halt is partial workaround for this. But, as said in AMD community forum, guest OS in virtual machine may execute mait instruction and provoke this bug. Luckily, mwait is not common for user-space applications (still not sure about that).

Another is power supply problem. This may be caused by unsupported PSU (no 0A 12V) or unsuitable power subsystem on motherboard. This may be partially (again!) solved by BIOS "typical current idle" option or disabling C6 states by .py script.
IMO this bios option may be implemented not properly in some mobos firmware. Looks like it only "says to OS" do not use C-states, but does not prevent deep sleep on hw level. Here we got [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) in dmesg and problem persists.

Have read some forums with assures of completely stable Ryzen systems work under Linux without any tweks. So not really too many systems affected by mwait and PSU lockups. I think some combination of factors may provoke this behavior. Mobo+PSU, memory latency (why not?) or some vendor-provided bios config.

Also in some cases these two problems may appear together.
Obviously, segfaults or amdgpu crashes is not related to this bug.

PS:Sorry for my bad English.

The PSU problem sounds like a weak excuse from AMD or may they are talking about very very bad 10 dollars units.

The BIOS options "typical/low current" doesn't seems to change the c-states set up by the kernel routines here. With any selection, I always end up with
c0 POLL
c1 ACPI HLT
c2 ACPI IOPORT 0x414

as you can see with cpufreq cpupower idle-info or /usr/src/linux/tools/power/x86/turbostat/turbostat -n1

That happens because the firmware already informed the kernel mwait/monitor cannot be used for c-states.

If idle=halt is used, there is no c-state management and the idl instruction is used to put a core in idle mode.

I think it is the hyper treading handling that is somehow bugged, somewhere.

Also note that many ryzen users overclock their machine and if you overclock, you have to c-state problems at all.

So if you got message with "typical current idle" option:
[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
it means probe of this C-state failed and it must be avoided by kernel.
Is there any difference between C-state list in OS with and without this option? I have only laptop with very limited BIOS and can't check.

And there only C0-1-2. What about C6? What says zenstates.py about that? Frankly, I have not tried this script - do not want disable power-saving features for laptop.

Also some overclocking may solve this problem (see comment 103). Turn on "performance bias" or similar option in BIOS for testing.

This is pretty strange, could someone explain it to me? Wasn't mwait bugged in ryzen?

Tired of the random freezes, I tried this:

--- ./arch/x86/kernel/acpi/cstate.c.orig 2018-10-22 08:37:37.000000000 +0200
+++ ./arch/x86/kernel/acpi/cstate.c 2019-03-20 19:26:45.261101857 +0100
@@ -86,6 +86,7 @@ static long acpi_processor_ffh_cstate_pr
        num_cstate_subtype = edx_part & MWAIT_SUBSTATE_MASK;

        retval = 0;
+#if 0
        /* If the HW does not support any sub-states in this C-state */
        if (num_cstate_subtype == 0) {
                pr_warn(FW_BUG "ACPI MWAIT C-state 0x%x not supported by HW (0x%x)\n",
@@ -93,6 +94,7 @@ static long acpi_processor_ffh_cstate_pr
                retval = -1;
                goto out;
        }
+#endif

        /* mwait ecx extensions INTERRUPT_BREAK should be supported for C2/C3 */
        if (!(ecx & CPUID5_ECX_EXTENSIONS_SUPPORTED) ||

And I don't have any freeze since 3 days...

I have revisited the errata. Errata 1033 "A Lock Operation May Cause the System to Hang" and 1109 "MWAIT Instruction May Hang a Thread" are the top contenders. According to page 12 and 13 of that document, Pinnacle Ridge processors are not affected. However, page 16 and 17 suggest 2nd Gen Ryzen are affected. This makes me conclude: either there is an error in the document, or Raven Ridge aka improved Zen 1 (Desktop and Mobile) are affected, while Pinnacle Ridge is not.

During my latest tests on kernel 5.0.1, on AGESA 1006 and after upgrading to AGESA 0070, I wasn't able to reproduce any freezes / hangs on a Ryzen 7 2700, ASUS A320M-K (BIOS defaults, no OC, no Typical Current Idle setting), 16GB Patriot Viper RGB 3200CL16, ASUS RX480 8GB and Corsair AX860 (Haswell C6/C7 support, 80 PLUS Platinum).

I installed a fresh, bare minimum, default Arch Linux install w/o any special configuration, let the system sit in idle for half a day and check log for hangs. Then, I enabled all power management features using powertop and let the system sit over night, still no hangs. I repeated the same for AGESA 0070, still no hangs.

I think it's too early to conclude anything, as the hangs / freezes used to be very random. I will repeat my tests (on AGESA 0070) with a slightly more bloated system by installing GNOME.

I am running Fedora on AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx with Gnome.

I use kernel parameters idle=nomwait iommu=pt processor.max_cstate=1 set via grubby.

I am not getting any hangs unless I suspend with lid close or suspend on power button. It would probably be better to not execute an MWAIT per the errata. The method in comment 561 would work for those who have the gumption to build their own kernel.

The iommu=pt I read about somewhere as useful to do, but I don't know if it helps. Some suggest set idle=halt which also avoids the MWAIT instruction.

I am also told that the DRI driver has an issue with loading at boot which will hang the kernel. https://bugs.freedesktop.org/show_bug.cgi?id=109206

(In reply to alfie from comment #561)
> This is pretty strange, could someone explain it to me? Wasn't mwait bugged
> in ryzen?

Looks like this patch makes kernel to ignore BIOS messages about unsupported C-state.
this function used in "acpi_processor_ffh_cstate_probe" which further used in "processor_idle.c" file. It cause "cx.entry_method" value change from ACPI_CSTATE_SYSTEMIO to ACPI_CSTATE_FFH.

It is really strange that this patch works but idle=halt is not. Maybe this is somehow silently disables C6 state.
Can your system reach turbo frequencies with this patch?
Maybe I understand something wrongly...

There's a patch in comment #526 for people to test before we include it so that there's at least *some* fix in the kernel, going forward...

(In reply to Another User from comment #564)
> Can your system reach turbo frequencies with this patch?
> Maybe I understand something wrongly...

Yes, I have a ryzen 1600 and I can see 1/2 cores running at ~3.6 GHz under some peculiar stress case.

(In reply to Borislav Petkov from comment #565)
> There's a patch in comment #526 for people to test before we include it so
> that there's at least *some* fix in the kernel, going forward...

But I really want C6/P6 and ignoring the BIOS seems to get me there...

(In reply to alfie from comment #566)
> But I really want C6/P6 and ignoring the BIOS seems to get me there...

Why do you think the patch I pointed to won't give you C6?

@Borislav has the fix for erratum 1033 "A Lock Operation May Cause the System to Hang" been applied so far? The suggested workaround was "Program MSRC001_1020[4] to 1b", but I couldn't find anything about it in master branch. According to the document, 1033 only affects B1.

(In reply to Tolga Cakir from comment #568)
> @Borislav has the fix for erratum 1033 "A Lock Operation May Cause the
> System to Hang" been applied so far? The suggested workaround was "Program
> MSRC001_1020[4] to 1b", but I couldn't find anything about it in master
> branch. According to the document, 1033 only affects B1.

Such a "fix" does not exist. Possibly because it is unlikely this is causing it and BIOS might be applying the fix already. People with B1s (model 1, stepping 1) could test though by doing as root:

# modprobe msr
# rdmsr -a 0xc0011020

and looking at bit 4 in the result.

(In reply to Borislav Petkov from comment #569)
> (In reply to Tolga Cakir from comment #568)
> Such a "fix" does not exist. Possibly because it is unlikely this is causing
> it and BIOS might be applying the fix already. People with B1s (model 1,
> stepping 1) could test though by doing as root:
>
> # modprobe msr
> # rdmsr -a 0xc0011020
>
> and looking at bit 4 in the result.

I've got an ASUS PRIME X370-PRO with the B1 CPU mentioned on firmware 4024
(from 2018/09/28).

rdmsr yields 6800000000010, which has bit 4 set.

(In reply to Lars Viklund from comment #570)
> rdmsr yields 6800000000010, which has bit 4 set.

Looks like your BIOS applies the fix. Now, does the patch in comment #526 fix your freezes?

(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
>
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

I'm sorry, I don't currently suffer from any significant hangs on my machines when running stock Ubuntu 18.04.02 LTS (4.15.0-43-generic).

I mostly meant #570 to demonstrate that some firmware does indeed implement the errata.

While I've had core hangs and hard freezes in the past, the machine has enough workarounds applied to make it sufficiently stable. Current uptime is something like 81 days since the last incident and I can honestly not remember what it did then, but it's been used heavily as a build machine.

From my memory: sufficiently RMA:d CPUs, slight overclocking, change the current idle setting in firmware, boot idle=nomwait.

Back when I had significant problems, the best way to mitigate them was to disable SMT.

It's possible that my increased stability has come from firmware implementing this errata, or it might be coincidental, I don't have enough data to tell.

(You people should be happy you're not running FreeBSD, there I can reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )

(In reply to Borislav Petkov from comment #569)
> (In reply to Tolga Cakir from comment #568)
> > @Borislav has the fix for erratum 1033 "A Lock Operation May Cause the
> > System to Hang" been applied so far? The suggested workaround was "Program
> > MSRC001_1020[4] to 1b", but I couldn't find anything about it in master
> > branch. According to the document, 1033 only affects B1.
>
> Such a "fix" does not exist. Possibly because it is unlikely this is causing
> it and BIOS might be applying the fix already. People with B1s (model 1,
> stepping 1) could test though by doing as root:
>
> # modprobe msr
> # rdmsr -a 0xc0011020
>
> and looking at bit 4 in the result.

206800000000100

CPU: AMD Ryzen 5 2400G
Motherboard: Asus ROG STRIX B350-F Gaming
Firmware version: 4207
Linux: 4.18.16-300.fc29 (Fedora 29 live system)

(In reply to Dennis Schridde from comment #573)
> 206800000000100

If this is correct:

http://www.cpu-world.com/CPUs/Zen/AMD-Ryzen%205%202400G.html

then you should have in /proc/cpuninfo something like this:

cpu family: 23
model: 17
stepping: 0

and if so, not affected.

(In reply to Borislav Petkov from comment #574)
> (In reply to Dennis Schridde from comment #573)
> > 206800000000100
>
> If this is correct:
>
> http://www.cpu-world.com/CPUs/Zen/AMD-Ryzen%205%202400G.html
>
> then you should have in /proc/cpuninfo something like this:
>
> cpu family: 23
> model: 17
> stepping: 0
>
> and if so, not affected.

A year ago I was affected by something (comment #295), but the issue appears to have vanished within the last months (comment #547 and comment #573).

(In reply to Lars Viklund from comment #572)
> (You people should be happy you're not running FreeBSD, there I can
> reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )

Looks like they've addressed Ryzen errata issues around August 2018:

https://github.com/freebsd/freebsd/blob/75ee4f08d3acd4bf70f24b3203fa440255873973/sys/amd64/amd64/initcpu.c#L133

Also, they've made machdep.idle=hlt and machdep.idle_mwait=0 default for Ryzen processors.

I've additionally checked the registers mentioned in FresBSD's source. Latest BIOS 4406 (w/ AGESA 0070) for my ASUS A320M-K seems to apply fixes for all affected errata.

Eventhough I can't reliably reproduce the stability issues anymore, I think it's still good practice to have Boris' patch applied, since it's addresses a known erratum. I'll test the patch and check for regressions.

I'm actually desperate about this issue. I'm looking forward to see results of Boris' patch, though I really doubt it will solve this issue. Needs explicit check against self-built kernel really.

Not all manufacturers are willing to fix this issue (especially when it comes to laptops), as Windows seems to fix this inside it's kernel since 1709 I believe. I've made a post on HP forum about this, specifically for my laptop:

https://h30434.www3.hp.com/t5/Notebook-Boot-and-Lockup/Random-Soft-Lock-up-on-Ryzen/m-p/7062823

(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
>
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

Hi Boris, on my laptop with its latest BIOS I get:

$ sudo rdmsr -a 0xc0011020
6800000000000

This is on:

cpu family : 23
model : 17
model name : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
stepping : 0
microcode : 0x8101007

Booting with:

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.20.16-200.fc29.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet LANG=en_US.UTF-8 idle=nomwait iommu=pt processor.max_cstate=1

Has your patch been applied to kernel yet?

(In reply to JerryD from comment #578)
> $ sudo rdmsr -a 0xc0011020
> 6800000000000
>
> This is on:
>
> cpu family : 23
> model : 17
> model name : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
> stepping : 0
> microcode : 0x8101007

You're showing me the MSR output for erratum 1033 but your CPU is
not affected.

> Has your patch been applied to kernel yet?

No it hasn't - I'm still waiting for someone to test it and confirm that it fixes the issue on their boxes. And by issue I don't mean erratum 1033 - that is a red herring anyway - but the lockups people are reporting on some broken BIOSes.

Created attachment 281999
Add rifw kernel parameter to test a couple of patch to workaround ryzen freezes

Add rifw kernel parameter.

rifw=none - don't do anything
rifw=alfie - ignore BIOS
rifw=boris - use Borislav Petkov patch

I works with kernels 4.19, 4.20 and 5.0.2

> No it hasn't - I'm still waiting for someone to test it and confirm that it
> fixes the issue on their boxes. And by issue I don't mean erratum 1033 -
> that is a red herring anyway - but the lockups people are reporting on some
> broken BIOSes.

Yes, your patch "seems" to work in my case.

"Seems" because this bug is difficult to replicate...

I've spent the last few days reading through this thread.

I have Fedora 28, ryzen 1700x, and ASRock Steel Legend B450M. It's a server mostly used for files at the moment. Bone stock Fedora 28, samba, ZFS.

With GPU plugged in, hangs within 6-10 hours. Replicating this too often since March 27th.

With GPU unplugged, hangs happen within 10-20 minutes. Replicated 3 times.

Does not hang while SSH'd in.

Ran Memtest and tested for the segfault issue. Hardware is fine. Bought 'open box' on Amazon; this was in case they were re-selling an affected CPU without knowing.

Have not tried alfie's patch. Do have kernel 4.20 compiled with alfie's newest rifw patch as per Comment 580.

With Borislav's patch:
- Modifications to the OS are Samba, ZFS, Borislav's patch (comment #526 on kernel 5.0.1)
- booted March 30th at ~22:00
- Hung March 31st at ~09:40
- This is slightly longer than other tests
- I used the instructions he provided in Comment 511 without checking out 4.20. It compiles, but doesn't fix the issue, at least with linux 5.0.1-rc2 (apologies for not following steps and using 4.20)

'Typical Current Idle' set:
- Modifications to the OS are only Samba and ZFS
- Booted at 20:45 EST
- Still on now at time of this post 11:40 EST(I'll check before I hit send)
- This is the record up-time for this server
- Haven't tested anything else because I needed to have reliable access to my shares for my days off. This was 1 seemingly sure-fire way of making that happen.

Will update with more results once linux 4.20 with borislav and Alfie's patches tested. Could be a couple days. Less if 'Typical Current Idle' doesn't work, as I'll likely be RMAing my mobo and CPU.

I used the rifw=alfie with a 1700x on an MSI X370 Gaming Plus BIOS 7A33v5H with no freezing for 1 week. Computer was rock solid on older BIOS (7A33v55) for a year+ until BIOS update to latest.

I'm using low power idle option with kernel 4.14.111 opensuse 15.0.

PC had kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) - After BIOS update and gone now after patch in Comment 580 with rifw=alfie boot option.

I haven't had the chance to get through the whole thread yet and not sure if I'll have enough time today. However, I wanted to quickly ask if anyone has faced this issue with an Intel processor?

We are seeing kernel panics due to the soft lockup issue with the following configuration:

CPU Architecture: Intel Skylake or Higher
vCPUs: 8
RAM: 52 GB
Kernel: 4.15.0
OS: Ubuntu 16.04.1

For some additional information, this setup does leverage Nested VMX to run Docker containers with the --privileged option enabled.

The soft lockups seems to occur at a rate of varying from 1-5% of our VMs getting impacted in a 24-hour period running in this configuration.

The load on the cluster does not seem to impact the frequency of the issue.

One thing I did notice is that it seems most/all of the kernel panics have had been involved with a call to smp_call_function_single.

Since I have not made it far enough through this thread, I'm not sure if the focus on this is exclusively on AMD, or if anyone else had an Intel example?

Is there a fix in the pipe for this issue and if so, a target kernel version?

Created attachment 282319
Threadripper 2920x soft lockup

I've went through this thread, and tried to disable c6 state, add idle=halt with no effect. Followed this excellent reprod steps in https://www.reddit.com/r/Amd/comments/apw8im/ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent. However, switching to gcc-8 seems resolved this for this particular reprod steps. With dmesg -w, I was able to capture and verify this is indeed a soft lockup. Attached the dmesg output.

It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest BIOS on Ubuntu 18.04.2 LTS.

Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Hello Liu

You're hitting another bug: it happens on high load (which can be easily triggered by a lot of parallel compilation threads). The only solution for your kind of problem is to RMA you CPU. See: https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response (but the problem seems not to be related to Ryzen 7 as stated at Phoronix and it's not possible to say that CPU's after a special build time would be ok). I have been affected, too (Ryzen 7) and went through RMA process. Luckily, the first RMA'd CPU was ok - there have been people who needed more than one RMA. Take a look there:
https://community.amd.com/message/2822339
If I hit the problem those days (gcc segfault), I had to rapidly reboot my machine because it mostly hang completely some time later.

This thread here covers the problem of hanging Ryzen CPU's when the machine is completely on idle over some time (it's the opposite of the problem you're facing and I saw before CPU RMA). I "fixed" the hang on idle problem by slightly overclocking the CPU (+200 MHz - ASUS X370-PRO / BIOS 4011 04/19/2018 - Bios option: Optimization for daily computing). All other suggested workarounds don't work for me. But you can't say, overclocking would fix the problem for sure (as any other suggested workaround). But you could try it.

Hey, Klaus,

My understanding of that issue is it only affects Ryzen 1700 / 1700x and doesn't affect Epyc / TR line. I am on ThreadRipper 2920x, which should be out of marginality problem for over a year now. Also, it doesn't SegFault at any point, just hangs. I do believe it is the similar problem as this thread, but since it is rare, so it could be just one faulty CPU for sure. Just doesn't align well with marginality problem timeline.

I have a couple of Threadrippers (1950X) that also hang under very specific workloads with a Asus X-399-A motherboard. In my case when the freeze happens the only way to get the machine back is to unplug it from the wall and plug it back. Not even reset works.

After many tweaks, I found out that I can avoid that but setting either the "overclocking enhancement" on in the BIOS (this is specific to Asus boards) our manually setting the Vcore to something like 1.315V (which should work on all motherboards). My guess is than when in idle the motherboard drops the voltage too low and the CPU hangs.

I ran into similar problems to this with a Threadripper 1950X and none of the proposed workarounds did anything. Disabling C6 in the BIOS made it fail to boot for some reason. Setting "Typical Current Idle" in the BIOS, processor.max_cstate=1 in the kernel command line, or disabling C6 with ZenStates.py all had no effect. I also tried overclocking and underclocking.

I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just works (for 10+ hours, I don't leave my machine on overnight). Once it started working, I set everything back to the defaults except for "Typical Current Idle" (since I have a PSU from 2010 and I doubt it supports any fancy features from 2013).

Unfortunately I can't easily narrow down all of the changes after the re-install, but the ones I'm aware of are:

 - Using the Nouvea driver (previously proprietary nVidia drivers)
 - Booting in EFI mode now (previously legacy mode)
 - Using ext4 (previously btrfs)

I strongly suspect that the graphics driver was the problem since my lockups would cause the screen to become completely unresponsive, but sound continued working, and in one case I had a lockup during a video call and the other person could still see and hear me.

I plan to try the proprietary drivers again sometime and I'll make a note if that brings the problem back for me.

Download full text (3.3 KiB)

(In reply to Brendan Long from comment #589)
> I strongly suspect that the graphics driver was the problem since my lockups
> would cause the screen to become completely unresponsive, but sound
> continued working, and in one case I had a lockup during a video call and
> the other person could still see and hear me.

What you're describing here is a new "feature" introduced between kernel 4.19.16 and 17 e.g. I can see exactly the same here with radeon hardware. The system is completely working (even VMs on the host are running well) except of graphics - even tty terminals are working sometimes. When ssh'ing the machine, I can always see log entries like these:

radeon 0000:0a:00.0: ring 0 stalled for more than 14084msec
radeon 0000:0a:00.0: GPU lockup (current fence id 0x0000000000053ed7 last fence id 0x0000000000053f0f on ring 0)
...

I'm trying to near it down currently using git bisect. The suspicious changes left are at the moment:

2019-01-22 arm64: Don't trap host pointer auth use to EL2 Mark Rutland bad
2019-01-22 arm64/kvm: consistently handle host HCR_EL2 flags Mark Rutland
2019-01-22 scsi: target: iscsi: cxgbit: fix csk leak Varun Prakash
2019-01-22 scsi: target: iscsi: cxgbit: fix csk leak Varun Prakash
2019-01-22 Revert "scsi: target: iscsi: cxgbit: fix csk leak" Sasha Levin
2019-01-22 mmc: sdhci-msm: Disable CDR function on TX Loic Poulain
2019-01-22 netfilter: nf_conncount: fix argument order to find_next_bit
2019-01-22 netfilter: nf_conncount: speculative garbage collection on empty lists Pablo Neira Ayuso
2019-01-22 netfilter: nf_conncount: move all list iterations under spinlock Pablo Neira Ayuso
2019-01-22 netfilter: nf_conncount: merge lookup and add functions Florian Westphal

2019-01-22 netfilter: nf_conncount: restart search when nodes have been erased Florian Westphal ?
2019-01-22 netfilter: nf_conncount: split gc in two phases Florian Westphal
2019-01-22 netfilter: nf_conncount: don't skip eviction when age is negative Florian Westphal
2019-01-22 netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS Shawn Bohrer
2019-01-22 can: gw: ensure DLC boundaries after CAN frame modification Oliver Hartkopp
2019-01-22 tty: Don't hold ldisc lock in tty_reopen() if ldisc present Dmitry Safonov
2019-01-22 tty: Simplify tty->count math in tty_reopen() Dmitry Safonov
2019-01-22 tty: Hold tty_ldisc_lock() during tty_reopen() Dmitry Safonov
2019-01-22 tty/ldsem: Wake up readers after timed out down_write() Dmitry Safonov

As you're describing correctly, the problem seems to be network related. I'm getting this error two when watching videos from internet. I'm...

Read more...

(In reply to Brendan Long from comment #589)
>
> I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just
> works (for 10+ hours, I don't leave my machine on overnight). Once it
> started working, I set everything back to the defaults except for "Typical
> Current Idle" (since I have a PSU from 2010 and I doubt it supports any
> fancy features from 2013).

a) Fedora 29 has very new kernels. In the last few weeks they rebased to 5.0 and 5.0 is supposed to have a lot of Ryzen fixes in it.

b) For me "Typical Current Idle" (plus mwait) was all that was required to fix the problem. I'm pretty sure kernel 5.0 makes it so mwait override is not needed.

> - Using the Nouvea driver (previously proprietary nVidia drivers)
> - Booting in EFI mode now (previously legacy mode)
> - Using ext4 (previously btrfs)

We never used proprietary video driver and we both had the problem and fixed the problem. Same with EFI: we use just legacy mode. FS should not matter at all.

Displaying first 40 and last 40 comments. View all 678 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers