Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
196
This bug affects 33 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tags: removed: kernel-da-key
information type: Public → Public Security
information type: Public Security → Public
381 comments hidden view all 461 comments

(In reply to kernel from comment #335)
> I think that is a great idea. I have time to waste at airports soon, so can
> draft an email message.
>

Just to tell part of my story: I bought the well-known Asus laptop with Ryzen CPU. It was exposing the segfault bug. But before I was most sure about that, it burnt just two weeks after having bought it... I sent it to Asus, hoping for a new lapotp. Asus offically told me that they replaced the motherboard and the keyboard (they did not tell anything about if the CPU was replaced, and I was not able to check this).
The new but repaired laptop did not expose the segfault bug anymore (same kernel version). But it exposed the soft-lock bug. I didn't know this was this bug at that time. So I returned the laptop again to Asus claiming to be reimbursed. Asus repaired it again by changing the motherboard again. The bug was still present. (and for the interested people I could finally got reimbursed).

Then I decided to do as usual: build my own desktop. And I was facing the soft-lock bug until I tricked the voltages...

To go back to the story, and if we can trust Asus, a motherboard change made the segfault bug to disappear (AMD officially 'replaced' such 'faulty' CPUs) but made the soft-lock bug to appear. A second change of the motherboard did not improve anything.

What I guess (since this cannot be really ensured) is that AMD does not know where all these bugs come from. And AMD does not know how to resolve them. AMD does not even know if this can be solved.

What we can know is that the guy responsible for the Ryzen architecture was hired by AMD back again (ater having moved to Intel or Apple) but he left AMD 3 years ago (so before Ryzen went out). He moved to Tesla and is now working to Intel again.
Link is here: https://en.wikipedia.org/wiki/Jim_Keller_(engineer)

And before my last paragraph could be misinterpreted, I was just meaning that this guy is the one to contact if we want some concrete information.

I've sent the aforementioned outlets an email. For completeness, here's the entire thing:

Receipients:
Ars Technica: https://arstechnica.wufoo.com/forms/z7p8x7/
Tomshardware: http://www.purch.com/about/#contact-general
HardOCP: <email address hidden>
Anandtech: http://www.purch.com/about/#contact-general
Tweakers.net: <email address hidden>
Phoronix: https://www.phoronix-media.com/?

Subject: Publication of an AMD Ryzen hardware issue

Dear editor,

Since their introduction the AMD Ryzen processors have been plagued by several issues, most notably the segfault issue that occurred under high (compilation) loads. To that particular issue AMD has responded by replacing affected chips.

However, there is another significant issue that affects both Ryzen 1xxx and Threadripper CPUs, as well as the newer Ryzen 2xxx processors. It appears Epyc is not affected (although sample size is one in this case).

The most complete storyline on this issue can be found in the link below [1], however for an overview the rest of my email attempts to summarise the issue.

This issue results in a complete system freeze, occurring under full idle conditions, and requires a hard reset. Evidence suggests this is a hardware problem, since several workarounds have been found that mitigate/solve the issue, such as disabling C6 entirely (inefficient), overclocking and tweaking voltages (not for everyone), or running processes that keep the CPU active at all times (again, inefficient and pointless).

AMD has been contacted multiple times but refuses to acknowledge the issue. At some point in one reply AMD blamed users' PSUs; this is obviously nonsensical as the issue occurs on a wide variety of PSUs including brand new models, the only constant is the Ryzen platform.

In response, AMD has pushed motherboard manufacturers to add an otherwise undocumented BIOS option, "Power Supply Idle Control", as part of their AGESA update. However, in its default value this setting does *not* solve the problem, and with other settings doesn't *always* solve the problem. To be precise, this setting needs to be changed from "auto" to "typical":
/advanced/amdcbs/zen-common-options/"power supply idle control"
However it is unknown what this option actually does.

The issue is also present in the laptop platforms. In example of one user, he sent his laptop back to Asus repeatedly for motherboard replacements, but the issue remains.

The root cause of this issue is still unknown, and moreover, the issue is still present in the latest Ryzen 2xxx CPUs. AMDs refusal to acknowledge let alone resolve the issue has led to this email, in the hope that media attention to this problem will inspire AMD to take action. Hence, I propose that your outlet publishes an article about the issue.

Kind regards,
<email address hidden>

[1] https://bugzilla.kernel.org/show_bug.cgi?id=196683

Sadly I haven't received back apart from the following from HardOCP:

"We touched on this issue in one of our recent reviews and found the issue to be with the Power settings implemented by Microsoft. The bug has been reported to MS. If you cycle through one of two of the other power settings, the issue will go away from our experience."

To which I pointed out that this issue occurs under Linux in particular (due to full idle). Haven't heard back yet.

Does anyone here have direct contacts with any of the major outlets?

(In reply to kernel from comment #340)
> Sadly I haven't received back apart from the following from HardOCP:

I got a reply from c't / Heise. They say that during their experiments with Linux on Ryzen they could not find any evidence of this bug so far.

So basically, the press is not interested. Big surprise. Anyway, C't reported today on the new 4.17 kernel, with following about low power state for AMD Ryzen. Anyone care to see if this has an effect on this bug?

"Auch Systeme mit AMDs aktuellen Prozessoren dürften mit 4.17 etwas sparsamer laufen. Das ist einer kleinen Änderung am Code zu verdanken, der den Prozessor im Leerlauf schlafen schickt. Er nutzte auf AMD-CPUs bislang den MWAIT-Aufruf, durch den AMDs aktuelle Prozessoren aber lediglich in den Schlafmodus C1 wechseln; jetzt verwendet der Kernel CPUIDLE oder HALT, durch die Ryzen, Epyc & Co. auch in tiefere und daher effizientere Schlafzustände wechseln."

Google translate:
Even systems with AMD's current processors should run a bit more economical with 4.17. This is due to a small change to the code that makes the processor sleep idle. He used on AMD CPUs so far the MWAIT call, through which AMD's current processors but only switch to sleep mode C1; Now the kernel uses CPUIDLE or HALT, through which Ryzen, Epyc & Co. also switch to deeper and therefore more efficient sleep states.

(In reply to Jonathan from comment #342)
> Anyone care to see if this has an effect on this bug?

Better save your time, on my Ryzen 2500U; Kernel 4.15 (Ubuntu's), nor 4.16 (Fedora's) nor 4.17rc6+7 (Mainline) had an effect on this.

Ubuntu & mainline kernels needed "pci=noacpi" or "acpi=noirq" to even boot.
Fedora's just needed "nomodeset", if you wanted to boot with XFCE, and nothing at all with Gnome.

Doing a backup with fsarchiver
```fsarchiver savefs -o -j8 -A -a -v -Z3 $BackupfileDestination $Partition```
__always lead to a freeze within 1 - 5 runs__). Fsarchiver causes a lot of up- and downclocking (and boosting above baseclock), while it produces only partial cpu-load).
Full idle never caused a freeze neither did full load (handbrake, kill-ryzen.sh).

On kernel 4.13 the runs of fsarchiver never caused a freeze (max cpu-clock 2ghz).
Also, the setting of scaling_governor to powersave on all other tested kernels prevented that freezes (cpu clock was locked to 1,6ghz in that case).

Things I (unsuccessful) tried to get rid of the freezes:
*blacklisting amdgpu
*kernel @ CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-7
*rmmod ath9k (rmmod ath10k_pci ath10k_core)
*30.5.2018 Ubuntu's microcode update for amd64
*disabled ASLR
*new mesa (18.0rc5 > 18.1.1. > 18.2 dev)
*c6-disabled via service
via grub:
*acpi=strict
*libata.noacpi
*amdgpu.dpm=0
*.dc=0
*.audio=0
*.bapm=0
*noibrs noibpb nopti
*libata.force=noncq
*pcie_aspm=off & the others
*amd_iommu=off & the other options (including iommu + options)
*pci=nocrs
*nomodeset
*all acpi= parameters
*processor.max_cstate=5

(In reply to Freihut from comment #343)
>
> Better save your time, on my Ryzen 2500U; Kernel 4.15 (Ubuntu's), nor 4.16
> (Fedora's) nor 4.17rc6+7 (Mainline) had an effect on this.

My earlier posts above relate to my desktop Ryzen 5 1600X but I've also been seeing daily freezes with a 2700U-powered laptop. I don't believe this is the same issue though as ZenStates.py did not help. I think it's more likely to be graphics-related, especially given what another user with the same hardware reported in bug #199653. Maybe your situation is different though.

(In reply to Freihut from comment #343)
> Doing a backup with fsarchiver
> ```fsarchiver savefs -o -j8 -A -a -v -Z3 $BackupfileDestination
> $Partition```
> __always lead to a freeze within 1 - 5 runs__). Fsarchiver causes a lot of
> up- and downclocking (and boosting above baseclock), while it produces only
> partial cpu-load).
> Full idle never caused a freeze neither did full load (handbrake,
> kill-ryzen.sh).

This is starting to sound very similar to my total system freeze with ZFS on FreeBSD on all three of my Ryzen 1700 chips on two different ASUS motherboards (B350M-A and X370-PRIME).

The underlying storage I had there was a pile of four HDDs, or a SSD, which I made a snapshot of and then:

  while true; do zfs send -R stuff@foo | pipemeter | cat >/dev/null; done

This would lock the system up in about 2-7 terabytes of HDD data if memory serves me right. It would not happen on USB HDDs, but happened regardless of if it was onboard or on a disk controller. I've also managed to reproduce it by receiving a stream over the network and writing it out to said disks, but this one is easier to run as one doesn't need another machine to feed it.

Doing the same under Linux at the time could not produce any similar hangs, but it may be due to a different implementation of ZFS and not using pipemeter.

I discussed it with AMD support over months (Aug-Dec 2017) and they gave me all sorts of power twiddling suggestions for firmware, but the only thing that ever really worked there was to disable hyperthreading (SMT), bringing the machine down to 8C/8T.

Have you tried disabling SMT in your firmware and seen if it changes anything?
Please note that if you do it and upgrade your firmware, you may need to do a full wipe of settings to even get the option back. Support recommended me (rightfully) to do full clears of firmware settings between upgrades, seems like the migration path isn't quite solid.

(In reply to Lars Viklund from comment #345)
> Have you tried disabling SMT in your firmware and seen if it changes
> anything?

I can't do that, because it's a notebook with only 2 BIOS-options (something virtualization something). Tried disabling both, but that had no effect at all.
As far as I can remember, I tried disabling SMT via Grub, but only to test, if it's booting (wasn't).

But I wouldn't say it's SMT-related, because the CPU runs utterly fine, while it's below it's base clock or at full load.
The freezes also occured on singlethreaded action: While the play-back of random youtube-stuff (autoplay, no user interactions applied) with Palemoon (Firefox fork). That usual took 30 - 120 mins to freeze and was also tested with disabled amdgpu, so I'm pretty sure it's CPU-related.

Have to send it back to the vendor now, because I ran out of time (for the right to return).

(In reply to Dennis Schridde from comment #295)

As an update from my earlier post, here is what I wrote to AMD support recently:

I can confirm the effectiveness of the "Power Supply Idle Control = Typical Current Idle" non-default firmware-option introduced with AGESA 1.0.0.2a only partially. After setting this option it often still takes several attempts until the machine boots -- during the failing attempts I either get 3 long beeps from the mainboard followed by an automatic reboot, or I see varying Linux kernel stack-traces during the early boot, seemingly related to firmware / EFI and CPU / idle. Sometimes the system can be restarted from this state using soft-reboot (ctrl+alt+del). But sometimes the situation requires a hard-reset, e.g. because the system completely froze (similar to the original problem reported here), or because the init process dies. I have the feeling that in such situations the system, even if it does not freeze completely, works with corrupt data and then writes this to the hard disk [3,4]. I destroyed my installation several times in the last month due to these problems, e.g. because the system had destroyed all (?!) superblocks of the file system, or lately the LVM cache. Surprisingly, once the system has booted properly, it will run stably for days without any further issues.

I could not reproduce the problem synthentically using either PassMark memtest86 [5], TU Dresden Firestarter [6] or Google Stressful Application Test [7], the latter including CPU, RAM, hard drive and file-system tests. My hardware supplier also tested all components (CPU, RAM, mainboard) again using Windows with Prime95 and Furmark, as well as with Memtest, and assured me that they could not detect an issue either. During regular use I can reliably reproduce it on Gentoo (with Linux 4.16), Fedora 27 (Linux 4.14), Fedora 28 Beta (Linux 4.15), Fedora 28 (Linux 4.16) and Arch 2018.03 (Linux 4.15). Before the introduction of the "Power Supply Idle Control = Typical Current Idle" firmware-option, I first noticed the issue while compiling large amounts of software on Gentoo, but was later able to reliably reproduce the freeze with simple `rsync` operations on all other operating systems, too. This even happened when no X-server is running and the GPU is not utilised in any other way (as far as I know), so I do not see a connection to possibly incomplete Vega-GPU-support in Linux. After setting "Power Supply Idle Control = Typical Current Idle" these freezes stopped happening and I only see the problem during early boot.

[3]: https://www.redhat.com/archives/linux-lvm/2018-May/msg00006.html
[4]: https://bugzilla.redhat.com/show_bug.cgi?id=1585670
[5]: https://www.memtest86.com/
[6]: https://tu-dresden.de/zih/forschung/projekte/firestarter
[7]: https://github.com/stressapptest/stressapptest

Hi Dennis, from my point of view, your case is not the same as the reported idle freezes in this bug report. As a owner of a Ryzen 7 1700 that has been replaced as of the other segfault bug during compiling, the bug mentioned here is known to me, but my feeling is that it's quite seldom now. Now to compare here my Ryzen 3 2200G, would be unfair as it is a completely different beast. By saying this, I can assure you I have trouble with this system as well. But it's known and Phoronix posted about it, but they say the 2400G might be stable now, but not the 2200G. If you have issues with this APU, which is quite expected reading your post, then from my own experience I can tell you that you need the latest kernel, latest UEFi, latest firmware, latest Mesa... This should give you a working system, at least my 2200G still often fails to boot, but if it's booted it works ok. But as this is a APU this way more complex, you may see GPU issues and that is why I think your post doesn't belong to the original bug report. From my personal experience, I got once a corrupted btrfs file system on my Ryzen 7 1700, but in the end one of my 16GB Ram was defect (detected by Memtest86+) and it corrupted the file system during a crash. Again, I have the feeling we shouldn't mix to many different issue in this bug report and I think my issues with the 2200G do not belong here.

I am happy to report 45 days of continuous uptime with "typical current idle" selected in bios, and otherwise vanilla everything including 4.15.0 kernel. My system configuration is as posted above.

I find myself willing to believe the AMD "indirectly" statement: this is related to power supply. My power supply is an EVGA 650GQ, with eco switch set on to reduce fan noise. I suspect that if I swapped it for some other power supply, perhaps one not quite so efficient, or flipped to eco switch off, that I would find myself among those lucky users able to run with default bios settings and 4 watts less power consumption at idle. Unless anybody can refute this theory, I would consider it advisable to post power supply details for anybody still hitting problems.

I suspect that the truth is, some power supplies just do not expect such low full system power draw, and freak out about it. Without "typical current", mine is 38 watts at the wall including 32GB of memory, but unfortunately, only lives a few days at that setting.

(In reply to Daniel Phillips from comment #349)
> I am happy to report 45 days of continuous uptime with "typical current
> idle" selected in bios, and otherwise vanilla everything including 4.15.0
> kernel. My system configuration is as posted above.

With my Ryzen 7 1800X and Asus X370 Pro: I can report that with the magic "typical" BIOS setting I have today 30 days uptime. This is more than twice the previous record.

FWIW: I am so fed up with this machine that I haven't used it since I updated the BIOS and applied the setting. It is running 4.16.5 (Fedora 27), with CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15. I don't know if the rcu_nocbs=0-15 is still required.

Also FWIW: zenstates.py -l tells me that C6 Package is Disabled, but C6 Core is Enabled. Before the BIOS update I used zenstates.py to set C6 the same way, but the machine froze after some 12 days. After the BIOS update I no longer use zenstates.py to set anything. So I guess the BIOS "typical" option disables C6 Package, but also does some other magic.

> I suspect that the truth is, some power supplies just do not expect such low
> full system power draw, and freak out about it. Without "typical current",
> mine is 38 watts at the wall including 32GB of memory, but unfortunately,
> only lives a few days at that setting.

Mr BeQuiet! are adamant that the Straight Power 11 I have is perfectly happy to supply 0A at all voltages.

Of course, there's a lot of stuff between the PSU and the CPU... so it could be a motherboard issue. Who can tell ?

Possibly, some day, I will go back to using by AMD machine, but I doubt I shall come to be fond of it :-( Certainly I am livid with AMD's abject failure to address the issue promptly, and their continuing inability to discuss or document the issue. Bugs happen. It's how they are dealt with that separates the sheep from the goats. <sigh>

Since the problem is not (and it seems will not) be solved, may I ask people here to send me full specs and BIOS settings of working configuration so that I can publish them on a web page (with the hope to be useful for other people) ?
So basically I am looking for the original spec list (HW, BIOS version and settings, boot options) and what you did to make your system stable (let say stability starts with at least 14 days of uptime).

Just email me if you are interested, and if I have enough lines to put in that DB, I will post the link of the web page here later on.

Since the power supply (and harddrives) seem still to be an important piece of the puzzle, send me that information too.

Is there anyone not affected by this when Package C6 is enabled?
Otherwise I intent to send the patch, which has the same effect as "typical current".

(In reply to Kai-Heng Feng from comment #352)
> Is there anyone not affected by this when Package C6 is enabled?
> Otherwise I intent to send the patch, which has the same effect as "typical
> current".

That would be perfect! Can you send us that patch for testing?

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed

On 2018-06-09 07:33 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #352 from Kai-Heng Feng (<email address hidden>) ---
> Is there anyone not affected by this when Package C6 is enabled?
> Otherwise I intent to send the patch, which has the same effect as "typical
> current".
>
There are multiple reports above that disabling package C6 is not the
same as setting typical current in Bios. ISTR there is even a report
that setting typical current did not disable C6 for that particular
bios. The theory that this power bug is related to motherboard is
gaining prominence, because different motherboards seem to have
different tweaks for "typical current". If only AMD would just tell us
what is going on.

Meanwhile, I am more than content with my Ryzen setup now that it is
apparently stable, but discontented with AMDs lack of disclosure. Not to
the point of swearing off AMD, but it got close.

How did I made my system ultra stable after months of despair ?
Easy, I never let it idle anymore :p
Mining crypto money on at least some of the core and I never
had trouble again :)

Le 09/06/2018 à 16:28, <email address hidden> a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> AMD Linux User (<email address hidden>) changed:
>
> What |Removed |Added
> ----------------------------------------------------------------------------
> CC| |<email address hidden>
>
> --- Comment #351 from AMD Linux User (<email address hidden>) ---
> Since the problem is not (and it seems will not) be solved, may I ask people
> here to send me full specs and BIOS settings of working configuration so that
> I
> can publish them on a web page (with the hope to be useful for other people)
> ?
> So basically I am looking for the original spec list (HW, BIOS version and
> settings, boot options) and what you did to make your system stable (let say
> stability starts with at least 14 days of uptime).
>
> Just email me if you are interested, and if I have enough lines to put in
> that
> DB, I will post the link of the web page here later on.
>
> Since the power supply (and harddrives) seem still to be an important piece
> of
> the puzzle, send me that information too.
>

> I find myself willing to believe the AMD "indirectly" statement: this is
> related to power supply. My power supply is an EVGA 650GQ, with eco switch
> set on to reduce fan noise. I suspect that if I swapped it for some other
Some want to believe and that's fine but AMD story is and always was bullshit and here's why: My EVGA SuperNOVA 750 G3 & Corsair RM650i are modern and C6/"haswell" compliant PSUs. There's plenty of other people with a variety of new and old and cheap and expensive PSUs who are affected by this TOTAL SCANDAL bug.

The new Power Supply option in BIOS set to Typical does fix it all combinations of Ryzen motherboards/CPUs/PSUs I've tried.

The root cause could be a combination of motherboard/CPU/PSU but my guess is that it's got more to do with motherboard voltage regulators and less to do with PSUs. Buying a new PSU isn't a viable option anyway, what are you going do to? buy every single PSU on the market and hope you stumble upon one that works with the "low current" power supply option in BIOS after trying 20 that don't? If you sell a computer monitor that does not work properly on 19 out of 20 graphics cards then it's not the graphics cards that are to blame, you're selling a defective product.

The Typical Idle setting is a fine solution if you know about it, so there's that. Works for me. I do feel sorry for poor noobs who don't know and pull their hairs for hours before discovering the shocking truth, though.

I would like just to pass some evidence that this affects Threadrippers. I have two identical system with the 1950X on a Asus X399-A Prime motherboard. This motherboard does not have the Power Supply option in the BIOS. Both freeze from time to time.

But more interesting, both systems freeze when running a code developed by one of our students that uses Matlab and computes the SVD of a large matrix followed by some time almost idle writing to a NFS mount many times in a loop. The code takes around an hour to run. I made a small script loops between running the code followed by calling sleep for one extra hour to put the machine in a idle state. It always make the machines freeze in less than one day. In some cases the freeze happens even while the code is running, not only when idle. This seems odd. It is very reproducible.

The solution is to disable C6 fully using zenstates.py. Only disabling c6-packaging is not enough. Another possibility, that I am inclined is to recover some of the performance I lost by loosing turbo boost and XFR, is a light overclock. I am testing one of the machines now @3.75GHz.

Have anyone seen the "power supply" option in a X399 motherboard?

@Paulo: Are you running the latest bios? There is a newer bios from ~April or so, but I do not have the motherboard to try it out.

To add some more information to this thread, I have the following machine:

 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to On)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0601 04/19/2018, Typical Current set in the BIOS)

This setup is currently working better with the Typical Current setting, though I have experienced a soft lockup with these settings after about ~6 hours of uptime. I am going to run the same setup with the eco switch to off on the PSU to see if anything changes.

Eco mode supposedly only controls the PSU fan. I'm guessing that will not change anything on the CPU soft lock side.

@Ryan, yes. Acoording to dmidecode I am using 0601 BIOS that is the last one available in Asus website.

Interesting that you are having soft lockups even with Typical Current. It completely solved the problems in my old 1700X system and it seems to have worked well for others. Are you overclocking? If yes, try first with default settings. Good luck!

@Paulo:
I downloaded the manual for your motherboard and found this

http://dlcdnet.asus.com/pub/ASUS/mb/socketTR4/PRIME_X399-A/E13557_PRIME_X399-A_BIOS_EM_WEB_20171030.pdf

EPU Power Saving Mode
The ASUS EPU (Energy Processing Unit) sets the CPU in its minimum power consumption
settings. Enabling this item will apply lower CPU Core/Cache Voltage and help save energy
consumption. Set this item to disabled if you are over clocking the system. Configuration
options: [Disabled] [Enabled]

Have you tried Disabling this?

On 2018-06-14 09:10 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #359 from Ryan Phillips (<email address hidden>) ---
> Eco mode supposedly only controls the PSU fan. I'm guessing that will not
> change anything on the CPU soft lock side.
>
How about this theory: With Eco mode on, the power supply prevents the
current from increasing until the fan speeds up, causing the voltage to
drop ever so slightly, and pademonium ensues. Note that EVGA's site does
not explicitly say that that Eco mode only controls the fan. I am not
pointing the finger at EVGA, I am just exploring a theory, please refute
if possible.

I am willing to believe that this system is stable now that I am at 51
days uptime,  so I will try some experiments. The first one will be, run
with both Typical Power and Eco Mode off, and all stock bios and kernel
settings.

@Eduardo. Yes I've found that option before. It is disabled by default (so the default is not to try the power saving). I even tried to enable hoping that the power saving system would be smart and avoid the lock up but it didn't work.

What I am trying right now is to set the option "Overclocking Enhancement" that is supposed to do "It enables SenseMi Skew (artificially reading lower temperatures in order to trick XFR to boost longer and higher), Performance Bias and tweaks VRM settings to improve overclockability". I am specially interested on the tweaked VRM settings. (https://rog.asus.com/forum/showthread.php?97680-Overclocking-Enhancement)

Interesting enough, if I only set this to Enable (no overclock, just default setting + this option enabled) it seems to make things better. As I said, I have a code that consistently triggers a lockup if I run in a loop followed by one hour of inactivity. I have been running this test in one of my system with the "Overclocking Enhancement" set to Enable and it did not trigger the bug for the last 24hs. I have just changed the other machine to test this as well. I'll keep the test running today and the whole weekend on both machines and report back.

If anyone else is having the problem with a Asus X399 motherboard it may be worth a try and, please, report back too.

Oh, and I forgot to mention. Both machines are using the "Overclocking Enhancement" options and I have explicitly set c6 state to enabled using zenstates.py.

Hi, a follow up. My two Threadripper machines are working flawlessly since Friday. I tried both, running a test that alternates between high load and idle that would normally trigger the but, and leave the machine idle for many hours.

It seems like a possible workaround for someone with a ASUS X399-A Prime motherboard is to set the BIOS option "Overclocking Enhancement". Beware that this option has a weird side-effect of skewing the temperature to report a lower value. Therefore you should have a good cooling solution in place.

So AMD has finally published the revision guide for 17h family, which includes errata:
https://developer.amd.com/wp-content/resources/55449_1.12.pdf

Would be cool if someone with more knowledge about that kind of stuff could have a quick look through and see if this issue could be related to anything in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

Wow, a lot of errors to be workarounded...

(In reply to Artem Hluvchynskyi from comment #366)
> So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

I remember seeing the kernel fix/workaround for that issue when I was trying to diagnose our issue. I don't think they are related.

(In reply to James Le Cuirot from comment #368)
> I remember seeing the kernel fix/workaround for that issue when I was trying
> to diagnose our issue. I don't think they are related.

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix

But don't know if it's the same MWAIT problem as mentioned in the errata pdf above.

I was thinking of this but with fresh eyes, maybe that is yet another issue. I'll shut up now and leave this to the experts. :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88d879d29f9cc0de2d930b584285638cdada6625

@James Le Cuirot:
Linux 4.14 already came with the mentioned patch from beginning and doesn't prevent freezing on idle at all - same as the MWAIT-patch mentioned at Phoronix (see link above).

(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

"1033 A Lock Operation May Cause the System to Hang" seems also related since some logs were refering about locking issues.

Hello James, does this issue already fixed? If not, do we have other sources sources that might help this to be fixed? Thank you!

Carlo B.
https://ultimatewebtraffic.com/

Hey, guys

We have been seeing this smp function call soft lockup since 3.x kernel. Unfortunately we don't know how to reproduce it either.

I suspect there is something on other CPU (not shown without sysrq-l) blocking the smp call function execution, we have to figure out what it is. So next time you see a living case of this bug, please collect the stack traces on all the CPU's at that time, use sysrq-l.

Thanks!

1 comments hidden view all 461 comments

Thread still ongoing with no response from AMD.

Really shouldn't need to fiddle with custom bios options.

What is big business / big iron or whatever it's called doing with Epyc and linux servers? Are those totally fine?

Emilio (emilio-moretti) wrote :

I have a Ryzen 1800X + Gigabyte GA-AX370-Gaming K7.

I was suffering from these hangs until I changed a few settings in my BIOS. The change that made the difference was the Power Supply Iddle Control that was added in the latest BIOS update. All other settings are left by default:
M.I.T->
  Advanced Frequency Settings ->
    Core Settings ->
      *Power Supply Iddle Control = Typical
      *SVM Mode = On
Peripherals ->
  *Above 4G Decoding = On
Chipset ->
  IOMMU = Auto (Disabled)

BIOS version: F23f AGESA 1.0.0.2a + SMU FW 43.18

I'm running Ubuntu x64 on stock kernel 4.15.0-23-generic with the following options
GRUB_CMDLINE_LINUX_DEFAULT="radeon.cik_support=1 amdgpu.cik_support=0 iommu=soft"

As an extra measure I completely blacklisted AMDGPU from my system because it was also a source of many of the kernel freezes (AMD 390x GPU), and I haven't seen a system freeze in almost two months. (they were _very_ frequent before, almost daily)

Good luck.

Displaying first 40 and last 40 comments. View all 461 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.