Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
200
This bug affects 34 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tags: removed: kernel-da-key
information type: Public → Public Security
information type: Public Security → Public
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
404 comments hidden view all 484 comments

@Ryan, yes. Acoording to dmidecode I am using 0601 BIOS that is the last one available in Asus website.

Interesting that you are having soft lockups even with Typical Current. It completely solved the problems in my old 1700X system and it seems to have worked well for others. Are you overclocking? If yes, try first with default settings. Good luck!

@Paulo:
I downloaded the manual for your motherboard and found this

http://dlcdnet.asus.com/pub/ASUS/mb/socketTR4/PRIME_X399-A/E13557_PRIME_X399-A_BIOS_EM_WEB_20171030.pdf

EPU Power Saving Mode
The ASUS EPU (Energy Processing Unit) sets the CPU in its minimum power consumption
settings. Enabling this item will apply lower CPU Core/Cache Voltage and help save energy
consumption. Set this item to disabled if you are over clocking the system. Configuration
options: [Disabled] [Enabled]

Have you tried Disabling this?

On 2018-06-14 09:10 AM, <email address hidden> wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #359 from Ryan Phillips (<email address hidden>) ---
> Eco mode supposedly only controls the PSU fan. I'm guessing that will not
> change anything on the CPU soft lock side.
>
How about this theory: With Eco mode on, the power supply prevents the
current from increasing until the fan speeds up, causing the voltage to
drop ever so slightly, and pademonium ensues. Note that EVGA's site does
not explicitly say that that Eco mode only controls the fan. I am not
pointing the finger at EVGA, I am just exploring a theory, please refute
if possible.

I am willing to believe that this system is stable now that I am at 51
days uptime,  so I will try some experiments. The first one will be, run
with both Typical Power and Eco Mode off, and all stock bios and kernel
settings.

@Eduardo. Yes I've found that option before. It is disabled by default (so the default is not to try the power saving). I even tried to enable hoping that the power saving system would be smart and avoid the lock up but it didn't work.

What I am trying right now is to set the option "Overclocking Enhancement" that is supposed to do "It enables SenseMi Skew (artificially reading lower temperatures in order to trick XFR to boost longer and higher), Performance Bias and tweaks VRM settings to improve overclockability". I am specially interested on the tweaked VRM settings. (https://rog.asus.com/forum/showthread.php?97680-Overclocking-Enhancement)

Interesting enough, if I only set this to Enable (no overclock, just default setting + this option enabled) it seems to make things better. As I said, I have a code that consistently triggers a lockup if I run in a loop followed by one hour of inactivity. I have been running this test in one of my system with the "Overclocking Enhancement" set to Enable and it did not trigger the bug for the last 24hs. I have just changed the other machine to test this as well. I'll keep the test running today and the whole weekend on both machines and report back.

If anyone else is having the problem with a Asus X399 motherboard it may be worth a try and, please, report back too.

Oh, and I forgot to mention. Both machines are using the "Overclocking Enhancement" options and I have explicitly set c6 state to enabled using zenstates.py.

Hi, a follow up. My two Threadripper machines are working flawlessly since Friday. I tried both, running a test that alternates between high load and idle that would normally trigger the but, and leave the machine idle for many hours.

It seems like a possible workaround for someone with a ASUS X399-A Prime motherboard is to set the BIOS option "Overclocking Enhancement". Beware that this option has a weird side-effect of skewing the temperature to report a lower value. Therefore you should have a good cooling solution in place.

So AMD has finally published the revision guide for 17h family, which includes errata:
https://developer.amd.com/wp-content/resources/55449_1.12.pdf

Would be cool if someone with more knowledge about that kind of stuff could have a quick look through and see if this issue could be related to anything in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

Wow, a lot of errors to be workarounded...

(In reply to Artem Hluvchynskyi from comment #366)
> So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat related.

I remember seeing the kernel fix/workaround for that issue when I was trying to diagnose our issue. I don't think they are related.

(In reply to James Le Cuirot from comment #368)
> I remember seeing the kernel fix/workaround for that issue when I was trying
> to diagnose our issue. I don't think they are related.

https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix

But don't know if it's the same MWAIT problem as mentioned in the errata pdf above.

I was thinking of this but with fresh eyes, maybe that is yet another issue. I'll shut up now and leave this to the experts. :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=88d879d29f9cc0de2d930b584285638cdada6625

@James Le Cuirot:
Linux 4.14 already came with the mentioned patch from beginning and doesn't prevent freezing on idle at all - same as the MWAIT-patch mentioned at Phoronix (see link above).

(In reply to Artem Hluvchynskyi from comment #366)
> So AMD has finally published the revision guide for 17h family, which
> includes errata:
> https://developer.amd.com/wp-content/resources/55449_1.12.pdf
>
> Would be cool if someone with more knowledge about that kind of stuff could
> have a quick look through and see if this issue could be related to anything
> in there. So far "1109 MWAIT Instruction May Hang a Thread" sounds somewhat
> related.

"1033 A Lock Operation May Cause the System to Hang" seems also related since some logs were refering about locking issues.

Hello James, does this issue already fixed? If not, do we have other sources sources that might help this to be fixed? Thank you!

Carlo B.
https://ultimatewebtraffic.com/

Hey, guys

We have been seeing this smp function call soft lockup since 3.x kernel. Unfortunately we don't know how to reproduce it either.

I suspect there is something on other CPU (not shown without sysrq-l) blocking the smp call function execution, we have to figure out what it is. So next time you see a living case of this bug, please collect the stack traces on all the CPU's at that time, use sysrq-l.

Thanks!

1 comments hidden view all 484 comments

Thread still ongoing with no response from AMD.

Really shouldn't need to fiddle with custom bios options.

What is big business / big iron or whatever it's called doing with Epyc and linux servers? Are those totally fine?

Emilio (emilio-moretti) wrote :

I have a Ryzen 1800X + Gigabyte GA-AX370-Gaming K7.

I was suffering from these hangs until I changed a few settings in my BIOS. The change that made the difference was the Power Supply Iddle Control that was added in the latest BIOS update. All other settings are left by default:
M.I.T->
  Advanced Frequency Settings ->
    Core Settings ->
      *Power Supply Iddle Control = Typical
      *SVM Mode = On
Peripherals ->
  *Above 4G Decoding = On
Chipset ->
  IOMMU = Auto (Disabled)

BIOS version: F23f AGESA 1.0.0.2a + SMU FW 43.18

I'm running Ubuntu x64 on stock kernel 4.15.0-23-generic with the following options
GRUB_CMDLINE_LINUX_DEFAULT="radeon.cik_support=1 amdgpu.cik_support=0 iommu=soft"

As an extra measure I completely blacklisted AMDGPU from my system because it was also a source of many of the kernel freezes (AMD 390x GPU), and I haven't seen a system freeze in almost two months. (they were _very_ frequent before, almost daily)

Good luck.

1 comments hidden view all 484 comments

(In reply to OptionalRealName from comment #375)
> Thread still ongoing with no response from AMD.
>
> Really shouldn't need to fiddle with custom bios options.
>
> What is big business / big iron or whatever it's called doing with Epyc and
> linux servers? Are those totally fine?

I am running an Epyc-based server, and as posted in this thread here:
https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID
(worth a full read)
everything in there was solved by setting the correct Virtualization flags in BIOS, as specified by a Supermicro Tech as follows:
========
It looks like you want to enable the virtualization feature.
Please go into BIOS,
Advanced -> NB Configuration -> IOMMU (change to Enabled).
Advanced -> PCIe/PCI/PnP Configuration -> SR-IOV Support (change to Enabled).
========

After this, the server worked without a hitch, and I have now upgraded it from Centos7, to Fedora 28, and it runs like a charm.

Of course the server is not IDLE at all, and maybe that saves the server from the low power issue discussed in this thread. However, I have had it at least 7 days just sitting there not doing anything, and it never had any problem.

I'm praying intensely that the new AMD Epyc 3000 Embedded series (low power e3000) don't have the same issues. Hoping to buy one in the next few months when they finally put the damn things on some ITX boards.

I was able to capture a soft-lock with sysrq-l [1]. Networking and mouse were still functional, but focusing windows and window activity in Plasma were locked.

https://gist.github.com/rphillips/fae9540a8e7a3a83731c9a9809a98df4

Notes:
 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to Off)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0601 04/19/2018, Typical Current set in the BIOS)

Is there any workaround/solution for this bug? With this problem i can't reliably have the PC running as a server :(

The bug is almost 1 year old

There are two solutions or workarounds that have been reliable for me, one of which I now trust more than the other.

First, if your BIOS has the option for 'Power Supply Idle Control' hiding somewhere in its settings (generally in a sub-menu off an 'advanced' menu), setting it to 'Typical current idle' seems to work. I trust this workaround more than the next one, but it requires a BIOS that has been updated to include AGESA 1.0.0.2a (AGESA is apparently a magic blob that AMD supplies to vendors).

Before I had the BIOS option available, I also had a stable system by using the kernel command line parameters 'rcu_nocbs=1-15 processor.max_cstate=5' (some people use 1 as the maximum cstate). This requires a kernel that supports rcu_nocbs, which not all kernels are built to do, and is more magical than the BIOS setting; it's clear that these settings are stabilizing the system through some side effects, not their direct operation.

(I experimentally determined that on my hardware and setup, merely using 'processor.max_cstate=5' wasn't enough; my machine still locked up.)

My machine runs Fedora 27, using Fedora 4.16.x and 4.17.x kernels on a Ryzen 1800X on an ASUS Prime X370-PRO motherboard with ECC RAM, currently using BIOS 4011.

Created attachment 277137
attachment-17307-0.html

Yep, typical current parameter in your UEFI
If not there, update it

________________________________
De : <email address hidden> <email address hidden>
Envoyé : Monday, July 2, 2018 11:12:26 AM
À : <email address hidden>
Objet : [Bug 196683] Random Soft Lockup on new Ryzen build

https://bugzilla.kernel.org/show_bug.cgi?id=196683

Alexandre Badalo (<email address hidden>) changed:

           What |Removed |Added
----------------------------------------------------------------------------
                 CC| |<email address hidden>

--- Comment #379 from Alexandre Badalo (<email address hidden>) ---
Is there any workaround/solution for this bug? With this problem i can't
reliably have the PC running as a server :(

The bug is almost 1 year old

--
You are receiving this mail because:
You are on the CC list for the bug.

(In reply to Alexandre Badalo from comment #379)
> Is there any workaround/solution for this bug? With this problem i can't
> reliably have the PC running as a server :(
>
> The bug is almost 1 year old

People need to contact AMD, it's ridiculous this is still ongoing.

I have finally gotten to stable (no soft lockup). Asus released the 0702 bios which includes Agesa update 1.0.0.2c. Typical Current is set to enabled within the bios. No other special kernel command-line options are enabled.

Machine:

 * Kernel 4.7.13
 * AMD Ryzen 2700X
 * SuperNOVA 650 G3 650W (Eco switch set to Off)
 * Asus ROG Crosshair VII Hero Wifi (BIOS 0702 06/22/2018, Typical Current set in the BIOS)

Adding to the confusion here:

I'm running a Ryzen 1700 on Asus PRIME B350-PLUS motherboard and on BIOS version 4011 I have the "Power Supply Idle Control" option and I have changed it from "Auto" to "Low Current Idle" a week or so ago and after that I have had zero lockups.

According to this bug report disabling C6 states *or* setting that option to "Typical Current Idle" should work and "Auto" should mean "Low Current Idle" hence it should not have made any difference. zenstates.py shows C6 enabled for both package and core.

Unless the Ubuntu 18.04 kernel 4.15.0-*-generic got an off-tree patch applied somewhere in-between my reboots I'm confused and relieved at the same time.

Did the ppl with the new BIOSes double test the 'Power Supply Idle Control' option by disabling it temporary? Because, it could also be fixed by that Agesa update some of you mentioned (also mentioned in [1])

tl;dr:
Had soft lockup at full load, BIOS-update seems to help, but it doesn't have the 'Power Supply Idle Control'-option.

More details:
My HP 17-ca0202ng (Ryzen 2500U) was stable by starting up with "idle=nomwait" Kernel-parameter for 2~3 weeks until *something*[2] happened and from one day to the next it was hard-locking again and on another testrun it threw the soft-lock-error (NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s!) while it was transcoding x264-stuff to x265 (kind of full load; 95% @ all 8 threads). It was bootet up by "nomodeset" at the time the errors occur, so I'm quite sure it can't be amdgpu-related.
The time the soft-lock occurred, the system was kind of responsive, just freezing after a amount of seconds with unfreezing after another amount of seconds. I was able to run dmesg and just saw that NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! for all threads, before more freezings in shorter amount of seconds appeared and the system completely froze. Strange!

But: HP provided a BIOS-Update (F4 > F10) some days ago, I was able to update it yesterday and I've done 8 hours of full-load testing, 8 hours of idle testing and 7,5 hours of partial-load testing. No lockup so far and no Kernel-parameters needed (running Ubuntu's mainline build kernel 4.17.2, the stuff from Padoka stable ppa and ravenridge firmware from [3] to prevent amdgpu-freezes). As any lockup occurred within 3 hours, the laptop seems to be stable (or at least "more stable" than before).
Can't say anything about the changes HP made in that update, because they don't provide a changelog. But it's a laptop, so it has 3 or 4 BIOS-options (something virtualisation something, something secure boot something and some boot order options) and for sure no 'Power Supply Idle Control'. Allow me to point out, that Laptops usually have power adapters special designed for the device. So it can't be power supply related as AMD said (IMO).

[1] https://community.amd.com/thread/225795 , https://community.amd.com/thread/224000
[2] May be microcode-update related provided by Ubuntu, but I can't remember for sure.
[3] https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu

Interestingly,

I had a nicely running system, uptime about 30 days, and I just decided to upgrade the Kernel to the latest (Fedora 28), and "what the heck" I will just upgrade the motherboard BIOS too.

Mobo: ASUSTeK model: PRIME B350M-A
6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1

So, now I have BIOS version 4104 (previous was 3801 I think)

And after like 10 minutes of doing stuff (i.e. not idle), LOCKED UP !!!

So I went into the BIOS and changed the Global C-State to Disabled, and right under that is the power profile, changed it to Typical.

We shall see how long it lasts !

I guess "don't fix what ain't broken" applied to my activity today lol.

(In reply to Robert Hoffmann from comment #386)
> Interestingly,
>
> I had a nicely running system, uptime about 30 days, and I just decided to
> upgrade the Kernel to the latest (Fedora 28), and "what the heck" I will
> just upgrade the motherboard BIOS too.
>
> Mobo: ASUSTeK model: PRIME B350M-A
> 6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1
>
> So, now I have BIOS version 4104 (previous was 3801 I think)
>
> And after like 10 minutes of doing stuff (i.e. not idle), LOCKED UP !!!
>
> So I went into the BIOS and changed the Global C-State to Disabled, and
> right under that is the power profile, changed it to Typical.
>
> We shall see how long it lasts !
>
> I guess "don't fix what ain't broken" applied to my activity today lol.

My case is similar, but i don't recall updating the BIOS/UEFI firmware, it was after a kernel upgrade that the system started locking up :/

I have the ASUS B350M board and I updated to latest BIOS 4014 but my lock ups started with the latest kernel 4.17 in Arch where the lock up bug reappeared. I have set the idle current setting to typical and so far no lock up in two days, lets see what happens in few days. Keeping my fingers crossed.

I'm watching this thread since February. My first stable workaround with legacy BIOS-Mode was the kernel parameter rcu_nocbs=0-15 because the BIOS didn't have the "Typical Current Idle" Option.

Systeminfo
R7 1700, GA AX370-Gaming K7 with BIOS F23f in legacy mode, PSU Corsair HX750, OS Gentoo Linux Kernel 4.17.4, desktop kde-plasma-5.13.2

no overclocking or tweaking, only using the "Typical Current Idle" Option (it's possible now with F23f) instead "Auto".

So I didn't need rcu_nocbs=0-15 furthermore and all works fine - no lockups

These lockups are probably not related to this bug. I've updated by Intel Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have random CPU lockups, too.
4.17.3 worked fine.

(In reply to ValdikSS from comment #391)
> These lockups are probably not related to this bug. I've updated by Intel
> Sandy Bridge laptop to 4.17.5 from Fedora 28 repository and now I have
> random CPU lockups, too.
> 4.17.3 worked fine.

Please take a look at this report, it matches your description https://bugzilla.redhat.com/show_bug.cgi?id=1598989 The user that reported it also has an Intel CPU (Acer Aspire V3-771 from his attached screenshot). I have experienced this issue once on one of my Ryzen systems running 4.17.6, could it be an unrelated problem? Another one https://bugzilla.redhat.com/show_bug.cgi?id=1598462

With regards to the original Ryzen Random Soft Lockup issue: I am witnessing it on every kernel that I have tested so far, ranging from kernel 4.10 through to 4.17.6. The frequency of the Ryzen soft lockup has increased since kernel >= 4.15 in my experience. It could just be a random effect as I am unaware of the cause of this problem.

I am running my machines at stock clocks and have the latest stable BIOS updates install as of today on a Ryzen 1700 w/ MSI PRIME X370-PRO, Ryzen 1800X w/ ASRock Fatal1ty X370 Professional Gaming, and finally my personal desktop Ryzen 1800X w/ X470 Taichi Ultimate w/ Seasonic 1000W Platinum PSU (Haswell ready). All of my CPUs are running microcode patch level 0x8001137

I can't find any correlation in workload, the issue occurs on web servers, VFIO gaming, even live USB sessions. I used the following (try) to provide insight.
journalctl -t kernel --no-pager | grep "soft lockup" | awk -F"!" '{print $2}' | sort -u
 [Compositor:4167]
 [kworker/10:1:18249]
 [kworker/1:3:418]
 [libvirtd:1226]
 [systemd:1]
 [Web Content:3505]

I have attempted CPU pinning, disabling ASLR completely. I have also tried isolating workloads in virtual machines that do not cross CCX units with hugepages (THP off). I am not an expert in knowledge of CCX, NUMA, Infinity Fabric, etc... I could have made a mistake in my test. That said, I am currently testing the "idle=nomwait" parameter in hope of getting better results, who knows perhaps even a stable system.

There still does not appear to be a single consistent solution to this problem and it exists in Ryzen 1xxx and 2xxx series.

Apparently not in the Epyc series, is that correct?

This thread is utterly embarrassing for AMD, truly embarrassing.

(In reply to OptionalRealName from comment #393)
> There still does not appear to be a single consistent solution to this
> problem and it exists in Ryzen 1xxx and 2xxx series.

Actually, there are several workarounds known to work listed in this thread.

I've been running a stable x1800 using the rcu_nocbs=0-15 fixes noted above for almost a year now.

> This thread is utterly embarrassing for AMD, truly embarrassing.

It's a nuanced hard to troubleshoot bug in a processor + support chipsets with a known workaround which only seems to happen on Linux... not really sure how that is "embarrassing". It sucks, it's new silicon. Welcome to being an early adopter.

Look at Intel's heart-bleed and all the other major security catastrophes they've had in recent history. Stuff happens. I do wish AMD would address it, but they're likely reluctant to bring it up until they have a reliable fix (if it even is a fix on their end)

> It's a nuanced hard to troubleshoot bug in a processor + support chipsets
> with a known workaround which only seems to happen on Linux... not really
> sure how that is "embarrassing". It sucks, it's new silicon. Welcome to
> being an early adopter.

Thread Reported: 2017-08-16 19:05 UTC

https://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700
16 month old silicon.

Nah this is too long.
Firstly, nothing should need to be done in linux in any capacity, there should be a bios flag that fixes it.
Second, the bios flag should be on at all times
Third, based on my second point, as long as someone has the latest firmware, it should 100% solve the problem.

Users should not need to mess around for stability, that is not acceptable.

I have Ryzen 7 1800X on Asus Prime X370-Pro. I upgraded the BIOS to v4011(Update AGESA 1.0.0.2a + SMU 43.18) and:

  1) turned on the "Typical Current Idle" option.

  2) stopped using zenstates.py -- which I had been using to enable "C6 Core"
     but disable "C6 Package" (to no avail).

  3) did *not* change Linux -- which was 4.16.5 -- Fedora 27.

  4) continued to use CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15

After 67 days uptime (leaving the system completely idle and changing nothing), I became convinced that the "Typical Current Idle" option has dealt with the "freezing when idle" problem.

When I say "freezing when idle", what I mean is: if the machine is left idle (typically over night) it simply stops responding. Nothing at all is logged -- no application, driver or kernel errors or warnings are logged -- the machine is still powered up, but frozen solid. The only way to restart the machine is to power down and up again.

Reviewing this thread, it seems to be mostly concerned with the "freezing while idle" issue.

The symptoms of the original "Random Soft Lockup" include log messages of the form:

     NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s!

is that related to "freezing while idle", or is it a separate issue ?

I get the impression that CONFIG_RCU_NOCB_CPU and rcu_nocbs=0-15 may be related to the "Random Soft Lockup"... but not to "freezing while idle" ???

It seems that other crashes/lockups are trying to attach themselves to this thread.

I note that this bug is asigned to <email address hidden>. This bug is very nearly 1 year old. Is this a good moment for the assignee to address this thread and say:

  * what, if any, Kernel issues have been identified

  * what, if any, Kernel fixes have been applied

related to this thread.

If the root cause of (some or all of) the issues in this thread is fixed or worked around by the "Typical Current Idle" BIOS option, does the assignee think that this "bug" can now be closed, or are there actual Kernel issues that remain, waiting to be fixed ?

Is it significant that W*nd*rs does not seem to suffer ?

This bug should be closed as "FIXED" in my humble opinion. "NEW" is grossly inaccurate.

My workstation uptime is now 109 days, continuous operation without suspend, mainly idle, mixed with episodes of heavy load on all cores and everything in between. The technical term for this is "spectacularly stable". Credit belongs to all involved, including AMD, kernel devs, motherboard maker, Debian maintainers, and many other groups who made this possible. Again in my humble opinion, the only issue remaining is AMD's failure to explain what went wrong and exactly how they worked around it. That is a separate bug, this one is done.

I suspect that Windows users did get this bug, but they didn't notice it because, well use your imagination. Then it was quietly fixed in a Windows update supplied by AMD that went out around the same time as the Typical Power bios updates.

Antoine Pitrou (pitrou) wrote :

This bug still occurs for me with 4.15.0-32-generic (Ubuntu 18.04.1) if I don't disable C6 states.

Displaying first 40 and last 40 comments. View all 484 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers