Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
208
This bug affects 35 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tags: removed: kernel-da-key
information type: Public → Public Security
information type: Public Security → Public
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
622 comments hidden view all 702 comments

(In reply to Lars Viklund from comment #572)
> (You people should be happy you're not running FreeBSD, there I can
> reasonably reliably hang Ryzens within hours by sending ZFS snapshots :D )

Looks like they've addressed Ryzen errata issues around August 2018:

https://github.com/freebsd/freebsd/blob/75ee4f08d3acd4bf70f24b3203fa440255873973/sys/amd64/amd64/initcpu.c#L133

Also, they've made machdep.idle=hlt and machdep.idle_mwait=0 default for Ryzen processors.

I've additionally checked the registers mentioned in FresBSD's source. Latest BIOS 4406 (w/ AGESA 0070) for my ASUS A320M-K seems to apply fixes for all affected errata.

Eventhough I can't reliably reproduce the stability issues anymore, I think it's still good practice to have Boris' patch applied, since it's addresses a known erratum. I'll test the patch and check for regressions.

I'm actually desperate about this issue. I'm looking forward to see results of Boris' patch, though I really doubt it will solve this issue. Needs explicit check against self-built kernel really.

Not all manufacturers are willing to fix this issue (especially when it comes to laptops), as Windows seems to fix this inside it's kernel since 1709 I believe. I've made a post on HP forum about this, specifically for my laptop:

https://h30434.www3.hp.com/t5/Notebook-Boot-and-Lockup/Random-Soft-Lock-up-on-Ryzen/m-p/7062823

(In reply to Borislav Petkov from comment #571)
> (In reply to Lars Viklund from comment #570)
> > rdmsr yields 6800000000010, which has bit 4 set.
>
> Looks like your BIOS applies the fix. Now, does the patch in comment #526
> fix your freezes?

Hi Boris, on my laptop with its latest BIOS I get:

$ sudo rdmsr -a 0xc0011020
6800000000000

This is on:

cpu family : 23
model : 17
model name : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
stepping : 0
microcode : 0x8101007

Booting with:

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-4.20.16-200.fc29.x86_64 root=/dev/mapper/fedora-root ro resume=/dev/mapper/fedora-swap rd.lvm.lv=fedora/root rd.lvm.lv=fedora/swap rhgb quiet LANG=en_US.UTF-8 idle=nomwait iommu=pt processor.max_cstate=1

Has your patch been applied to kernel yet?

(In reply to JerryD from comment #578)
> $ sudo rdmsr -a 0xc0011020
> 6800000000000
>
> This is on:
>
> cpu family : 23
> model : 17
> model name : AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
> stepping : 0
> microcode : 0x8101007

You're showing me the MSR output for erratum 1033 but your CPU is
not affected.

> Has your patch been applied to kernel yet?

No it hasn't - I'm still waiting for someone to test it and confirm that it fixes the issue on their boxes. And by issue I don't mean erratum 1033 - that is a red herring anyway - but the lockups people are reporting on some broken BIOSes.

Created attachment 281999
Add rifw kernel parameter to test a couple of patch to workaround ryzen freezes

Add rifw kernel parameter.

rifw=none - don't do anything
rifw=alfie - ignore BIOS
rifw=boris - use Borislav Petkov patch

I works with kernels 4.19, 4.20 and 5.0.2

> No it hasn't - I'm still waiting for someone to test it and confirm that it
> fixes the issue on their boxes. And by issue I don't mean erratum 1033 -
> that is a red herring anyway - but the lockups people are reporting on some
> broken BIOSes.

Yes, your patch "seems" to work in my case.

"Seems" because this bug is difficult to replicate...

I've spent the last few days reading through this thread.

I have Fedora 28, ryzen 1700x, and ASRock Steel Legend B450M. It's a server mostly used for files at the moment. Bone stock Fedora 28, samba, ZFS.

With GPU plugged in, hangs within 6-10 hours. Replicating this too often since March 27th.

With GPU unplugged, hangs happen within 10-20 minutes. Replicated 3 times.

Does not hang while SSH'd in.

Ran Memtest and tested for the segfault issue. Hardware is fine. Bought 'open box' on Amazon; this was in case they were re-selling an affected CPU without knowing.

Have not tried alfie's patch. Do have kernel 4.20 compiled with alfie's newest rifw patch as per Comment 580.

With Borislav's patch:
- Modifications to the OS are Samba, ZFS, Borislav's patch (comment #526 on kernel 5.0.1)
- booted March 30th at ~22:00
- Hung March 31st at ~09:40
- This is slightly longer than other tests
- I used the instructions he provided in Comment 511 without checking out 4.20. It compiles, but doesn't fix the issue, at least with linux 5.0.1-rc2 (apologies for not following steps and using 4.20)

'Typical Current Idle' set:
- Modifications to the OS are only Samba and ZFS
- Booted at 20:45 EST
- Still on now at time of this post 11:40 EST(I'll check before I hit send)
- This is the record up-time for this server
- Haven't tested anything else because I needed to have reliable access to my shares for my days off. This was 1 seemingly sure-fire way of making that happen.

Will update with more results once linux 4.20 with borislav and Alfie's patches tested. Could be a couple days. Less if 'Typical Current Idle' doesn't work, as I'll likely be RMAing my mobo and CPU.

I used the rifw=alfie with a 1700x on an MSI X370 Gaming Plus BIOS 7A33v5H with no freezing for 1 week. Computer was rock solid on older BIOS (7A33v55) for a year+ until BIOS update to latest.

I'm using low power idle option with kernel 4.14.111 opensuse 15.0.

PC had kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0) - After BIOS update and gone now after patch in Comment 580 with rifw=alfie boot option.

I haven't had the chance to get through the whole thread yet and not sure if I'll have enough time today. However, I wanted to quickly ask if anyone has faced this issue with an Intel processor?

We are seeing kernel panics due to the soft lockup issue with the following configuration:

CPU Architecture: Intel Skylake or Higher
vCPUs: 8
RAM: 52 GB
Kernel: 4.15.0
OS: Ubuntu 16.04.1

For some additional information, this setup does leverage Nested VMX to run Docker containers with the --privileged option enabled.

The soft lockups seems to occur at a rate of varying from 1-5% of our VMs getting impacted in a 24-hour period running in this configuration.

The load on the cluster does not seem to impact the frequency of the issue.

One thing I did notice is that it seems most/all of the kernel panics have had been involved with a call to smp_call_function_single.

Since I have not made it far enough through this thread, I'm not sure if the focus on this is exclusively on AMD, or if anyone else had an Intel example?

Is there a fix in the pipe for this issue and if so, a target kernel version?

Created attachment 282319
Threadripper 2920x soft lockup

I've went through this thread, and tried to disable c6 state, add idle=halt with no effect. Followed this excellent reprod steps in https://www.reddit.com/r/Amd/comments/apw8im/ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent. However, switching to gcc-8 seems resolved this for this particular reprod steps. With dmesg -w, I was able to capture and verify this is indeed a soft lockup. Attached the dmesg output.

It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest BIOS on Ubuntu 18.04.2 LTS.

Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Hello Liu

You're hitting another bug: it happens on high load (which can be easily triggered by a lot of parallel compilation threads). The only solution for your kind of problem is to RMA you CPU. See: https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response (but the problem seems not to be related to Ryzen 7 as stated at Phoronix and it's not possible to say that CPU's after a special build time would be ok). I have been affected, too (Ryzen 7) and went through RMA process. Luckily, the first RMA'd CPU was ok - there have been people who needed more than one RMA. Take a look there:
https://community.amd.com/message/2822339
If I hit the problem those days (gcc segfault), I had to rapidly reboot my machine because it mostly hang completely some time later.

This thread here covers the problem of hanging Ryzen CPU's when the machine is completely on idle over some time (it's the opposite of the problem you're facing and I saw before CPU RMA). I "fixed" the hang on idle problem by slightly overclocking the CPU (+200 MHz - ASUS X370-PRO / BIOS 4011 04/19/2018 - Bios option: Optimization for daily computing). All other suggested workarounds don't work for me. But you can't say, overclocking would fix the problem for sure (as any other suggested workaround). But you could try it.

Hey, Klaus,

My understanding of that issue is it only affects Ryzen 1700 / 1700x and doesn't affect Epyc / TR line. I am on ThreadRipper 2920x, which should be out of marginality problem for over a year now. Also, it doesn't SegFault at any point, just hangs. I do believe it is the similar problem as this thread, but since it is rare, so it could be just one faulty CPU for sure. Just doesn't align well with marginality problem timeline.

I have a couple of Threadrippers (1950X) that also hang under very specific workloads with a Asus X-399-A motherboard. In my case when the freeze happens the only way to get the machine back is to unplug it from the wall and plug it back. Not even reset works.

After many tweaks, I found out that I can avoid that but setting either the "overclocking enhancement" on in the BIOS (this is specific to Asus boards) our manually setting the Vcore to something like 1.315V (which should work on all motherboards). My guess is than when in idle the motherboard drops the voltage too low and the CPU hangs.

I ran into similar problems to this with a Threadripper 1950X and none of the proposed workarounds did anything. Disabling C6 in the BIOS made it fail to boot for some reason. Setting "Typical Current Idle" in the BIOS, processor.max_cstate=1 in the kernel command line, or disabling C6 with ZenStates.py all had no effect. I also tried overclocking and underclocking.

I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just works (for 10+ hours, I don't leave my machine on overnight). Once it started working, I set everything back to the defaults except for "Typical Current Idle" (since I have a PSU from 2010 and I doubt it supports any fancy features from 2013).

Unfortunately I can't easily narrow down all of the changes after the re-install, but the ones I'm aware of are:

 - Using the Nouvea driver (previously proprietary nVidia drivers)
 - Booting in EFI mode now (previously legacy mode)
 - Using ext4 (previously btrfs)

I strongly suspect that the graphics driver was the problem since my lockups would cause the screen to become completely unresponsive, but sound continued working, and in one case I had a lockup during a video call and the other person could still see and hear me.

I plan to try the proprietary drivers again sometime and I'll make a note if that brings the problem back for me.

Download full text (3.3 KiB)

(In reply to Brendan Long from comment #589)
> I strongly suspect that the graphics driver was the problem since my lockups
> would cause the screen to become completely unresponsive, but sound
> continued working, and in one case I had a lockup during a video call and
> the other person could still see and hear me.

What you're describing here is a new "feature" introduced between kernel 4.19.16 and 17 e.g. I can see exactly the same here with radeon hardware. The system is completely working (even VMs on the host are running well) except of graphics - even tty terminals are working sometimes. When ssh'ing the machine, I can always see log entries like these:

radeon 0000:0a:00.0: ring 0 stalled for more than 14084msec
radeon 0000:0a:00.0: GPU lockup (current fence id 0x0000000000053ed7 last fence id 0x0000000000053f0f on ring 0)
...

I'm trying to near it down currently using git bisect. The suspicious changes left are at the moment:

2019-01-22 arm64: Don't trap host pointer auth use to EL2 Mark Rutland bad
2019-01-22 arm64/kvm: consistently handle host HCR_EL2 flags Mark Rutland
2019-01-22 scsi: target: iscsi: cxgbit: fix csk leak Varun Prakash
2019-01-22 scsi: target: iscsi: cxgbit: fix csk leak Varun Prakash
2019-01-22 Revert "scsi: target: iscsi: cxgbit: fix csk leak" Sasha Levin
2019-01-22 mmc: sdhci-msm: Disable CDR function on TX Loic Poulain
2019-01-22 netfilter: nf_conncount: fix argument order to find_next_bit
2019-01-22 netfilter: nf_conncount: speculative garbage collection on empty lists Pablo Neira Ayuso
2019-01-22 netfilter: nf_conncount: move all list iterations under spinlock Pablo Neira Ayuso
2019-01-22 netfilter: nf_conncount: merge lookup and add functions Florian Westphal

2019-01-22 netfilter: nf_conncount: restart search when nodes have been erased Florian Westphal ?
2019-01-22 netfilter: nf_conncount: split gc in two phases Florian Westphal
2019-01-22 netfilter: nf_conncount: don't skip eviction when age is negative Florian Westphal
2019-01-22 netfilter: nf_conncount: replace CONNCOUNT_LOCK_SLOTS with CONNCOUNT_SLOTS Shawn Bohrer
2019-01-22 can: gw: ensure DLC boundaries after CAN frame modification Oliver Hartkopp
2019-01-22 tty: Don't hold ldisc lock in tty_reopen() if ldisc present Dmitry Safonov
2019-01-22 tty: Simplify tty->count math in tty_reopen() Dmitry Safonov
2019-01-22 tty: Hold tty_ldisc_lock() during tty_reopen() Dmitry Safonov
2019-01-22 tty/ldsem: Wake up readers after timed out down_write() Dmitry Safonov

As you're describing correctly, the problem seems to be network related. I'm getting this error two when watching videos from internet. I'm...

Read more...

(In reply to Brendan Long from comment #589)
>
> I eventually did a from-scratch reinstall of Fedora 29 and it suddenly just
> works (for 10+ hours, I don't leave my machine on overnight). Once it
> started working, I set everything back to the defaults except for "Typical
> Current Idle" (since I have a PSU from 2010 and I doubt it supports any
> fancy features from 2013).

a) Fedora 29 has very new kernels. In the last few weeks they rebased to 5.0 and 5.0 is supposed to have a lot of Ryzen fixes in it.

b) For me "Typical Current Idle" (plus mwait) was all that was required to fix the problem. I'm pretty sure kernel 5.0 makes it so mwait override is not needed.

> - Using the Nouvea driver (previously proprietary nVidia drivers)
> - Booting in EFI mode now (previously legacy mode)
> - Using ext4 (previously btrfs)

We never used proprietary video driver and we both had the problem and fixed the problem. Same with EFI: we use just legacy mode. FS should not matter at all.

The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power Supply Idle Control" option off the configuration UI; downgrading is not supported (but effectively possible at least from Windows). It is still possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g.
  /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
during boot.

Luckily, this workaround may no longer be needed. While, with default 'BIOS' (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during idle for me, it no longer does so on 5.1.2 (apparently - needs more testing - not on 5.0 either).

(In reply to Moritz Naumann from comment #592)
> The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power
> Supply Idle Control" option off the configuration UI; downgrading is not
> supported (but effectively possible at least from Windows). It is still
> possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g.
> /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
> during boot.
>
> Luckily, this workaround may no longer be needed. While, with default 'BIOS'
> (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during
> idle for me, it no longer does so on 5.1.2 (apparently - needs more testing
> - not on 5.0 either).

Since 5.0.10, and this commit ...:

(https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.10)
"commit 205c53cbe553c9e5a9fe93f63e398da7e59124b6
Author: Thomas Gleixner <email address hidden>
Date: Sun Apr 14 19:51:06 2019 +0200

    x86/speculation: Prevent deadlock on ssb_state::lock

    commit 2f5fb19341883bb6e37da351bc3700489d8506a7 upstream.

    Mikhail reported a lockdep splat related to the AMD specific ssb_state
    lock:

      CPU0 CPU1
      lock(&st->lock);
                                 local_irq_disable();
                                 lock(&(&sighand->siglock)->rlock);
                                 lock(&st->lock);
      <Interrupt>
         lock(&(&sighand->siglock)->rlock);

      *** DEADLOCK ***

    The connection between sighand->siglock and st->lock comes through seccomp,
    which takes st->lock while holding sighand->siglock.

    Make sure interrupts are disabled when __speculation_ctrl_update() is
    invoked via prctl() -> speculation_ctrl_update(). Add a lockdep assert to
    catch future offenders.

    Fixes: 1f50ddb4f418 ("x86/speculation: Handle HT correctly on AMD")"

... the problem seems to be fixed! I no longer need to use idle=halt and the systems don't freeze anymore!

(In reply to Philip Rosvall from comment #593)
> Since 5.0.10, and this commit ...:
>
> (https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.0.10)
> "commit 205c53cbe553c9e5a9fe93f63e398da7e59124b6
> Author: Thomas Gleixner <email address hidden>
> Date: Sun Apr 14 19:51:06 2019 +0200
>
> x86/speculation: Prevent deadlock on ssb_state::lock

If that commit really fixes the "issue" on your machine then you
shouldn't have been experiencing any softlockups etc, i.e., what this
bugzilla is about, but rather a software deadlock.

Which would mean that your machine is not necessarily affected by the
entering into a C-state with MWAIT and not waking up after, issue.

This bug bugs me so much, that I registered to kernel bugzilla.

I tried many workarounds, nothing solved it.

Machine: ThinkPad A485 with AMD 2500U.
System:
Distributor ID: Ubuntu
Description: Ubuntu 19.04
Release: 19.04
Codename: disco
Kernel: Linux kni 5.0.0-13-generic #14-Ubuntu SMP Mon Apr 15 14:59:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

And the problem is still there.

Have idle=nomwait kernel parameter set, but tried many others. Disabling C6 states, everything.

I can reproduce this with 100% success by running my accounting software Flexibee (https://www.flexibee.eu/podpora/stazeni-flexibee/) which has a PostgreSQL backend. When it stars PostgreSQL, soft lockups start to appear, it gives me freezes and finally the system completely freezes after a 30 or so seconds. I can even reproduce it in VirtualBox, when starting the software in VirtualBox, it crashes the whole system (the host system!).

I can offer help with debugging this problem. This problem needs to go away. It is not present on Windows as far as I know.

1 comments hidden view all 702 comments

(In reply to Liu Liu from comment #585)
> Created attachment 282319 [details]
> Threadripper 2920x soft lockup
>
> I've went through this thread, and tried to disable c6 state, add idle=halt
> with no effect. Followed this excellent reprod steps in
> https://www.reddit.com/r/Amd/comments/apw8im/
> ryzen_freezes_in_linux_even_if_linux_is_in_vm/ It is pretty consistent.
> However, switching to gcc-8 seems resolved this for this particular reprod
> steps. With dmesg -w, I was able to capture and verify this is indeed a soft
> lockup. Attached the dmesg output.
>
> It is a Threadripper 2920x with MSI x399 Gaming Pro Carbon AC with latest
> BIOS on Ubuntu 18.04.2 LTS.
>
> Linux sz77 4.15.0-47-generic #50-Ubuntu SMP Wed Mar 13 10:44:52 UTC 2019
> x86_64 x86_64 x86_64 GNU/Linux

Some updates since I last posted. I've updated to gcc-8 and enabled idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day uses, I haven't encountered any system lockup in the past 2 months. I concluded that idle=halt should mitigate this problem for normal uses.

(In reply to Liu Liu from comment #597)
--- snip --
> Some updates since I last posted. I've updated to gcc-8 and enabled
> idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> uses, I haven't encountered any system lockup in the past 2 months. I
> concluded that idle=halt should mitigate this problem for normal uses.

Are you saying that you compiled kernel with gcc-8 or that you use gcc-8 in your day to day work?

On my system I get a lockup if I do not use idle=nomwait. Like everyone else I too have tried verious combinations to be stable. (Ryzen 2500U Laptop) As far as I can tell this bug is not really fixed except on Microsoft Windows (and who knows what they are doing). Regardless things seem stable now but I can not use suspend (a different bug)

This embarrassing problem persists, will it be fixed on the new 3000 series, due in the next few months? Does anyone out there have access to this hardware?

Very close to the 2 year anniversary of this issue and there's still no 100% clear solution outlining how to fix the problem.

It may be originally a hardware issue, but I am sure it does not affect Windows, so it should be fixable in kernel code. I am not even sure if the problem was present on Windows at all.

I can reproduce it with 100% success anytime I want.

(In reply to JerryD from comment #598)
> (In reply to Liu Liu from comment #597)
> --- snip --
> > Some updates since I last posted. I've updated to gcc-8 and enabled
> > idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> > still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> > uses, I haven't encountered any system lockup in the past 2 months. I
> > concluded that idle=halt should mitigate this problem for normal uses.
>
> Are you saying that you compiled kernel with gcc-8 or that you use gcc-8 in
> your day to day work?
>
> On my system I get a lockup if I do not use idle=nomwait. Like everyone else
> I too have tried verious combinations to be stable. (Ryzen 2500U Laptop) As
> far as I can tell this bug is not really fixed except on Microsoft Windows
> (and who knows what they are doing). Regardless things seem stable now but I
> can not use suspend (a different bug)

gcc-8 for my day-to-day work.

(In reply to OptionalRealName from comment #599)
> This embarrassing problem persists, will it be fixed on the new 3000 series,
> due in the next few months? Does anyone out there have access to this
> hardware?
>
>
> Very close to the 2 year anniversary of this issue and there's still no 100%
> clear solution outlining how to fix the problem.

No, it won't be fixed. I GUESS. Because I have mailed this issue and the link to this thread to many hardware reviews sites to get attention, and to AMD, and not a single one even bothered to reply let alone have a look at it.

If they even don't want to recognize this as a bug, why fix it?

(In reply to Liu Liu from comment #597)
> Some updates since I last posted. I've updated to gcc-8 and enabled
> idle=halt. Even though idle=halt + gcc-7 with the original reprod steps can
> still cause a lockup. By defaulting to gcc-8 and idle=halt, in day-to-day
> uses, I haven't encountered any system lockup in the past 2 months. I
> concluded that idle=halt should mitigate this problem for normal uses.

I have a Ryzen Threadripper 2990WX setup and experience the similar random lockups that you describe! Setting idle=halt however improved but did not fix the problem for me entirely. When do a "btrfs scrub" on my NVME SSD with 250 GB data I could get reproducible freezes even with idle=halt set. In average every second btrfs scrub over 250 GB data would cause a freeze so this was a very effective way to reproduce the error condition. (scrubbing twice through 250GB data takes only about 3 minutes on my system)

The "good" news is that since I changed the idle parameter to idle=poll (which obviously burns off electricity like crazy) I can now do many (20+) btrfs scrub runs in a row without provoking any lockups and so far the system runs stable.

Maybe someone here who also has a btrfs on a SSD drive can try if they can reproduce the freezes with it in the same way. The btrfs scrub may be the more "evil" workload compared to parallel kernel compilations with gcc-7/8.

A final thought is that this thread may report two independent issues. There is the "completely unstable system issue" that initially froze up during two out of three boot cycles that can be solved for me by seting idle=nomwait and disabling c6 states in bios.

And then there is this random freeze once-in-a-while issue that only is resolved so far by going to idle=poll.

AMD issued a new ucode few days back. Anyone using that, it updated automatically on my Arch system. I havent had the lock up issue on my ASUS B350 board with latest BIOS and option of system idle current checked.

(In reply to Arup from comment #604)
> AMD issued a new ucode few days back. Anyone using that, it updated
> automatically on my Arch system.
For me this update that came yesterday on Arch did not change the microcode version at all! Seems like there is no update for my processor.
Before (and still) running:
model name : AMD Ryzen Threadripper 2990WX 32-Core Processor
microcode : 0x800820b

So Ryzen 3000 is coming, who will be the first guinea pig?

Curious to see if this problem is mysteriously, entirely fixed on the 3000 series.

(I damn well hope so)

(In reply to Moritz Naumann from comment #592)
> The latest firmware for X370 Taichi, v5.50 (2019/4/24), removes the "Power
> Supply Idle Control" option off the configuration UI; downgrading is not
> supported (but effectively possible at least from Windows). It is still
> possible to set "Power Supply Idle Control" (C6 package) via MSR using e.g.
> /sbin/modprobe msr && /usr/sbin/wrmsr -a 0xC0010292 true
> during boot.
>
> Luckily, this workaround may no longer be needed. While, with default 'BIOS'
> (actually UEFI firmware) settings on v5.50, Linux 4.18 still freezes during
> idle for me, it no longer does so on 5.1.2 (apparently - needs more testing
> - not on 5.0 either).

Hi, have you been able to do more testing? Does it still not freeze anymore for you with v.5.50? You're the second person I find saying that v5.50 removes the freezes with the default settings. I also use the ASRock X370 Taichi and the "Power Supply Idle Control" option, but I'm still on v4.70. However, since my computer also freezes on Windows when idle without changing that option, I now think that this is probably a different problem than the bug in this discussion. I actually found other people on forums who also have experienced those freezes on Windows, so I’m not an isolated case. They don’t all have ASRock boards, but there might also be something specific to ASRock. I have not really tested this yet, but sometime ago, I was told that at least on the ASRock boards, the default SOC voltage of 0.9 V is too low, and that raising it to at least 1.0 V would prevent the freezes (but don't go higher than 1.1 V). I'm probably going to raise that voltage anyway because I would like to increase the frequency of my RAM and lower the timings, if the BIOS would let me do so properly and keep the settings… The v4.x BIOSes for this board are known to have serious bugs in this respect [http://forum.asrock.com/forum_posts.asp?TID=9371&KW=ram+worse&title=psa-stay-away-from-480-bios-x370-taichi], so I will probably take the risk to upgrade to v5.50 soon, too.

Meanwhile, AMD replaced my faulty 1700X CPU, and the replacement I received (same CPU model, but produced much later) seems not to have the segfault bug, although I should test it more in order to be completely sure. I must say that AMD was very nice in the RMA process, I really recommend doing the RMA directly with them if you have the segfault bug.

(In reply to onox from comment #607)

To prevent misunderstandings, let me repeat what I previously wrote, from a different angle: ASRock X370 Taichi firmware version v5.50 is worse than previous versions in that it removes the "Power Supply Idle Control" option off the configuration screen. As a result, with Linux 4.18 (Ubuntu kernel 4.18.0-20-generic) and below the system *will* lockup unless different counter measures are taken (namely set the MSR).

This said, I have been and am running this single system 24/7 without issues despite extended low load periods on Ubuntu 18.04 LTS kernel 5.0.0-15-generic with microcode level 0x08001137 since May 15 (and from May 11 to May 15 on mainline Linux 5.1.2 with the same microcode) with AsRocks' firmware configuration defaults and without setting the MSR.

Summing up, the latest ASRock firmware for this mainboard takes a step in the wrong direction for Linux users, however newer Linux versions prevent this system from locking up.

MCExtractor is aware of a newer microcode, 0x08001138 (seen on 2019-02-04) for CPUID 0x00800f11, which ASRock decided not to include in their firmware.

I think I experience the same problem as described here with my recently built Ryzen computer. After installing a fresh Linux distribution I started out with Linux kernel 4.19.45. The computer ran for two days continuously, after a short pause I came back and saw it completely froze. Screen showed the last visible image, but no input worked. Neither ping/ssh, neither mouse or keyboard worked. Not even the num lock LED. There were no log entries written, neither on screen or harddisk. Last entry was some cron job, cleaning up temporary directories. The workload was just a vim, a VNC viewer and a Firefox. The next night, the same freeze happened. That finally got my attention to dig deeper.

As first measure I updated the Kernel to 5.1.4. After 2 hours running, without much load, it also froze.

I dug deeper, and as first measure I set the BIOS from "Auto" to "Typical current Idle". Uptime is now 1 day, without any workload except a chat client.
This is

This is my System:

# CPU:
    cpu family : 23
    model : 1
    model name : AMD Ryzen 5 1600 Six-Core Processor
    stepping : 1
    microcode : 0x8001137
    cpu MHz : 1646.140

# Mainboard:
    Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F5 01/25/2019

# GPU:
    NVIDIA GP107 [GeForce GTX 1050 Ti]
    Driver/OpenGL: 4.6.0 NVIDIA 418.74

# OS:
    Distro: Manjaro
    Kernel: 5.1.4-1-MANJARO x86_64

I also see this in my bootup log:
12 times: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
12 times: microcode: CPU0: patch_level=0x0800113

Like said, my computer is running now for 1 day without any load with the BIOS option "Typical current Idle" set. This is so far the same "workaround" that
brodyck wrote at 2019-04-01 15:40:26 UTC. Unfortunately, or fortunately for him, he has not wrote anything since then. That might mean, that the problems stopped for him at that point.
I will further observe this issue with my new workstation. If you don't hear back from me, the BIOS setting very likely worked.

As I mention around Comment 485 and later (starting Jan 22), for us we have nearly the same computer and after setting BIOS to current idle as you did, all of our problems disappeared. Haven't messed with it since. That's 4 months of uptime, zero hangs, zero crashes.

I'd bet everyone who was here and said they'd try the BIOS tweak and has not reported back falls into the same camp. Everyone else is dealing with more or other problems, probably.

As for Ryzen 3000 series, AMD better have fixed this !@%!^# problem! Since it looks like 3000 will do ECC nicely, I'll be building another box with it in early July.

My two (server) machines running 1600's are also not experiencing problems any more using new BIOSes + 4.19 kernels + "Typical current bullshit". Hence I tend to agree that for most people the issue is solved.

However Trevor I'm mostly posting to let you know ECC works fine, at least with ASRock boards (of which the spec actually features ECC support, e.g. mine https://www.asrock.com/MB/AMD/AB350M%20Pro4/index.asp#Specification), even under Ryzen 1xxx CPU's:

# dmesg | grep -i edac
[ 0.057169] EDAC MC: Ver: 3.0.0
[ 9.926945] EDAC amd64: Node 0: DRAM ECC enabled.
[ 9.926946] EDAC amd64: F17h detected (node 0).
[ 9.926985] EDAC MC: UMC0 chip selects:
[ 9.926986] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 9.926987] EDAC amd64: MC: 2: 4096MB 3: 0MB
[ 9.926988] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 9.926989] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 9.926991] EDAC MC: UMC1 chip selects:
[ 9.926991] EDAC amd64: MC: 0: 0MB 1: 0MB
[ 9.926992] EDAC amd64: MC: 2: 4096MB 3: 0MB
[ 9.926992] EDAC amd64: MC: 4: 0MB 5: 0MB
[ 9.926993] EDAC amd64: MC: 6: 0MB 7: 0MB
[ 9.926993] EDAC amd64: using x8 syndromes.
[ 9.926994] EDAC amd64: MCT channel count: 2
[ 9.927095] EDAC MC0: Giving out device to module amd64_edac controller F17h: DEV 0000:00:18.3 (INTERRUPT)
[ 9.927104] EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV 0000:00:18.0 (POLLED)
[ 9.927105] AMD64 EDAC driver v3.5.0

# edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow2: 0 Uncorrected Errors
mc0: csrow2: mc#0csrow#2channel#0: 0 Corrected Errors
mc0: csrow2: mc#0csrow#2channel#1: 0 Corrected Errors
edac-util: No errors to report.

See also:
http://www.hardwarecanucks.com/forum/hardware-canucks-reviews/75030-ecc-memory-amds-ryzen-deep-dive.html

Download full text (4.4 KiB)

AMD has released a new version of AGESA Combo-AM4 1.0.0.3 for my motherboard MSI B450 Gaming Plus and others. There were also updates of the CPU microcode.

The first (#1) on the list for example is Ryzen 2600, old and new bios, comparison of the microcode version:

OLD bios
╔═════════════════════════════════════════════════════════════════╗
║ AMD ║
╟────┬──────────┬──────────┬────────────┬───────┬──────────┬──────╢
║ # │ CPUID │ Revision │ Date │ Size │ Offset │ Last ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 1 │ 00800F82 │ 0800820B │ 2018-06-20 │ 0xC80 │ 0x4DD000 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 2 │ 00800F12 │ 08001230 │ 2018-08-04 │ 0xC80 │ 0x4DDD00 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 3 │ 00800F11 │ 08001137 │ 2018-02-14 │ 0xC80 │ 0x4DEA00 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 4 │ 00800F10 │ 0800100C │ 2017-01-31 │ 0xC80 │ 0x4DF700 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 5 │ 00800F00 │ 0800002A │ 2016-10-06 │ 0xC80 │ 0x4E0400 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 6 │ 00810F10 │ 0810100B │ 2018-02-12 │ 0xC80 │ 0x65A500 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 7 │ 00820F00 │ 08200002 │ 2018-02-14 │ 0xC80 │ 0x65B200 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 8 │ 00810F00 │ 08100004 │ 2016-11-20 │ 0xC80 │ 0x65BF00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 9 │ 00810F80 │ 08108002 │ 2018-06-05 │ 0xC80 │ 0x65CC00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 10 │ 00810F81 │ 08108102 │ 2018-08-13 │ 0xC80 │ 0x65D900 │ No ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 11 │ 00810F11 │ 08101102 │ 2018-11-06 │ 0xC80 │ 0x65E600 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 12 │ 00660F00 │ 06006012 │ 2014-10-14 │ 0xA20 │ 0xD97F50 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 13 │ 00660F01 │ 0600611A │ 2018-01-26 │ 0xA20 │ 0xD98970 │ Yes ║
╚════╧══════════╧══════════╧════════════╧═══════╧══════════╧══════╝

NEW bios

╔═════════════════════════════════════════════════════════════════╗
║ AMD ║
╟────┬──────────┬──────────┬────────────┬───────┬──────────┬──────╢
║ # │ CPUID │ Revision │ Date │ Size │ Offset │ Last ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 1 │ 00800F82 │ 0800820D │ 2019-04-16 │ 0xC80 │ 0x50D000 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 2 │ 00800F12 │ 08001250 │ 2019-04-16 │ 0xC80 │ 0x50DD00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 3 │ 00800F11 │ 08001138 │ 2019-02-04 │ 0xC80 │ 0x50EA00 │ Yes ║
╟────┼──────────┼──────────┼────────────┼───────┼──────────┼──────╢
║ 4 │ 00800F10 │ 0800100C │ 2017-01-31 │ 0xC80 │ 0x50F7...

Read more...

So when do we get our first post, with the same problems, from X570 users, with Ryzen 3600 / 3600x / 3800 etc?

I'm very curious to know if this issue magically is gone on the next series.

No workarounds fixed my 2400G lockups.
I RMAd the CPU (got a new one) and motherboard (because the firmware killed itself) and got it repaired.

Since then I didn't have a single lockup with all the workarounds disabled, so

I'd say that CPUs before some certain time period were possibly manufactured with defects or my unit was just defective.

FWIW: I have an ASUS Prime 370-Pro, Ryzen 7 1800X, currently running Kernel v5.0.17.

With "Typical Current Idle", and no kernel tweaks, that has run without "freezing while idle" for about a year.

Yesterday I upgraded the BIOS, which for some reason reset the "Power Supply Idle Control" to "Auto". That prompted me to run a small experiment to see if kernel or other upgrades have improved things. They have not: the machine "froze while idle" (fairly promptly). I have now restored "Typical Current Idle", in the hope of another untroubled year.

I note, however, that my cpu (0x00800f11) is running with firmware 0x0800137, not the more recent 0x0800138 -- if I am reading the table in Comment 612, above, correctly.

Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
Displaying first 40 and last 40 comments. View all 702 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.