Bug #1690085 “Ryzen 1800X freeze - rcu_sched detected stalls on ...” : Bugs : linux package : Ubuntu

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-11:

#1

lspci-vnvn.log Edit (72.5 KiB, text/plain)

Revision history for this message

Brad Figg (brad-figg) wrote on 2017-05-11: Missing required logs.

#2

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-11: JournalErrors.txt

#3

JournalErrors.txt Edit (2.2 KiB, text/plain)

apport information

tags:	added: apport-collected zesty
description:	updated

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-11: ProcCpuinfoMinimal.txt

#4

ProcCpuinfoMinimal.txt Edit (1.3 KiB, text/plain)

apport information

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Lipido (lipido) wrote on 2017-05-15:

#5

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-15:

#6

kern.log Edit (134.1 KiB, text/plain)

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Revision history for this message

Lipido (lipido) wrote on 2017-05-15:

#7

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-05-15:

#8

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance:	Undecided → High
status:	Confirmed → Incomplete
tags:	added: kernel-da-key

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-15:

#9

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-15:

#10

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed
tags:	added: kernel-bug-exists-upstream

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-05-15:

#11

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status:	Confirmed → Triaged

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-17:

#12

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-18:

#13

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

Revision history for this message

camparijet (iichikolamp) wrote on 2017-05-22:

#14

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Revision history for this message

Vincent (hvincent13) wrote on 2017-05-24:

#15

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

Revision history for this message

camparijet (iichikolamp) wrote on 2017-05-29:

#16

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-05-29:

#17

I'm seeing this crash even with the Nvidia official driver.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-06-09:

#18

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Alex Jones (blenheimears) on 2017-06-09

Changed in linux (Ubuntu):
status:	Triaged → Invalid

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-06-14:

#19

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status:	Invalid → New

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-06-14: Status changed to Confirmed

#20

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed
tags:	added: artful

Revision history for this message

Marc Rene Schädler (suaefar) wrote on 2017-07-06:

#21

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-07-21:

#22

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-07-23:

#23

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-07-23:

#24

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2017-07-24:

#25

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-07-26:

#26

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-07-26:

#27

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Revision history for this message

Alex Jones (blenheimears) wrote on 2017-07-26:

#28

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2017-07-26:

#29

This is beyond my expertise - let's see what upstream can do.

Revision history for this message

Torge (cyslider) wrote on 2017-07-26:

#30

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

Revision history for this message

David Wilson (plottt) wrote on 2017-07-27:

#31

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Revision history for this message

Torge (cyslider) wrote on 2017-07-28:

#32

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#85

Created attachment 257955
Example Log

Full description of the problem/report:

Ryzen 1700x (8 cores) system randomly hard freezes with a soft lockup bug and a hard reset is needed. The system doesn't respond anymore to mouse or keyboard input. The problem typically occurs when the load is low while web surfing - stress testing with e.g. mprime is no problem. Example Log: Please see attachment.

NMI watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [DOM Worker:1364]

Kernel version:
Linux version 4.12.4-2-MANJARO (builduser@ste74-i5) (gcc version 7.1.1 20170630 (GCC) ) #1 SMP PREEMPT Sun Jul 30 17:04:04 UTC 2017

Environment - Software:
Manajaro Linux - Stable Update 2017-08-10; KDE Plasma; Graphics driver: linux-nvidia 1:375.82-1-any

Environment - Additional information:
Attachments follow.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#86

Created attachment 257957
cpuinfo

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#87

Created attachment 257959
iomem

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#88

Created attachment 257961
ioports

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#89

Created attachment 257963
lspci -vvv

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#90

Created attachment 257965
modules

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-08-16:

#91

Created attachment 257967
scsi

Revision history for this message

Stuart Page (sdpagent) wrote on 2017-08-24:

#33

image of issues showing in console Edit (2.0 MiB, image/png)

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
- Disabled SMT
- Disabled c-states

Revision history for this message

In Linux Kernel Bug Tracker #196683, goric (goric-linux-kernel-bugs) wrote on 2017-08-29:

#92

I have the exact same problem with the same setup (Ryzen 1700X).

I have experienced the problem with kernels 4.4, 4.10 and now 4.12.8. The system freeze also without X and without nvidia proprietary drivers.

Revision history for this message

xb5i7o (xb5i7o) wrote on 2017-09-08:

#34

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Revision history for this message

In Linux Kernel Bug Tracker #196683, wim (wim-linux-kernel-bugs) wrote on 2017-09-09:

#93

Bug confirmed. Will revert patch 1fccb73011ea8a5fa0c6d357c33fa29c695139ea.

Revision history for this message

Stuart Page (sdpagent) wrote on 2017-10-05:

#35

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-05:

#94

(In reply to Wim Van Sebroeck from comment #8)
> Bug confirmed. Will revert patch 1fccb73011ea8a5fa0c6d357c33fa29c695139ea.

Are you sure? That commit concerns iTCO_wdt, which is an Intel driver?

I have seen similar freezes many times. I captured the last time with netconsole and admittedly the call trace was different but I've seen many reports of Ryzen freezes on a variety of motherboards and disabling C6 or enabling CONFIG_RCU_NOCB_CPU is always said to work around it. Both of these certainly work here.

I must stress that this issue must be not confused with the segfault issue commonly reported with early Ryzens. I was facing that issue too and had my CPU replaced but while the segfaults have gone, both my original and the replacement exhibit these freezes.

I believe we have not seen so many reports of this issue, partly because most users will not know how to get kernel output in this situation, and partly because Fedora's kernel already has CONFIG_RCU_NOCB_CPU enabled. I am using Gentoo but I have confirmed that this affects Debian.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2017-10-06:

#36

Stuart,

Does this issue also happen on latest mainline kernel?

Revision history for this message

In Linux Kernel Bug Tracker #196683, wim (wim-linux-kernel-bugs) wrote on 2017-10-06:

#95

Oeps. Sorry. That comment was for BUG 196509.

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-06:

#96

Thanks for James Le Cuirot 2017-10-05 20:23:40 UTC. My Ryzen 5 1600 build did hang randomly with Amd drm-next-4.15-wip kernel and mainline 4.14-rc1 and rc2 and 4.13.5 custom non debug 1000Hz timer kernels. It took 3 weeks to find this solution.

X did freeze randomly when scrolling web content in the Chrome Beta browser. I thought it was usb 3.0 problem at first and moved my mouse to usb 2.0 port. Then I found help for a Chrome bug:
https://askubuntu.com/questions/765974/chrome-freeze-very-frequently-with-ubuntu-16-04
No help.

Now uptime is 7 hours and 19 minutes and system looks stable. I did enable gpu support in the Chrome again and system is not freezing while surfing.

xfce@ryzen5pc:~$ screenfetch
         _,met$$$$$gg. xfce@ryzen5pc
      ,g$$$$$$$$$$$$$$$P. OS: Debian unstable sid
    ,g$$P"" """Y$$.". Kernel: x86_64 Linux 4.14.0-rc2
   ,$$P' `$$$. Uptime: 7h 19m
  ',$$P ,ggs. `$$b: Packages: 1942
  `d$$' ,$P"' . $$$ Shell: bash 4.4.12
   $$P d$' , $$P Resolution: 1920x1080
   $$: $$. - ,d$$' DE: XFCE
   $$\; Y$b._ _,d$P' WM: Xfwm4
   Y$$. `.`"Y$$$$P"' WM Theme: Default
   `$$b "-.__ GTK Theme: Xfce [GTK2]
    `Y$$ Icon Theme: Tango
     `Y$$. Font: Sans 10
       `$$b. CPU: AMD Ryzen 5 1600 Six-Core @ 12x 3.194GHz [39.6°C]
         `Y$$b. GPU: AMD/ATI Baffin [Polaris11]
            `"Y$b._ RAM: 1150MiB / 7989MiB

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-06:

#97

With RCU_NOCB_CPU you need to have the following in the kernel command line:
rcu_nocbs=0-11

Ryzen 5 1600 has 12 threads so change the upper limit according to your cpu thread count.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-06:

#98

(In reply to fin4478 from comment #12)
> With RCU_NOCB_CPU you need to have the following in the kernel command line:
> rcu_nocbs=0-11

Correct, there used to be CONFIG_RCU_NOCB_CPU_ALL until very recently. I don't know why it was removed. I still see this as a workaround though as I gather it effectively avoids C6.

ptheis, could you please confirm whether these workarounds work for you and reassign the bug to someone more appropriate? This isn't a watchdog bug, the watchdog just gets triggered by the problem.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-06:

#99

I wonder if this is going to start affecting Fedora 27 now that CONFIG_RCU_NOCB_CPU_ALL has gone. I grabbed a rawhide kernel RPM and that entry is indeed missing from the config. It seems unlikely they would add rcu_nocbs to everyone's command line.

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-07:

#100

(In reply to James Le Cuirot from comment #13)
> (In reply to fin4478 from comment #12)
> > With RCU_NOCB_CPU you need to have the following in the kernel command
> line:
> > rcu_nocbs=0-11
>
> Correct, there used to be CONFIG_RCU_NOCB_CPU_ALL until very recently. I
> don't know why it was removed. I still see this as a workaround though as I
> gather it effectively avoids C6.
>

RCU_NOCB_CPU_ALL was removed 2017-06-08 18:52:43 -0700. "The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing and
are redundant with the rcu_nocbs= boot parameter. This commit therefore
removes these three Kconfig options and adjusts the rcutorture scripts
to use the boot parameter instead."

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

This is typical not end user friendly solution from tired programmers. Other companies have a conspiracy against Amd, they do want to make Amd look bad and do code like this. Another example, Ryzen k10temp and Zen cpu select patches have been ready for several months but are not in the mainline kernel. Coffee lake and q cpus are coming.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-10-07:

#101

Alright, many thanks for the workarounds. Manjaro has seen a few kernel updates recently. The last time I saw the error was on Kernel 4.13.4-1 (current stable is 4.13.4-3). This is what I changed on my system (not sure both are needed at the same time):

- added rcu_nocbs=0-15 to kernel command line (1700x has 16 threads)
- disabled C6 power management in BIOS

I also have an eye on HW acceleration in Firefox which was enabled all the time. If the workarounds help, I will need 1-3 weeks to confirm as the system doesn't run very long on weekdays.

@James Le Cuirot: I also changed the category of this bug (I hope this is correct - this is the first time I filed a Kernel bug).

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-07:

#102

(In reply to ptheis from comment #16)

> - added rcu_nocbs=0-15 to kernel command line (1700x has 16 threads)
> - disabled C6 power management in BIOS

You need to make a custom kernel and enable RCU_NOCB_CPU in the kernel configuration. Otherwise the rcu_nocbs parameter have no effect.

You want to save power when your CPU cores are not used so keep the C6 option enabled in the Bios.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-07:

#103

(In reply to ptheis from comment #16)
> This is what I changed on my system (not sure
> both are needed at the same time):
Only one is needed, I am not enabling these kernel options at the moment.

> @James Le Cuirot: I also changed the category of this bug (I hope this is
> correct - this is the first time I filed a Kernel bug).
The category/component you've chosen is better but the bug is still assigned to the watchdog team. Unfortunately the default assignee for this component works for Intel and I doubt they'd be able to spend time on this. Maybe you should change it to "x86-64|Platform Specific/Hardware" instead. Make sure you tick the "Reset Assignee to default" box when it appears.

(In reply to fin4478 from comment #17)
> You need to make a custom kernel and enable RCU_NOCB_CPU in the kernel
> configuration. Otherwise the rcu_nocbs parameter have no effect.
Yes, Manjaro's kernel config doesn't appear to have this enabled.

> You want to save power when your CPU cores are not used so keep the C6
> option enabled in the Bios.
Building your own kernel can be daunting for some so the BIOS option is far easier. I don't think it makes any difference power-wise as I gather RCU_NOCB_CPU effectively prevents the CPU from entering C6 anyway.

(In reply to fin4478 from comment #15)
> This is typical not end user friendly solution from tired programmers. Other
> companies have a conspiracy against Amd, they do want to make Amd look bad
> and do code like this.
Let's just stick to the facts. I think the change was justified and it shouldn't be necessary to do this for Ryzen. I doubt the guy was even aware of this Ryzen issue as I discovered the RCU_NOCB_CPU workaround myself after spending several days flipping kernel options.

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-07:

#104

(In reply to James Le Cuirot from comment #18)

> Let's just stick to the facts. I think the change was justified and it
> shouldn't be necessary to do this for Ryzen.

IBM does not have the money to build one Ryzen pc to test their code;-)
Think of all newbies that have a new Ryzen PC and even using Bios is difficult. Ryzen & C6 documentation is hidden in here. Thousands of PCs with stock kernels are freezing randomly now. Great for Amd;-)

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-07:

#105

(In reply to James Le Cuirot from comment #18)
> easier. I don't think it makes any difference power-wise as I gather
> RCU_NOCB_CPU effectively prevents the CPU from entering C6 anyway.

"
Depending on the CPU / APU model, the highest boosted frequency PState usually has a C6-state requirement.
For example on FX-8370 which has 4.0GHz (P0, base), 4.1GHz (Pb1, boost) and 4.3GHz (Pb0, boost) states, the C6-state requirement is set to four. This means that the highest PState (4.3GHz, Pb0) will not activate unless half of the cores are currently in C6 active state.

If you disable C6-state this condition can obviously never be met, and the highest boosted state will never activate.
In this case the CPU will operate at the highest frequency PState which doesn´t have the C6-state requirement (4.1GHz Pb1).
"

http://www.overclock.net/t/1328938/what-is-core-c6-state-exactly

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-10-08:

#106

C6 activated in BIOS ... testing custom kernel with rcu_nocbs=0-15 ...

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-08:

#107

Thanks for changing the category/component again but it still needs reassigning. Please set this to <email address hidden>.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-10-09:

#108

There is no option for setting ...

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-10-10:

#109

(In reply to ptheis from comment #23)
> There is no option for setting ...

If you mean kernel configuration, You need to to enable the RCU_EXPERT option to see the RCU_NOCB_CPU setting.

https://cateee.net/lkddb/web-lkddb/RCU_NOCB_CPU.html
"
depends on: ( CONFIG_TREE_RCU || CONFIG_PREEMPT_RCU ) && ( CONFIG_RCU_EXPERT || CONFIG_NO_HZ_FULL )

"

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-10-10:

#110

(In reply to fin4478 from comment #24)
> (In reply to ptheis from comment #23)
> > There is no option for setting ...
>
> If you mean kernel configuration, You need to to enable the RCU_EXPERT
> option to see the RCU_NOCB_CPU setting.
>
> https://cateee.net/lkddb/web-lkddb/RCU_NOCB_CPU.html
> "
> depends on: ( CONFIG_TREE_RCU || CONFIG_PREEMPT_RCU ) && ( CONFIG_RCU_EXPERT
> || CONFIG_NO_HZ_FULL )
>
> "

Sorry, my comment was on changing the assignee. I found no option to do so.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2017-10-13:

#111

I got hit by this bug on Fedora. My Ryzen 1600X system would randomly hang a short while after boot after upgrading to kernel 4.13.4. I looked at various things that could be the cause and thought it was fixed but then I rebooted and it happened again. And again. I quickly figured out that there's no problem with kernel 4.12.3 on Fedora. Tried kernel 4.13.5, same problem, went back to 4.12.3 until I wasted too much time looking into this today.

config-4.12.14-300.fc26.x86_64 on Fedora has
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_ALL=y

and there's no problem. config-4.13.5-300.fc27.x86_64 only has
CONFIG_RCU_NOCB_CPU=y

and with that kernel there's a problem _unless_ I add rcu_nocbs=0-11 to the kernel command line - which I only figured out after looking at this bug.

Thank you James Le Cuirot.

This bug is listed as "Regression: No". It should be Yes in the case of Fedora; 4.12.x kernels work, 4.13.x do not work without a kernel boot parameter fix.

The commit that removed CONFIG_RCU_NOCB_CPU_ALL should please be reverted. The statement "The CONFIG_RCU_NOCB_CPU_ALL, CONFIG_RCU_NOCB_CPU_NONE, and
CONFIG_RCU_NOCB_CPU_ZERO Kconfig options are used only in testing" is clearly false since these options are/were used by distributions like Fedora and removing CONFIG_RCU_NOCB_CPU_ALL *breaks* kernel 4.13.5. You can't really expect distributions to ship with/add the rcu_nocbs= parameter as an alternative.

Just to repeat this point: My first conclusion was simply that 4.13.x kernels are broken and this made me simply stick with 4.12.x which doesn't have this new problem with Ryzen CPUs until I wasted time looking into this.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-13:

#112

(In reply to oyvinds from comment #26)
> I got hit by this bug on Fedora. My Ryzen 1600X system would randomly hang a
> short while after boot after upgrading to kernel 4.13.4.

Welcome to the club. Hopefully now that Fedora is affected, this will get more attention.

> This bug is listed as "Regression: No". It should be Yes in the case of
> Fedora; 4.12.x kernels work, 4.13.x do not work without a kernel boot
> parameter fix.

I wouldn't call it as a regression as the kernel options involved are merely a workaround for the underlying issue. Fedora just happened to be enabling them anyway until one of them went away.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2017-10-14:

#113

(In reply to James Le Cuirot from comment #27)
> I wouldn't call it as a regression as the kernel options involved are merely
> a workaround for the underlying issue.

That's true and we shouldn't have to apply this workaround. Still, on Fedora it's the regression in my opinion because kernel 4.12.x (happens to) work fine and 4.13.x doesn't.

As a end-user my conclusion was simply that 4.13.x is broken and I kept using 4.12.x until I had time to look into why 4.13.x is broken. And it still is for many, my 68 year old mother isn't going to be adding no rcu_nocbs= kernel command line on her laptop or do much beyond using it to check the news and weather. I am fairly sure I wouldn't be able to guide her through doing so even if I spent 10 hours on the phone, I'd have to go there. Luckily it's got some old Intel CPU.

Yes, these are kernel options that some enable and some don't - but in the case of Fedora it's You update, you get new kernel, you're screwed and now your machine hangs randomly - which it previously did not. Isn't that the definition of regression?

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-10-17:

#114

My results, testing some days: There was one freeze without errors in journal. The second was also not visible in the logs, but the system rebooted and threw a mce hardware error. The system is not stable yet.

Okt 16 19:32:22 pc1 kernel: x86: Booting SMP configuration:
Okt 16 19:32:22 pc1 kernel: .... node #0, CPUs: #1 #2
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: Machine check events logged
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 3: b2a00020003f0000
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: TSC 0 IPID 300b000000000
Okt 16 19:32:22 pc1 kernel: mce: [Hardware Error]: PROCESSOR 2:800f11 TIME 1508175138 SOCKET 0 APIC 2 microcode 8001126
Okt 16 19:32:22 pc1 kernel: #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15
Okt 16 19:32:22 pc1 kernel: smp: Brought up 1 node, 16 CPUs

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-10-17:

#115

(In reply to ptheis from comment #29)
> My results, testing some days: There was one freeze without errors in
> journal. The second was also not visible in the logs, but the system
> rebooted and threw a mce hardware error. The system is not stable yet.

How old is your Ryzen? The older ones with the random segfault issue often emitted MCE messages. I don't recall seeing any with my newly replaced Ryzen. Although segfaults are the most common symptom, I wouldn't be surprised if the fault could also trigger a freeze in some way.

Revision history for this message

Franck Charras (franckc) wrote on 2017-10-18:

#37

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

Revision history for this message

tgui (eric-c-morgan) wrote on 2017-10-21:

#38

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Revision history for this message

Franck Charras (franckc) wrote on 2017-10-22:

#39

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

Revision history for this message

AS (as2008) wrote on 2017-10-22:

#40

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

Revision history for this message

tgui (eric-c-morgan) wrote on 2017-10-24:

#41

My 1700 is week 33 as well.

https://photos.app.goo.gl/HNwpPJqI85sZU4VU2

Revision history for this message

tgui (eric-c-morgan) wrote on 2017-10-24:

#42

More details.

- latest agesa
- disabled nouveau
- disabled randomized mem
- upped SOC voltage
- disabled C-states / cool-n-quiet
- using kernel 4.13.4 with RCU changes made, though only CONFIG_RCU_NOCB_CPU was available in 4.13.4

I'm open to trying other kernels if requested.

Revision history for this message

Franck Charras (franckc) wrote on 2017-10-24:

#43

It seems that this problem is widespread, and is not related to https://bugzilla.kernel.org/show_bug.cgi?id=196481 which is linked in this thread.

Revision history for this message

tgui (eric-c-morgan) wrote on 2017-10-24:

#44

Franck: I very much agree that its not related to what you linked. The link you refer to deals with heavy load compilation segfaults, which is addressed with a RMAd CPU.

This thread seems to be mostly centered around system idle / low usage complete system crash. No logs, blank screen.

I have a movie playing on my server on loop to hopefully keep the machine from becoming too idle.... which I hate doing.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-10-26:

#116

Week 33 Ryzen 1700 checking in here. I compiled my 4.13.4 kernel withCONFIG_RCU_NOCB_CPU=Y, and still have crashes as described.

I'm hoping CONFIG_RCU_NOCB_CPU_ALL will be added back as an option as kernel command params seem like a hack.

Please do know I appreciate those that offer their time to the linux kernel.

Revision history for this message

tgui (eric-c-morgan) wrote on 2017-10-26:

#45

PLEASE post your concerns to this kernel.org thread with this issue!

https://bugzilla.kernel.org/show_bug.cgi?id=196683

CONFIG_RCU_NOCB_CPU_ALL kernel option was removed and would likely fix this issue if put back! Supposedly there is a kernel command param you can pass to alleviate some of this, but thats a hack in my mind.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-10-27:

#117

Update, added kernel boot param and enabled C6 and cool and quiet in the BIOS. I'll report back... I guess in a few days or a few weeks.

Revision history for this message

In Linux Kernel Bug Tracker #196683, hoper (hoper-linux-kernel-bugs) wrote on 2017-10-27:

#118

Just a small message to say "me too". I spend the last 4 months spending a lot of time and money to change the motherboard, the cpu, the power... Before doing research and find that these freezes are software related :(

I tried (and manage) to compile my own kernel with CONFIG_RCU_NOCB_CPU=Y and so on. Before that, my server always crashed after 2 ou 3 days. With this custom kernel, yes, it's better... The freeze only appear after 8 or 9 days. But the freeze are still here. And I guess I will just sell all this stuff and go back to intel :(

I can't understand that this information "LINUX + RYZEN = NOT STABLE" is not spread everywhere. Lot's of people out there are loosing lot's of time and money, I'm sure of that. Of course I'm grateful for all open sources developers (I'm sharing also what I can :) and I really hope that the root cause of this bug will be found and corrected in the next few months... We need to be able to use linux on RYZEN ! (perfect cpu for servers).

If someone manage to make this bug disappear (and have an uptime > 30 days), please share ! how you did that, with enough details for beginner like me :)

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-10-27:

#119

Hoper,

If you're still running your compiled kernel add a kernel param "rcu_nocbs=0-15", where 0-15 is for a 16 thread CPU, 0-11 for a 12 thread for example.

Here is a good link on how to add it.

https://askubuntu.com/questions/19486/how-do-i-add-a-kernel-boot-parameter

Remember that you're still using a new architecture. It takes time to make Linux play nice. We've been in an intel dominated world for a very long time. ;-)

Revision history for this message

In Linux Kernel Bug Tracker #196683, hoper (hoper-linux-kernel-bugs) wrote on 2017-10-30:

#120

Thanks. I did it Sunday... Waiting for the freeze :o)

Revision history for this message

Franck Charras (franckc) wrote on 2017-10-30:

#46

I've added "rcu_nocbs=0-15" to the kernel boot parameters this morning (along with "processor.max_cstate=1" last week), will update if a new freeze occurs.

Revision history for this message

In Linux Kernel Bug Tracker #196683, jon780 (jon780-linux-kernel-bugs) wrote on 2017-11-04:

#121

Download full text (3.3 KiB)

I'm experiencing this bug as well, locking up every 24-48 hours roughly. I've tried adding "rcu_nocbs=0-15" to the grub boot to see if that resolves it.

Fedora 26 (kernel 4.13.9-200.fc26.x86_64)
Ryzen 1700X

Nov 03 19:54:30 pc003 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [kworker/14:1:217]
Nov 03 19:54:30 pc003 kernel: Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_co
Nov 03 19:54:30 pc003 kernel: crc32_pclmul ccp snd_timer drm ghash_clmulni_intel snd soundcore tpm_tis sp5100_tco tpm_tis_core wmi_bmof i2c_piix4 shpchp tpm parport_pc parport acp
Nov 03 19:54:30 pc003 kernel: CPU: 14 PID: 217 Comm: kworker/14:1 Tainted: P OEL 4.13.9-200.fc26.x86_64 #1
Nov 03 19:54:30 pc003 kernel: Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 TOMAHAWK ARCTIC (MS-7A34), BIOS H.50 06/22/2017
Nov 03 19:54:30 pc003 kernel: Workqueue: events netstamp_clear
Nov 03 19:54:30 pc003 kernel: task: ffff8ac634d62640 task.stack: ffffaea343fdc000
Nov 03 19:54:30 pc003 kernel: RIP: 0010:smp_call_function_many+0x24a/0x270
Nov 03 19:54:30 pc003 kernel: RSP: 0018:ffffaea343fdfce0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Nov 03 19:54:30 pc003 kernel: RAX: ffff8ac63e61f398 RBX: ffff8ac63e99b580 RCX: 0000000000000000
Nov 03 19:54:30 pc003 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ac63e022b88
Nov 03 19:54:30 pc003 kernel: RBP: ffffaea343fdfd18 R08: ffffffffffffffff R09: 000000000000bfff
Nov 03 19:54:30 pc003 kernel: R10: fffffa649fe76140 R11: ffff8ac63e007c00 R12: 0000000000000010
Nov 03 19:54:30 pc003 kernel: R13: 0000000000000010 R14: ffffffff8b02d6a0 R15: 0000000000000000
Nov 03 19:54:30 pc003 kernel: FS: 0000000000000000(0000) GS:ffff8ac63e980000(0000) knlGS:0000000000000000
Nov 03 19:54:30 pc003 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 03 19:54:30 pc003 kernel: CR2: 00007fc490f23000 CR3: 000000011fe09000 CR4: 00000000003406e0
Nov 03 19:54:30 pc003 kernel: Call Trace:
Nov 03 19:54:30 pc003 kernel: ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel: ? setup_data_read+0xa0/0xa0
Nov 03 19:54:30 pc003 kernel: ? netif_receive_skb_internal+0x29/0x410
Nov 03 19:54:30 pc003 kernel: on_each_cpu+0x2d/0x60
Nov 03 19:54:30 pc003 kernel: ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel: text_poke_bp+0x6a/0xf0
Nov 03 19:54:30 pc003 kernel: __jump_label_transform.isra.0+0x10b/0x120
Nov 03 19:54:30 pc003 kernel: arch_jump_label_transform+0x32/0x50
Nov 03 19:54:30 pc003 kernel: __jump_label_update+0x68/0x80
Nov 03 19:54:30 pc003 kernel: jump_label_update+0xae/0xc0
Nov 03 19:54:30 pc003 kernel: static_key_slow_inc+0x95/0xa0
Nov 03 19:54:30 pc003 kernel: static_key_enable+0x1d/0x30
Nov 03 19:54:30 pc003 kernel: netstamp_clear+0x2d/0x40
Nov 03 19:54:30 pc003 kernel: process_one_work+0x193/0x3c0
Nov 03 19:54:30 pc003 kernel: worker_thread+0x4a/0x3a0
Nov 03 19:54:30 pc003 kernel: kthread+0x125/0x140
Nov 03 19:54:30 pc003 kernel: ? process_one_work+0x3c0/0x3c0
Nov 03 19:54:30 pc003 kernel: ? kthread_park+0x60/0x60
Nov 03 19:54:30 pc003 kernel: ret_from_f...

I'm experiencing this bug as well, locking up every 24-48 hours roughly.  I've tried adding "rcu_nocbs=0-15" to the grub boot to see if that resolves it.

Fedora 26 (kernel 4.13.9-200.fc26.x86_64)
Ryzen 1700X

Nov 03 19:54:30 pc003 kernel: watchdog: BUG: soft lockup - CPU#14 stuck for 23s! [kworker/14:1:217]
Nov 03 19:54:30 pc003 kernel: Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_co
Nov 03 19:54:30 pc003 kernel:  crc32_pclmul ccp snd_timer drm ghash_clmulni_intel snd soundcore tpm_tis sp5100_tco tpm_tis_core wmi_bmof i2c_piix4 shpchp tpm parport_pc parport acp 
Nov 03 19:54:30 pc003 kernel: CPU: 14 PID: 217 Comm: kworker/14:1 Tainted: P           OEL  4.13.9-200.fc26.x86_64 #1
Nov 03 19:54:30 pc003 kernel: Hardware name: Micro-Star International Co., Ltd MS-7A34/B350 TOMAHAWK ARCTIC (MS-7A34), BIOS H.50 06/22/2017
Nov 03 19:54:30 pc003 kernel: Workqueue: events netstamp_clear
Nov 03 19:54:30 pc003 kernel: task: ffff8ac634d62640 task.stack: ffffaea343fdc000
Nov 03 19:54:30 pc003 kernel: RIP: 0010:smp_call_function_many+0x24a/0x270
Nov 03 19:54:30 pc003 kernel: RSP: 0018:ffffaea343fdfce0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Nov 03 19:54:30 pc003 kernel: RAX: ffff8ac63e61f398 RBX: ffff8ac63e99b580 RCX: 0000000000000000
Nov 03 19:54:30 pc003 kernel: RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8ac63e022b88
Nov 03 19:54:30 pc003 kernel: RBP: ffffaea343fdfd18 R08: ffffffffffffffff R09: 000000000000bfff
Nov 03 19:54:30 pc003 kernel: R10: fffffa649fe76140 R11: ffff8ac63e007c00 R12: 0000000000000010
Nov 03 19:54:30 pc003 kernel: R13: 0000000000000010 R14: ffffffff8b02d6a0 R15: 0000000000000000
Nov 03 19:54:30 pc003 kernel: FS:  0000000000000000(0000) GS:ffff8ac63e980000(0000) knlGS:0000000000000000
Nov 03 19:54:30 pc003 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 03 19:54:30 pc003 kernel: CR2: 00007fc490f23000 CR3: 000000011fe09000 CR4: 00000000003406e0
Nov 03 19:54:30 pc003 kernel: Call Trace:
Nov 03 19:54:30 pc003 kernel:  ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel:  ? setup_data_read+0xa0/0xa0
Nov 03 19:54:30 pc003 kernel:  ? netif_receive_skb_internal+0x29/0x410
Nov 03 19:54:30 pc003 kernel:  on_each_cpu+0x2d/0x60
Nov 03 19:54:30 pc003 kernel:  ? netif_receive_skb_internal+0x28/0x410
Nov 03 19:54:30 pc003 kernel:  text_poke_bp+0x6a/0xf0
Nov 03 19:54:30 pc003 kernel:  __jump_label_transform.isra.0+0x10b/0x120
Nov 03 19:54:30 pc003 kernel:  arch_jump_label_transform+0x32/0x50
Nov 03 19:54:30 pc003 kernel:  __jump_label_update+0x68/0x80
Nov 03 19:54:30 pc003 kernel:  jump_label_update+0xae/0xc0
Nov 03 19:54:30 pc003 kernel:  static_key_slow_inc+0x95/0xa0
Nov 03 19:54:30 pc003 kernel:  static_key_enable+0x1d/0x30
Nov 03 19:54:30 pc003 kernel:  netstamp_clear+0x2d/0x40
Nov 03 19:54:30 pc003 kernel:  process_one_work+0x193/0x3c0
Nov 03 19:54:30 pc003 kernel:  worker_thread+0x4a/0x3a0
Nov 03 19:54:30 pc003 kernel:  kthread+0x125/0x140
Nov 03 19:54:30 pc003 kernel:  ? process_one_work+0x3c0/0x3c0
Nov 03 19:54:30 pc003 kernel:  ? kthread_park+0x60/0x60
Nov 03 19:54:30 pc003 kernel:  ret_from_fork+0x25/0x30
Nov 03 19:54:30 pc003 kernel: Code: 35 30 00 39 05 fc 92 f1 00 89 c1 0f 8e 3d fe ff ff 48 98 48 8b 13 48 03 14 c5 e0 e3 d3 8b 48 89 d0 8b 52 18 83 e2 01 74 0a f3 90 <8b> 50 18 83 e

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-06:

#122

Another hang on 4.13.5 with Ryzen 1600X using Fedora 26. Works fine on 4.12.9.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-06:

#123

I'm at 11 days uptime with the custom 4.13.4 kernel and kernel boot params.

Jon, did you also compile the RCU settings or only apply the boot params?

Seth, I'm on 4.13.4 FWIW and now doing well. How long was your uptime?

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-06:

#124

Eric, Fedora has CONFIG_RCU_NOCB_CPU=y for 4.13 so I just set rcu_nocbs=0-11 on the kernel command line. I'll report how it goes. Without the params, I would hang within a few hours.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-06:

#125

Eric, been running for a few hours now, longer than the hang normally takes to surface, and it hasn't. Seems that the rcu_nocbs=0-11 masks the issue.

I sent an email off to Paul McKenney, RCU maintainer, to see if he couldn't take a look. I'm out of my depth on this one. RCU is voodoo of the highest order!

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-06:

#126

Seth, fantastic! Agreed on RCU complexity. I started reading up on it.. and then went back to funny cat pictures.

Revision history for this message

In Linux Kernel Bug Tracker #196683, johnpaul7 (johnpaul7-linux-kernel-bugs) wrote on 2017-11-06:

#127

Eric and Seth, I'm debugging and testing myself now after having very similar issues. Hardware no longer an issue after early-week RMA and confirmed stability with 48hr stressapptest run. Curious if you wouldn't mind sharing what distro and DE/WM you are running as well? (Seth looks like Fedora 26 for you).

I also assume that at this point the only Ryzen specific tweaks you are running are `CONFIG_RCU_NOCB_CPU` and `rcu_nocbs`? No more Ryzen-specific bios tweaks?

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-06:

#128

John,

No custom kernel or BIOS settings. Running stock clocks and voltages for memory and CPU. Just added rcu_nocbs to cmdline to (re)mask whatever is going on.

Also, I'm not sure that stress tests are the best way to recreate this issue. In my experience, it happens when the system is mostly idle.

Revision history for this message

In Linux Kernel Bug Tracker #196683, jon780 (jon780-linux-kernel-bugs) wrote on 2017-11-06:

#129

(In reply to eric.c.morgan from comment #38)
> I'm at 11 days uptime with the custom 4.13.4 kernel and kernel boot params.
>
> Jon, did you also compile the RCU settings or only apply the boot params?
>
> Seth, I'm on 4.13.4 FWIW and now doing well. How long was your uptime?

I've only specified the boot parameter. Too soon to say it's resolved for sure, but uptime is over 2 days at this point, which seems to be an improvement.

Revision history for this message

In Linux Kernel Bug Tracker #196683, jon780 (jon780-linux-kernel-bugs) wrote on 2017-11-06:

#130

(In reply to John Paul Herold from comment #42)
> Eric and Seth, I'm debugging and testing myself now after having very
> similar issues. Hardware no longer an issue after early-week RMA and
> confirmed stability with 48hr stressapptest run. Curious if you wouldn't
> mind sharing what distro and DE/WM you are running as well? (Seth looks like
> Fedora 26 for you).
>
> I also assume that at this point the only Ryzen specific tweaks you are
> running are `CONFIG_RCU_NOCB_CPU` and `rcu_nocbs`? No more Ryzen-specific
> bios tweaks?

I'm also running Fedora 26, DE/WM is dwm.

I also want to say that my CPU is confirmed to be affected by the Ryzen bug. I'm currently discussing RMA with AMD but I do not believe these issues are related.

Revision history for this message

In Linux Kernel Bug Tracker #196683, johnpaul7 (johnpaul7-linux-kernel-bugs) wrote on 2017-11-06:

#131

(In reply to Seth Jennings from comment #43)
> Also, I'm not sure that stress tests are the best way to recreate this
> issue. In my experience, it happens when the system is mostly idle.

Seth, correct on the stress test, just wanted to clarify my hardware is stable, no bad OC, etc. I too get the issue when I return to work the next day after system had been idle overnight.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-06:

#132

Just got the hang even with rcu_nocbs=0-11. Back to 4.12 then.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-06:

#133

John,

- week 33 R7 1700 CPU no segfault issues
- ubuntu 16.04 with 4.13.4 kernel pulled from git, copied config, applied RCU_NOCBS, the RCU_NOCBS_ALL option was NOT available.
- XFCE/Xubuntu to be more exact
- memtest 4 passes OK
- all c states, cool and quiet enabled in BIOS, no changes really
- boot params used as discussed

Seth,

My system crashes/d under very low load, almost idle. Before I tried any of these fixes I would run a video on loop to keep one processor more active to stave off crashes. It seemed to help. Maybe this would help you until a fix is official?

Jon,

Segfault bug is indeed a different issue. I had to RMA to get my week 33 CPU. It takes some time. Hang in there!

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-06:

#134

(In reply to Seth Jennings from comment #47)
> Just got the hang even with rcu_nocbs=0-11. Back to 4.12 then.

Are you able to disable C6 in the BIOS? I still don't know whether the RCU workaround avoids the issue entirely or just makes it less likely. I haven't had a single freeze since disabling C6.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-06:

#135

Seth/James,

Also consider disabling ASLR.

# disable in current session
echo 0 | tee /proc/sys/kernel/randomize_va_space

# make change permanent (across reboots)
echo "kernel.randomize_va_space = 0" > /etc/sysctl.d/01-disable-aslr.conf

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-06:

#136

(In reply to eric.c.morgan from comment #50)
> Also consider disabling ASLR.

That only helps with the segfault issue. I now have a new Ryzen and no longer encounter that.

I discovered the RCU workaround after receiving the new Ryzen and trying to work out why Gentoo froze but Fedora (at the time) didn't. I compared the kernel configurations and manually converged one towards the other, focusing on the things that seemed most likely to make a difference. It took several days. I was certain by that point that ASLR was unrelated here.

Revision history for this message

In Linux Kernel Bug Tracker #196683, johnpaul7 (johnpaul7-linux-kernel-bugs) wrote on 2017-11-07:

#137

(In reply to James Le Cuirot from comment #51)
> That only helps with the segfault issue. I now have a new Ryzen and no
> longer encounter that.

James, regarding ASLR, that is helpful as I've seen it recommended as a general Ryzen stability tweak. Which we definitely want to keep separate from this bug.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-07:

#138

I want to confirm this issue on my system (AMD Ryzen 1700X, segfault free chip after RMA) and offer some more info:

- First of all, let's define what CONFIG_RCU_NOCB_CPU does. It enables the support for rcuo kernel threads which handle the RCU callback processing. Option is automatically enabled if you select CONFIG_NO_HZ_FULL=y and this is the reason why it is automatically enabled in Fedora kernels. Fedora also chose to set CONFIG_RCU_NOCB_CPU_ALL=y in their 4.12 series kernels, while the default is CONFIG_RCU_NOCB_CPU_NONE=y which only enables RCU callback offloading to rcuo kernel threads for those CPUs that are defined in rcu_nocbs boot parameter. This is also the default for 4.13 kernels, and options CONFIG_RCU_NOCB_CPU_* have been dropped. So, in order to have the behaviour of CONFIG_RCU_NOCB_CPU_ALL=y in 4.13 kernel, you need to supply boot parameter rcu_nocbs=0-XX, where XX is your number of cpus-1. So, by setting CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y (or rcu_nocbs=0-XX), we are telling the kernel to offload RCU callbacks to seperate kernel threads called rcuo (rcu offload), one for each cpu. Affinity of those kernel threads is set to ffff, so they can run to any CPU.

- C6 is not effectively disabled by using rcuo kernel threads for RCU callback processing. When C6 is disabled, CPU voltage never drops under the voltage defined for P2 state (0.9V for my CPU). With C6 enabled and rcuo kernel threads enabled, my voltage drops to 0.4V frequently. Also, disabling C6 prevents single core to hit XFR turbo speeds. Max cpu frequency for my 1700X cpu is 3500 MHz when C6 is disabled. With C6 enabled, I can run single thread processes at 3900 MHz (XFR turbo speed of 1700X). Enabling rcuo kernel threads for RCU callback processing does not disable XFR turbo speeds. So, this feature does not effectively disables C6, this is not true.

- C6 can be disabled manually with zenstates.py script, even if your BIOS does not offer this option. It can be found at https://github.com/r4m0n/ZenStates-Linux

- In my case, idle freezes are prevented with either C6 disabled or with rcuo kernel threads for RCU callback processing enabled and C6 enabled. I had 14 days uptime with rcuo kernel threads enabled, while my system will usually freeze overnight without it.

- And finally, I have no idle freezes in Windows 10 with C6 enabled.

My opinion is that this idle freeze is a hardware issue of the AMD Zen processor and the rcuo kernel threads for RCU callback just hide or make less probable to trigger this hardware issue. Probably something similar happens in Windows 10 kernel.

I think AMD should be contacted.

I want to confirm this issue on my system (AMD Ryzen 1700X, segfault free chip after RMA) and offer some more info:

- First of all, let's define what CONFIG_RCU_NOCB_CPU does. It enables the support for rcuo kernel threads which handle the RCU callback processing. Option is automatically enabled if you select CONFIG_NO_HZ_FULL=y and this is the reason why it is automatically enabled in Fedora kernels. Fedora also chose to set CONFIG_RCU_NOCB_CPU_ALL=y in their 4.12 series kernels, while the default is CONFIG_RCU_NOCB_CPU_NONE=y which only enables RCU callback offloading to rcuo kernel threads for those CPUs that are defined in rcu_nocbs boot parameter. This is also the default for 4.13 kernels, and options CONFIG_RCU_NOCB_CPU_* have been dropped. So, in order to have the behaviour of CONFIG_RCU_NOCB_CPU_ALL=y in 4.13 kernel, you need to supply boot parameter rcu_nocbs=0-XX, where XX is your number of cpus-1. So, by setting CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y (or rcu_nocbs=0-XX), we are telling the kernel to offload RCU callbacks to seperate kernel threads called rcuo (rcu offload), one for each cpu. Affinity of those kernel threads is set to ffff, so they can run to any CPU.

- C6 is not effectively disabled by using rcuo kernel threads for RCU callback processing. When C6 is disabled, CPU voltage never drops under the voltage defined for P2 state (0.9V for my CPU). With C6 enabled and rcuo kernel threads enabled, my voltage drops to 0.4V frequently. Also, disabling C6 prevents single core to hit XFR turbo speeds. Max cpu frequency for my 1700X cpu is 3500 MHz when C6 is disabled. With C6 enabled, I can run single thread processes at 3900 MHz (XFR turbo speed of 1700X). Enabling rcuo kernel threads for RCU callback processing does not disable XFR turbo speeds. So, this feature does not effectively disables C6, this is not true.

- C6 can be disabled manually with zenstates.py script, even if your BIOS does not offer this option. It can be found at https://github.com/r4m0n/ZenStates-Linux

- In my case, idle freezes are prevented with either C6 disabled or with rcuo kernel threads for RCU callback processing enabled and C6 enabled. I had 14 days uptime with rcuo kernel threads enabled, while my system will usually freeze overnight without it.

- And finally, I have no idle freezes in Windows 10 with C6 enabled.

My opinion is that this idle freeze is a hardware issue of the AMD Zen processor and the rcuo kernel threads for RCU callback just hide or make less probable to trigger this hardware issue. Probably something similar happens in Windows 10 kernel.

I think AMD should be contacted.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-07:

#139

Thank you very much for this extra detail, Panagiotis. I'll turn C6 back on! I wanted to contact AMD but I wasn't sure how best to do so.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-07:

#140

I think that an experienced kernel developer that understands how RCU callback could affect the triggering of a probable hardware issue should contact AMD. There are AMD kernel developers so they should get the attention of this bug report.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2017-11-07:

#141

(In reply to Panagiotis Malakoudis from comment #55)
> I think that an experienced kernel developer that understands how RCU
> callback could affect the triggering of a probable hardware issue should
> contact AMD. There are AMD kernel developers so they should get the
> attention of this bug report.

How can we contact them? For me, the situation is still unclear ... I didn't have problems for some weeks thus power using my PC while configuring a new firewall. After a custom kernel I used standard Manjaro kernels from 4.13.9 to 4.13.11, keeping the RCU_NOCB_CPU option in my standard kernel command line. I is not visible in the Manjaro kernel config. Maybe it's triggered by something else?

https://github.com/manjaro/packages-core/blob/master/linux413/config.x86_64

Revision history for this message

In Linux Kernel Bug Tracker #196683, jon780 (jon780-linux-kernel-bugs) wrote on 2017-11-08:

#142

I've now got over 4 days of uptime with ryzen 1700x on fedora 26 (4.13.9-200) using the rcu_nocbs=0-XX boot option. I feel pretty confident in saying this resolved my issue. Should we be filing a bug report with Fedora as well? Clearly the current kernel configuration isn't compatible with (at least some) Ryzen CPUs.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-08:

#143

I'm running 4.13.11 now on F26. I enabled kdump and kernel.softlockup_panic = 1 so maybe I can get a kernel core on the next hang if/when it occurs.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2017-11-08:

#144

Using Fedora 27 I've had zero problems with random hangs since I first commented on this on 2017-10-13, I've had rcu_nocbs=0-11 as part of GRUB_CMDLINE_LINUX= in /etc/sysconfig/grub since then. Currently using 4.13.11-300.fc27.x86_64, been using various 4.13.x kernels.

Like already mentioned, this was something I noticed very quickly after going from kernel 4.12.x to 4.13.x, everything was working fine and then Fedora "upgraded" and the system started randomly hanging (mostly right after boot if I didn't start stuff immediately). My first reaction (before finding this bug) was to simply go back to 4.12.x.

I have not filed a Fedora bug but I'm not sure how useful it would be. Their kernels used to ship with CONFIG_RCU_NOCB_CPU=y and CONFIG_RCU_NOCB_CPU_ALL=y and they probably still would if CONFIG_RCU_NOCB_CPU_ALL wasn't removed.

The kernel developers could fix this (for Fedora at minimum, do not know what other distributions kernels are configured to do) by reverting this,

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/init/Kconfig?h=v4.14-rc3&id=44c65ff2e3b0b48250a970183ab53b0602c25764

I'm not sure why this is not happening. Perhaps it will in the far future when Ubuntu and RHEL/CentOS finally switch to a 4.13+ kernel, AMD seems to only care about Ubuntu 16 and CentOS 7 for some reason (look at their binary blob GPU drivers, for example). It would be excellent if someone with access could please revert that git commit before 2020. I don't really see any compelling arguments as to why that change was done and a whole lot of Ryzen systems randomly hanging is a good argument as to why it should be reverted.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-08:

#145

Reverting the CONFIG_RCU_NOCB_CPU_{ALL|ZERO|NONE} options deletion is not a solution. This option is just a workaround for what I think is a hardware issue.

There is at least one Ryzen 1700X processor (friend of mine has it) that doesn't freeze on idle with stock kernel settings. Also, my original CPU that had segfault issues, didn't freeze on idle too. Actually, it seems most CPUs from RMA that don't have the segfault issue are more affected with the idle freeze issue than others.

AMD should be informed about the freeze on idle issue and an experienced kernel developer should work with them to find why a kernel with the rcuo kernel threads for RCU callback enabled hide the issue (or make it less probable to happen).

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-08:

#146

(In reply to Panagiotis Malakoudis from comment #60)
> Reverting the CONFIG_RCU_NOCB_CPU_{ALL|ZERO|NONE} options deletion is not a
> solution. This option is just a workaround for what I think is a hardware
> issue.
Very much agreed.

> There is at least one Ryzen 1700X processor (friend of mine has it) that
> doesn't freeze on idle with stock kernel settings. Also, my original CPU
> that had segfault issues, didn't freeze on idle too. Actually, it seems most
> CPUs from RMA that don't have the segfault issue are more affected with the
> idle freeze issue than others.
My original CPU definitely had both issues and I wouldn't say that one froze more than the other. Perhaps there's another factor involved.

Revision history for this message

In Linux Kernel Bug Tracker #196683, b-o-s-s (b-o-s-s-linux-kernel-bugs) wrote on 2017-11-08:

#147

Another affected user here. Got my RMA week 33 segfault-free Ryzen 7 1700 a few weeks ago, running Gentoo on a custom vanilla kernel 4.13.10 atm. After 10 days of uptime, I now consider it stable. I did have freezes before, also with my pre-RMA CPU, but rcu_nocbs boot parameter seems to have fixed it, together with CONFIG_RCU_NOCB_CPU of course.

What I don't understand is why this should be another hardware bug? Ok, we don't know, maybe it is, but even then, this time there should be some software workaround. And I don't necessarily mean this RCU offloading-thing, which probably just masks the issue, but more in the terms of some underlying RCU (?) kernel issue triggering on this platform...?

Also consider that there are no freeze-issues on Windows as far as I know. Ok, AMD quickly put out their AMD Ryzen Balanced power plan, but I think it's mostly about SMT-scheduling, although it does disable C6 (I think, not sure). But anyhow, even on default Windows power plans, there are no freezes. So... either Microsoft quickly and silently incorporated some patch fixing this issue a long time ago, or it never was an issue for Windows in the first place. Doesn't matter however, the point is that something can be done in software in order to make this "bug" not triggering.

Forgive my ignorance, but why is it so difficult to assign this bug to the correct kernel dev? I'm just a user really and wondering... thanks.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-08:

#148

I'm at 13 days uptime with kernel and boot params applied. This ties my system stability days counted from before the boot params were added. I'll keep reporting back.

I am used to months of uptime with my previous Intel i5 2400 server. I hope to reach the same with my Ryzen.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-09:

#149

Created attachment 260577
softlock-dmesg.log

Was finally able to get a BUG on the soft lockup after attaching a serial console. Hoping to get a few more to create a good sampling of cases in which this happens.

tl;dr, two cores seem to be soft locked in native_flush_tlb_others().

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-09:

#150

Created attachment 260579
panic-dmesg.log

Here is a soft lock followed by watchdog timer panic.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-09:

#151

Created attachment 260581
panic2-dmesg.log

Last one for good measure.

smp_call_function_many() is the common denominator for all traces in this report.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-09:

#152

@Seth: How you achieve capturing these logs? You do serial console logging to another computer? Enabling serial console is enough or something else is needed too? I would also like to capture my idle freeze logs.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-09:

#153

@Panagiotis, yes, I attached a serial port header to my motherboard and am using a DB9 null modem cable and minicom to get the dmesg. Added console=ttyS0,38400n8 to the kernel boot line.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-09:

#154

I was able to capture logs using netconsole, which may be easier for some people.

Revision history for this message

In Linux Kernel Bug Tracker #196683, derrick.aguren (derrick.aguren-linux-kernel-bugs) wrote on 2017-11-10:

#155

Disabling C6 with zenstates.py has worked so far for me (6 hours without a freeze vs a freeze every one to two hours).

Ryzen 1800x (I haven't checked what week)
Ubuntu 16.04 LTS 4.11.0-kfd-compute-rocm-rel-1.6-180
Gigabyte AX370 - Gaming 5 BIOS version F8 (default params)
MSI Radeon RX 580
amdgpu-pro version 17.40-492261

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-11:

#156

Doing some code investigation

http://elixir.free-electrons.com/linux/v4.13.12/source/kernel/smp.c#L401

The two RIP addresses in my 3 dmesg logs go here

smp_call_function_many
csd_lock_wait
smp_cond_load_acquire
cpu_relax (called in loop from smp_cond_load_acquire)
rep_nop
asm volatile("rep; nop" ::: "memory") <-- smp_call_function_many+0x248

smp_call_function_many
csd_lock_wait
smp_cond_load_acquire
READ_ONCE(x) (called in loop from smp_cond_load_acquire)
__READ_ONCE(x, 1)
__read_once_size
__READ_ONCE_SIZE <-- smp_call_function_many+0x24a

Both are within the tight loop in smp_cond_load_acquire waiting on the per-cpu csd locks

Basically, smp_call_function_many() executes a function on each cpu via IPI.
When wait=true, it runs synchronously, with the cpu that runs smp_call_function_many() waiting for each of the other cpus to report they have run the function as indicated by releasing their per-cpu csd lock.

The CSD_FLAG_SYNCHRONOUS flag determines the order of the func and unlock in flush_smp_call_function_queue()

http://elixir.free-electrons.com/linux/v4.13.11/source/kernel/smp.c#L242

This leads me to believe the issue here deals with IPIs being dropped and thus the CPU calling smp_call_function_many() deadlocks waiting on a cpu that has dropped the IPI and therefore will not unlock the csd lock.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-11:

#157

I found this commit that went into 4.12-rc2

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3fc5b3b6a80b2e08a0fec0056208c5dff757e547

While the reasoning in the commit message makes sense, it does modify the exact code area where this is happening. It would be interesting to revert this and see if the problem goes away. There could be some edge case where the enqueued csd get dropped.

Revision history for this message

In Linux Kernel Bug Tracker #196683, jon780 (jon780-linux-kernel-bugs) wrote on 2017-11-11:

#158

This will be my last post unless the behavior changes, but I've gone from 1-2 lock ups per day to an uptime of: 7 days, 11:57 so I'm considering this resolved for me.

By adding the following boot option: rcu_nocbs=0-15

And disabling ASLR. I don't know which one solved the problem, but I'd start with the boot option then try disabling ASLR if you still have issues.

Revision history for this message

In Linux Kernel Bug Tracker #196683, b-o-s-s (b-o-s-s-linux-kernel-bugs) wrote on 2017-11-11:

#159

(In reply to Seth Jennings from comment #72)
> I found this commit that went into 4.12-rc2
>
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> ?id=3fc5b3b6a80b2e08a0fec0056208c5dff757e547

Great find, thanks!
I'm going to test 4.13.12 with this reverted. C6 is enabled in UEFI, no special boot params, just booted. Now, let's wait... :D

Revision history for this message

In Linux Kernel Bug Tracker #196683, thunderbird2k (thunderbird2k-linux-kernel-bugs) wrote on 2017-11-11:

#160

Created attachment 260619
Kernel softlockup Fedora 27 serial log

Hi,

I'm seeing the same issue as others reported here. If my memory serves me well I started seeing it on some 4.13.x kernel on fedora 26 and have had it ever since even now on fedora 27.

The issue usually happens every few hours and my system is fairly idle, just a few gnome-terminals and a browser open, no compilation or heavy 3d content.

I'm about to try the workarounds suggested on this ticket, but I want to capture a few more logs just in case.

Thanks,
Roderick

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-12:

#161

@Seth: I tried with the commit you mentioned reverted and got idle freeze again. If I remember well I have had idle freeze with Debian 9.0 4.9 kernel as well.

I don't know if it helps, but running without gnome/X, I didn't have idle freeze. System was running only some network services (ssh, nfs, samba etc).

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-12:

#162

Yeah, I really want to believe it was that commit but while I haven't tried reverting it, I'm pretty sure I started with a kernel older than 4.12.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-13:

#163

@Panagiotis yes, I also built a kernel with a revert for that commit and it still locked up so... not that.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-14:

#164

@James, I am disabling C6 now (w/o the kernel options). See how this goes. If that alone can fix it, that is a pretty strong indicator in my mind that this is a hardware issue :-/

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-14:

#165

Derrick, evidently you work for AMD, is there anybody you can speak to about this?

Revision history for this message

Franck Charras (franckc) wrote on 2017-11-14:

#47

Almost 1 month of uptime, it looks solved.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-14:

#166

I'm stable with C6 disabled (both package and core) and no other modifications.

In my mind, this proves that this is a hardware issue. Gotta find some way to get AMDs attention on this...

Revision history for this message

In Linux Kernel Bug Tracker #196683, tom (tom-linux-kernel-bugs) wrote on 2017-11-14:

#167

We have a machine (with Ryzen Threadripper 1950X) where this is happening even with rcu_nocbs=0-31 and I'd love to test the disabling C6 hypothesis except that as far as I can tell there's not such option in the BIOS (ASUS PRIME X399-A motherboard).

In fact it looks to me like there is only C1 and C2 and hence no C6 to disable - that's going both by powertop and looking at /sys/devices/system/cpu/cpu0/cpuidle/state*/name to see what states exist.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-14:

#168

@Seth I crashed with C states disabled.

I haven't crashed in 20 days with C states ENABLED and RCU kernel and boot params.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-14:

#169

@Tom, did you compile your kernel with the RCU option/s set?

Revision history for this message

In Linux Kernel Bug Tracker #196683, tom (tom-linux-kernel-bugs) wrote on 2017-11-14:

#170

No this is a stock Fedora kernel, but my understanding was that rcu_nocbs=0-31 was equivalent to recompiling with the RCU configuration change?

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-14:

#171

@Tom, unfortunately not. You'll need to verify CONFIG_RCU_NOCB_CPU=Y and if possible CONFIG_RCU_NOCB_CPU_ALL=Y. The newer kernels dont have CONFIG_RCU_NOCB_CPU_ALL, so the boot param you reference does the same.

Here is an article I contributed to that discusses this. While Ubuntu focused it can still guide you.

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Like I stated before, with these options set I'm at 20 days crash free.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tom (tom-linux-kernel-bugs) wrote on 2017-11-14:

#172

Turns out it is actually set:

rover [~] % fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-14:

#173

@Tom that is correct. You do NOT have to recompile your kernel to use the rcu_nocbs option. The fedora kernel already has CONFIG_RCU_NOCB_CPU=y.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-14:

#174

@Tom, then life is easier for you! ;-)

Perhaps consider disabling ASLR? http://blog.programster.org/how-to-disable-aslr

Maybe up the SOC voltage a tad?

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-14:

#175

I found this project a while ago that claims to set C states for ryzen chips on some motherboards.

https://github.com/r4m0n/ZenStates-Linux

Revision history for this message

In Linux Kernel Bug Tracker #196683, tom (tom-linux-kernel-bugs) wrote on 2017-11-14:

#176

So that does show it as enabled at code level:

rover [~/ZenStates-Linux] % sudo ./zenstates.py -l
P0 - Enabled - FID = 88 - DID = 8 - VID = 44 - Ratio = 34.00 - vCore = 1.12500
P1 - Enabled - FID = 8C - DID = A - VID = 5A - Ratio = 28.00 - vCore = 0.98750
P2 - Enabled - FID = 84 - DID = C - VID = 6A - Ratio = 22.00 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Enabled

I've turned it off now so we'll see what happens with that...

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-14:

#177

@Tom, Good luck!

As dumb as it sounds, at one point I had a video running on loop on my server to keep it from going too idle. If your C state attempt is fruitless then maybe that would be a bandaid until a resolution for all this is found.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-14:

#178

@Tom the BIOS option on Gigabyte boards that disabled C6 is called "Global C-state Control". I think ASUS calls it the same thing.

@Eric interesting. IIRC I hung with just the rcu_nocbs kernel options set and C6 enabled, similar to Tom's experience.

All of this doesn't not change the fact that one should not have to do hacks to prevent the hardware from going "too idle".

Revision history for this message

In Linux Kernel Bug Tracker #196683, derrick.aguren (derrick.aguren-linux-kernel-bugs) wrote on 2017-11-14:

#179

(In reply to James Le Cuirot from comment #80)
> Derrick, evidently you work for AMD, is there anybody you can speak to about
> this?

Hi James. Unfortunately this sort of thing is outside of my wheelhouse. Let's let it progress through normal channels.

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-14:

#180

@Derrick these "normal channels" of which you speak... please, do tell! :D What are they?

How do users report issues that will likely require and AGESA update to fix?

Revision history for this message

In Linux Kernel Bug Tracker #196683, derrick.aguren (derrick.aguren-linux-kernel-bugs) wrote on 2017-11-14:

#181

@Seth I would start here, which mentions this site, but also other paths: https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html

Revision history for this message

In Linux Kernel Bug Tracker #196683, lucio (lucio-linux-kernel-bugs) wrote on 2017-11-19:

#182

(In reply to Derrick Aguren from comment #96)
> @Seth I would start here, which mentions this site, but also other paths:
> https://www.kernel.org/doc/html/v4.10/admin-guide/reporting-bugs.html

Maybe there's a shorter path, maybe Bridgman@Phoronix knows how to get in touch with the right AMD person. I've just asked him to have a look just in case he knows:

https://www.phoronix.com/forums/forum/phoronix/latest-phoronix-articles/990202-linux-4-15-is-a-huge-update-for-both-amd-cpu-radeon-gpu-owners?p=990258#post990258

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-19:

#183

I already tried to contact Bridgman through a Phoronix PM but his inbox was full so I figured that he's probably pestered about random AMD issues enough already.

I then sent a message to AMD Tech Support. I haven't heard anything back yet.

Revision history for this message

In Linux Kernel Bug Tracker #196683, lorenz.bona (lorenz.bona-linux-kernel-bugs) wrote on 2017-11-20:

#184

Hi guys.
I've been running a Ryzen build since late May/early June from Linus git and I never faced this lockup.
I've just checked and in my configuration that option is disabled.

In attachment you can find my config file.

May I help you testing something?

R5-1600 with a Gigabyte Gaming 3.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-20:

#185

@Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py script mentioned earlier in this thread.

Revision history for this message

In Linux Kernel Bug Tracker #196683, lorenz.bona (lorenz.bona-linux-kernel-bugs) wrote on 2017-11-20:

#186

(In reply to Panagiotis Malakoudis from comment #100)
> @Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py
> script mentioned earlier in this thread.

Yes, C6 enabled.

./zenstates.py -l
P0 - Enabled - FID = 94 - DID = 8 - VID = 32 - Ratio = 37.00 - vCore = 1.23750
P1 - Enabled - FID = B9 - DID = A - VID = 32 - Ratio = 37.00 - vCore = 1.23750
P2 - Enabled - FID = 7C - DID = 10 - VID = 68 - Ratio = 15.50 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

(In reply to Panagiotis Malakoudis from comment #100)
> @Lorenzo: Do you have C6 states enabled? You can check with the zenstates.py
> script mentioned earlier in this thread.

Yes, C6 enabled.

./zenstates.py -l                                                                                                                                                                                                                 
P0 - Enabled - FID = 94 - DID = 8 - VID = 32 - Ratio = 37.00 - vCore = 1.23750                                                                                                                                                                                                 
P1 - Enabled - FID = B9 - DID = A - VID = 32 - Ratio = 37.00 - vCore = 1.23750                                                                                                                                                                                                 
P2 - Enabled - FID = 7C - DID = 10 - VID = 68 - Ratio = 15.50 - vCore = 0.90000                                                                                                                                                                                                
P3 - Disabled                                                                                                                                                                                                                                                                  
P4 - Disabled                                                                                                                                                                                                                                                                  
P5 - Disabled                                                                                                                                                                                                                                                                  
P6 - Disabled                                                                                                                                                                                                                                                                  
P7 - Disabled                                                                                                                                                                                                                                                                  
C6 State - Package - Enabled                                                                                                                                                                                                                                                   
C6 State - Core - Enabled

Revision history for this message

In Linux Kernel Bug Tracker #196683, lorenz.bona (lorenz.bona-linux-kernel-bugs) wrote on 2017-11-20:

#187

Created attachment 260745
ryzen kernel config

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2017-11-20:

#188

I've been facing those system hangs, too, until I switched on "daily computing" optimization in bios (ASUS PRIME X370-PRO BIOS 0902 09/08/2017) running 4.13.x.

Switching to "daily computing" increases CPU speed to max MHz 3600 (instead of 3400) using Ryzen 7 1700X. Maybe it changes some more things I don't know off. But C states are definitely enabled:

# zenstates.py -l
P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P1 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Revision history for this message

In Linux Kernel Bug Tracker #196683, johannes.hirte (johannes.hirte-linux-kernel-bugs) wrote on 2017-11-20:

#189

I don't think Bridgman can help here. You should ask Borislav Petkov <email address hidden> with cc-ing <email address hidden>. He doesn't work for AMD anymore, but he is still involved with kernel development. So if he can't say what's going on, he can point to the right developer.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sfearghail (sfearghail-linux-kernel-bugs) wrote on 2017-11-21:

#190

Just to add more as I've been following along ever since this 1st happened to me about 6 days ago. To preface I am running Debian 9 with 4.13 kernel from backports. I went several days after upgrade from 4.12 with a very idle Ryzen 1700 and didn't have any issues. I did have 2 qemu-kvm VM's running, and they are also mostly idle. I encountered the soft lock CPU bug out of the blue on the Nov 15th while not doing anything interactive with the system.

I'm now running 6 VM's on this system as it was a replacement for an old Intel system. I simply cannot have this thing crashing on me unattended, especially while I am remote.

Here are the measures I took, and frankly I don't really know if it's going to prevent this as the system simply hasn't been up long enough.

Board: Gigabyte GA-AB350-GAMING 3
Bios version: F7

Disabled Global C-State Control in bios. The manual also says there should be a C6 Mode in bios, however it simply doesn't exist.

Disabled AMD Cool&Quiet in bios.

Further issued a disable of C6 via the zenstate.py script upon boot up in systemd as I wasn't certain what the difference is between Package and Core.

# zenstates.py -l
P0 - Enabled - FID = 78 - DID = 8 - VID = 3A - Ratio = 30.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Disabled

Question for clarification. Does CONFIG_RCU_NOCB_CPU need to be set to yes in kernel to be able to use the boot parameter? I get the impression that the boot parameter doesn't actually do anything unless kernel option is configured.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-21:

#191

(In reply to Scott Farrell from comment #105)
> Disabled Global C-State Control in bios. The manual also says there should
> be a C6 Mode in bios, however it simply doesn't exist.
These are the same thing.

> Disabled AMD Cool&Quiet in bios.
You don't need to disable this. I would leave it on.

> Question for clarification. Does CONFIG_RCU_NOCB_CPU need to be set to yes
> in kernel to be able to use the boot parameter? I get the impression that
> the boot parameter doesn't actually do anything unless kernel option is
> configured.
Yes, the boot parameter alone won't do anything. You only need to do this or disable C6, not both. At this point, the overall consensus seems to be that the RCU workaround is better as it still allows some power saving and doesn't prevent boost.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tom (tom-linux-kernel-bugs) wrote on 2017-11-23:

#192

Well our machine has now gone about nine days without any soft lockups with C6 disabled - previously (even with rcu_nocbs=0-31) it was lucky to go more than a day or two.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-11-24:

#193

I'm at 30 days with C states ENABLED, Cool & Quiet ENABLED, boot param set, and kernel param set. I consider my server stable now.

Now if only the out of the box linux kernel would be this stable.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-24:

#194

I have had a response from AMD Tech Support. They told me to disable C6 even though I'd already told them that this has been found to work around the issue. Not very helpful but they also added that the issue "should be addressed in an up and coming BIOS release." This would imply that they already know about the issue.

I didn't say so but I am slightly dubious about the suggestion of a BIOS fix as they said the same thing about the segfault issue and that never materialised.

However I did implore them to be more transparent with their customers. Bad press is bad but how they handle the situation is what really counts. While they never fully admitted that the segfault issue was a large scale hardware fault in the face of overwhelming evidence, they did at least honour all RMA requests with little resistance. I get the impression that most customers were grateful for this. I certainly was.

They then told me my feedback had been passed onto the Ryzen team.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tom (tom-linux-kernel-bugs) wrote on 2017-11-24:

#195

Don't forget that a BIOS update can include new CPU microcode.

Revision history for this message

In Linux Kernel Bug Tracker #196683, master.homer (master.homer-linux-kernel-bugs) wrote on 2017-11-24:

#196

(In reply to James Le Cuirot from comment #109)
> I have had a response from AMD Tech Support. They told me to disable C6 even
> though I'd already told them that this has been found to work around the
> issue. Not very helpful but they also added that the issue "should be
> addressed in an up and coming BIOS release." This would imply that they
> already know about the issue.
>
> I didn't say so but I am slightly dubious about the suggestion of a BIOS fix
> as they said the same thing about the segfault issue and that never
> materialised.
>
> However I did implore them to be more transparent with their customers. Bad
> press is bad but how they handle the situation is what really counts. While
> they never fully admitted that the segfault issue was a large scale hardware
> fault in the face of overwhelming evidence, they did at least honour all RMA
> requests with little resistance. I get the impression that most customers
> were grateful for this. I certainly was.
>
> They then told me my feedback had been passed onto the Ryzen team.

Good to know that they are working on it!

I've had this problem myself, with multiple random lockups per day when it was mostly idle. But like others a custom compiled kernel (latest 4.15) with the CONFIG_RCU_NOCB_CPU option set and the rcu kernel command line seems to have fixed it.

Revision history for this message

In Linux Kernel Bug Tracker #196683, b-o-s-s (b-o-s-s-linux-kernel-bugs) wrote on 2017-11-24:

#197

As a test, I've been running like below with an uptime of 12 days without any freeze or strange syslog entries. Note that this is with the patch Seth mentioned far above _reverted_, so at least for me this seems to have fixed the issue. I've read that for some it hadn't, strange. I have to countercheck another few weeks.

# uname -a
Linux donald 4.13.12-x64 #14 SMP Sun Nov 12 17:23:57 CET 2017 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux
# zcat /proc/config.gz |grep RCU_NO
CONFIG_RCU_NOCB_CPU is not set
# cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.13.12 root=/dev/nvme0n1p3 ro
# ./zenstates.py -l
P0 - Enabled - FID = 8C - DID = 8 - VID = 3A - Ratio = 35.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Some relevant dmidecode output:

Base Board Information
Manufacturer: ASRock
Product Name: AB350 Gaming-ITX/ac

BIOS Information
Version: P3.10
Release Date: 08/28/2017

Memory Device (x2)
        Data Width: 64 bits
        Size: 8192 MB
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 3200 MT/s
        Part Number: G.Skill F4-3200C14-8GFX
        Rank: 1
        Configured Clock Speed: 1600 MT/s

I'm running RAM on its specified XMP profile 3200-14-14-14-34 1.35V, which is quite good actually. UEFI settings are mostly default, only P-state overclocking is used and two memory related options turned off: GearDown mode and Bank Swapping. Voltages are default except CPU offset -100 mV. All turbo and power saving stuff is working as intented using ondemand governor.

So... are you guys sure your freezes are no UEFI setting and/or kernel config and/or hardware, most probably RAM, issue? Especially RAM is really critical on this platform and maybe Memtest cannot detect it...?
(But as noted, I have to countercheck at least 2 weeks.)

As a test, I've been running like below with an uptime of 12 days without any freeze or strange syslog entries. Note that this is with the patch Seth mentioned far above _reverted_, so at least for me this seems to have fixed the issue. I've read that for some it hadn't, strange. I have to countercheck another few weeks.

# uname -a
Linux donald 4.13.12-x64 #14 SMP Sun Nov 12 17:23:57 CET 2017 x86_64 AMD Ryzen 7 1700 Eight-Core Processor AuthenticAMD GNU/Linux
# zcat /proc/config.gz |grep RCU_NO
  CONFIG_RCU_NOCB_CPU is not set
# cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-4.13.12 root=/dev/nvme0n1p3 ro
# ./zenstates.py -l
P0 - Enabled - FID = 8C - DID = 8 - VID = 3A - Ratio = 35.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

Some relevant dmidecode output:

Base Board Information
        Manufacturer: ASRock
        Product Name: AB350 Gaming-ITX/ac

BIOS Information
        Version: P3.10
        Release Date: 08/28/2017
        
Memory Device (x2)
        Data Width: 64 bits
        Size: 8192 MB
        Type: DDR4
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 3200 MT/s
        Part Number: G.Skill F4-3200C14-8GFX
        Rank: 1
        Configured Clock Speed: 1600 MT/s

I'm running RAM on its specified XMP profile 3200-14-14-14-34 1.35V, which is quite good actually.  UEFI settings are mostly default, only P-state overclocking is used and two memory related options turned off: GearDown mode and Bank Swapping. Voltages are default except CPU offset -100 mV. All turbo and power saving stuff is working as intented using ondemand governor.

So... are you guys sure your freezes are no UEFI setting and/or kernel config and/or hardware, most probably RAM, issue? Especially RAM is really critical on this platform and maybe Memtest cannot detect it...? 
(But as noted, I have to countercheck at least 2 weeks.)

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-24:

#198

(In reply to Tyler from comment #112)
> As a test, I've been running like below with an uptime of 12 days without
> any freeze or strange syslog entries. Note that this is with the patch Seth
> mentioned far above _reverted_, so at least for me this seems to have fixed
> the issue.

I'll have to try it myself then.

> So... are you guys sure your freezes are no UEFI setting and/or kernel
> config and/or hardware, most probably RAM, issue? Especially RAM is really
> critical on this platform and maybe Memtest cannot detect it...?

I am booting in legacy mode so I doubt it's UEFI related. As someone who actually does have faulty RAM but has worked around it (for now), I can say that while it might have caused the odd freeze earlier, it was more apparent in other ways like random disk corruption (particularly when using git) and the browser constantly crashing until I rebooted.

Revision history for this message

In Linux Kernel Bug Tracker #196683, master.homer (master.homer-linux-kernel-bugs) wrote on 2017-11-27:

#199

(In reply to James Le Cuirot from comment #113)
> (In reply to Tyler from comment #112)
> > As a test, I've been running like below with an uptime of 12 days without
> > any freeze or strange syslog entries. Note that this is with the patch Seth
> > mentioned far above _reverted_, so at least for me this seems to have fixed
> > the issue.
>
> I'll have to try it myself then.
I tried this myself over the weekend - I compiled a kernel with CONFIG_RCU_NOCB_CPU set to n and that commit reverted, unfortunately, after a few hours of being left idle it had locked up. After that I also did a test with that commit applied again, and CONFIG_RCU_NOCB_CPU set to y (with the kernel command line also being set) and it ran well for 12+ hours. So it definitely feels like that config option fixes it.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tomm (tomm-linux-kernel-bugs) wrote on 2017-11-29:

#200

We have been in "indirect" contact with an engineer at AMD who offers this explanation:

> We have been investigating the issue where systems are reportedly locking up
> when idling or running small workloads.
>
> This issue is related to the power supply. Most PC power supplies (PSUs) are
> designed to handle a wide range of power consumption from your PC components,
> but not all PSUs are created equal.
>
> Because of this, there are some rare conditions where the power draw of an
> efficient PC does not meet the minimum power consumption requirements for one
> or more circuits inside some PSUs.
>
> This scenario (called "minimal loading supply") can cause such PSUs to output
> poor quality power, or shut off entirely.
>
> To prevent this issue from happening, it is important to ensure that the
> power supply supports 0A minimum load on the +12V circuit. These PSUs became
> commonplace starting in 2013 for the Intel "Haswell" platform.
>
> This specification can be found printed on the sticker affixed to most PSUs,
> or it may be available on the manufacturer’s website.
>
> However, AMD understands that not everyone is in a position to replace their
> PSU with a contemporary 0A-rated unit. To help with that, AMD is also
> developing a firmware workaround for these power supplies, and will make it
> available through motherboard partners as a BIOS update in the future.

You guys buy this?

Revision history for this message

In Linux Kernel Bug Tracker #196683, spartacus06 (spartacus06-linux-kernel-bugs) wrote on 2017-11-29:

#201

Nope, my PSU is this:
https://www.newegg.com/Product/Product.aspx?Item=N82E16817139146

Advertised "Haswell Ready" (0A min load). I _do_ believe this is a problem with power delivery, just not from the PSU. I think it is the on-chip power management.

Sounds like they might try to spin it. I don't really care though as long as the firmware update fixes the issue :)

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-11-29:

#202

It sounds plausible but my Seasonic G-550, bought last year, was also declared to be Haswell ready.

https://www.pcper.com/news/General-Tech/Seasonic-Releases-Information-Its-Haswell-Ready-Power-Supplies

Maybe Ryzens use even less power in this state!?

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-11-29:

#203

I don't "buy" this explanation. My PSU (Coolermaster G650M) specifically states: Haswell C6/C7 support & zero load operation

@Klaus Mueller: What this option changes is that it disables XFR and turbo (single core is 3600 same as all core) and it sets P1 state to 3600, 1.35V as can be seen in zenstates.py output. This configuration makes the CPU to spend longer time in 1.35V and less time to 0.9V and less. It does seem to make the problem less frequent.

@Seth Jennings: I agree with you, I believe there is an issue with the on-chip power management. Let's hope it can be fixed in firmware.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2017-11-30:

#204

>> This issue is related to the power supply.
> You guys buy this?

No. This is hogwash.

EVGA SuperNOVA 750 G3 is a rather new PSU. Corsair RMi Series RM650i is also fairly new. Both of these were released after Haswell was in 2013. These are the ones I (ab)use for my Ryzen 1600X's. AMD's "engineer" can forget about selling me some story about how both of these PSUs are somehow flawed in a way that makes these CPUs hang shortly after boot if I leave the system idle (when not using the right kernel options/parameter).

If everyone with this problem was re-using a ancient PSU from an earlier build then there would have been a chance that this would hold water. That's just not the case and that makes it obvious that AMD story is hogwash.

It is still the case that kernel configuration CONFIG_RCU_NOCB_CPU=y and the kernel option rcu_nocbs=0-11 fixes this issue - completely on both my Ryzen systems. This tells me that a kernel patch can fix this (or permanently mask it) because it's plainly clear that software does fix it.

Revision history for this message

In Linux Kernel Bug Tracker #196683, master.homer (master.homer-linux-kernel-bugs) wrote on 2017-11-30:

#205

Yeah, I don't buy it either. My PSU is a Fractal Design R3 1000W and is rated for a minimum current of 0A on the 12V output.
Plus, Fractal Design's support page states that:
"All power supplies
that use the DC-DC method are able to output their full 3.3V/5V ratings even with no load on the 12V
rail, so Tesla R2 and Newton R3 power supplies will easily support the new sleep states introduced with
Intel’s Haswell platform."

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2017-12-01:

#206

(In reply to Panagiotis Malakoudis from comment #118)
> @Klaus Mueller: What this option changes is that it disables XFR and turbo
> (single core is 3600 same as all core) and it sets P1 state to 3600, 1.35V
> as can be seen in zenstates.py output. This configuration makes the CPU to
> spend longer time in 1.35V and less time to 0.9V and less. It does seem to
> make the problem less frequent.

Well, happily I couldn't measure any increased power consumption. It's unchanged compared to standard configuration.

I'm running this config now since about 3 weeks and I didn't face any problem so far - there would have been a lot of chances to hang.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-01:

#207

(In reply to Klaus Mueller from comment #121)
>
> Well, happily I couldn't measure any increased power consumption. It's
> unchanged compared to standard configuration.
>
> I'm running this config now since about 3 weeks and I didn't face any
> problem so far - there would have been a lot of chances to hang.

I have tried it in my system and although it didn't freeze in the first 30-60 minutes of idle like it usually does if I run it without rcu threads, it did freeze when idle all night - found it frozen in the morning. But even if it worked, I wouldn't like to use it since such setting limits single core potential performance. 1700X can go up to 3900 MHz for single core performance.

Revision history for this message

In Linux Kernel Bug Tracker #196683, mitsarionas (mitsarionas-linux-kernel-bugs) wrote on 2017-12-01:

#208

Just to chime in with my case... My Ryzen 1600 (w/ Gigabyte AB350M Gaming 3) also used to lock up on stock Ubuntu 17.10 (4.13), installed on a hdd, C6 enabled in bios. The PSU is a 750W one from 2011-2012, I doubt it's OA min load.
As in everyone's (?) case here, this was fixed with the CONFIG_RCU_NOCB_CPU workaround.

A few days ago though, I did a clean install on an ssd , and thought I'd check again. It didn't lock up for 3-4 days that I've left it on. Could one say that it has some relation to the ssd not using 12V? (It seems opposite to what one would expect? Perhaps it has something to do with the 12V power draw alternating between zero and non-zero? Are the sata 12V rails even related to this?)

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2017-12-01:

#209

(In reply to Panagiotis Malakoudis from comment #122)
> I have tried it in my system and although it didn't freeze in the first
> 30-60 minutes of idle like it usually does if I run it without rcu threads,
> it did freeze when idle all night - found it frozen in the morning.

Thanks for testing. It obviously isn't a workaround suitable for everybody.

> But even
> if it worked, I wouldn't like to use it since such setting limits single
> core potential performance. 1700X can go up to 3900 MHz for single core
> performance.

It depends on the characteristics of the workload.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-01:

#210

Another me too.

Originally I had a Ryzen 1700 which was suffering from both the segfault and the idle lockup issues. I sent it back via RMA, got a replacement which does not segfault anymore but which still exhibits the idle lockup.

For me, messing around with RCU kernel settings/boot params does not fix the idle lockup behaviour. Disabling the C6/Coolnquiet at the BIOS level does not fix it either (ASRock X370 Gaming K4 motherboard). What DOES fix the problem is disabling C6/P states via the zenstates.py script: with it, the system has been rock solid for weeks.

At this point I am waiting for a BIOS update before possibly going through RMA again.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-12-01:

#211

(In reply to Francesco Biscani from comment #125)
> At this point I am waiting for a BIOS update before possibly going through
> RMA again.
I wouldn't go through RMA again as there's no indicator that this has been fixed in hardware.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-01:

#212

(In reply to James Le Cuirot from comment #126)
> (In reply to Francesco Biscani from comment #125)
> > At this point I am waiting for a BIOS update before possibly going through
> > RMA again.
> I wouldn't go through RMA again as there's no indicator that this has been
> fixed in hardware.

Yeah you are probably right. Let's hope they fix this in the next AGESA update.

Running with C6 disabled is not that much of a problem: thermally I did not notice any difference, and while the power savings would be nice, it's not a deal breaker for me personally.

Still, according to some quick tests I ran, it seems like there's a ~10% performance decrease in single-threaded workloads due to the fact that, without C6, turbo mode is disabled. Would be nice to get that back.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2017-12-02:

#213

I've been interested in getting a new AMD CPU, but being new to Linux this bug has scared me off sofar. Too complex for me to compile my own kernel yet. So I created an account to follow this bug here, and see if it gets fixed, maybe with the new Ryzen's coming next February, so rumours have it.

However, I would like to ask if anyone here knows if this bug has been happening on Threadrippers as well. AMD says the TR's are the sorted top 5 Zeppelin dies. Therefor if this bug is a general design issue, it should happen on those high end workstation CPU's as well, but I could not find any complaints. If it does not happen on these TR's, then it's a manufacturing error?

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-02:

#214

(In reply to Jonathan from comment #128)
> I've been interested in getting a new AMD CPU, but being new to Linux this
> bug has scared me off sofar. Too complex for me to compile my own kernel
> yet. So I created an account to follow this bug here, and see if it gets
> fixed, maybe with the new Ryzen's coming next February, so rumours have it.
>
> However, I would like to ask if anyone here knows if this bug has been
> happening on Threadrippers as well. AMD says the TR's are the sorted top 5
> Zeppelin dies. Therefor if this bug is a general design issue, it should
> happen on those high end workstation CPU's as well, but I could not find any
> complaints. If it does not happen on these TR's, then it's a manufacturing
> error?

It's difficult to say, as AMD has been horribly tight-lipped. Most information on these issues comes from various online resources (forums, bugzilla, etc.), and it's often anecdotal and sometimes contradictory.

My understanding is that the segfault problem (the only one publicly admitted by AMD so far) was a manufacturing issue, which was solved somewhere around the introduction of the TR. See also here:

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

The other issue, the hang/reboot when idle, seems to be persisting also on newer processors, and, going by memory (again, anecdotal) I do remember at one point someone reporting this very problem on a TR.

The current speculation/wishful thinking is that the idle bug may be solved by a future AGESA/microcode update, which should be imminent:

https://www.vortez.net/news_story/its_looking_like_november_for_amds_agesa_1_7.html

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-12-03:

#215

(In reply to Francesco Biscani from comment #129)
> My understanding is that the segfault problem (the only one publicly
> admitted by AMD so far) was a manufacturing issue, which was solved
> somewhere around the introduction of the TR. See also here:
>
Ryzen cpus manufactured before week 20 or so have the segfault problem with high stress compilation that average users are not doing. You get a new cpu Amd after submitting RMA.

>
> The other issue, the hang/reboot when idle, seems to be persisting also on
> newer processors, and, going by memory (again, anecdotal) I do remember at
> one point someone reporting this very problem on a TR.
>
> The current speculation/wishful thinking is that the idle bug may be solved
> by a future AGESA/microcode update, which should be imminent:

This bug is made by the intel developers who does not test their code with other hardware and older intel hardware is affected this problem too. Easy workaround is found in this bug report, put for those who can not follow discussions, here it is again:
To prevent random kernel lock ups, enable RCU_NOCB_CPU and boot the kernel with the rcu_nocbs=0-X command line parameter. X is the cpu thread count -1.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-03:

#216

(In reply to fin4478 from comment #130)
> Ryzen cpus manufactured before week 20 or so have the segfault problem with
> high stress compilation that average users are not doing. You get a new cpu
> Amd after submitting RMA.
>

This is not true, processors from week 28 have also segfault problem.

> This bug is made by the intel developers who does not test their code with
> other hardware and older intel hardware is affected this problem too. Easy
> workaround is found in this bug report, put for those who can not follow
> discussions, here it is again:

This is not true either. Present what change Intel submitted in the kernel that made AMD cpus to freeze on idle. The previously mentioned commit from Seth (comment 72) has been reverted and doesn't fix the issue. And there are people that only C6 disable fixes the issue for them, rcuo threads don't fix it. For me, rcuo fixes the issue - although I would prefer to say "it hides" the issue. To me, as a Computer Engineer, this is a hardware issue, much like the segfault issue. It doesn't seem to occur under Windows, and it hides itself with rcuo threads enabled.

Also, there is at least one user of Threadripper that has this problem and rcuo threads didn't help him. Setting C6 to disabled helped him (comments 82, 91 and 107)

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-03:

#217

I made it 38 days uptime until system crash. I had RCU fixes applied, C6 enabled, cool and quiet enabled. RCU changes appear to help but not fix my issues.

I just turned off C6.

Frustrating. I want to rely on my server.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-03:

#218

Why doesn't this issue manifest in Windows?

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2017-12-03:

#219

(In reply to eric.c.morgan from comment #133)
> Why doesn't this issue manifest in Windows?

I had previously said half jokingly that Windows never gets that idle. I have since heard that AMD have said the same thing themselves.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-03:

#220

(In reply to James Le Cuirot from comment #134)
> (In reply to eric.c.morgan from comment #133)
> > Why doesn't this issue manifest in Windows?
>
> I had previously said half jokingly that Windows never gets that idle. I
> have since heard that AMD have said the same thing themselves.

Thats funny ;-)

While it sucks overall for us, creating a whole new CPU must be maddening to validate. You'll never hit all corner cases the first try.

I'll probably pay for a Zen 2 or whatever is next.

Revision history for this message

Mike R (mi9e) wrote on 2017-12-04:

#48

@franckc did you have to rebuild your kernel? I tried the kernel options without that and still had failures. I've just rebuilt with CONFIG_RCU_NOCB_CPU to see if now the rcu_nocbs=0-15 resolves my lockups

Revision history for this message

Franck Charras (franckc) wrote on 2017-12-04:

#49

YES the boot option alone has no effect, you must also compile the kernel with CONFIG_RCU_NOCB_CPU. You should be fine now (I have this confirmed on several different configs). Here's a tutorial: http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-12-04:

#221

(In reply to Panagiotis Malakoudis from comment #131)
> (In reply to fin4478 from comment #130)
> > Ryzen cpus manufactured before week 20 or so have the segfault problem with
> > high stress compilation that average users are not doing. You get a new cpu
> > Amd after submitting RMA.
> >
>
> This is not true, processors from week 28 have also segfault problem.
>

I did not check what phoronix wrote. You can have a new Ryzen cpu anyway if you are compiling Mesa and having the segfault problem.

To me, as a Computer Engineer, this is a hardware issue, much
> like the segfault issue. It doesn't seem to occur under Windows, and it
> hides itself with rcuo threads enabled.

I am Msc software engineer and when you say win virus hoover does not have random lockups, then it is a software problem made by wintel kernel developers. Intel develops Linux core and test only with their latest hardware.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-04:

#222

(In reply to fin4478 from comment #136)
> I did not check what phoronix wrote. You can have a new Ryzen cpu anyway if
> you are compiling Mesa and having the segfault problem.

When has phoronix become a respectful place for info? When they published the "performance marginality problem exclusive to to certain workloads in Linux"? phoronix wrote whatever AMD told them to get some Threadripper and Epyc systems for review. I don't trust phoronix. They first escalate the segfault issue, then they buy the "marginality" excuse from AMD.

It has been reported in the original thread in AMD forums (https://community.amd.com/thread/215773) that CPUs from week 28 bought from stores still excibit segfault issues. CPUs from week 33 and after seem to have this fixed, but maybe still not 100%.

> I am Msc software engineer and when you say win virus hoover does not have
> random lockups, then it is a software problem made by wintel kernel
> developers. Intel develops Linux core and test only with their latest
> hardware.

This "Intel develops Linux core" is a statement of no value. Unless you present some specific commits from Intel that specifically affect Ryzen (AMD's Opteron for example is not affected by this idle freeze issue), you are just talking nonsense. We should better stay with facts. Fact is that even rcuo threads enabled don't fix the issue for everyone, some CPUs require C6 disabled completely. Fact is that some CPUs don't idle freeze even with rcuo threads disabled and C6 enabled. To me this is a clear indication about a hardware and not a software issue.

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-12-04:

#223

(In reply to Panagiotis Malakoudis from comment #137)

> have this fixed, but maybe still not 100%.

Less complex devices than Ryzen cpus do have hardware errors and you can have money back or a new product.
>
> > I am Msc software engineer and when you say win virus hoover does not have
> > random lockups, then it is a software problem made by wintel kernel
> > developers. Intel develops Linux core and test only with their latest
> > hardware.
>
> This "Intel develops Linux core" is a statement of no value. Unless you
> present some specific commits from Intel that specifically affect Ryzen

https://bugzilla.kernel.org/show_bug.cgi?id=197177

Revision history for this message

Mario Emmenlauer (emmenlau) wrote on 2017-12-04:

#50

I have compiled the current Ubuntu 17.10 kernel 4.13.0-17-generic with CONFIG_RCU_NOCB_CPU
and boot option "rcu_nocbs=0-15" and just had a system freeze again. My System is a Ryzen 7 1700X
on an ASrock AB350 Pro4 mainboard with 32GB Ram @3200.
I do not use the kernel option "processor.max_cstate=1", is it suggested for this problem?

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-04:

#224

(In reply to fin4478 from comment #138)

> https://bugzilla.kernel.org/show_bug.cgi?id=197177

And this proves that? This commit doesn't affect Ryzen, it affects some specific motherboard with some specific BIOS/ACPI implementation. It is just some false ACPI interpretation - and more probably it is what you have been told - an incorrect ACPI table compilation based on older tools.

Revision history for this message

In Linux Kernel Bug Tracker #196683, hi-angel (hi-angel-linux-kernel-bugs) wrote on 2017-12-04:

#225

(In reply to fin4478 from comment #138)
> (In reply to Panagiotis Malakoudis from comment #137)
> > This "Intel develops Linux core" is a statement of no value. Unless you
> > present some specific commits from Intel that specifically affect Ryzen
>
> https://bugzilla.kernel.org/show_bug.cgi?id=197177

Besides being irrelevant, you've been told your firmware is buggy, for that matter up to the point of not even working on Windows. You've also been told that inhibiting firmware errors is a work-in-progress; and I bet many users do want to know if their firmware is buggy — I do.

You've straight ignored everything, and started cussing the dev. Please stop being an asshole.

Revision history for this message

In Linux Kernel Bug Tracker #196683, fin4478 (fin4478-linux-kernel-bugs) wrote on 2017-12-04:

#226

(In reply to Panagiotis Malakoudis from comment #139)
> (In reply to fin4478 from comment #138)
>
> > https://bugzilla.kernel.org/show_bug.cgi?id=197177
>
> And this proves that? This commit doesn't affect Ryzen, it affects some
> specific motherboard with some specific BIOS/ACPI implementation. It is just
> some false ACPI interpretation - and more probably it is what you have been
> told - an incorrect ACPI table compilation based on older tools.

Asus sure knows better than intel what tools to use when creating a motherboard. Nobody buys intel motherboards. Intel cpus have had and will have hardware errors too, so let us stop here.

https://en.wikipedia.org/wiki/Pentium_FDIV_bug

Revision history for this message

Franck Charras (franckc) wrote on 2017-12-04:

#51

No "processor.max_cstate=1" is not necessary. I'd also suggest to double check that your motherboard bios is up to date. If it persists I'd try with a more recent kernel (I'm using 13.6) because IIRC the rcu options where a little bit different in earlier versions of 4.13.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-04:

#227

(In reply to fin4478 from comment #141)
>
> Asus sure knows better than intel what tools to use when creating a
> motherboard. Nobody buys intel motherboards. Intel cpus have had and will
> have hardware errors too, so let us stop here.
>
> https://en.wikipedia.org/wiki/Pentium_FDIV_bug

I don't understand what you are trying to say, sure Intel has bugs and more recent ones from the FDIV bug, there was hyperthreading bug in Skylake and Kabylake CPUs - which were fixed by microcode update.

The problem discussed here is probably hardware related and we hope it could be fixed with a future microcode (AGESA) update. Workaround it in software (the rcuo threads fix) would be nice but since it doesn't fix it for 100% of the cases, it should be investigated further. Intel Skylake hyperthreading issue took around 18 months to be fixed in microcode, so I can wait for a fix from AMD. But we need a confirmation that a. they have reproduced the issue and are looking at it and b. that it can be fixed with microcode update.

Revision history for this message

In Linux Kernel Bug Tracker #196683, lucio (lucio-linux-kernel-bugs) wrote on 2017-12-04:

#228

(In reply to Panagiotis Malakoudis from comment #142)
> I don't understand what you are trying to say,

fin4478 (a.k.a. debianxfce@phoronix) is trolling as usual, please don't feed it.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2017-12-04:

#229

Is there a reliable way to test and be 100% sure a Ryzen/TR is OK or has the problem? Like there was a script to test for the segfault issue IIRC?

Revision history for this message

Mike R (mi9e) wrote on 2017-12-04:

#52

Cool Thanks for the confirmation Franck I've rebuilt 4.13.0-17 with instructions from here https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel (Probably similarly to Mario). I'm going to test it with this first (since don't really want to reboot again). But might get stuff ready with a 4.14.3 (which is current latest).

Revision history for this message

Franck Charras (franckc) wrote on 2017-12-04:

#53

If you rely on the boot option "rcu_nocbs=0-15" you also need to reboot to enable it.

Revision history for this message

Mike R (mi9e) wrote on 2017-12-04:

#54

Yep I did :) thanks

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-04:

#230

(In reply to Jonathan from comment #144)
> Is there a reliable way to test and be 100% sure a Ryzen/TR is OK or has the
> problem? Like there was a script to test for the segfault issue IIRC?

Leaving your computer idle over night for a couple of nights will either trigger the issue and you will find your computer freezed or it will not. Idle freeze usually happens from 15-30 minutes up to some hours.

It should be noted here that there are two kinds of idle freeze. In the first one (discussed here) compurer locks but if you press reset button it reboots fine. The second one is related to memory/cpu overclock and computer freezes completely (in overclock.net it is called "black screen freeze on idle") and pressing reset button does nothing - you have to completely power off. This second one also happens under Windows. So, in order to test reliably don't overclock CPU and RAM - meaning that RAM should run at max 2400 MHz.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tstraus13 (tstraus13-linux-kernel-bugs) wrote on 2017-12-05:

#231

I just wanted to chime in as well. I have an AMD Ryzen 1600, ASRock X370 Taichi acting as a NFS File Server. It is running Ubuntu 17.10 server with I believe kernel 4.13 or whatever the latest kernel is after updates. Recently I have been experiencing the same exact issue as you all. Within 24 of the system being idle or low load the system just freezes. I am unable to access it. It actually happened today as I am typing this, I am unable to remote into it but I have it hooked up to a keyboard, mouse, and monitor. When I return home I want to see if there are errors of anykind on screen. I initally thought it may have been due to some hardware changes but it just didnt make sense and then I found this thread.

I am going to try disabling the C6 State and see if it helps at all and will report back. Hopefully this gets fixed with a BIOS update or something. This is a major inconvenience for server use with frequent idle loads. Thanks.

Revision history for this message

Tom Reynolds (tomreyn) wrote on 2017-12-05:

#55

The workaround discussed at http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen fixes the recurring (roughly once or twice in 6 hours of operation, during low or idle CPU load periods) system freezes for this Ryzen 7 1800X (early weeks), AMD microcode 0x8001129, ASRock X370 Taichi (Firmware P3.20, BIOS 5.12, Release Date: 09/08/2017) system running XUbuntu 16.04 LTS (with latest updates)

All the mainline kernels up to linux-image-4.13.10-041310-generic, and official Ubuntu kernels (default 16.04 kernel, hwe) all kept freezing - although I had only tested them without "rcu_nocbs=0-15". I did test hwe-edge (4.10.0.20.13), both with and without "rcu_nocbs=0-15". Both variants gave me freezes.

Joseph Salisbury (jsalisbury) on 2017-12-05

tags:

removed: kernel-da-key

Revision history for this message

Mike R (mi9e) wrote on 2017-12-06:

#56

Well I've hit 48 hours uptime since booting to my custom build kernel (Linux ganymede 4.13.0-17-generic #20+ryzen SMP Sun Dec 3 20:18:51 MST 2017 x86_64 x86_64 x86_64 GNU/Linux) with rcu_nocbs=0-15 in my boot command and disabled ASLR. That's about double my previous uptime record, I'll report back again if I hit 1 week

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2017-12-06:

#232

I have two identical Ryzen systems, running Debian 9 "stretch" as servers.

Specs:
Ryzen 5 1600
ASRock AB350 Pro4
Crucial memory with ECC

These servers have been running fine since early August. I had one random Apache crash (segfault) on kernel 4.9 (too old for Ryzen really), but both had been running kernel 4.12 (from stretch-backports) just fine for about 3.5 months now. Suddenly, in the last week, I have seen 3 crashes across both machines. Similar symptoms as described above. Total system freeze in all cases, sometimes with some jibberish on screen:
https://i.imgur.com/J8M1zUS.jpg
https://i.imgur.com/Qb7Z8YW.jpg
In all cases, crashes happened between 03.00 and 06.00, when no users are on and backup jobs have finished.

I am since yesterday running stress-ng on both servers which keeps 3 cores at 100% 24/7, to see if that stabilises the systems for about a week. This could add to the evidence that prolonged full idle is the trigger for this issue.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2017-12-06:

#233

What I forgot to mention is that I find it rather peculiar that both machines are rock solid for 3.5 months and suddenly start crashing a number of times in a week.. Perhaps that is worth investigating too.

Revision history for this message

Mario Emmenlauer (emmenlau) wrote on 2017-12-06:

#57

Its great that the frequency of freezes is reduced. But I guess many people here expect a stable
system and can work with nothing less. Ryzen-Processors are attractive for CI-systems, and I can
not have regular freezes in our CI. I think this is an all-or-nothing-issue, and I still had a
freeze with latest Ubuntu Kernel and fixes applied (after 8hrs idle time).

Revision history for this message

Franck Charras (franckc) wrote on 2017-12-06:

#58

It could be something else then ? After applying the fix on several machines, I have no freezes to report, more than 1 month of uptime, it looks rock stable. Did you also disable ASLR ? Also I compiled a kernel from git.kernel.org as suggested in the tutorial, not Ubuntu Kernel.

Revision history for this message

Mario Emmenlauer (emmenlau) wrote on 2017-12-06:

#59

I think its very very unlikely to have a different underlying cause, even though I there is no guarantee. I assume that I am affected by the same problem because:
(1) The problem manifests identical to other reports here
(2) The freeze happens particularly during idle times, not under heavy load
(3) This is the only freeze we encountered in years, on many different desktop
    and server machines with partially identical or overlapping software setup,
    and partially overlapping hardware (graphics boards, hard disks, peripherals).
(4) The frequency of freezes was drastically reduced after I performed the
    suggested steps: compiling a custom kernel with CONFIG_RCU_NOCB_CPU,
    setting rcu_nocbs=0-15, and disabling ASLR.
It would be a very very rare coincidence if my issue is unrelated, albeit its never impossible. I have also updated to latest BIOS but with no improvements. I will report if there are further freezes.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-06:

#234

(In reply to kernel from comment #147)
> in the last week, I have seen 3 crashes across
> both machines. Similar symptoms as described above. Total system freeze in
> all cases, sometimes with some jibberish on screen:
> https://i.imgur.com/J8M1zUS.jpg
> https://i.imgur.com/Qb7Z8YW.jpg
> In all cases, crashes happened between 03.00 and 06.00, when no users are on
> and backup jobs have finished.

Your first screen shows sda disk error. Your second screen indeed shows the problem discussed here.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2017-12-06:

#235

(In reply to Panagiotis Malakoudis from comment #149)
> Your first screen shows sda disk error. Your second screen indeed shows the
> problem discussed here.

Agreed, but the machine completely locked up in the process. That disk is part of a SSD raid1 array for root, and one disk going down or not responding should not crash the entire machine; I find it hard to believe it is unrelated. Both crashes are from the same machine, by the way.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-06:

#236

(In reply to kernel from comment #147)
> I have two identical Ryzen systems, running Debian 9 "stretch" as servers.
>
> Specs:
> Ryzen 5 1600
> ASRock AB350 Pro4
> Crucial memory with ECC
>
> These servers have been running fine since early August. I had one random
> Apache crash (segfault) on kernel 4.9 (too old for Ryzen really), but both
> had been running kernel 4.12 (from stretch-backports) just fine for about
> 3.5 months now. Suddenly, in the last week, I have seen 3 crashes across
> both machines. Similar symptoms as described above. Total system freeze in
> all cases, sometimes with some jibberish on screen:
> https://i.imgur.com/J8M1zUS.jpg
> https://i.imgur.com/Qb7Z8YW.jpg
> In all cases, crashes happened between 03.00 and 06.00, when no users are on
> and backup jobs have finished.
>
> I am since yesterday running stress-ng on both servers which keeps 3 cores
> at 100% 24/7, to see if that stabilises the systems for about a week. This
> could add to the evidence that prolonged full idle is the trigger for this
> issue.

FWIW before my RCU and C6 changes I would run a video on loop and that improved things for me. I'm not sure I would keep 50% of my CPU on full tilt like you!

Revision history for this message

Tom Reynolds (tomreyn) wrote on 2017-12-06:

#60

~emmenlau: Try setting up a serial console to or netconsole from this system to get a better idea of why / how it's freezing.

Kai-Heng Feng (kaihengfeng) on 2017-12-07

information type:	Public → Public Security
information type:	Public Security → Public

Revision history for this message

Mike R (mi9e) wrote on 2017-12-10:

#61

After 5 days uptime with my custom ubuntu kernel build I got a different crash with a reboot vs the watchdog error. I'm now running a mainline kernel 4.14.3 with ASLR disabled and rcu_nocbs=0-15 and have restarted my no crash counter.

Revision history for this message

Huygens (huygens-25) wrote on 2017-12-11:

#62

We are experiencing the same behaviour and error messages with a Ryzen5 1600 CPU and an ASUS PRIME A320M-K motherboard.

What we can add is:

1. When no VMs are running, even if there are long idle periods, the machines can be stable for several days on Ubuntu 16.04.3 using the HWE Kernel 4.10.
2. When we run a single VM on the machine, but both the VM and the host are idle, then we get a crash. The machine is unresponsive, we see CPU soft lockups in the screen, no logs, etc. We tested it with Ubuntu 16.04.3 using the stable and edge HWE kernel (so 4.10 and 4.13), we also even tested with Ubuntu 17.10 and Kernel 4.13. And latest test was with Ubuntu 16.04.3 with custom built Kernel based on HWE edge (so 4.13) with the config CONFIG_RCU_NOCB_CPU, we also got a crash within the following night on one of the 2 machines (note that we did not have ASLR deactivated).

For our latest test we did:
- Ubuntu 16.04.3
- HWE Edge custom built kernel with CONFIG_RCU_NOCB_CPU
- ASLR deactivated
- boot parameters: rcu_nocbs=0-11 processor.max_cstate=1

We set this up last Friday, on Monday morning both machine were down, so it did not help.

PS: our motherboard had the latest "BIOS/UEFI" update which includes AMD AGESA 1.0.0.6B, we even upgraded to a newer firmware (which ASUS has since removed) with stated that it contains AMD AGESA 1.0.0.7A. This morning we saw again another firmware update from ASUS, we are going to apply it on top of our other changes, but if it does not work, we will try to return the machine (and that would be a pity, but stability is the number 1 feature more than the number of core or throughput).

Revision history for this message

Mike R (mi9e) wrote on 2017-12-15:

#63

Well another watchdog failure with 4.14.5

I also updated to AGESA 1.0.0.6B when I rebooted to 4.14.5.

Just building a 4.14.6 to try again.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sfearghail (sfearghail-linux-kernel-bugs) wrote on 2017-12-16:

#237

Specs:

AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
Ryzen 1700 week 33
G.SKILL DDR4-2400 32GB 15-15-15-39 1.2v
4.13.13-custom #1 SMP Mon Nov 27 20:17:00 CST 2017 x86_64 GNU/Linux

Debian 9 and kernel is compiled with CONFIG_RCU_NOCB_CPU and booting with parameter set. ASLR disabled. C-State enabled bios.

It runs samba and 6 active VMs. The VMs are fairly idle most of the time. The crash occurred at 6:40AM after 10 days of uptime.

"watchdog: BUG: soft lookup - CPU#x stuck for 23s! [worker:1659]"

https://photos.app.goo.gl/3ZIsi9aWXltZcixn1
https://photos.app.goo.gl/wfJFlVChftMHyqYQ2

So is the general consensus now that we simply must disable C-State completely in bios if kernel work-around is not working? I just need this system to not crash on me at this point.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tstraus13 (tstraus13-linux-kernel-bugs) wrote on 2017-12-16:

#238

Hello all,

Since I have posted my last comment, about ten days ago, I had disabled C6 state in the BIOS and have not had a single lockup. It has been stable. Just wanted to report back in with my results. Thanks.

(In reply to Tom from comment #146)
> I just wanted to chime in as well. I have an AMD Ryzen 1600, ASRock X370
> Taichi acting as a NFS File Server. It is running Ubuntu 17.10 server with I
> believe kernel 4.13 or whatever the latest kernel is after updates. Recently
> I have been experiencing the same exact issue as you all. Within 24 of the
> system being idle or low load the system just freezes. I am unable to access
> it. It actually happened today as I am typing this, I am unable to remote
> into it but I have it hooked up to a keyboard, mouse, and monitor. When I
> return home I want to see if there are errors of anykind on screen. I
> initally thought it may have been due to some hardware changes but it just
> didnt make sense and then I found this thread.
>
> I am going to try disabling the C6 State and see if it helps at all and will
> report back. Hopefully this gets fixed with a BIOS update or something. This
> is a major inconvenience for server use with frequent idle loads. Thanks.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sfearghail (sfearghail-linux-kernel-bugs) wrote on 2017-12-16:

#239

I forgot to mention the PSU is Corsair CX 550M.

(In reply to Scott Farrell from comment #152)
> Specs:
>
> AB350-Gaming 3/AB350-Gaming 3-CF, BIOS F7 06/16/2017
> Ryzen 1700 week 33
> G.SKILL DDR4-2400 32GB 15-15-15-39 1.2v
> 4.13.13-custom #1 SMP Mon Nov 27 20:17:00 CST 2017 x86_64 GNU/Linux
>
> Debian 9 and kernel is compiled with CONFIG_RCU_NOCB_CPU and booting with
> parameter set. ASLR disabled. C-State enabled bios.
>
> It runs samba and 6 active VMs. The VMs are fairly idle most of the time.
> The crash occurred at 6:40AM after 10 days of uptime.
>
> "watchdog: BUG: soft lookup - CPU#x stuck for 23s! [worker:1659]"
>
> https://photos.app.goo.gl/3ZIsi9aWXltZcixn1
> https://photos.app.goo.gl/wfJFlVChftMHyqYQ2
>
> So is the general consensus now that we simply must disable C-State
> completely in bios if kernel work-around is not working? I just need this
> system to not crash on me at this point.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-16:

#240

I built a new Ryzen system with MSI B350 motherboard, Ryzen 1700 CPU 1733SUS (segfault free direct from retail). Running same Debian 9 system with kernel 4.13 from backports (also tried 4.14.5 self built). This system does not freeze on idle. Believing it is a CPU issue, I installed this CPU to my first system (Asus Prime X370, 1700X 1725SUS from RMA).

Guess what? My first system still freezes on idle with the new CPU.

Differences: First system has GTX1060, second has GT1030. They also have different PSU and different disk. RAM is different but have already exchanged them and didn't make any difference.

Will continue trying hardware combinations to find the root cause, at least for my case.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2017-12-16:

#241

(In reply to Panagiotis Malakoudis from comment #155)
> Believing it is a CPU issue, I installed this CPU to my
> first system (Asus Prime X370, 1700X 1725SUS from RMA).
>
> Guess what? My first system still freezes on idle with the new CPU.

Could you please tell, which Bios version you are using w/ Asus Prime X370?

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-16:

#242

BIOS is 3401, but this was happening with previous BIOS versions as well.

Revision history for this message

Mike R (mi9e) wrote on 2017-12-18:

#64

well another lock up. I've now disabled C6 in the bios and added max_cstates=1 to the boot parameters

Revision history for this message

Nagy Ákos (e-contact-pecska-ro) wrote on 2017-12-19:

#65

Download full text (4.9 KiB)

I experienced the same issue with Ryzen 5 1600 and Ryzen 7 1700, with AsRock AB350M PRO4 and Asus Prime B350M-A.
I try with ubuntu 16.04 normal and hwe kernel, and ubuntu 17.10.

On the system with Asus motherboard, I found a detailed log:
Dec 19 06:25:04 am02 kernel: [227622.995187] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:115]
Dec 19 06:25:04 am02 kernel: [227622.996201] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables snd_hda_codec_hdmi nls_iso8859_1 eeepc_wmi asus_wmi sparse_keymap edac_mce_amd edac_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer i2c_piix4 snd ccp soundcore mac_hid 8250_dw i2c_designware_platform i2c_designware_core shpchp kvm_amd kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Dec 19 06:25:04 am02 kernel: [227622.996242] raid0 multipath linear nouveau crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc mxm_wmi video i2c_algo_bit ttm aesni_intel drm_kms_helper aes_x86_64 crypto_simd syscopyarea glue_helper sysfillrect cryptd sysimgblt fb_sys_fops drm r8169 ahci mii libahci wmi fjes gpio_amdpt gpio_generic
Dec 19 06:25:04 am02 kernel: [227622.996263] CPU: 0 PID: 115 Comm: kworker/0:1 Tainted: G L 4.10.0-28-generic #32~16.04.2-Ubuntu
Dec 19 06:25:04 am02 kernel: [227622.996264] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 0611 05/05/2017
Dec 19 06:25:04 am02 kernel: [227622.996271] Workqueue: events netstamp_clear
Dec 19 06:25:04 am02 kernel: [227622.996274] task: ffff9d12fa288000 task.stack: ffffba804361c000
Dec 19 06:25:04 am02 kernel: [227622.996278] RIP: 0010:smp_call_function_many+0x1f1/0x250
Dec 19 06:25:04 am02 kernel: [227622.996280] RSP: 0018:ffffba804361fce8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Dec 19 06:25:04 am02 kernel: [227622.996283] RAX: 0000000000000003 RBX: 0000000000000010 RCX: 000000000000000c
Dec 19 06:25:04 am02 kernel: [227622.996285] RDX: ffff9d12fe91d0f8 RSI: 0000000000000010 RDI: ffff9d12fe021d30
Dec 19 06:25:04 am02 kernel: [227622.996286] RBP: ffffba804361fd20 R08: 0000000000000000 R09: 000000000000fffe
Dec 19 06:25:04 am02 kernel: [227622.996287] R10: ffffeceadfd69100 R11: ffff9d12fe021d30 R12: ffffffffbb634c60
Dec 19 06:25:04 am02 kernel: [227622.996288] R13: 0000000000000000 R14: ffff9d12fe61a340 R15: 000000000001a300
Dec 19 06:25:04 am02 kernel: [227622.996291] FS: 0000000000000000(0000) GS:ffff9d12fe600000(0000) knlGS:0000000000000000
Dec 19 06:25:04 am02 kernel: [227622.996292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 19 06:25:04 am02 kernel: [227622.996294] CR2: 00007f5780200a58 CR3: 00000007f67ca000 CR4: 00000000003406f0
Dec 19 06:25:04 am02 kernel: [227622.996296] Call Trace:
Dec 19 06:25:04 am02 kernel: [227622.996301] ? netif_receiv...

I experienced the same issue with  Ryzen 5 1600 and Ryzen 7 1700, with AsRock AB350M PRO4 and Asus Prime B350M-A.
I try with ubuntu 16.04 normal and hwe kernel, and ubuntu 17.10.

On the system with Asus motherboard, I found a detailed log:
Dec 19 06:25:04 am02 kernel: [227622.995187] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:115]
Dec 19 06:25:04 am02 kernel: [227622.996201] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables snd_hda_codec_hdmi nls_iso8859_1 eeepc_wmi asus_wmi sparse_keymap edac_mce_amd edac_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer i2c_piix4 snd ccp soundcore mac_hid 8250_dw i2c_designware_platform i2c_designware_core shpchp kvm_amd kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Dec 19 06:25:04 am02 kernel: [227622.996242]  raid0 multipath linear nouveau crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc mxm_wmi video i2c_algo_bit ttm aesni_intel drm_kms_helper aes_x86_64 crypto_simd syscopyarea glue_helper sysfillrect cryptd sysimgblt fb_sys_fops drm r8169 ahci mii libahci wmi fjes gpio_amdpt gpio_generic
Dec 19 06:25:04 am02 kernel: [227622.996263] CPU: 0 PID: 115 Comm: kworker/0:1 Tainted: G             L  4.10.0-28-generic #32~16.04.2-Ubuntu
Dec 19 06:25:04 am02 kernel: [227622.996264] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 0611 05/05/2017
Dec 19 06:25:04 am02 kernel: [227622.996271] Workqueue: events netstamp_clear
Dec 19 06:25:04 am02 kernel: [227622.996274] task: ffff9d12fa288000 task.stack: ffffba804361c000
Dec 19 06:25:04 am02 kernel: [227622.996278] RIP: 0010:smp_call_function_many+0x1f1/0x250
Dec 19 06:25:04 am02 kernel: [227622.996280] RSP: 0018:ffffba804361fce8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Dec 19 06:25:04 am02 kernel: [227622.996283] RAX: 0000000000000003 RBX: 0000000000000010 RCX: 000000000000000c
Dec 19 06:25:04 am02 kernel: [227622.996285] RDX: ffff9d12fe91d0f8 RSI: 0000000000000010 RDI: ffff9d12fe021d30
Dec 19 06:25:04 am02 kernel: [227622.996286] RBP: ffffba804361fd20 R08: 0000000000000000 R09: 000000000000fffe
Dec 19 06:25:04 am02 kernel: [227622.996287] R10: ffffeceadfd69100 R11: ffff9d12fe021d30 R12: ffffffffbb634c60
Dec 19 06:25:04 am02 kernel: [227622.996288] R13: 0000000000000000 R14: ffff9d12fe61a340 R15: 000000000001a300
Dec 19 06:25:04 am02 kernel: [227622.996291] FS:  0000000000000000(0000) GS:ffff9d12fe600000(0000) knlGS:0000000000000000
Dec 19 06:25:04 am02 kernel: [227622.996292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 19 06:25:04 am02 kernel: [227622.996294] CR2: 00007f5780200a58 CR3: 00000007f67ca000 CR4: 00000000003406f0
Dec 19 06:25:04 am02 kernel: [227622.996296] Call Trace:
Dec 19 06:25:04 am02 kernel: [227622.996301]  ? netif_receive_skb_internal+0x20/0xa0
Dec 19 06:25:04 am02 kernel: [227622.996304]  ? arch_unregister_cpu+0x30/0x30
Dec 19 06:25:04 am02 kernel: [227622.996307]  ? netif_receive_skb_internal+0x21/0xa0
Dec 19 06:25:04 am02 kernel: [227622.996309]  on_each_cpu+0x2d/0x60
Dec 19 06:25:04 am02 kernel: [227622.996312]  ? netif_receive_skb_internal+0x20/0xa0
Dec 19 06:25:04 am02 kernel: [227622.996314]  text_poke_bp+0x6a/0xf0
Dec 19 06:25:04 am02 kernel: [227622.996316]  ? netif_receive_skb_internal+0x20/0xa0
Dec 19 06:25:04 am02 kernel: [227622.996318]  arch_jump_label_transform+0x9b/0x120
Dec 19 06:25:04 am02 kernel: [227622.996322]  __jump_label_update+0x76/0x90
Dec 19 06:25:04 am02 kernel: [227622.996324]  jump_label_update+0x88/0x90
Dec 19 06:25:04 am02 kernel: [227622.996326]  static_key_slow_inc+0x95/0xa0
Dec 19 06:25:04 am02 kernel: [227622.996328]  static_key_enable+0x1d/0x50
Dec 19 06:25:04 am02 kernel: [227622.996330]  netstamp_clear+0x2d/0x40
Dec 19 06:25:04 am02 kernel: [227622.996333]  process_one_work+0x16b/0x4a0
Dec 19 06:25:04 am02 kernel: [227622.996335]  worker_thread+0x4b/0x500
Dec 19 06:25:04 am02 kernel: [227622.996339]  kthread+0x109/0x140
Dec 19 06:25:04 am02 kernel: [227622.996341]  ? process_one_work+0x4a0/0x4a0
Dec 19 06:25:04 am02 kernel: [227622.996344]  ? kthread_create_on_node+0x60/0x60
Dec 19 06:25:04 am02 kernel: [227622.996347]  ret_from_fork+0x2c/0x40
Dec 19 06:25:04 am02 kernel: [227622.996348] Code: 63 d2 e8 f3 da 34 00 3b 05 c1 1e e8 00 89 c1 0f 8d 9a fe ff ff 48 98 49 8b 16 48 03 14 c5 e0 73 34 bc 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bd 0f b6 4d d0 4c 89 ea 4c 89 e6 44 89
Dec 19 06:25:32 am02 kernel: [227650.996992] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:115]

If I can help, I can give SSH access for my systems.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2017-12-20:

#243

Did you do further tests with your hardware combinations? Any news? It's frustrating not knowing what is the cause and AMD not interested in recognising let alone fixing the issue.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-20:

#244

C6 disabled and lockup after about a week. This is getting infuriating.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-20:

#245

(In reply to eric.c.morgan from comment #159)
> C6 disabled and lockup after about a week. This is getting infuriating.

How did you disable C6? From BIOS or using the python script?

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-20:

#246

(In reply to Jonathan from comment #158)
> Did you do further tests with your hardware combinations? Any news? It's
> frustrating not knowing what is the cause and AMD not interested in
> recognising let alone fixing the issue.

Yes I did.
I changed the graphics card, system still froze on idle.
Now I changed the disk. Original system was running from a USB 3.0 adapter with an mSATA SSD in it. Now running with a SATA SSD (the one from the second system that didn't freeze).
If it freezes again, last thing to change is the PSU. But I can't accept it is the PSU. Most probably it is the power deliver system on the 2nd motherboard that made the 2nd system not freeze - at least not as quick as my 1st system.

@eric.c.morgan: Freezing on idle with C6 disabled should be something completely different. Are you running with memory overclocked? I keep testing my different setups with memory always running at 2400 or 2133.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-20:

#247

C6 disabled via bios, I'll verify with that python script I posted earlier.

Mem is stock, memtested for 24 hours too.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-20:

#248

Crap, the .py said C6 was still enabled! I disabled via said script. Fingers crossed.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-20:

#249

(In reply to eric.c.morgan from comment #163)
> Crap, the .py said C6 was still enabled! I disabled via said script. Fingers
> crossed.

I had the same experience: disabling C6 via BIOS did not actually disable C6 (at least according to the script), and it would not fix the idle bug. Since I started disabling C6 via the script, I haven't experienced the bug once so far (in 2 months or so of usage - in normal circumstances the bug happens once every 1/2 days for me).

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-21:

#250

As expected, system froze on idle overnight again.
My last test is with the PSU from the 2nd system.

To summarize:
1st system Asus Prime X370, Ryzen 1700X, Flare X 2*8GB DDR4, mSATA SSD on USB 3.0, nvme disk (Windows install), GTX 1060 6GB, Coolermaster G650M PSU freezes on idle. It doesn't freeze with rcuo threads or C6 disabled.
2nd system MSI B350M Bazooka, Ryzen 1700, Corsair Vengeance 2*8GB DDR4, SATA SSD, GT 1030, Antec Earthwatts EA-380D PSU did not freeze on idle for 3 days, running same kernel with C6 enabled without rcuo threads. I didn't test much longer than 3 days because 1st system usually freezes overnight.

Installed on 1st system CPU, RAM, Graphics Card and disk from 2nd system and still freezes on idle. Now testing with the PSU. The only thing after that left is the motherboard. There is also a small difference on peripheral devices (different USB keyboard, a USB webcam) and the other difference is the water cooling system on 1st system.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-21:

#251

System froze with the PSU from 2nd system. As it is I am out of ideas.

Revision history for this message

In Linux Kernel Bug Tracker #196683, klausman (klausman-linux-kernel-bugs) wrote on 2017-12-22:

#252

(In reply to Panagiotis Malakoudis from comment #166)
> System froze with the PSU from 2nd system. As it is I am out of ideas.

FWIW, at GHz speeds all circuits are analog. Propensity to crash on low load may well require (or be hastened by) a particular combination of PSU, CPU and mainboard, possible more components. Plus, in the light of Win10 being unaffected, the system being exercised in a particular way.

In this case, there are at least three variables (PSU/MB/CPU). Combined with the elusive nature of an easy way to reproduce, it may well be a fool's errand to try and figure out which component is at fault.

This may also be the reason why AMD is so mum about this whole matter: there are no clear steps to make a crashing system better that aren't papering over the problem (C6 disabling, RCUO tweaking). If I had to make a decision at AMD, I'd be *very* careful to not claim a source of the error (and possibly solution) too quickly. I would, however, be more transparent about the whole ordeal.

Relatedly, I'm on an Asus Prime B350 Plus with a 1700X and 2x16GB Kingston RAM. I used to have an AMD GPU (ancient, 6450, IIRC) but now have a GTX1050, but crashiness didn't really vary between the two GPUs.

My first CPU had both the lockup and the segv problem, so I got it RMA'd. The new CPU has no segv problem, but is still crashy, usually within a day of light computing, or at the very least over night.

I haven't tried the RCUO approach yet, but with C6 disabled via script, I've gotten to two days of uptime, with very long stretches of complete idle (sitting at login prompt, not doing anything in the background).

As others have said, while this is far from perfect (I don't particularly care about the power saving or turbo, but I understand those who do), it at least it reduces the risk of data loss.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2017-12-22:

#253

(In reply to Panagiotis Malakoudis from comment #166)
> System froze with the PSU from 2nd system. As it is I am out of ideas.

For the hell of it and completion's sake, could you try switching the USB devices as well (or remove them from system 1)?

If that still makes no difference, then perhaps we could start listing which mainboards have issues with what bios versions and see if we find a pattern there.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2017-12-22:

#254

@Jonathan: Friend of mine has Asus Prime X370 as I do and he doesn't have the idle freeze issue. And I don't think a USB webcam is to blame.

I think the problem is a complex PSU/Motherboard/CPU + specific workload puzzle that I am not willing to solve any more. I already lost many days trying combinations. After all I didn't pay AMD to debug their problems, I paid to use my CPU.

Since my issue is resolved with rcuo threads, I will use it like that and test every now and then if a BIOS update solves this. And will certainly replace my Motherboard/CPU combination as soon as X470 and new Ryzen CPUs are out.

Revision history for this message

In Linux Kernel Bug Tracker #196683, arup.chowdhury (arup.chowdhury-linux-kernel-bugs) wrote on 2017-12-24:

#255

Have the same issue with my ASIS 350M and Ryzen 1700 latest BIOS and GSkill RAM, all the freeze happen during idle and never under load. Sometimes it would go days without a freeze, other days its random. I am on Arch with GNOME and using Nvidia drivers for my 1050Ti

Revision history for this message

In Linux Kernel Bug Tracker #196683, lyyb46 (lyyb46-linux-kernel-bugs) wrote on 2017-12-27:

#256

Just would like to report that I have almost exactly the same problem (log shows "soft lockup - CPU#0 stuck for 22s!"), with TP-Link TL-WDN4800 WiFi adapter. Apparently someone has exactly the same issue with me at https://forum.level1techs.com/t/solved-linux-is-unstable-ever-since-i-upgraded-to-ryzen/117541/152 .

Not sure if I have other same components with the OP in the level1techs forum. But for me the soft lockup definitely comes from the WiFi driver process (it mentions about ath9k_hw_wait, which is the WiFi driver in kernel I think).

It seems to me that TP-Link TL-WDN4800, with a recent kernel (I tried 4.10, 4.11, 4.13) can reliably re-produce the soft lockup issue within an hour in Ubuntu 16.04. (Of course you will need to turn on WiFi)

I have given my Ryzen CPU to a friend who is a windows user. But I kept a copy of the log - let me know if you guys need me to upload it.

Revision history for this message

In Linux Kernel Bug Tracker #196683, lyyb46 (lyyb46-linux-kernel-bugs) wrote on 2017-12-27:

#257

(In reply to Yibin Lin from comment #171)
> Just would like to report that I have almost exactly the same problem (log
> shows "soft lockup - CPU#0 stuck for 22s!"), with TP-Link TL-WDN4800 WiFi
> adapter. Apparently someone has exactly the same issue with me at
> https://forum.level1techs.com/t/solved-linux-is-unstable-ever-since-i-
> upgraded-to-ryzen/117541/152 .
>
> Not sure if I have other same components with the OP in the level1techs
> forum. But for me the soft lockup definitely comes from the WiFi driver
> process (it mentions about ath9k_hw_wait, which is the WiFi driver in kernel
> I think).
>
> It seems to me that TP-Link TL-WDN4800, with a recent kernel (I tried 4.10,
> 4.11, 4.13) can reliably re-produce the soft lockup issue within an hour in
> Ubuntu 16.04. (Of course you will need to turn on WiFi)
>
> I have given my Ryzen CPU to a friend who is a windows user. But I kept a
> copy of the log - let me know if you guys need me to upload it.

Some additional information: I tried disabling C6 and with the WiFi card it still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA 550 G3 PSU, Asus ROG Strix B350-f motherboard.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-27:

#258

(In reply to Yibin Lin from comment #172)
> Some additional information: I tried disabling C6 and with the WiFi card it
> still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA
> 550 G3 PSU, Asus ROG Strix B350-f motherboard.

How did you disable C6? Via BIOS or the Python script posted earlier?

Revision history for this message

In Linux Kernel Bug Tracker #196683, lyyb46 (lyyb46-linux-kernel-bugs) wrote on 2017-12-28:

#259

(In reply to Francesco Biscani from comment #173)
> (In reply to Yibin Lin from comment #172)
> > Some additional information: I tried disabling C6 and with the WiFi card it
> > still freezes (that's why I gave away my Ryzen). I have Ryzen 5 1600, EVGA
> > 550 G3 PSU, Asus ROG Strix B350-f motherboard.
>
> How did you disable C6? Via BIOS or the Python script posted earlier?

I think I did both..(?) After I disabled C6 in the BIOS, I ran the Python script and it shows two things about C6 states, one disabled (I guess that's the effect of BIOS setup) and the other enabled. I then wrote a simple systemd script to disable the other C6 state at the Ubuntu startup time. But normally within half an hour my computer would freeze.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2017-12-28:

#260

(In reply to Yibin Lin from comment #174)
> I think I did both..(?) After I disabled C6 in the BIOS, I ran the Python
> script and it shows two things about C6 states, one disabled (I guess that's
> the effect of BIOS setup) and the other enabled. I then wrote a simple
> systemd script to disable the other C6 state at the Ubuntu startup time. But
> normally within half an hour my computer would freeze.

Thanks for the info.

That's a bummer, I was becoming convinced that C6 was at the core of the issue.

For reference, since 2 months I am executing these commands at every startup:

python zenstates.py --c6-disable
python zenstates.py --disable -p0
python zenstates.py --disable -p1
python zenstates.py --disable -p2

And I haven't had a single issue so far. I will start experimenting leaving C6 on and disabling only the P states, just to see what happens. Still waiting for the next BIOS update for my motherboard, and then it's almost time for the release of Ryzen 2...

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2017-12-28:

#261

I've seen something new on one of my two affected machines. Note I have C6 *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The machines have been rock-solid since I started doing that, about 3-4 weeks uptime now.

Anyway, this message came up on one of them:
[1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
[1905495.738523] Do you have a strange power saving mode enabled?
[1905495.738524] Dazed and confused, but trying to continue

I have no idea what to make of this, perhaps it can help.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2017-12-29:

#262

(In reply to kernel from comment #176)
> I've seen something new on one of my two affected machines. Note I have C6
> *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The
> machines have been rock-solid since I started doing that, about 3-4 weeks
> uptime now.
>
> Anyway, this message came up on one of them:
> [1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
> [1905495.738523] Do you have a strange power saving mode enabled?
> [1905495.738524] Dazed and confused, but trying to continue
>
> I have no idea what to make of this, perhaps it can help.

I can't even with those errors. Wow.

Regarding the loaded CPU. I did the same with about %15 cpu load 24/7 and had similar results. I'm now testing with just the python script C6 turned off.

This whole mess is getting really annoying though. I want to support AMD, but man, I might have to go back to intel for a totally reliable machine. Segfault first, RMA pains, and now random crashes with low load? FFS...

Revision history for this message

Mike R (mi9e) wrote on 2018-01-02:

#66

Well I've now hit 3 weeks of uptime with C6 disabled in the BIOS and
GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-15 processor.max_cstate=1"

and 4.14.6 built from the mainline kernel repo. At this point I'm satisfied that for me at least my Ryzen is now stable.

Revision history for this message

Mike R (mi9e) wrote on 2018-01-03:

#67

Just to be clear my mainline kernel is configured with CONFIG_RCU_NOCB_CPU to yes.

Revision history for this message

Mario Emmenlauer (emmenlau) wrote on 2018-01-06:

#68

Ok, after updating to the latest BIOS the issue seems resolved for me. I'm now also at an uptime of over 3 weeks, and before I could never get past 1 week. I've updated the BIOS of my ASRock AB350 Pro4 AMD B350 to 3.20 with AGESA 1.0.0.6b.

I am currently running kernel 4.14.5 with CONFIG_RCU_NOCB_CPU and boot option "rcu_nocbs=0-15". My System is a Ryzen 7 1700X on ASrock AB350 Pro4 with 32GB Ram @3200.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sergio (sergio-linux-kernel-bugs) wrote on 2018-01-08:

#263

(In reply to kernel from comment #176)
> I've seen something new on one of my two affected machines. Note I have C6
> *on*, but I'm running stress-ng 24/7 which keeps one core at 100%. The
> machines have been rock-solid since I started doing that, about 3-4 weeks
> uptime now.
>
> Anyway, this message came up on one of them:
> [1905495.738522] Uhhuh. NMI received for unknown reason 3d on CPU 8.
> [1905495.738523] Do you have a strange power saving mode enabled?
> [1905495.738524] Dazed and confused, but trying to continue
>
> I have no idea what to make of this, perhaps it can help.

As per this thread, this message could be due to your GPU, for instance: https://bbs.archlinux.org/viewtopic.php?id=121291

Revision history for this message

Tom Reynolds (tomreyn) wrote on 2018-01-12:

#69

So, to sum up, we have several reports here that Ubuntu's default kernels freeze on low load situations with Ryzen CPUs (or a subset of Ryzen CPUs?) up to and including the latest mainline releases. This bug differs from bugzilla.kernel.org #196481 (user space segfaults only on high CPU load).

For low load freezes we have a reliable workaround for Xenial (probably also for newer releases) across different hardware components (differences in exact Ryzen model, mainboards, memory models and configurations) and kernel versions:

* Update to latest UEFI
* Kernel >= 4.10.0 (4.14.5 mainline tested successfully) built with CONFIG_RCU_NOCB_CPU
* The result of `echo rcu_nocbs=0-$(($(nproc)-1))` appended to GRUB_CMDLINE_LINUX

In my experience, with the above changes applied, the following BIOS changes are NOT necessary:

* Disabling SMT
* Disabling ALSR

It is unclear whether it is strictly necessary to disable C6 in BIOS, comments welcome.

I would think CONFIG_RCU_NOCB_CPU is an acceptable compromise (for this CPU type) if it provides the stability AMD has, for exactly one year now, been unable to provide to Linux users on their current platform.

Revision history for this message

Tom Reynolds (tomreyn) wrote on 2018-01-12:

#70

Link to correct upstream bug report

Revision history for this message

In Linux Kernel Bug Tracker #196683, bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote on 2018-01-12:

#264

Several Ubuntu users seem to be able to work around this bug as discussed in
https://bugs.launchpad.net/linux/+bug/1690085 , more specifically in https://bugs.launchpad.net/linux/+bug/1690085/comments/69

Revision history for this message

In Linux Kernel Bug Tracker #196683, darkbasic (darkbasic-linux-kernel-bugs) wrote on 2018-01-12:

#265

Didn't know about it, that's really bad! I'm subscribing to make sure to not buy a Ryzen until this gets fixed.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-01-12:

#266

(In reply to Moritz Naumann from comment #179)
> Several Ubuntu users seem to be able to work around this bug as discussed in
> https://bugs.launchpad.net/linux/+bug/1690085 , more specifically in
> https://bugs.launchpad.net/linux/+bug/1690085/comments/69

That's just the same workaround we've been discussing the whole time. It works but it's not 100% effective. I've maybe had one or two freezes with it over the course of a few months.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bugspam (bugspam-linux-kernel-bugs) wrote on 2018-01-12:

#267

Created attachment 273573
Workaround init script (derived from zenstates.py)

I'm using the attached init script to work around the problem. I can be fairly sure of it hanging (or even rebooting) within a day or two without this; no issues with it in place.

CONFIG_RCU_NOCB_CPU is not set here.

Revision history for this message

Mike R (mi9e) wrote on 2018-01-16:

#71

@tomreyn

My corrent config to get stability

C6 disabled in BIOS
ALSR disabled
kernel >= 4.14.4 with CONFIG_RCU_NOCB_CPU
rcu_nocbs=0-15 processor.max_cstate=1 in GRUB_CMDLINE_LINUX

I got lockups still with ASLR disabled and rcu_nocbs with the custom built kernel, once I added C6 disabled in bios and processor.max_cstate=1 I've gotten 6 weeks of uptime

Revision history for this message

Kevin Wong (stonecracker) wrote on 2018-01-17:

#72

CPU: Ryzen Threadripper 1920X
Ubuntu 16.04 LTS

Identical problems to the above. Agonizing, irritating, random, low-load idle-state freezes. No issues while under heavy load.

My fixes:

Disabled c-states in the BIOS
Disabled ASLR in the BIOS
Disabled AMD Cool 'n Quiet in the BIOS.

Compiled a new kernel from source, (4.13.16) but with "Offload RCU_callback processing" aka CONFIG_RCU_NOCB_CPU, chosen in the kernel config.

I added "rcu_nocbs=0-23" to the "GRUB_CMDLINE_LINUX" parameter in /etc/default/grub (and then ran "sudo update-grub"). "0-23", to cover all 24 threads on the 12 core Ryzen,

Rebooted.

Only with all of those fixes (yep, all of them, tested every one of them separately -- that took an eon -- have I finally gotten a stable system.

Zero issues now after 8 weeks. Let's hope that holds.

Haven't tried any kernel updates, though.

In addition to the above, I also consulted these references:

http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

http://blog.programster.org/how-to-disable-aslr

Revision history for this message

In Linux Kernel Bug Tracker #196683, 05497894 (05497894-linux-kernel-bugs) wrote on 2018-01-19:

#268

Apologies if this has been mentioned or is completely obvious, but if you're using zenstates or ryzen-stabilizator or some other software means to disable C6, make sure you run it after resuming from S3. At least on my system, C6 is re-enabled after I resume from sleep.

Revision history for this message

In Linux Kernel Bug Tracker #196683, cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote on 2018-01-26:

#269

I'm also experiencing what appears to be this same issue, on a ASUS
Prime X370-PRO motherboard and a Ryzen 1800X running Fedora 27 (with no
overclocking and Kingston server RAM with ECC). I've been able to capture
kernel logs through netconsole and, as with other people, they're reported
to run through smp_call_function_many and/or smp_call_function_single
(usually called from TLB flushing code).

Booting my system with 'rcu_nocbs=0-15 processor.max_cstate=5' appears
to have made it stable (again, as with other people).

Revision history for this message

Peridot (peridot) wrote on 2018-02-05:

#73

This is still an issue even with the bionic dailies. The easiest way (read: without recompiling kernels) I have found to get it to work is to disable IOMMU in your BIOS and add "iommu=soft" to the kernel booting options in grub.

linux can then detect everything properly (all cores) and I've had zero crashes. The only issue is that it's using software IOMMU.

Without these options you will either get crashes or hangs relating to the ACPI table (booting with acpi=off will only show a single core). Lots of AMD-Vi Logged events. and irq crashes.

I believe this is related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360

I'm on a Ryzen 1800X and Biostar B350GT5.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2018-02-05: Re: [Bug 1690085] Re: Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

#74

Download full text (5.8 KiB)

> On 5 Feb 2018, at 7:21 PM, Peridot <email address hidden> wrote:
>
> This is still an issue even with the bionic dailies. The easiest way
> (read: without recompiling kernels) I have found to get it to work is to
> disable IOMMU in your BIOS and add "iommu=soft" to the kernel booting
> options in grub.
>
> linux can then detect everything properly (all cores) and I've had zero
> crashes. The only issue is that it's using software IOMMU.
>
> Without these options you will either get crashes or hangs relating to
> the ACPI table (booting with acpi=off will only show a single core).
> Lots of AMD-Vi Logged events. and irq crashes.
>
> I believe this is related to
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360>

This particular bug is about interrupt storm from AMD GPIO driver.
Can you file a separate bug instead?

>
> I'm on a Ryzen 1800X and Biostar B350GT5.
>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1690085
>
> Title:
> Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks
>
> Status in Linux:
> Unknown
> Status in linux package in Ubuntu:
> Confirmed
>
> Bug description:
> Hi,
>
>
> We aregetting various kernel crash on a pretty new config.
> We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)
>
> We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
> Tested kernel version:
>
> native 17.04 kernel
> 4.10.15
>
> Issues are the same, we're getting random freeze on the machine.
>
> Here is kern.log entry when happening :
>
> May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
> May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
> May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
> May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
> May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
> May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
> May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
> May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
> May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
> May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
> May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
> May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
> May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
> May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
> May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
> May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
> May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_ker...

> On 5 Feb 2018, at 7:21 PM, Peridot <1690085@bugs.launchpad.net> wrote:
> 
> This is still an issue even with the bionic dailies. The easiest way
> (read: without recompiling kernels) I have found to get it to work is to
> disable IOMMU in your BIOS and add "iommu=soft" to the kernel booting
> options in grub.
> 
> linux can then detect everything properly (all cores) and I've had zero
> crashes. The only issue is that it's using software IOMMU.
> 
> Without these options you will either get crashes or hangs relating to
> the ACPI table (booting with acpi=off will only show a single core).
> Lots of AMD-Vi Logged events. and irq crashes.
> 
> I believe this is related to
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1671360>

This particular bug is about interrupt storm from AMD GPIO driver.
Can you file a separate bug instead?

> 
> I'm on a Ryzen 1800X and Biostar B350GT5.
> 
> -- 
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1690085
> 
> Title:
>  Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks
> 
> Status in Linux:
>  Unknown
> Status in linux package in Ubuntu:
>  Confirmed
> 
> Bug description:
>  Hi,
> 
> 
>  We aregetting various kernel crash on a pretty new config.
>  We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)
> 
>  We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
>  Tested kernel version:
> 
>  native 17.04 kernel
>  4.10.15
> 
>  Issues are the same, we're getting random freeze on the machine.
> 
>  Here is kern.log entry when happening :
> 
>  May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
>  May 10 22:41:56 dev2 kernel: [24366.187618]     0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
>  May 10 22:41:56 dev2 kernel: [24366.188977]     (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
>  May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
>  May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0       R  running task        0     0      0 0x00000008
>  May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
>  May 10 22:41:56 dev2 kernel: [24366.190354]  ? native_safe_halt+0x6/0x10
>  May 10 22:41:56 dev2 kernel: [24366.190355]  ? default_idle+0x20/0xd0
>  May 10 22:41:56 dev2 kernel: [24366.190358]  ? arch_cpu_idle+0xf/0x20
>  May 10 22:41:56 dev2 kernel: [24366.190360]  ? default_idle_call+0x23/0x30
>  May 10 22:41:56 dev2 kernel: [24366.190362]  ? do_idle+0x16f/0x200
>  May 10 22:41:56 dev2 kernel: [24366.190364]  ? cpu_startup_entry+0x71/0x80
>  May 10 22:41:56 dev2 kernel: [24366.190366]  ? rest_init+0x77/0x80
>  May 10 22:41:56 dev2 kernel: [24366.190368]  ? start_kernel+0x464/0x485
>  May 10 22:41:56 dev2 kernel: [24366.190369]  ? early_idt_handler_array+0x120/0x120
>  May 10 22:41:56 dev2 kernel: [24366.190371]  ? x86_64_start_reservations+0x24/0x26
>  May 10 22:41:56 dev2 kernel: [24366.190372]  ? x86_64_start_kernel+0x14d/0x170
>  May 10 22:41:56 dev2 kernel: [24366.190373]  ? start_cpu+0x14/0x14
>  May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
>  May 10 22:44:56 dev2 kernel: [24546.189461]     0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
>  May 10 22:44:56 dev2 kernel: [24546.190823]     (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
>  May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
>  May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0       R  running task        0     0      0 0x00000008
>  May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
>  May 10 22:44:56 dev2 kernel: [24546.192199]  ? native_safe_halt+0x6/0x10
>  May 10 22:44:56 dev2 kernel: [24546.192201]  ? default_idle+0x20/0xd0
>  May 10 22:44:56 dev2 kernel: [24546.192203]  ? arch_cpu_idle+0xf/0x20
>  May 10 22:44:56 dev2 kernel: [24546.192204]  ? default_idle_call+0x23/0x30
>  May 10 22:44:56 dev2 kernel: [24546.192206]  ? do_idle+0x16f/0x200
>  May 10 22:44:56 dev2 kernel: [24546.192208]  ? cpu_startup_entry+0x71/0x80
>  May 10 22:44:56 dev2 kernel: [24546.192210]  ? rest_init+0x77/0x80
>  May 10 22:44:56 dev2 kernel: [24546.192211]  ? start_kernel+0x464/0x485
>  May 10 22:44:56 dev2 kernel: [24546.192213]  ? early_idt_handler_array+0x120/0x120
>  May 10 22:44:56 dev2 kernel: [24546.192214]  ? x86_64_start_reservations+0x24/0x26
>  May 10 22:44:56 dev2 kernel: [24546.192215]  ? x86_64_start_kernel+0x14d/0x170
>  May 10 22:44:56 dev2 kernel: [24546.192217]  ? start_cpu+0x14/0x14
> 
>  Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
>  Crash is happening randomly, but in general after some hours (3-4h).
> 
>  Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
>  For now, the machine is not "used", at least, it's not CPU stressed...
> 
> 
>  Thanks
>  --- 
>  ApportVersion: 2.20.4-0ubuntu4
>  Architecture: amd64
>  DistroRelease: Ubuntu 17.04
>  InstallationDate: Installed on 2017-05-09 (1 days ago)
>  InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
>  Package: linux (not installed)
>  ProcEnviron:
>   TERM=xterm-256color
>   PATH=(custom, no user)
>   XDG_RUNTIME_DIR=<set>
>   LANG=fr_FR.UTF-8
>   SHELL=/bin/bash
>  Tags:  zesty
>  Uname: Linux 4.11.0-041100-generic x86_64
>  UnreportableReason: The running kernel is not an Ubuntu kernel
>  UpgradeStatus: No upgrade log present (probably fresh install)
>  UserGroups:
> 
>  _MarkForUpload: True
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/linux/+bug/1690085/+subscriptions

Revision history for this message

Peridot (peridot) wrote on 2018-02-05:

#75

Opened the new bug at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1747463

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-02-17:

#270

I have Ryzen 7 1800X, ASUS Prime X370-PRO, with CPU replaced by AMD.

The damn thing has not worked properly since I bought it.

I have: kernel 4.14.18-300.fc27
CONFIG_RCU_NOCB_CPU=y
rcu-nocbs=0-15

I have seen streams of "watchdog: BUG: soft lockup", but mostly the system just stops and I can find no logging that tells me why.

Currently I find the machine has stopped over night (when it is idle), every two or three days.

I have just installed kernel 4.15.3-300.fc27. But I can find nothing to suggest that this fault is understood, let alone fixed.

I have just added processor.max_cstate=5 -- though I have failed to find any way to check what difference that has made.

So: is the "work-around" still the only available solution ?

[FWIW, while the AMD CPU went through the RMA process, I bought an Intel Coffee Lake i7-8700K machine. Will I buy another AMD ? I doubt it.]

Revision history for this message

In Linux Kernel Bug Tracker #196683, johnpaul7 (johnpaul7-linux-kernel-bugs) wrote on 2018-02-17:

#271

There really has to be something else at play here. With my system everything has been stable since finding this thread and setting rcu-nocbs. But my co-worker who built an almost identical system (only differences being 1800X and non QVL memory) is experiencing the problem and close to giving up. My system specs:

Fedora 26
kernel 4.14.18
CONFIG_RCU_NOCB_CPU=y
rcu-nocbs=0-15

Ryzen 7 1700x (week 37)
ASrock Fatal1ty AB350 Gaming-ITX/ac (4.40 bios)
32gb QVL memory running at rated 2933 via profile
Other stuff: AMD GPU, Corsair PSU, Samsung SSD

Bios settings are pretty much stock apart from memory profile, enabling virtualization stuff, and tweaking fan profiles. Nothing tweaked related to C6 (bios or zenstates script).

To help aggregate all our experiences, and try to find some patterns what is everyone's thoughts on starting a Google form spreadsheet with columns for:
- CPU
- CPU week: yes, not related really but good to know in general
- Mobo: helpful for finding others in "your" situation
- Mobo Chipset: curious if 350 vs 370 shows anything significant. Mobo name would probably contain this, but having a dedicated column to sort is helpful
- Mobo BIOS version
- (boolean) QVL memory: we know Ryzen is sensitive to memory, so would be good data point to collect
- (boolean) CONFIG_RCU_NOCB_CPU
- (boolean) rcu-nocbs: bool since value is based on processor and not something I think we misalign with number of threads
- C6 tweaks: "false" or what you did (e.g. bios, zenstates.py, etc)
- Distro
- Kernel
- (boolean) Passes stress test: to help eliminate hardware instability from OC'ing or faulty memory sticks, we can define what qualifies as "true" here
- (linear scale) Stable: from my life sucks and I can't use system, to my life on Ryzen is awesome and I forgot last time I rebooted (but still conscious bugs and crashes exist outside of Ryzen related issues).

I want to try and keep it as simple/straightforward as possible so it compliments this thread, and helps us see the data in a more structured format. Hopefully we can find a pattern! I also can't help but wonder how more people are not reporting issues. Would be curious if any people on the levelonetechs community can get Wendell & Co. to draw more attention to this. AMD is making great strides for open source and linux friendly hardware, but this problem really hurts that effort.

**EDIT** v1 of form is live: https://goo.gl/forms/oGCTPnNK0vJtNntj2

If you want to be a collaborator on the form, message me directly. Also share feedback on how to make the form better suited for our goal!

There really has to be something else at play here. With my system everything has been stable since finding this thread and setting rcu-nocbs. But my co-worker who built an almost identical system (only differences being 1800X and non QVL memory) is experiencing the problem and close to giving up. My system specs:

Fedora 26
kernel 4.14.18
CONFIG_RCU_NOCB_CPU=y
rcu-nocbs=0-15

Ryzen 7 1700x (week 37)
ASrock Fatal1ty AB350 Gaming-ITX/ac (4.40 bios)
32gb QVL memory running at rated 2933 via profile
Other stuff: AMD GPU, Corsair PSU, Samsung SSD

Bios settings are pretty much stock apart from memory profile, enabling virtualization stuff, and tweaking fan profiles. Nothing tweaked related to C6 (bios or zenstates script).

To help aggregate all our experiences, and try to find some patterns what is everyone's thoughts on starting a Google form spreadsheet with columns for:
- CPU
- CPU week: yes, not related really but good to know in general
- Mobo: helpful for finding others in "your" situation
- Mobo Chipset: curious if 350 vs 370 shows anything significant. Mobo name would probably contain this, but having a dedicated column to sort is helpful
- Mobo BIOS version
- (boolean) QVL memory: we know Ryzen is sensitive to memory, so would be good data point to collect
- (boolean) CONFIG_RCU_NOCB_CPU
- (boolean) rcu-nocbs: bool since value is based on processor and not something I think we misalign with number of threads 
- C6 tweaks: "false" or what you did (e.g. bios, zenstates.py, etc)
- Distro
- Kernel
- (boolean) Passes stress test: to help eliminate hardware instability from OC'ing or faulty memory sticks, we can define what qualifies as "true" here
- (linear scale) Stable: from my life sucks and I can't use system, to my life on Ryzen is awesome and I forgot last time I rebooted (but still conscious bugs and crashes exist outside of Ryzen related issues).

I want to try and keep it as simple/straightforward as possible so it compliments this thread, and helps us see the data in a more structured format. Hopefully we can find a pattern! I also can't help but wonder how more people are not reporting issues. Would be curious if any people on the levelonetechs community can get Wendell & Co. to draw more attention to this. AMD is making great strides for open source and linux friendly hardware, but this problem really hurts that effort.

**EDIT** v1 of form is live: https://goo.gl/forms/oGCTPnNK0vJtNntj2

If you want to be a collaborator on the form, message me directly. Also share feedback on how to make the form better suited for our goal!

Revision history for this message

John Paul Herold (dailyherold) wrote on 2018-02-17:

#76

Been following this thread along with https://bugzilla.kernel.org/show_bug.cgi?id=196683. To help us aggregate data and look for patterns, I created a google form: https://goo.gl/forms/oGCTPnNK0vJtNntj2

Message me if you want to be a collaborator on it, and also please share feedback. Goal is to keep it simple and straightforward so that it compliments this thread (and kernel bugzilla thread). I could have had questions for every workaround listed across both threads, but tried to distill it to the ones most commonly stated/used/discussed.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2018-02-17:

#272

(In reply to Chris Hall from comment #185)
> I have Ryzen 7 1800X, ASUS Prime X370-PRO, with CPU replaced by AMD.
>
> The damn thing has not worked properly since I bought it.
>
> I have: kernel 4.14.18-300.fc27
> CONFIG_RCU_NOCB_CPU=y
> rcu-nocbs=0-15
>
> I have seen streams of "watchdog: BUG: soft lockup", but mostly the system
> just stops and I can find no logging that tells me why.
>
> Currently I find the machine has stopped over night (when it is idle), every
> two or three days.
>
> I have just installed kernel 4.15.3-300.fc27. But I can find nothing to
> suggest that this fault is understood, let alone fixed.
>
> I have just added processor.max_cstate=5 -- though I have failed to find any
> way to check what difference that has made.
>
> So: is the "work-around" still the only available solution ?
>
> [FWIW, while the AMD CPU went through the RMA process, I bought an Intel
> Coffee Lake i7-8700K machine. Will I buy another AMD ? I doubt it.]

Run the python script mentioned earlier.

For me new kernel/params with RCU setting enabled & python script to turn off C states has had my machine running for 60 days now.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bluescarni (bluescarni-linux-kernel-bugs) wrote on 2018-02-18:

#273

(In reply to eric.c.morgan from comment #187)
> Run the python script mentioned earlier.

+1, disabling C6 has resulted in a rock-solid experience over here, no idle reboots in months of constant use of the machine.

Just wanted to mention a couple of points for those still following this thread.

I recently updated my motherboard's BIOS to the latest version, which contains a CPU microcode update from AMD (I am now on version "0x08001136", whereas I was stuck on version "0x08001129" for - I believe - 6 months or so). I had high hopes a microcode update would fix the idle reboots, but, alas, they still persist with the new microcode and C6 enabled.

The other interesting tidbit is this article from 1 year ago I found:

https://www.gamersnexus.net/news-pc/2870-ryzen-power-plan-update-min-frequency-90-pct

Apparently, soon after the ryzen launch last year, AMD issued an updated version of the Windows "Balanced" preset power plan which disables "core parking". Doing a bit of research, it seems like core parking is synonymous (or, at least strongly related) to the C6 power state. Might this explain why people on Windows are not experiencing random reboots? I am not sure 100% how these Windows preset power plans work, but the page linked above seems to indicate that:

1) after the update, the "Balanced" power plan disables core parking,
2) core parking was always disabled anyway in the "Performance" power plan.

See also this thread on the AMD community forum:

https://community.amd.com/thread/220563

Maybe AMD knew all along there are issues with the C6 power state...

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-02-18:

#274

Nice work John-Paul. I've just put in my entry. I'm curious to see whether *anyone* has experienced these freezes when C6 is disabled. If not then I think we can be fairly sure the culprit has been found.

My own stress-ng approach is also still holding strong. Both servers have been rock solid since early December, with C6 *on* and no kernel options. This adds to the evidence that prolonged full idle is the trigger for this issue.

This was brought up before, but can we get onto an AMD engineer with regards to this? It'd be particularly interesting to ask about Francesco's hypothesis.

Revision history for this message

In Linux Kernel Bug Tracker #196683, johnpaul7 (johnpaul7-linux-kernel-bugs) wrote on 2018-02-19:

#275

I enabled the setting so you can see charts and summary of data after submission, but for those who missed that or just want a direct link to the spreadsheet, here it is: https://docs.google.com/spreadsheets/d/1PuIkhxbxdE2H7fjbmfWYKBJI_iw8A6W1h-SZnQUXBSI/edit?usp=sharing

Pretty interesting stuff, only 7 responses total so far so sample size way too small to draw conclusions. However, initial notes I find interesting:

- 2 of 7 still have stability issues, one has a >1725 chip (1744). Again, this issue not related to segfault stuff, but good data point that newer steppings still show the behavior.
- The only two responses that aren't using QVL listed memory, are the two having issues still. Both happen to be ECC memory as well. May or not be relevant because these two do not have any RCU tweaks in place.
- Of the 5 who have stable systems, the CPU models vary, the chipsets vary, BIOS and AGESA vary from new to old.
- The only constant from 5 "stable" systems is "CONFIG_RCU_NOCB_CPU=y". Whether C-States was tweaked, or rcu-nocbs, or ASLR, there are mixed responses.
- Good on everyone for testing system/hardware stability!

Thank you to those who shared! Let's get more data!

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-02-19:

#276

John-Paul, I'd like to note that I entered that I still have issues because with stress-ng not running to create artificial load, the problem persists. As in my previous comment, with a single thread at 100% 24/7 I do *not* have issues.

I did this originally as to provide more data; letting the system not enter full idle will also solve the problem, without BIOS tweaks or custom kernel options. n=2 for this result.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-02-19:

#277

OK...

[]# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-4.15.3-300.fc27.x86_64 root=... ro rhgb quiet rcu_nocbs=0-15 processor.max_cstate=5

  []# python zenstates.py -l
  P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
  P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
  P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
  P3 - Disabled
  P4 - Disabled
  P5 - Disabled
  P6 - Disabled
  P7 - Disabled
  C6 State - Package - Enabled
  C6 State - Core - Enabled

So... kernel command line option "processor.max_cstate=5" does not appear to disable C6. [And removing the option did not seem to change anything, either.]

But...

[]# python zenstates.py --c6-disable
Disabling C6 state

  []# python zenstates.py -l
  P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore = 1.35000
  P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore = 1.27500
  P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
  P3 - Disabled
  P4 - Disabled
  P5 - Disabled
  P6 - Disabled
  P7 - Disabled
  C6 State - Package - Disabled
  C6 State - Core - Disabled

So... having added "python zenstates.py --c6-disable" to rc.local (eventually), I now appear to have C6 disabled when the system is rebooted.

Do I have to do anything else to make sure that C6 *stays* disabled ?

Chris

Revision history for this message

In Linux Kernel Bug Tracker #196683, 05497894 (05497894-linux-kernel-bugs) wrote on 2018-02-19:

#278

(In reply to Chris Hall from comment #192)
> OK...
>
> []# cat /proc/cmdline
> BOOT_IMAGE=/vmlinuz-4.15.3-300.fc27.x86_64 root=... ro rhgb quiet
> rcu_nocbs=0-15 processor.max_cstate=5
>
> []# python zenstates.py -l
> P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
>
> So... kernel command line option "processor.max_cstate=5" does not appear to
> disable C6. [And removing the option did not seem to change anything,
> either.]
>
> But...
>
> []# python zenstates.py --c6-disable
> Disabling C6 state
>
> []# python zenstates.py -l
> P0 - Enabled - FID = 90 - DID = 8 - VID = 20 - Ratio = 36.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 80 - DID = 8 - VID = 2C - Ratio = 32.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Disabled
> C6 State - Core - Disabled
>
> So... having added "python zenstates.py --c6-disable" to rc.local
> (eventually), I now appear to have C6 disabled when the system is rebooted.
>
> Do I have to do anything else to make sure that C6 *stays* disabled ?
>
> Chris

If you suspend your system, make sure you run zenstates again upon resume. Otherwise you should be ok.

Revision history for this message

In Linux Kernel Bug Tracker #196683, it (it-linux-kernel-bugs) wrote on 2018-02-23:

#279

I want to share my findings. Today I googled for "Ryzen C6 state" and found this reddit thread: https://www.reddit.com/r/Amd/comments/7ita4h/why_its_suggested_to_disable_c6_and_cstate_global/

The last comment regarding DRAM "Power Down Enable" and instabilities combined with C6 raised my interest. So I searched the UEFI of my ASrock AB350 Pro4 and found that option, but it was already disabled - but perhaps it's worth to look into if you experience idle freezes.

But I found another interesting option called "Power Supply Idle Control". Didn't find much documentation about it except: https://lime-technology.com/forums/topic/61767-amd-ryzen-update/?tab=comments#comment-607258

As was mentioned before AMD claims that's the idle freezes are related to the PSU. I don't think my PSU isn't able to support that low power mode but here http://forum.asrock.com/forum_posts.asp?TID=6832&title=ramocvoltagecantlowerthanxmpvoltageb350pro4-420
someone states, that the "Typical current Idle" setting changes the idle voltage of the cores. Perhaps that might make a difference?
Anyone tried that?

I've set it to "Typical current Idle" for now and we'll see. 3h of idle uptime so far - not bad :-)

Oh, and I was a little bit astonished to see my AGESA version shown in the UEFI to be 1.0.0.1a. If anyone wonders like me, it seems, that AMD has reset the versioning for whatever reason.

Revision history for this message

Jamie Davis (sichemist) wrote on 2018-02-24:

#77

I had the same issues this bug describes: Freeze on low load; Runs fine on heavy load. I even ran stress -c 16 when I had to keep the system up for long periods of low load.

I tried pretty much everything in this list including compiling a mainline kernel with CONFIG_RCU_NOCB_CPU enabled and the respective GRUB changes without success.

I noticed a problem in /var/log/syslog involving the ath9k driver.
I disabled the driver (rmmod ath9k) and the problems went away.

I had a PCIe wireless card I swapped in from an earlier Intel Core i5 build that used that driver. After disabling ath9k, all MY problems went away. I replaced it with a different PCIe wireless adapter that has an Intel chipset and my system has been stable ever since. I changed back all the settings that I had altered (except C6) and everything works.

The old Ath9k PCIe card is stable in the Intel system I moved it back to.

I'm only posting here because my bug had the exact symptoms as the ones in this thread but a completely different cause/solution.

Revision history for this message

In Linux Kernel Bug Tracker #196683, c.buddeweg (c.buddeweg-linux-kernel-bugs) wrote on 2018-02-25:

#280

(In reply to John-Paul Herold from comment #186)

Linux version 4.15.5-gentoo (gcc version 7.3.0 (Gentoo 7.3.0 p1.0)) #1 SMP Sat Feb 24 08:35:43 CET 2018
Command line: BOOT_IMAGE=/kernel-4.15.5-prod rcu_nocbs=0-15 root=/dev/sdb5 ro

Boot Mode is legacy - no tweaks, C6 is on

Hardware R7 1700 (RMA'ed because of segfault problem)
GA-AX370-Gaming K7 Bios F22b
32 GB RAM no QVL

The lockup is reproduceable after longtime idle/sleep (more than 30 min). I didn't found something inside the logs.

With CONFIG_RCU_NOCB_CPU=n I had the lockups. Now I use CONFIG_RCU_NOCB_CPU=y and the commandline above and it works since 3 days stable.

I use the CPU "on demand gouvernor" and the range of the frequency is from 1400 MHz up to 3100 MHz. It's wider than the "AMD balanced power plan for Ryzen" that you can see for Win10.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2018-02-25:

#281

Question, I'm curious what the impact is: how much power would say a Ryzen 1800x use more in idle or low use (like desktop or browsing internet) with these 'fixes' enabled?

Revision history for this message

In Linux Kernel Bug Tracker #196683, it (it-linux-kernel-bugs) wrote on 2018-03-02:

#282

FYI: I just passed 1 week uptime without any freeze. I never before managed to have such a long uptime. So the "Power Supply Idle Control" option in BIOS seems to really make a difference.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel.org (kernel.org-linux-kernel-bugs) wrote on 2018-03-02:

#283

Having an uptime of 5 days now. Seems to be stable.
Was down to ~4-6h between lockups.

Running successfully with:
rcu_nocbs=0-11
zenstates.py --c6-disable
On
4.13.0-36-generic (HWE kernel)
Ubuntu 16.04.4 LTS
AMD Ryzen 5 1600X
Asus PRIME X370-PRO (Bios 3803)

I'm not sure if I would have needed the RCU. C6 disable was a must.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2018-03-02:

#284

(In reply to rtux from comment #198)
> Having an uptime of 5 days now. Seems to be stable.
> Was down to ~4-6h between lockups.
>
> Running successfully with:
> rcu_nocbs=0-11
> zenstates.py --c6-disable
> On
> 4.13.0-36-generic (HWE kernel)
> Ubuntu 16.04.4 LTS
> AMD Ryzen 5 1600X
> Asus PRIME X370-PRO (Bios 3803)
>
> I'm not sure if I would have needed the RCU. C6 disable was a must.

RCU did not fix my issue, C6 disabling script did. I'm at 75 days uptime.

Revision history for this message

In Linux Kernel Bug Tracker #196683, excieve (excieve-linux-kernel-bugs) wrote on 2018-03-03:

#285

I think it might be solved for me by the latest ASRock X370 Gaming K4 firmware.

So, I have an RMA'd Ryzen 1700X, which had this freezing problem since receiving it. The workaround I applied was disabling the C6 state in BIOS, which stopped this. After installing BIOS 4.50 (currently latest) I noticed there's no such option any more so I started using the zenstates script on boot to do this. However, a week ago I was reading more on this issue and the "Power Supply Idle Control" option in the BIOS. So I decided to stop disabling C6 while keeping the option at default (where it should be adaptive) and see how it goes. During this week I tried leaving idle over night, whole day, using as usual, causing sustained high load, etc. and there's been no freezing so far.

Might be just luck of course, but before, with C6 enabled, it would freeze quite reliably even with normal daily usage. Or perhaps it's just a lot less common now.

Revision history for this message

In Linux Kernel Bug Tracker #196683, dagecko (dagecko-linux-kernel-bugs) wrote on 2018-03-04:

#286

Good evening,

I was facing the same issues most people are facing here: the so-called AMD Ryzen soft-lock bug, happening when the CPU was iddling. Digging on the Internet, I found some interesting information avoiding having to disable C-state, which is very good for CPU life-expectancy and energy saving.
It is all about tricking the component voltage and frequency. For interested people, this is depicted here: http://www.silence.host/node/1

To Artem Hluvchynskyi, please let us know more about this. As guessed by some, this definately looks to be a power issue, and not a kernel one.

Revision history for this message

In Linux Kernel Bug Tracker #196683, excieve (excieve-linux-kernel-bugs) wrote on 2018-03-07:

#287

Unfortunately just got a freeze again after 1.5 weeks. So I ended up switching "Power Supply Idle Control" to "Typical" and apparently what it does is actually disabling the C6 state on the CPU package. At least that's what MSR shows (through zenstates script) and I can't see Vcore going lower than 0.8V so it does indeed look like C6 is not being reached. This should make it stable but it really sucks. Will later check if the most fresh firmware (released these days) changes anything and how RCU offload workaround affects stability.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kallisti5 (kallisti5-linux-kernel-bugs) wrote on 2018-03-11:

#288

Same issue here for a while on Fedora 27. System seems to hard-lock when idle almost daily. (I've never had a lockup while "in-use")

Ryzen X1800
PRIME X370-PRO + Latest BIOS
32GB Ram
amd_iommu=on
I noticed the issue completely stopped ~4.15.3, but definitely has returned on 4.15.6

I'm trying the "rcu_nocbs=0-15" fix since it has been driving me nuts.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2018-03-11:

#289

Is anyone here planning on getting a Ryzen 2 a month or so from now? With or without a new x470/b450 mainboard? I'm curious if AMD, even if they haven't really acknowledged the issue, has actually worked on the problem and fixed it in the new CPU or mainboards.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2018-03-12:

#290

> I'm trying the "rcu_nocbs=0-15" fix since it has been driving me nuts.
This may not solve it for you. I use rcu_nocbs=0-15 and I made systemd services that run on boot and return from suspend which runs the script attached to this bug https://bugzilla.kernel.org/attachment.cgi?id=273573 that disables C6. A combination of those two solves it on both my boxes, the one with a ASUS board hung regularly with just rcu_nocbs=0-15 but works fine with both.

Unrelated, it's interesting that AMD is very silent about this total scandal bug in their system. I won't be buying a new Ryzen 2 when it's released (not that I would anyway, current one is fast enough). Intel does not have this problem. It is further a total outrage that AMD has been trying to blame this on PSUs. If two brand new high-end PSUs from two different brands result in the exact same problem then it's clearly AMDs fault.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kallisti5 (kallisti5-linux-kernel-bugs) wrote on 2018-03-12:

#291

Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in this thread.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-12:

#292

(In reply to Alexander von Gluck from comment #206)
> Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in
> this thread.

I don't think the rest of us without ASUS boards are imagining this issue.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2018-03-12:

#293

(In reply to Alexander von Gluck from comment #206)
> Are we sure this isn't an Asus BIOS bug?
Yes. I have two 1600X's, one's on a Asus board and the other a Gigabyte board. Switching PSU/RAM/GPU doesn't change anything, both have the idle bug in every combination. This total scandal is all AMDs fault.

Revision history for this message

In Linux Kernel Bug Tracker #196683, dagecko (dagecko-linux-kernel-bugs) wrote on 2018-03-13:

#294

Asus is the main partner from AMD concerning Ryzen chipsets. AMD works
first with them, and when they decide a fix is stable, propagate the
BIOS updates to other vendors, ie MSI or Gigabyte.

The issue _is_ a power issue (not a PSU issue). Giving a bit more power
to the chipset, CPU and memory generally solves the bug. The bug also
existed in a different form on Windows, and exists with the exact same
form on BSD.

Please, consider this (and this is something different from disabling
C-state). Google, or consider reading this: http://www.silence.host/node/1

Le 12/03/2018 à 12:43, <email address hidden> a écrit :
> https://bugzilla.kernel.org/show_bug.cgi?id=196683
>
> --- Comment #207 from James Le Cuirot (<email address hidden>) ---
> (In reply to Alexander von Gluck from comment #206)
>> Are we sure this isn't an Asus BIOS bug? I see a *lot* of Asus boards in
>> this thread.
> I don't think the rest of us without ASUS boards are imagining this issue.
>

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-03-14:

#295

Is anyone from AMD aware of this at all?

I don't have a Ryzen system but I support someone who does and this appears to be going on for them too.

Anyone here, or anyone who someone knows here, got contacts within AMD to make them aware of the issue?

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-14:

#296

(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

They know, they're just not admitting it publicly.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-03-14:

#297

Download full text (4.7 KiB)

(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

I opened a ticket with them on 25-Feb-2018.

I received a response on 28-Feb-2018, to which I replied as follows:

[start_______________________________________________________________________
On 28/02/18 10:51, <email address hidden> wrote:
> Dear Chris,
>
> Your service request : SR #{ticketno:[8200794428]} has been reviewed and
> updated.
>
> Response and Service Request History:
>
> Thank you for the response.
>
> I understand you are experiencing an issue on your PC with Ryzen processor
> when C6 state is enabled on the BIOS.

Yes, my PC freezes at random when it is idle. Typically it will freeze when left overnight, roughly every 2 or 3 days.

Having disabled C6 -- using the 'zenstates.py' script, from https://github.com/r4m0n/ZenStates-Linux -- my machine has not frozen for 7 days.

> This issue has been fixed with the latest BIOS updates, but the option to fix
> it may not be available in all BIOS.

What is the root cause of the issue ?

In what way has it been fixed ?

> I request you to update to the latest BIOS and see if you have the Power
> Supply Control option in the MB BIOS. Try toggling this option between
> the different settings to see if it fixes it. If the specific option is
> not available I would suggest you keep C6 off for now.

I have the latest available "PRIME X370-PRO BIOS 3803" from ASUS. That apparently includes:

2.Update to AGESA 1000a for new upcoming processors

I understand that means AGESA 1.0.0.0a (?) -- I have no idea what that means, since AMD seem to keep the release notes for AGESA as a deep, dark secret. A previous BIOSes had (according to ASUS) "AGESA 1071", and before that were "AGESA 1.0.0.6B" and "AGESA 1.0.0.6a"... so I admit to being baffled.

What does this new BIOS "Power Supply Option" do ?

Are you telling me that this is problem with my power supply ?

If so, does this mean I need a better power supply ?

Disabling C6 is not really a long term solution... since that disables both (a) the maximum single core performance, and (b) the minimum power consumption state. While these are arguably marginal, I have wasted a lot of time and energy trying to get my machine to work reliably.

I am seriously disappointed that the only information available is buried in kernel bug report(s) and in various support forums.

Having (eventually) found Linux Kernel Bug 196683, I have been hoping that AMD would leap into action to: (a) inject some clarity into the discussion, and (b) provide a proper solution.

<sigh>

For completeness, let me repeat my questions:

1) What is the root cause of the issue ?

2) In what way has it been fixed ?

3) What does this new BIOS "Power Supply Option" do ?

4) Are you telling me that this is problem with my power supply ?

5) If so, does this mean I need a better power supply ?
__________________________________________________________________________end]

I prompted them on 12-Mar-2018, and today (14-Mar-2018) have received:

[start_______________________________________________________________________

Your service request : SR #{ticketno:[8200794428]} has bee...

(In reply to OptionalRealName from comment #210)
> Is anyone from AMD aware of this at all?

I opened a ticket with them on 25-Feb-2018.

I received a response on 28-Feb-2018, to which I replied as follows:

[start_______________________________________________________________________
On 28/02/18 10:51, TECH.SUPPORT@AMD.COM wrote:
> Dear Chris,
>
> Your service request : SR #{ticketno:[8200794428]} has been reviewed and 
> updated.
>
> Response and Service Request History:
>
> Thank you for the response.
>
> I understand you are experiencing an issue on your PC with Ryzen processor
> when C6 state is enabled on the BIOS.

Yes, my PC freezes at random when it is idle.  Typically it will freeze when left overnight, roughly every 2 or 3 days.

Having disabled C6 -- using the 'zenstates.py' script, from https://github.com/r4m0n/ZenStates-Linux -- my machine has not frozen for 7 days.

> This issue has been fixed with the latest BIOS updates, but the option to fix
> it may not be available in all BIOS.

What is the root cause of the issue ?

In what way has it been fixed ?

> I request you to update to the latest BIOS and see if you have the Power 
> Supply Control option in the MB BIOS. Try toggling this option between 
> the different settings to see if it fixes it. If the specific option is 
> not available I would suggest you keep C6 off for now.

I have the latest available "PRIME X370-PRO BIOS 3803" from ASUS.  That apparently includes:

2.Update to AGESA 1000a for new upcoming processors

I understand that means AGESA 1.0.0.0a (?) -- I have no idea what that means, since AMD seem to keep the release notes for AGESA as a deep, dark secret.  A previous BIOSes had (according to ASUS) "AGESA 1071", and before that were "AGESA 1.0.0.6B" and "AGESA 1.0.0.6a"... so I admit to being baffled.

What does this new BIOS "Power Supply Option" do ?

Are you telling me that this is problem with my power supply ?

If so, does this mean I need a better power supply ?

Disabling C6 is not really a long term solution... since that disables both (a) the maximum single core performance, and (b) the minimum power consumption state.  While these are arguably marginal, I have wasted a lot of time and energy trying to get my machine to work reliably.

I am seriously disappointed that the only information available is buried in kernel bug report(s) and in various support forums.

Having (eventually) found Linux Kernel Bug 196683, I have been hoping that AMD would leap into action to: (a) inject some clarity into the discussion, and (b) provide a proper solution.

<sigh>

For completeness, let me repeat my questions:

1) What is the root cause of the issue ?

2) In what way has it been fixed ?

3) What does this new BIOS "Power Supply Option" do ?

4) Are you telling me that this is problem with my power supply ?

5) If so, does this mean I need a better power supply ?
__________________________________________________________________________end]

I prompted them on 12-Mar-2018, and today (14-Mar-2018) have received:

[start_______________________________________________________________________

Your service request : SR #{ticketno:[8200794428]} has been reviewed and updated.

Response and Service Request History:

Thanks for getting back to me, i really appreciate your patience.

This issue is related to the power supply.  Most PC power supplies (PSUs) are designed to handle a wide range of power consumption from your PC components, but not all PSUs are created equal.

Because of this, there are some rare conditions where the power draw of an efficient PC does not meet the minimum power consumption requirements for one or more circuits inside some PSUs.

This scenario (called “minimal loading supply”) can cause such PSUs to output poor quality power, or shut off entirely.

To prevent this issue from happening, it is important to ensure that the power supply supports 0A minimum load on the +12V circuit. These PSUs became commonplace starting in 2013 for the Intel “Haswell” platform.

This specification can be found printed on the sticker affixed to most PSUs, or it may be available on the manufacturer’s website. If you cannot locate this information related to your PSU, you will need to contact the manufacturer directly.

If you have not already, please download and update to the latest Motherboard bios and check to see if you have this new bios option and let me know the outcome of toggling this option on and off, and between the various preset settings. 
__________________________________________________________________________end]

I admit that the PSU I was using was (a) old and (b) cheap.  I have now replaced that by a brand-new, more efficient PSU which supports 0A loads across all voltages.  I am running that with C6 enabled, and await developments.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-14:

#298

(In reply to Chris Hall from comment #212)
> I admit that the PSU I was using was (a) old and (b) cheap. I have now
> replaced that by a brand-new, more efficient PSU which supports 0A loads
> across all voltages. I am running that with C6 enabled, and await
> developments.

Thanks for putting the pressure on them. I also contacted support but didn't press them any further. Your new PSU probably won't help though, we discussed the various PSUs that people have further up this page and most, like mine, are good quality and not that old.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-14:

#299

(In reply to Chris Hall from comment #212)
> I admit that the PSU I was using was (a) old and (b) cheap. I have now
> replaced that by a brand-new, more efficient PSU which supports 0A loads
> across all voltages. I am running that with C6 enabled, and await
> developments.

Could you please tell which PSU you're now testing?

Revision history for this message

In Linux Kernel Bug Tracker #196683, cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote on 2018-03-14:

#300

My Ryzen 1800X / Asus Prime X370-Pro based system that experiences this
has an EVGA SuperNova G3 550 watt PSU. Measured power usage is well
under its maximum rating, and while I can't find an explicit statement
in its documentation about 0A loads, I would be stunned if it doesn't
fully support that, since it's a quite modern and well regarded model
(and other G3 models seem to be reported as Haswell-ready in this way).

(More broadly, if AMD Ryzen CPUs and motherboards don't work reliably
with modern high quality PSUs, there is a problem in practice for people
who want to have reliable Ryzen-based Linux systems.)

Revision history for this message

In Linux Kernel Bug Tracker #196683, kallisti5 (kallisti5-linux-kernel-bugs) wrote on 2018-03-14:

#301

There's a lot of amd bashing in this thread, I'm still inclined to believe that it's just bugs to work out with a new major CPU changeup. The same kind of bugs bubbled up when athlon 64 + opteron came out.

Beyond the idle bug (which does indeed seem fixed so far adding the rcu_nocbs boot flag), performance seems fine under Fedora 27.

Let me ping my amd graphics contacts and see if they're aware of this ticket.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-03-14:

#302

(In reply to Klaus Mueller from comment #214)
> (In reply to Chris Hall from comment #212)
> > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > across all voltages. I am running that with C6 enabled, and await
> > developments.

> Could you please tell which PSU you're now testing?

It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-14:

#303

(In reply to Chris Hall from comment #217)
> (In reply to Klaus Mueller from comment #214)
> > (In reply to Chris Hall from comment #212)
> > > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > > across all voltages. I am running that with C6 enabled, and await
> > > developments.
>
> > Could you please tell which PSU you're now testing?
>
> It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011. I'm facing problems, which can be worked around here by using the optimized mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm curious about your test result. According be quiet!, your PSU should be 0A - stable (compliant to intel C6/C7).

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-03-14:

#304

(In reply to Klaus Mueller from comment #218)
>
> Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011.
> I'm facing problems, which can be worked around here by using the optimized
> mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm
> curious about your test result. According be quiet!, your PSU should be 0A -
> stable (compliant to intel C6/C7).

FWIW, I have looked at a number of reputable PSU suppliers, and found absolutely nothing else which explicitly covers the minimum current.

However, I find that ATX12V v2.4 is included in Revision 1.31 of Intel's 'Design Guide for Desktop Platform Form Factors', published in April 2013. [https://www.intel.com/content/dam/www/public/us/en/documents/guides/power-supply-design-guide.pdf] This appears to be the latest revision, and I note:

* revision history says Revision 1.31 includes:

* Changed 3.2.10 12 V2DC Minimum Loading to REQUIRED

* Updated ... ATX12V Specific Guidelines to version 2.4

* section 3.2.10 seems to:

* REQUIRE +12 V2DC to operate down to 0.05A

* and RECOMMEND that it operates down to 0A

So... I guess that a PSU which claims ATX12V v2.4 compliance should be OK ?

Revision history for this message

In Linux Kernel Bug Tracker #196683, 2rb0alex (2rb0alex-linux-kernel-bugs) wrote on 2018-03-14:

#305

Hi,
ASRock AB350 Pro4 + Be quiet! Pure Power 10 CM owner here.
Still having issues (got hit by this bug during "pacman -Syu" update last week).
The only thing that works for me is disabling C6.
Could AMD provide a reference PC spec or PSU QVL or something?
I've found no PSU recommendations from AMD except this:
https://support.amd.com/en-us/recommended/power-supplies

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel.org (kernel.org-linux-kernel-bugs) wrote on 2018-03-14:

#306

(In reply to Klaus Mueller from comment #218)
> Actually, I'm running a "be quiet! Straight Power E8 400W". It's from 2011.
> I'm facing problems, which can be worked around here by using the optimized
> mode for daily computing (Asus Prime X370 pro) and nothing else more. I'm
> curious about your test result. According be quiet!, your PSU should be 0A -
> stable (compliant to intel C6/C7).

I'm running a new beQuiet! System Power 8 with 400W (Jan 2018) which supports latest processor generations from Intel and AMD.

I had to disable C6 to get it stable - see comment #198 for details.
Current uptime is 17 days now.

Still waiting for a Asus BIOS fix, so that I can re-enable C6 again.

Revision history for this message

In Linux Kernel Bug Tracker #196683, johnpaul7 (johnpaul7-linux-kernel-bugs) wrote on 2018-03-15:

#307

I'm running a [Corsair SF450](https://www.corsair.com/us/en/Power/Plug-Type/sf-series-psu-config/p/CP-9020104-NA), which under "Tech Specs" lists the ATX12V version as v2.4. I had the idle freeze issue until I added `rcu_nocbs=0-15`. Haven't had to do anything else specific in BIOS or other tweaks.

I have modified the google form to include a question for validating the ATX12V version, just to see if anything trends from that.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-15:

#308

Blaming the PSU from AMD's part is wrong, since idle freezes have been documented with various PSUs, new and old. If you read my older posts, I did a complete replacement of parts between two systems: one excibiting the freeze on idle issue and one that didn't. The system not excibiting the freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the other system didn't avoid idle freeze.

The problem is related to power and actually I have "hide" the problem by overclocking my CPU. Hitting certain voltages/frequencies avoids completely the issue (at least on my case with an ASUS X370 Prime). But even these voltages/frequencies are different from one case to another. Another user here reported he "hides" the problem by using the automatic overclocking of motherboard ("optimized mode for daily computing" option) which actually overclocks to 3600 if I remember well. For me this didn't work, still had idle freezes. But idle freezes disappeared when I overclocked at 3900.

rcu option doesn't fix the problem for all. It changes the way cores get idle so some see the problem as fixed, some don't. It is more than obvious that this is not a kernel problem. So I would close this report as invalid. It is not a kernel bug. It is a hardware issue that AMD needs to fix and it seems it has been fixed in with this new option "Power Supply Option". Unfortunately my motherboard doesn't have this option yet.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kallisti5 (kallisti5-linux-kernel-bugs) wrote on 2018-03-16:

#309

I opened a support case to AMD, and buried in several basic steps (what GPU, etc) was "disable C6 in BIOS".

So, sounds like AMD is definitely aware of the problem since disable C6 was in their first support communication.

I have a 1000 Watt Corsair HX1000i and see the idle hang-ups, so definitely *NOT* a power supply issue.

I wonder if windows users see this issue?

(In reply to Panagiotis Malakoudis from comment #223)
> rcu option doesn't fix the problem for all.

For the RCU fix to work, your distro also has to have kernel support for it (CONFIG_RCU_NOCB_CPU). See this article:

http://textandhubris.com/2017/12/09/ryzen-issues-on-fedora-27/

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-03-16:

#310

I suspect this solves the Windows issues.

https://www.eteknix.com/amd-releases-balanced-power-plan-patch-for-windows-10/

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-03-16:

#311

Also does anyone know if the Epyc series has this issue?

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-16:

#312

(In reply to Panagiotis Malakoudis from comment #223)
> Blaming the PSU from AMD's part is wrong, since idle freezes have been
> documented with various PSUs, new and old. If you read my older posts, I did
> a complete replacement of parts between two systems: one excibiting the
> freeze on idle issue and one that didn't. The system not excibiting the
> freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the
> other system didn't avoid idle freeze.

Two have one more documented check by Chris Hall can't be bad. But I fear I know the end of this test - even if I wish him it would work.

> The problem is related to power and actually I have "hide" the problem by
> overclocking my CPU.

This is a new interesting information.

> Hitting certain voltages/frequencies avoids completely
> the issue (at least on my case with an ASUS X370 Prime). But even these
> voltages/frequencies are different from one case to another. Another user

Me!

> here reported he "hides" the problem by using the automatic overclocking of
> motherboard ("optimized mode for daily computing" option) which actually
> overclocks to 3600 if I remember well.

You do! 3600 probably is enough for me, as I'm always running 4 VMs - which are mostly idle, too, but they add some additional minimal load, which could prevent the system from often entering very low current levels. Moderate overclocking additionally helps to circumvent the problem.
Next point is, that I'm seldom running this machine more than 15 hours - therefore, I can't say, if the problem disappeared completely. But before the overclocking, I could see freezes even after an hour or earlier.

> For me this didn't work, still had
> idle freezes. But idle freezes disappeared when I overclocked at 3900.

This is good to know! There are a lot of posts elsewhere about overclocking Ryzen (and RAM - that seems to be much more difficult) - mostly by Windows users. Overclocking doesn't cause difficulties but even could resolve problems! That's a cool new idea!
Could it be related to the relation between RAM and CPU clock?

[...]

> It is a hardware issue that AMD needs to fix and it
> seems it has been fixed in with this new option "Power Supply Option".

As I could read elsewhere [1], this option would disable C6 states among others. so, using zenstates.py to switch off c6 states would be the same "solution".

[1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714

(In reply to Panagiotis Malakoudis from comment #223)
> Blaming the PSU from AMD's part is wrong, since idle freezes have been
> documented with various PSUs, new and old. If you read my older posts, I did
> a complete replacement of parts between two systems: one excibiting the
> freeze on idle issue and one that didn't. The system not excibiting the
> freeze was using a really old (pre Hasswell) PSU. Putting this PSU on the
> other system didn't avoid idle freeze.

Two have one more documented check by Chris Hall can't be bad. But I fear I know the end of this test - even if I wish him it would work.

> The problem is related to power and actually I have "hide" the problem by
> overclocking my CPU.

This is a new interesting information.

> Hitting certain voltages/frequencies avoids completely
> the issue (at least on my case with an ASUS X370 Prime). But even these
> voltages/frequencies are different from one case to another. Another user

Me!

> here reported he "hides" the problem by using the automatic overclocking of
> motherboard ("optimized mode for daily computing" option) which actually
> overclocks to 3600 if I remember well.

You do! 3600 probably is enough for me, as I'm always running 4 VMs - which are mostly idle, too, but they add some additional minimal load, which could prevent the system from often entering very low current levels. Moderate overclocking additionally helps to circumvent the problem.
Next point is, that I'm seldom running this machine more than 15 hours - therefore, I can't say, if the problem disappeared completely. But before the overclocking, I could see freezes even after an hour or earlier.

> For me this didn't work, still had
> idle freezes. But idle freezes disappeared when I overclocked at 3900.

This is good to know! There are a lot of posts elsewhere about overclocking Ryzen (and RAM - that seems to be much more difficult) - mostly by Windows users. Overclocking doesn't cause difficulties but even could resolve problems! That's a cool new idea!
Could it be related to the relation between RAM and CPU clock?

[...]

> It is a hardware issue that AMD needs to fix and it
> seems it has been fixed in with this new option "Power Supply Option".

As I could read elsewhere [1], this option would disable C6 states among others. so, using zenstates.py to switch off c6 states would be the same "solution".

[1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714

Revision history for this message

In Linux Kernel Bug Tracker #196683, pjssilva (pjssilva-linux-kernel-bugs) wrote on 2018-03-16:

#313

>
> As I could read elsewhere [1], this option would disable C6 states among
> others. so, using zenstates.py to switch off c6 states would be the same
> "solution".
>
> [1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714

This does not seem to be true for all motherboards. I have a B350 Tomahawk from MSI and I updated the BIOS today. In its last BIOS it has this new Power Supply Option that can be set to Auto, Low current idle, Typical current idle. The default is auto.

I have changed it to both Low and Typical current and checked the possible states using zenstates.py -l and in both cases I get:

P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

So the C6 states looks enabled. I am now running it with Typical current idle to see if it improves stability. Next monday, when I get back to work I will see if the machine locked over the weekend, as usual, or not.

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2018-03-16:

#314

(In reply to Paulo J. S. Silva from comment #228)
> >
> > As I could read elsewhere [1], this option would disable C6 states among
> > others. so, using zenstates.py to switch off c6 states would be the same
> > "solution".
> >
> > [1] https://www.gamingonlinux.com/forum/topic/3207/post_id=14714
>
> This does not seem to be true for all motherboards. I have a B350 Tomahawk
> from MSI and I updated the BIOS today. In its last BIOS it has this new
> Power Supply Option that can be set to Auto, Low current idle, Typical
> current idle. The default is auto.
>
> I have changed it to both Low and Typical current and checked the possible
> states using zenstates.py -l and in both cases I get:
>
>
> P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore =
> 1.35000
> P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore =
> 1.27500
> P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore =
> 0.90000
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
>
> So the C6 states looks enabled. I am now running it with Typical current
> idle to see if it improves stability. Next monday, when I get back to work I
> will see if the machine locked over the weekend, as usual, or not.

I look forward to your results! I've done new kernel, params, and .py to get a stable system. I'd like to get back to stock distro kernels and if your setting works I'll be there!

Revision history for this message

In Linux Kernel Bug Tracker #196683, dagecko (dagecko-linux-kernel-bugs) wrote on 2018-03-16:

#315

Created attachment 274783
attachment-6763-0.html

> This does not seem to be true for all motherboards. I have a B350 Tomahawk
> from
> MSI and I updated the BIOS today. In its last BIOS it has this new Power
> Supply
> Option that can be set to Auto, Low current idle, Typical current idle. The
> default is auto.
>
> So the C6 states looks enabled. I am now running it with Typical current idle
> to see if it improves stability. Next monday, when I get back to work I will
> see if the machine locked over the weekend, as usual, or not.
>

Good to hear that the new BIOS start to be spread over other constructors.
For my motherboard (MSI X370 Gaming Plus), there is an explicit comment
for the new BIOS version:

1. Do not update this BIOS if you`re currently using windows7

Thus letting believe they are focusing on these Unix soft luckups even
if nothing official is written anywhere about it. But with that kind of
comment we can expect a big mess on BIOS side in some months.
On my side, I won't test it since tricking the voltages made my computer
to be very stable (20 days for now, 14 days last time until a reboot due
to new kernel installation).

Revision history for this message

In Linux Kernel Bug Tracker #196683, bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote on 2018-03-16:

#316

For me, Ryzen 7 1800X (acquired in late April 2017) on ASRock X370 Taichi with 2x 16GB (non-QVL) Kingston 9965669-019.A00G unbuffered ECC RAM (banks 2+4) on be quiet! Dark Power Pro 11 80plus 650W, is stable out of the box since BIOS 4.60 (2018/3/6) with default BIOS settings (except for SVM = enabled) on Ubuntu 16.04 and linux-image-hwe-edge 4.13.0.36.37. The previous BIOS version 4.40 (2018/2/9) also worked with a default Linux kernel image and without kernel parameters after setting the new BIOS 'current' option to the non-default 'full current' value.
So it just took a year and the follow-up generation to be released, yeay.

Revision history for this message

In Linux Kernel Bug Tracker #196683, bugzilla.kernel.org (bugzilla.kernel.org-linux-kernel-bugs) wrote on 2018-03-18:

#317

It turns out my previous statement was premature. The system still freezes unless "Power Supply Idle Control" is set to the non-default value of "Typical Current".

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-18:

#318

(In reply to Moritz Naumann from comment #232)
> It turns out my previous statement was premature. The system still freezes
> unless "Power Supply Idle Control" is set to the non-default value of
> "Typical Current".

http://download.gigabyte.cn/FileList/Manual/mb_manual_ga-ax370m-ds3h_e.pdf

Power Supply Idle Control
Enables or disables Package C6 State.

Typical Current Idle
Disables this function.

Low Current Idle
Enables this function.

Auto
The BIOS automatically configures this setting (Default)

Revision history for this message

In Linux Kernel Bug Tracker #196683, pjssilva (pjssilva-linux-kernel-bugs) wrote on 2018-03-19:

#319

>
> Power Supply Idle Control
> Enables or disables Package C6 State.

Interesting, according to zenstates.py in my motherboard (a MSI B350 Tomahawk) the Package C6 State is enabled in both cases.

Actually, I have good news. Setting this option to typical current idle made my machine stable over the week end (it almost always freezes during the weekends if let idle). It is still too soon to assert anything. But it is a good sign.

I may also bring my kill-a-watt to compare the idle consumption from the outlet with this option enabled and disabled to see if I can find any difference. But, I have to find the kill-a-watt at my place first.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sineways (sineways-linux-kernel-bugs) wrote on 2018-03-20:

#320

Ryzen 3 1200 Quad-Core (Made: 17/28) with an ASRock AB350 Pro4 running BIOS 3.20 (AGESA 1.0.0.6b)

System gets hardlocks (anywhere from 1-3, 10-30 days) that cause even SysRq to not work. The chip is in a server that varies between idle and active, but the hang seems to happen regardless of CPU usage state.

Only once did I get a "BUG: soft lockup". And in that situation I was able to REISUB sysrq my way to a clean reboot.

I've been following this bug for a while, but I'm not sure of what a proper solution might be (if any) for those that have found one at this point.

Below is some possibly helpful information:
Kernel: 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1 (2018-01-14) x86_64 GNU/Linux
Grub: rcu_nocbs=0-3

-
python zenstates.py --list
P0 - Enabled - FID = 7C - DID = 8 - VID = 3A - Ratio = 31.00 - vCore = 1.18750
P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore = 0.88750
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled

At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I thought that disabled C6 states, and that wasn't desirable, but perhaps I'm mistaken.

Appreciate any guidance
Doug

Revision history for this message

In Linux Kernel Bug Tracker #196683, eric.c.morgan (eric.c.morgan-linux-kernel-bugs) wrote on 2018-03-20:

#321

(In reply to Douglas S from comment #235)
> Ryzen 3 1200 Quad-Core (Made: 17/28) with an ASRock AB350 Pro4 running BIOS
> 3.20 (AGESA 1.0.0.6b)
>
> System gets hardlocks (anywhere from 1-3, 10-30 days) that cause even SysRq
> to not work. The chip is in a server that varies between idle and active,
> but the hang seems to happen regardless of CPU usage state.
>
> Only once did I get a "BUG: soft lockup". And in that situation I was able
> to REISUB sysrq my way to a clean reboot.
>
> I've been following this bug for a while, but I'm not sure of what a proper
> solution might be (if any) for those that have found one at this point.
>
> Below is some possibly helpful information:
> Kernel: 4.14.0-0.bpo.3-amd64 #1 SMP Debian 4.14.13-1~bpo9+1 (2018-01-14)
> x86_64 GNU/Linux
> Grub: rcu_nocbs=0-3
>
> -
> python zenstates.py --list
> P0 - Enabled - FID = 7C - DID = 8 - VID = 3A - Ratio = 31.00 - vCore =
> 1.18750
> P1 - Enabled - FID = 8C - DID = A - VID = 50 - Ratio = 28.00 - vCore =
> 1.05000
> P2 - Enabled - FID = 7C - DID = 10 - VID = 6A - Ratio = 15.50 - vCore =
> 0.88750
> P3 - Disabled
> P4 - Disabled
> P5 - Disabled
> P6 - Disabled
> P7 - Disabled
> C6 State - Package - Enabled
> C6 State - Core - Enabled
>
> At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I
> thought that disabled C6 states, and that wasn't desirable, but perhaps I'm
> mistaken.
>
> Appreciate any guidance
> Doug

Run the mentioned python script and disable C6. I have 90 days uptime on my server after doing this.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-20:

#322

(In reply to Douglas S from comment #235)
> At a loss for where to go now. I did not enable CONFIG_RCU_NOCB_CPU=y as I
> thought that disabled C6 states, and that wasn't desirable, but perhaps I'm
> mistaken.

I thought that at one point but that's not true. It's an effective workaround for most (but evidently not all) people with fewer downsides.

On my side, I have updated to my latest BIOS and tried setting the new "Power Supply Idle Control" option to "Typical Current" while removing the rcu_nocbs parameter. So far, so good. zenstates.py reports the following, which makes sense given the option's description.

C6 State - Package - Disabled
C6 State - Core - Enabled

If you don't know what this means, I've read that "core" just powers down the CPU cores while "package" powers down most of the rest of the CPU after the cores. Sounds like an appropriate fix. I'd like to measure the power usage but haven't gotten round to it yet.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-03-21:

#323

Still no response from AMD? Look forward to the brave soul who tests this on the new 2xxx series in a month or two.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-21:

#324

Created attachment 274853
disable c6

So maybe disable package c6 within kernel?

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-21:

#325

(In reply to Kai-Heng Feng from comment #239)
> So maybe disable package c6 within kernel?

Good idea and for testing ok - but please as kernel commandline option if it should go to production because not everybody does have this problem respectively does have better solution like more or less CPU overclocking.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-22:

#326

(In reply to Klaus Mueller from comment #240)
> Good idea and for testing ok - but please as kernel commandline option if it
> should go to production because not everybody does have this problem
> respectively does have better solution like more or less CPU overclocking.

It's not necessary because we can already write the MSR from userspace.

Is the RMA'd CPU also stepping 1?

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-22:

#327

Created attachment 274863
zenstates.py small patch to allow disabling only C6 package

I made a small patch to zenstates.py to allow disabling only C6 package for those that don't have the new power idle control options in their BIOS. Test and report.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sergio (sergio-linux-kernel-bugs) wrote on 2018-03-22:

#328

(In reply to Kai-Heng Feng from comment #241)
> (In reply to Klaus Mueller from comment #240)
> > Good idea and for testing ok - but please as kernel commandline option if
> it
> > should go to production because not everybody does have this problem
> > respectively does have better solution like more or less CPU overclocking.
>
> It's not necessary because we can already write the MSR from userspace.
>
> Is the RMA'd CPU also stepping 1?

Mine is stepping 1.

For people using the "Power Supply Idle Control" setting in the BIOS/UEFI, does it also get reset after resuming from suspend/sleep?

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-22:

#329

(In reply to Sergio C. from comment #243)
> (In reply to Kai-Heng Feng from comment #241)
> > Is the RMA'd CPU also stepping 1?
>
> Mine is stepping 1.

So is mine, which has been RMA'd.

> For people using the "Power Supply Idle Control" setting in the BIOS/UEFI,
> does it also get reset after resuming from suspend/sleep?

Just tested, C6 Package stays disabled.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-23:

#330

If we are sure the issue is gone when package C6 is disabled, I'll send a patch which disables package on model == 1 && stepping == 1.

Revision history for this message

In Linux Kernel Bug Tracker #196683, pjssilva (pjssilva-linux-kernel-bugs) wrote on 2018-03-23:

#331

(In reply to Kai-Heng Feng from comment #245)
> If we are sure the issue is gone when package C6 is disabled, I'll send a
> patch which disables package on model == 1 && stepping == 1.

Does this does not have consequences on power consumption and/or performance? I am not sure this is the best approach now that the nw BIOS option is in place. As I said in my system (MSI B350 Motherboard) the option set to "Typical" solves the problem. My machine is running for a week now without any freeze.

Moreover, in my case the BIOS option does not seem to disable the C6 Package as fair as zenstates.py can tell:

pjssilva@quorra:~/local/administracao/ZenStates-Linux$ sudo ./zenstates.py -l
[sudo] senha para pjssilva:
P0 - Enabled - FID = 88 - DID = 8 - VID = 20 - Ratio = 34.00 - vCore = 1.35000
P1 - Enabled - FID = 78 - DID = 8 - VID = 2C - Ratio = 30.00 - vCore = 1.27500
P2 - Enabled - FID = 84 - DID = C - VID = 68 - Ratio = 22.00 - vCore = 0.90000
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled
pjssilva@quorra:~/local/administracao/ZenStates-Linux$

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-23:

#332

(In reply to Paulo J. S. Silva from comment #246)
> (In reply to Kai-Heng Feng from comment #245)
> > If we are sure the issue is gone when package C6 is disabled, I'll send a
> > patch which disables package on model == 1 && stepping == 1.
>
> Does this does not have consequences on power consumption and/or
> performance?

I don't know. Need someone who has power factor meter to do the test.

Revision history for this message

In Linux Kernel Bug Tracker #196683, klausman (klausman-linux-kernel-bugs) wrote on 2018-03-23:

#333

I for one have a stable system *without* disabling C6, but just rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it wouldn't help me, but likely make the power consumption a lot worse.

Hardware:
AMD Ryzen 7 1700X (model 1, stepping 1) that was RMA'd for segfaults
ASUS PRIME B350-Plus, BIOS 3401
2x Kingston 9965669-017.A00G

No overclocking and core boost disabled.

I regularly get uptimes well over two weeks and haven't had a lockup since I added the RCU blacklist.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-23:

#334

I haven't measured it yet but I wouldn't panic about power usage prematurely. I think just disabling C6 package is quite different to disabling C6 entirely.

It also seems like BIOS settings aren't always proving reliable and you need to check zenstates.py to see whether it has changed what you expect or even anything at all.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-23:

#335

(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

I guess RCU_NOCBS only papers over the bug by let the CPU hard to enter Package C6.

Regarding to power consumption, we need real world test results before making any judgment.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-23:

#336

(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

This option must be switchable of course - I don't need it, too! It should be handled like the bios switch: default is without any change - machines which need it, enable it with c6package.disable e.g. or even better additionally via sysctl (always opt in).

About power consumption: I can test it on base of an already overclocked system (+200 MHz). I can do it once I can switch off the system to put in the power meter. The precision isn't very high, but 1 or 2 Watt more or less should be noticeable.

Revision history for this message

In Linux Kernel Bug Tracker #196683, excieve (excieve-linux-kernel-bugs) wrote on 2018-03-23:

#337

I think both RCU workaround and disabling C6 (either through MSR or using the BIOS option) just masks the problem, which is probably on hardware level. It is likely also masked on Windows through how its power management works. I strongly believe that AMD should properly investigate this issue instead of waving it off on PSUs. This happens on all kinds of PSUs, including those that definitely support C6/C7.

One thing I noticed is that with C6 disabled through the script or using the "Typical Current" power supply option in BIOS in my case MB sensors never report Vcore less than 0.8V, which indicates that indeed it doesn't reach lower idle states even if only "package C6" is disabled. On default settings it will frequently report 0.4-0.5V range when idle.

Since Zen would only apply boosting and especially XFR when certain power and thermal conditions are met, disabling the deep idle states may affect this adversely. So IMO all that should be well tested before applying a workaround like this on the kernel level.

Revision history for this message

In Linux Kernel Bug Tracker #196683, it (it-linux-kernel-bugs) wrote on 2018-03-23:

#338

Since I've enabled "Typical Current" my system is absolutely stable. On my system that disables package C6.

I did a short test and run stress-ng -c1, pinned the process affinity to the first core and watched the output of cpufreq-aperf. Didn't see a difference between package C6 enabled or disabled.

Didn't do a very long running test nor do I know, if the cpufreq-aperf values are really meaningful in this scenario (sometimes it showed frequencies above the 3.9GHz my 1700x should be able to reach), but at least there was no difference.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-23:

#339

(In reply to Artem Hluvchynskyi from comment #252)
> I think both RCU workaround and disabling C6 (either through MSR or using
> the BIOS option) just masks the problem, which is probably on hardware
> level.

Maybe true.

[...]

> One thing I noticed is that with C6 disabled through the script or using the
> "Typical Current" power supply option in BIOS in my case MB sensors never
> report Vcore less than 0.8V, which indicates that indeed it doesn't reach
> lower idle states even if only "package C6" is disabled. On default settings
> it will frequently report 0.4-0.5V range when idle.

I'm reaching exactly this 0.4-0.5V range with overclocking (+200 MHz) and C6 completely enabled - and I didn't see the problem any more since a lot of weeks now (before not even after an hour). My PSU officially is definitely not C6/C7 capable because it's too old.

> Since Zen would only apply boosting and especially XFR when certain power
> and thermal conditions are met, disabling the deep idle states may affect
> this adversely. So IMO all that should be well tested before applying a
> workaround like this on the kernel level.

Fine with me if its switchable (opt in).

Nevertheless it would be indisputable much better to have a real fix basing of the real reason instead of a workaround - but I fear it won't come anytime soon.

Revision history for this message

In Linux Kernel Bug Tracker #196683, pjssilva (pjssilva-linux-kernel-bugs) wrote on 2018-03-23:

#340

> One thing I noticed is that with C6 disabled through the script or using the
> "Typical Current" power supply option in BIOS in my case MB sensors never
> report Vcore less than 0.8V, which indicates that indeed it doesn't reach
> lower idle states even if only "package C6" is disabled. On default settings
> it will frequently report 0.4-0.5V range when idle.

The same here (Tomahawk B350 from MSI). With the "Typical current" option in BIOS the voltage has a minimum of 0.82V (that's what sensors report). If I change the setting to "Auto" or "Low current" it goes down to 0.4-0.5V.

With "Typical" my system is stable, with "Auto" it freezes, usually during the night. I haven't tried "Low" yet. Maybe next week.

Revision history for this message

In Linux Kernel Bug Tracker #196683, laforge (laforge-linux-kernel-bugs) wrote on 2018-03-24:

#341

FYI: I'm experiencing a similar o the same issue on a Ryzen 1700X

vendor_id : AuthenticAMD
cpu family : 23
model : 1
model name : AMD Ryzen 7 1700X Eight-Core Processor
stepping : 1
microcode : 0x8001129

Base Board Information
        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: PRIME B350M-A
        Version: Rev X.0x
BIOS Information
        Vendor: American Megatrends Inc.
        Version: 3803
        Release Date: 01/22/2018

where at least one per day the machine would hang with
<pre>
Message from syslogd@host2 at Mar 23 21:35:56 ...
kernel:[31606.963222] NMI watchdog: BUG: soft lockup - CPU#15 stuck for 23s! [(md-udevd):2623]
</pre>

being the last messages. C6 was enabled for both the package and the core by the bios.

The machine would hang quite reliably at least once every 24 hours, typically during the idle night times. The load at that point was 0.01.

OS/Kernel is Debian 9 / 4.9.0-6-amd64, which has "CONFIG_RCU_EXPERT is not set" and hence none of the [old or new] more detailed RCU settings are enabled.

Have decided not to upgrade the kernel but simply disable C6 and wait...

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-24:

#342

(In reply to Tobias Klausmann from comment #248)
> I for one have a stable system *without* disabling C6, but just
> rcu_nocbs=0-15 on the kernel command line. If C6 was hard-disabled, it
> wouldn't help me, but likely make the power consumption a lot worse.

I tested power consumption with C6 package enabled (default) or disabled. I disabled it via patch for zenstates.py.

I ran two tests over 2 hours with disabled VMs and no KDE session (just the login screen started, but active was text console 0).

Based on the measuring tolerance of the power meter, I couldn't find any difference between enabled or disabled C6 package.

The voltages I saw in sensors have been the same (most of the time 0.39 V) - no difference. The load of the machine was 0 0 0.

Maybe the patch for zenstates.py to disable C6 package doesn't work? It would have been nice to have the possibility to re enable it.

Another nice finding was: the consumption of my 4 idle(!) VMs is constantly about 5 Watt!

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-24:

#343

Those of you that have the new power idle options, can you report the values of msr registers 0xC0010292 and 0xC0010296 after booting?
It can be done with:
modprobe msr
rdmsr -x 0xC0010292
rdmsr -x 0xC0010296

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-25:

#344

Disabling only C6 package does not fix the issue for me. So whatever "Power Supply Idle Control" does, it is not just disabling C6 package.

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-03-25:

#345

(In reply to Panagiotis Malakoudis from comment #258)
> rdmsr -x 0xC0010292

52

> rdmsr -x 0xC0010296

484848

I've taken some basic power measurements. Not particularly scientific or accurate, just watching the numbers fluctuate on my power plug meter. I've not looked into this before so I really didn't know whether the difference would be tens of watts or milliwatts. I booted with just C6 package disabled and it hovered around 64.5W. With C6 entirely disabled, it went up to 70W. With C6 entirely enabled, it was around 63.5W. So C6 Package hardly makes any difference. I made these changes with zenstates.py rather than the BIOS for this test.

Revision history for this message

In Linux Kernel Bug Tracker #196683, robert (robert-linux-kernel-bugs) wrote on 2018-03-25:

#346

Hello guys, (new here, but not to unix/linux)

I have at the moment a new Epyc-based server, and also a new Ryzen PC.
No soft lokups on Epyc, only a very evil error under heavy load I have been hunting down for a while (https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID) until I decided to move away from FC27 (not certified on this server), and load Centos7. So far no error, but I have yet to test more load on it.

Specs, Epyc:
Kernel: 3.10.0-693.21.1.el7.x86_64 x86_64 bits: 64 gcc: 4.8.5
           Desktop: Openbox Distro: CentOS Linux release 7.4.1708 (Core)
Machine: Device: kvm System: Supermicro product: AS -2023US-TR4 v: 0123456789 serial: <filter>
           Mobo: Supermicro model: H11DSU-iN v: 1.02A serial: <filter>
           UEFI [Legacy]: American Megatrends v: 1.1 date: 02/07/2018
CPU(s): 2 16 core AMD EPYC 7351s (-MCP-SMP-) arch: Zen rev.2 cache: 16384 KB

Ryzen:
Kernel: 4.15.10-300.fc27.x86_64 x86_64 bits: 64 gcc: 7.3.1 Console: tty 0
Distro: Fedora release 27 (Twenty Seven)
Machine: Device: desktop Mobo: ASUSTeK model: PRIME B350M-A v: Rev X.0x serial: <filter>
UEFI [Legacy]: American Megatrends v: 3402 date: 12/11/2017
CPU: 6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1 cache: 3072 KB

I can say that I have seen the soft lockup on the Ryzen several times in the past 3 months, maybe once / week or so. (but then again the machine is always doing stuff in my home, i.e. downloading backups from servers)

The boot options on both are now:
Epyc: GRUB_CMDLINE_LINUX="rhgb selinux=0 rcu_nocbs=0-63"
Ryzen: GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 rhgb quiet selinux=0 nmi_watchdog=0 nohpet pci=biosirq rcu_nocbs=0-11"

I have not been home yet to upgrade the BIOS on the B350 mobo, but I will be there in 7 days, do it, and post my findings. In the meantime, I will leave the Ryzen completely idle, waiting for a crash.
I have not (ever) touched the C6 issue, alhough I do remember setting my Mobo BIOS on "Standard" (not Performance, and not Power Saving).

Hello guys, (new here, but not to unix/linux)

I have at the moment a new Epyc-based server, and also a new Ryzen PC.
No soft lokups on Epyc, only a very evil error under heavy load I have been hunting down for a while (https://forums.fedoraforum.org/showthread.php?317537-first-server-error-reboot-what-is-this-UUID) until I decided to move away from FC27 (not certified on this server), and load Centos7. So far no error, but I have yet to test more load on it.

Specs, Epyc:
Kernel: 3.10.0-693.21.1.el7.x86_64 x86_64 bits: 64 gcc: 4.8.5
           Desktop: Openbox Distro: CentOS Linux release 7.4.1708 (Core)                            
Machine:   Device: kvm System: Supermicro product: AS -2023US-TR4 v: 0123456789 serial: <filter>    
           Mobo: Supermicro model: H11DSU-iN v: 1.02A serial: <filter>                              
           UEFI [Legacy]: American Megatrends v: 1.1 date: 02/07/2018                               
CPU(s):    2 16 core AMD EPYC 7351s (-MCP-SMP-) arch: Zen rev.2 cache: 16384 KB

Ryzen:
Kernel: 4.15.10-300.fc27.x86_64 x86_64 bits: 64 gcc: 7.3.1 Console: tty 0
           Distro: Fedora release 27 (Twenty Seven)                                                 
Machine:   Device: desktop Mobo: ASUSTeK model: PRIME B350M-A v: Rev X.0x serial: <filter>          
           UEFI [Legacy]: American Megatrends v: 3402 date: 12/11/2017                              
CPU:       6 core AMD Ryzen 5 1600X Six-Core (-MT-MCP-) arch: Zen rev.1 cache: 3072 KB

I can say that I have seen the soft lockup on the Ryzen several times in the past 3 months, maybe once / week or so. (but then again the machine is always doing stuff in my home, i.e. downloading backups from servers)

The boot options on both are now:
Epyc: GRUB_CMDLINE_LINUX="rhgb selinux=0 rcu_nocbs=0-63"
Ryzen: GRUB_CMDLINE_LINUX="rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 rhgb quiet selinux=0 nmi_watchdog=0 nohpet pci=biosirq rcu_nocbs=0-11"

I have not been home yet to upgrade the BIOS on the B350 mobo, but I will be there in 7 days, do it, and post my findings. In the meantime, I will leave the Ryzen completely idle, waiting for a crash.
I have not (ever) touched the C6 issue, alhough I do remember setting my Mobo BIOS on "Standard"  (not Performance, and not Power Saving).

Revision history for this message

In Linux Kernel Bug Tracker #196683, robert (robert-linux-kernel-bugs) wrote on 2018-03-25:

#347

Oh, I forgot:

Epyc (Centos7):
fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y
CONFIG_RCU_NOCB_CPU_NONE=y
# CONFIG_RCU_NOCB_CPU_ZERO is not set
# CONFIG_RCU_NOCB_CPU_ALL is not set

Ryzen (FC27):
fgrep CONFIG_RCU_NOCB_CPU /boot/config-$(uname -r)
CONFIG_RCU_NOCB_CPU=y

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-25:

#348

If there's no noticeable power consumption increase when Package C6 gets disabled, and we are sure that it really workarounds the hard lockup, I'll send the patch to upstream.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-25:

#349

It doesn't workarounds the hard lock, as I reported a while back (comment 259)

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2018-03-25:

#350

(In reply to Kai-Heng Feng from comment #263)
> If there's no noticeable power consumption increase when Package C6 gets
> disabled, and we are sure that it really workarounds the hard lockup, I'll
> send the patch to upstream.

What exactly would this mean (Linux newbie here)? Would this mean this would be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because electricity is bloody expensive here and the 5-10% power consumption difference would matter to me.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-25:

#351

(In reply to Panagiotis Malakoudis from comment #264)
> It doesn't workarounds the hard lock, as I reported a while back (comment
> 259)

So does disable Core C6 do the trick?

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-03-25:

#352

(In reply to Chris Hall from comment #217)
> (In reply to Klaus Mueller from comment #214)
> > (In reply to Chris Hall from comment #212)
> > > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > > replaced that by a brand-new, more efficient PSU which supports 0A loads
> > > across all voltages. I am running that with C6 enabled, and await
> > > developments.

> > Could you please tell which PSU you're now testing?

> It's a "be quiet! Straight Power 11", Model E11-450W (BN280).

I have been running with the new PSU with C6 enabled (package and core), and with CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15, for 10 days.

This morning the machine was frozen. So, new PSU has not cured the problem.

With the old PSU the machine was freezing every 2-3 days. So, things have improved... which could be the PSU, but the kernel (now 4.15.10) and other software has been updated and the idle load on the machine depends on how frequently somebody tries to ssh in :-(

The BIOS for my ASUS Prime X370-PRO does not have the magic "Power Supply Idle Control". I will now try zenstates.py --c6-package-disable".

FWIW: I last wrote to AMD "TECH.SUPPORT" on 14-Mar... so far, no reply.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-25:

#353

Disabling both C6 package and C6 core prevents hard lock, but also disables single core turbo XFR speeds. With C6 disable my 1700X can go only up to 3500MHz while a single core can go up to 3900 with C6 enabled.

Disabling both C6 package and C6 core should not be considered a fix.

Revision history for this message

In Linux Kernel Bug Tracker #196683, laforge (laforge-linux-kernel-bugs) wrote on 2018-03-25:

#354

I can report that on my Ryzen 1700X machine disabling core+package C6 seems to
be working around the problem. Where previously the machine would fail quite
reliably at least once per 24 hours, it's now running longer without any trouble
so far.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-25:

#355

(In reply to Jonathan from comment #265)
> (In reply to Kai-Heng Feng from comment #263)
> > If there's no noticeable power consumption increase when Package C6 gets
> > disabled, and we are sure that it really workarounds the hard lockup, I'll
> > send the patch to upstream.
>
> What exactly would this mean (Linux newbie here)? Would this mean this would
> be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because
> electricity is bloody expensive here and the 5-10% power consumption
> difference would matter to me.

Kai-Heng Feng seems to mulishly ignore the fact, that not *all people* need this workaround. I repeatedly wrote, that this *must be switchable*, as opt in e.g.

But before anything is done, it must be proven in general, that the fix is correct at all - means, that it really addresses C6 package and nothing else.

My tests here have been shown, that there isn't any difference between enabled / disabled C6 package on voltages using zenstates.py. Did it really address the C6 package? But I'm not sure, if it does the same as the kernel patch.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tomm (tomm-linux-kernel-bugs) wrote on 2018-03-25:

#356

(In reply to Klaus Mueller from comment #240)
> (In reply to Kai-Heng Feng from comment #239)
> > So maybe disable package c6 within kernel?
>
> Good idea and for testing ok - but please as kernel commandline option if it
> should go to production because not everybody does have this problem
> respectively does have better solution like more or less CPU overclocking.

Good luck getting Linus to accept a kernel change for a defect that AMD won't even acknowledge.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-25:

#357

(In reply to Chris Hall from comment #267)
> (In reply to Chris Hall from comment #217)
> > (In reply to Klaus Mueller from comment #214)
> > > (In reply to Chris Hall from comment #212)
> > > > I admit that the PSU I was using was (a) old and (b) cheap. I have now
> > > > replaced that by a brand-new, more efficient PSU which supports 0A
> loads
> > > > across all voltages. I am running that with C6 enabled, and await
> > > > developments.
>
> > > Could you please tell which PSU you're now testing?
>
> > It's a "be quiet! Straight Power 11", Model E11-450W (BN280).
>
> I have been running with the new PSU with C6 enabled (package and core), and
> with CONFIG_RCU_NOCB_CPU=y and rcu_nocbs=0-15, for 10 days.
>
> This morning the machine was frozen. So, new PSU has not cured the problem.

Thanks for your reply!

> With the old PSU the machine was freezing every 2-3 days. So, things have
> improved... which could be the PSU, but the kernel (now 4.15.10) and other
> software has been updated and the idle load on the machine depends on how
> frequently somebody tries to ssh in :-(
>
> The BIOS for my ASUS Prime X370-PRO does not have the magic "Power Supply
> Idle Control". I will now try zenstates.py --c6-package-disable".

I've got the same board. My workaround is to overclock the CPU a bit: I'm using the PC scenario "Daily computing" (+200 MHz) - all other configurations are default and no RCU or C6 - workaround. Panagiotis Malakoudis had to use even higher overclocking (see comment 223) to work around the problem - but he has another CPU if I remember correctly - I have a AMD Ryzen 7 1700X and my clock for DRAM is 2400 MHz (default - it isn't changed). Overclocking doesn't affect power consumption here in idle mode (I tested it)!

Maybe you want to try it?

Revision history for this message

In Linux Kernel Bug Tracker #196683, kai.heng.feng (kai.heng.feng-linux-kernel-bugs) wrote on 2018-03-25:

#358

(In reply to Klaus Mueller from comment #270)
> (In reply to Jonathan from comment #265)
> > (In reply to Kai-Heng Feng from comment #263)
> > > If there's no noticeable power consumption increase when Package C6 gets
> > > disabled, and we are sure that it really workarounds the hard lockup,
> I'll
> > > send the patch to upstream.
> >
> > What exactly would this mean (Linux newbie here)? Would this mean this
> would
> > be active by default on Linux (C6 disabled for Ryzen/TR/Epyc)? Because
> > electricity is bloody expensive here and the 5-10% power consumption
> > difference would matter to me.
>
> Kai-Heng Feng seems to mulishly ignore the fact, that not *all people* need
> this workaround. I repeatedly wrote, that this *must be switchable*, as opt
> in e.g.

But lots of systems are affected by this, no? A functional kernel should work out of the box.
So fix like this needs to be opted out, like how all other workarounds get handled in the kernel. You can still easily opt out via MSR.

But apparently people are not happy about this fix, let's just wait for AMD's fix, if there's one.

Revision history for this message

In Linux Kernel Bug Tracker #196683, robert (robert-linux-kernel-bugs) wrote on 2018-03-25:

#359

Well, judging by what I have read and what I have seen, this is obviously a fault in AMD or the Ryzen processor either:
1. by allowing the power to the CPU to get too low, so it starves and stutters.
2. by specifying the wrong CPU power needs to the motherboard manufacturers.

I have read above, there is a site where they "play with the voltages". That seems to be the most obvious real fix to me. However, it's not for "granny", and it should just not happen to a company like AMD, or a hyped product like Ryzen.

All said and done, when I leave my comp at home transcoding 8TB of videos to mobile format, the beast just rocks so good, and it never crashes then ...

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-03-26:

#360

(In reply to Kai-Heng Feng from comment #273)
> But lots of systems are affected by this, no? A functional kernel should
> work out of the box.
> So fix like this needs to be opted out, like how all other workarounds get
> handled in the kernel. You can still easily opt out via MSR.

Opt out is fine for me, too. This is ok. It's important to be switchable. But zenstates.py can't switch it off as of now. Therefore, it must be done via kernel.

> But apparently people are not happy about this fix, let's just wait for
> AMD's fix, if there's one.

Because it wouldn't be a fix, but a workaround. That's the problem. AMD should fix it - but they seem to behave lazy and just seem to ignore it!

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-03-26:

#361

I find it hard to believe this can't be fixed with a bios / microcode update. It really is crazy.

Someone earlier here was going to contact AMD graphics division to see if they could shed some light on it, I'm going to assume that went nowhere.

There is an AMD staff member who regularly posts on reddit, perhaps if this were to be posted on the AMD subreddit, succinctly, with evidence, it might get upvotes (but some AMD fans, tend to be extremely defensive)

It's certainly frustrating, this bug has been now open quite a while.

Revision history for this message

In Linux Kernel Bug Tracker #196683, sergio (sergio-linux-kernel-bugs) wrote on 2018-03-26:

#362

(In reply to OptionalRealName from comment #276)
> I find it hard to believe this can't be fixed with a bios / microcode
> update. It really is crazy.
>
> Someone earlier here was going to contact AMD graphics division to see if
> they could shed some light on it, I'm going to assume that went nowhere.
>
>

They probably got the same "I apologise for the delay in response, unfortunately the information you are requesting is private and confidential and not publicly available at this time." I got when asked about the power supply idle control setting.

On a sidenote, I am currently trying to debug an issue with a wireless card (Intel 7260, mobo ASRock AB350 Pro4, Ryzen 5 1600). Today I went back to an older BIOS that does not have the power supply workaround and simply disabling C6 package did not help. I got a reboot with the infamous "mce: [Hardware Error]: CPU 4: Machine Check: 0 Bank 5: bea0000000000108" message; hadn't seen that one in months :)

Revision history for this message

In Linux Kernel Bug Tracker #196683, it (it-linux-kernel-bugs) wrote on 2018-03-26:

#363

(In reply to Panagiotis Malakoudis from comment #258)
> rdmsr -x 0xC0010292

52

> rdmsr -x 0xC0010296

484848

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-26:

#364

It seems my motherboard, ASUS Prime X370 will be getting the new "Power Supply Idle Control" option in new BIOS with version 3907 and AGESA 1002, as reported in https://www.hardwareluxx.de/community/f219/asus-prime-x370-pro-am4-1156996-289.html#post26231965

So I will be able to test this workaround soon.

Revision history for this message

In Linux Kernel Bug Tracker #196683, simon (simon-linux-kernel-bugs) wrote on 2018-03-26:

#365

Hi there,

i read this list here since about 3 month,
trying to get 4 hard/software identical computers
running stable:

amd ryzen7 1700
asrock ab350m
opensuse tumbleweed.
32 gb memory (max for this board)
no overcloaking, memory (except the color) correct for this board
(there are different numbers of the memory but they are related to
the color of the cover (as my computer shop told me))

1) i changed all processors twice now the high load bug is gone
   ua 1714 pgt (was first processors with this bug)
   ua 1733 pgs (where the second processors with this bug)
   now i got some other (no idear which, but they are running
   with ( a modified) "kill rizen script" and BIOS LOWER than 3.2!!!!
   -> so there ARE processors out there with week > 30 which still did
      have high load bug!!

2) still with that bios versions at low load the system crashed
   randomly about ever 5 days, systems are running about 12 hours/day,
   after this, shut down, next morning starting again.
   and in this situation every about 5 days once a crash.
   -> the crash(s) did not appear if the system runs contignously
    one week without touching the system!!!
   it crahes when starting browser, or changing windows in kde, or
   type after a while not doing anithing in libreoffice, so it seems
   there has to be a small load from nearly no load to cause the problem.

3) i updated to bios 4.4 later to 4.5.
   crahes are still present.
   but in this bios there is this feature:
   /advanced/amdcbs/zen-common-options/"power supply idel control"
   changed this from "auto" to "typical" (as i have seen here in this list)
   up to now, since about 3 weeks no lowload-crashes appear.
   BUT:

4) with bios 4.4 and 4.5 under high load i run into the problem:
   with modified "kill ryzen script", from 16 loops, i wrote 4 loops
   to the nvme0 (because for 16 loops 32gb memory is to small)
   and 12 loops to ramdrive.
   so there is a heavy load also at the bus.
   kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
   kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
   this errors occour every couple of seconds.
   with bios smaller 3.2 this was all fine, this error did not show up.

5) to solve the 4) problem, i added pcie_aspm=off" to the kernel command-line.

now it SEEMS that this systems are stable under low and high load :-)))
... cost me nearly a half of a year to get to this point.

thanks to all here who have reported ad gave me some hints.

simoN

Hi there,

i read this list here since about 3 month,
trying to get 4 hard/software identical computers
running stable:

amd ryzen7 1700
asrock ab350m
opensuse tumbleweed.
32 gb memory (max for this board)
no overcloaking, memory (except the color) correct for this board
(there are different numbers of the memory but they are related to
the color of the cover (as my computer shop told me))

1) i changed all processors twice now the high load bug is gone
   ua 1714 pgt (was first processors with this bug)
   ua 1733 pgs (where the second processors with this bug)
   now i got some other (no idear which, but they are running
   with ( a modified) "kill rizen script" and BIOS LOWER than 3.2!!!!
   -> so there ARE processors out there with week > 30 which still did
      have high load bug!!

2) still with that bios versions at low load the system crashed
   randomly about ever 5 days, systems are running about 12 hours/day,
   after this, shut down, next morning starting again.
   and in this situation every about 5 days once a crash.
   -> the crash(s) did not appear if the system runs contignously
    one week without touching the system!!!
   it crahes when starting browser, or changing windows in kde, or
   type after a while not doing anithing in libreoffice, so  it seems
   there has to be a small load from nearly no load to cause the problem.

3) i updated to bios 4.4 later to 4.5.
   crahes are still present. 
   but in this bios there is this feature:
   /advanced/amdcbs/zen-common-options/"power supply idel control"
   changed this from "auto" to "typical" (as i have seen here in this list)
   up to now, since about 3 weeks no lowload-crashes appear.
   BUT:

4) with bios 4.4 and 4.5 under high load i run into the problem:
   with modified "kill ryzen script", from 16 loops, i wrote 4 loops
   to the nvme0 (because for 16 loops 32gb memory is to small)
   and 12 loops to ramdrive.
   so there is a heavy load also at the bus.
   kernel: pcieport 0000:00:03.1: AER: Multiple Corrected error received: id=0000
   kernel: pcieport 0000:00:03.1: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0019(Transmitter ID)
   this errors occour every couple of seconds.
   with bios smaller 3.2 this was all fine, this error did not show up.
 
   
5) to solve the 4) problem, i added pcie_aspm=off" to the kernel command-line.

now it SEEMS that this systems are stable under low and high load :-)))
... cost me nearly a half of a year to get to this point.

thanks to all here who have reported ad gave me some hints.

simoN

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-03-28:

#366

FWIW, after 14 days of silence, I have heard back from "<email address hidden>".

When I last wrote I sent them the URL for this discussion.

"<email address hidden>" tells me:
__________________________________________________________________________
Thank you for your patience, I understand that the idle hang freezing
issue is frustrating.

  Thanks for sharing the Bugzilla link, I have glanced through the
  discussion thread and saw that users that have the Power Supply
  Control option available in the bios confirmed that changing the
  setting has resolved their problem.

This option is the AMD solution and has been provided to motherboard
vendor for validation and inclusion in a future BIOS release.

  According to comment 279, a poster with the same motherboard as you,
  has pointed to a discussion where it states that an updated BIOS will
  be made available for your motherboard which contains the Power Supply
  Control Option. https://www.hardwareluxx.de/community/f219/
                   asus-prime-x370-pro-am4-1156996-289.html#post26231965

Although I cannot confirm the accuracy of this post, you may want to
check with ASUS directly.
__________________________________________________________________________

I did ask them what the "Power Supply Control" option actually does, and why.

I have asked again.

Given that the "Power Supply Control" option is "the AMD solution" I assume that AMD are the right people to ask ?

FWIW: ASUS PRIME X370-PRO BIOS 3805 has appeared, with the release note "Improve system performance" -- but that does not have a "Power Supply Control" setting (or at least not in the "Advanced/AMD CBS" menu).

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-28:

#367

For ASUS X370 Prime, beta BIOS 3907 has this option available. 3805 does not have it.
I think I will test 3907 in the next days, although I would prefer not to use beta BIOS.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-03-29:

#368

Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've seen some reports from users that they no longer have the issue with 4.15 kernel. I updated to 4.15 and already survived two nights of idle usage at stock settings (no cpu overclock etc). Of course this has happened in the past as well some times so I have to wait a few more days to make conclusions, but I wonder if anyone else tested with 4.15

Revision history for this message

In Linux Kernel Bug Tracker #196683, excieve (excieve-linux-kernel-bugs) wrote on 2018-03-29:

#369

(In reply to Panagiotis Malakoudis from comment #283)
> Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've
> seen some reports from users that they no longer have the issue with 4.15
> kernel. I updated to 4.15 and already survived two nights of idle usage at
> stock settings (no cpu overclock etc). Of course this has happened in the
> past as well some times so I have to wait a few more days to make
> conclusions, but I wonder if anyone else tested with 4.15

For me it froze in a bit more than a week of mixed usage on all defaults on 4.15.x.

Revision history for this message

In Linux Kernel Bug Tracker #196683, simon (simon-linux-kernel-bugs) wrote on 2018-03-29:

#370

(In reply to Panagiotis Malakoudis from comment #283)
> Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've

Hi, i using opensuse's 4.15.10-1-default kernel see my comment 280 for system configuration (stable).
how to check if rcuo treads are enabled or not?

Revision history for this message

In Linux Kernel Bug Tracker #196683, simon (simon-linux-kernel-bugs) wrote on 2018-03-29:

#371

(In reply to Simon from comment #285)
> (In reply to Panagiotis Malakoudis from comment #283)
> > Has anyone tested with 4.15.x kernels without rcuo threads enabled? I've
>
> Hi, i using opensuse's 4.15.10-1-default kernel see my comment 280 for
> system configuration (stable).
> how to check if rcuo treads are enabled or not?

maybe this is the info you need?:

zcat /proc/config.gz | grep RCU

# RCU Subsystem
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_HAVE_RCU_TABLE_FREE=y
# RCU Debugging
# CONFIG_PROVE_RCU is not set
CONFIG_RCU_PERF_TEST=m
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set

Revision history for this message

In Linux Kernel Bug Tracker #196683, ptheis (ptheis-linux-kernel-bugs) wrote on 2018-04-01:

#372

I don't get it ... my Manjaro system was stable for 3 months, lately with 4.14 LTS kernels and constantly upgraded NVIDIA drivers. Now the system is unstable again, even when decoding HD video streams from the web ... I will give kernel 4.16 a try.

Revision history for this message

In Linux Kernel Bug Tracker #196683, cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote on 2018-04-06:

#373

We've now seen this idle lockup with a Ryzen Pro 1700X based system running current Fedora 27. As with my Fedora 27 system, it is avoided by using 'rcu_nocbs=0-15 processor.max_cstate=5'. The hardware is an Asus Prime Z370-Pro motherboard, EVGA SuperNova G3 650W PSU, 2x 8GB Kingston ECC RAM, and an nVidia graphics card.

(It's disappointing but not surprising that the Ryzen Pro doesn't seem to fix this.)

Revision history for this message

In Linux Kernel Bug Tracker #196683, cks-kernelbugs (cks-kernelbugs-linux-kernel-bugs) wrote on 2018-04-06:

#374

Whoops, I need to correct my comment slightly, because Asus has so many motherboards: the Ryzen Pro is in an Asus Prime X370-Pro motherboard.

(There is no Asus 'Prime Z370-Pro', but there is an Asus 'Prime Z370-A' for current Intel CPUs so you can easily be confused just by changing a few letters.)

Revision history for this message

In Linux Kernel Bug Tracker #196683, jvdelisle (jvdelisle-linux-kernel-bugs) wrote on 2018-04-07:

#375

See also https://bugzilla.redhat.com/show_bug.cgi?id=1562530

I was thinking this was an amdgpu driver bug, but not sure now.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-04-10:

#376

Does the AMD Epyc embedded series have this bug?

Revision history for this message

In Linux Kernel Bug Tracker #196683, robert (robert-linux-kernel-bugs) wrote on 2018-04-10:

#377

(In reply to OptionalRealName from comment #291)
> Does the AMD Epyc embedded series have this bug?

I am running a dual standard Epyc (7351s - Zen Rev.2) server and I can say that it has not manifested itself in any way. The server was one week completely idle waiting for software installations, and there was no errors. (CentOS 7.4).
I cannot say anything about the Embedded version though. But I guess it may depend on the "successful" marriage with its motherboard in reality.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-12:

#378

Hi.
I am testing it under Windows.
First, you must know that on Windows only the economy saver mode make use of core parking. Normal mode doesn't, as the ryzen delivered with the chipset driver package. From what i see via Parkcontrol app.

I'm now using core parking for try to reproduce it under this others OS. It's running now since 4hours in economy mode with core parking enabled and still no crash.

I use several OS and i had a crash (freeze) just after i installed the chipset driver package, which touch to the scheduler settings.
Since it, never had an other freeze. I must have use the reset button for get out of the freeze.

From what i read in this thread if i want to reproduce it on Linux, best is to try Fedora 27 updated.

I'll be back with others tests.

Revision history for this message

In Linux Kernel Bug Tracker #196683, brauliobo (brauliobo-linux-kernel-bugs) wrote on 2018-04-13:

#379

(In reply to JerryD from comment #290)
> See also https://bugzilla.redhat.com/show_bug.cgi?id=1562530
>
> I was thinking this was an amdgpu driver bug, but not sure now.

See my replies there, I think with Raven Ridge this is really related to amdgpu driver.

Revision history for this message

In Linux Kernel Bug Tracker #196683, devurandom (devurandom-linux-kernel-bugs) wrote on 2018-04-14:

#380

I am using an AMD Ryzen 5 2400G on an Asus ROG Strix B350-F (firmware version 3805 and 3803 before that) and was also experiencing freezes, e.g. after about one hour into `rsync -a /home ...`, with Gentoo Linux (Linux 4.15.10), Arch Linux 2018.03.1, Fedora 27, Fedora 28 nightly, Fedora 28 beta. Setting rcu_nocbs=0-7, processor.max_cstate=5 or intel_idle.max_cstate=5 did not seem to have an effect -- the freezes still occurred. The hardware itself (mainboard, CPU, RAM) was cross-checked by the supplier, who found no fault in it. Since I disabled CPU C-states in the mainboard firmware about a day ago, I am no longer able to reproduce the freeze.

Revision history for this message

In Linux Kernel Bug Tracker #196683, dagecko (dagecko-linux-kernel-bugs) wrote on 2018-04-16:

#381

(In reply to Dennis Schridde from comment #295)
> I am using an AMD Ryzen 5 2400G on an Asus ROG Strix B350-F (firmware
> version 3805 and 3803 before that) and was also experiencing freezes, e.g.
> after about one hour into `rsync -a /home ...`, with Gentoo Linux (Linux
> 4.15.10), Arch Linux 2018.03.1, Fedora 27, Fedora 28 nightly, Fedora 28
> beta. Setting rcu_nocbs=0-7, processor.max_cstate=5 or
> intel_idle.max_cstate=5 did not seem to have an effect -- the freezes still
> occurred. The hardware itself (mainboard, CPU, RAM) was cross-checked by
> the supplier, who found no fault in it. Since I disabled CPU C-states in
> the mainboard firmware about a day ago, I am no longer able to reproduce the
> freeze.

Ryzen 2xxx are almost Ryzen 1xxx but with lower TDP. Meaning lower CPU frequency and thus lower power.
And since all available solutions and workarounds are about not to let the chipset give less power to the system (cpu, memory...) or to explicitly give more power to the system, this is not surprising (to my opinion) that the bug is still here.
What will be more interesting is to see if the bug will still be here for the next generatation of Ryzen...

Also, since now this is clear like rock water that this bug is not a software bug (thus not a kernel bug), but clearly a bug from AMD (most probably in the full design of Ryzen both CPU and chipsets), please let me propose you all to consider again to trick your motherboard voltage. Just giving it very few more and you'll have a system fully working. The CPU will still be able to lower its power, to scale its frequency... This is totally a win-win.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-04-19:

#382

I flashed my motherboard (ASUS Prime X370 Pro) with the new 4008 BIOS that finally offers the "Power Supply Control" option. I set it to typical idle, disabled my CPU overclock that made the problem less frequent and now waiting to see if it will freeze.

What is already obvious though is that CPU power never goes under 0.8V. Without the Power Supply Control option set to Typical, voltage goes down to 0.39V

XFR and single core turbo works fine even if voltage is not dropping to 0.39V

Will report back in a few days.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-19:

#383

I do the same on my side.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-19:

#384

(In reply to Panagiotis Malakoudis from comment #297)
> I flashed my motherboard (ASUS Prime X370 Pro) with the new 4008 BIOS that
> finally offers the "Power Supply Control" option. I set it to typical idle,
> disabled my CPU overclock that made the problem less frequent and now
> waiting to see if it will freeze.
>
> What is already obvious though is that CPU power never goes under 0.8V.
> Without the Power Supply Control option set to Typical, voltage goes down to
> 0.39V
>
> XFR and single core turbo works fine even if voltage is not dropping to 0.39V
>
> Will report back in a few days.

How do you monitor XFR/single core turbo under Linux?
Thank you

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-04-19:

#385

I open a terminal and run as root:
watch -n 0.5 cpupower monitor

cpupower is in package linux-cpupower in Debian 9 that I am using.
apt install linux-cpupower

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-04-19:

#386

And if you want to test a single core benchmark pinned to a specific core to check what frequency can be reached, it is easy with following command:
taskset -c 10 openssl speed rsa2048

On my 1700X, cpupower monitor shows frequency up to 3880 for core 10 in the above benchmark.

If you want to stress two cores, you can either run two commands, or:
taskset -c 6,10 openssl speed -multi 2 rsa2048

Two cores go up to 3800 on my 1700X. Three and above go only to 3500. Two cores+multithread (taskset -c 6,7,10,11 for example) go up to 3650. This is the point where the new ryzen 2XXX are better, they can use higher turbo when using >2 cores. It seems they can do 4000 MHz up to 4 cores/8 threads with XFR2.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-19:

#387

(In reply to Panagiotis Malakoudis from comment #301)
> And if you want to test a single core benchmark pinned to a specific core to
> check what frequency can be reached, it is easy with following command:
> taskset -c 10 openssl speed rsa2048
>
> On my 1700X, cpupower monitor shows frequency up to 3880 for core 10 in the
> above benchmark.
>
> If you want to stress two cores, you can either run two commands, or:
> taskset -c 6,10 openssl speed -multi 2 rsa2048
>
> Two cores go up to 3800 on my 1700X. Three and above go only to 3500. Two
> cores+multithread (taskset -c 6,7,10,11 for example) go up to 3650. This is
> the point where the new ryzen 2XXX are better, they can use higher turbo
> when using >2 cores. It seems they can do 4000 MHz up to 4 cores/8 threads
> with XFR2.
No problem for it, i use -mmt1 2 3 4 etc for it with 7za b
And as you said, xfr and turbo are working.
Now i see that my performance under windows are better, i think it's because of timing for change frequencies who is smaller on windows. I'm going to tune it a bit for see, on windows better timing improve performance by ~10% iirc.

Yes they can.
between 4.05 et 4.075 GHz over eight cores. "

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-19:

#388

So far,
[philectro@linux ~]$ uptime
21:32:34 up 7:14

The typical current seems to fix the problem.
The script zenstates says my C6 is enabled, so i don't know what this parameter does.

Revision history for this message

In Linux Kernel Bug Tracker #196683, phillips (phillips-linux-kernel-bugs) wrote on 2018-04-22:

#389

For my Ryzen 1700 system, setting bios "Power Supply Idle Control" to "Typical Current" increases power consumption at idle by approximately 4 watts, from 39 watts to 43. Still impressively low, but I hope that AMD will give me my 4 watts back eventually. Uptime is currently 2 days, but has been as high as 6 weeks in the past before locking up, so I cannot yet report that my stability issue is resolved.

As others have reported, the "Typical Current" option disables package C6 state, but not core C6 state. It is high time for AMD to make an official statement on this issue.

Motherboard: GA-AB360-GAMING, stock voltages and clocks
Processor: Ryzen 7 1700, week 38
Power supply: EVGA 650 GQ
Video: Radeon 6450, radeon driver
Kernel: various, including 4.15.0

Revision history for this message

In Linux Kernel Bug Tracker #196683, phillips (phillips-linux-kernel-bugs) wrote on 2018-04-22:

#390

Correction: GA-AB350-GAMING

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-04-22:

#391

package C6 state is not disabled with "Power Supply Idle Control" to "Typical Current" on the Asus Prime X370 with 4008 chipset. Also, disabling package C6 state with zenstates.py didn't have any effect on previous BIOSes.

I don't care about the 4 watts, but I would definitely like a technical explanation of the issue.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-23:

#392

Same for me.
I confirm than on my x370-pro "typical current" doesn't disable C6 at all.

For the script, it disabled both C6 on my motherboard on 3805 uefi version.
What do you mean by didn't have any effect?
My system was stable when i disabled C6.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-04-23:

#393

If I was only disabling package C6, I still had idle freezes. Completely disabling C6 was not an option, since you loose XFR and turbo speeds above 3500 (on my 1700X)

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-23:

#394

I contacted AMD support for my Ryzen 1600 and asked an explanation. I got an answer but so far no explanation.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-04-29:

#395

Today, I tested bios 4008 (Prime X370-PRO / previous was 3404). The update wasn't any problem. I measured power consumption with enabled "typical current" and couldn't find any difference to the standard adjustment (and to my current overclocking solution w/ bios 3404).

Sensors showed with enabled "typical current", that minimal voltage is about 0.8 V - whereas standard uses between 0.4 V and 0.8 V. C6 (package) states are enabled.

Nevertheless I went back to 3404 (and overclocking) via dos/afudos because of the completely broken fan regulation (CPU fan min duty cycle is 60% - this is completely unusable) - I didn't want to use the modded bios, because I'm not sure how to install it.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-04-30:

#396

Now that my freezes are solved i have get this kind of flood on 4.15.17 today(and maybe yesterday as my smb share were odly slow)

pcieport 0000:00:01.3: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=000b(Transmitter ID)
I have reboot several times for try different kernel version.
4.16.4 and 4.15.16
The same error appeared.

I have shutdown the computer. Cut the PSU completely with the switch and wait.
Pushed the radeon 7970 a bit in the slot in case it moved.
Put it on, the error doesn't appear anymore.
Tried to reproduce it with 4.15.17 but i can't.

I have see several users on level1 with threadripper and some with ryzen that had this error.

Revision history for this message

In Linux Kernel Bug Tracker #196683, malakudi (malakudi-linux-kernel-bugs) wrote on 2018-05-02:

#397

Unfortunately I have to report that the idle freeze issue hasn't been fixed for me with new BIOS and the "Typical Power idle" option. My system froze 2 times since 19/4 when I installed the new BIOS.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-05-02:

#398

(In reply to Panagiotis Malakoudis from comment #312)
> Unfortunately I have to report that the idle freeze issue hasn't been fixed
> for me with new BIOS and the "Typical Power idle" option.

What's your kernel version?

I'm testing at the moment w/ Bios 4008 / default configuration - means w/o using "Typical Power idle" option and Linux 4.16.6. I didn't see any problem so far - but this isn't a final result until now - it may change anytime.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-05-02:

#399

Forgot to mention, that I applied this patch(1) to 4.16.6, which is now part of 4.16.7. See comments(2).

BTW, I couldn't measure any power consumption change with this patch.

(1) https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86-urgent-for-linus&id=da6fa7ef67f07108a1b0cb9fd9e7fcaabd39c051&utm_source=anz

(2) https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.17-AMD-Power-Fix

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2018-05-02:

#400

Has anyone tried a Ryzen 2000 series / x470 mainboard yet?
no reason to wait for a TR 2000 series apparently, since it won't be using Zen+ cores.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-05-02:

#401

(In reply to Jonathan from comment #315)
> Has anyone tried a Ryzen 2000 series / x470 mainboard yet?
> no reason to wait for a TR 2000 series apparently, since it won't be using
> Zen+ cores.

I literally just came here tot ask the same question, curious if they've actually fixed this properly.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-05-03:

#402

(In reply to Panagiotis Malakoudis from comment #312)
> Unfortunately I have to report that the idle freeze issue hasn't been fixed
> for me with new BIOS and the "Typical Power idle" option. My system froze 2
> times since 19/4 when I installed the new BIOS.

Hmm time for talk with AMD and rma because it does for me.
But i still consider it as a dirty workaround for hide the fact cpus are defective.

Revision history for this message

In Linux Kernel Bug Tracker #196683, nethershaw (nethershaw-linux-kernel-bugs) wrote on 2018-05-03:

#403

To answer the question re: the new Ryzen CPUs:

I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board, and I can positively confirm that the random soft-lockups are very much still a thing. Nothing appears in dmesg, and nothing appears in the system journal.

My kernel is 4.16.6 (gentoo-sources patchset) and so far, the only troubleshooting I've done before landing on this bug report is to set 'processor.max_cstate=5' in my kernel boot options. The machine soft-locked about two hours later whilst I was using it (a few minutes ago).

I'll go back through the comments and suggestions to attempt the deeper workarounds, including kernel 4.16.7, but I thought this information might be valuable to someone first.

If anyone has anything they'd like me to test in particular, this is a Gentoo system and I am more than happy to subject it to trials in service of finding a solution to this bug.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-05-03:

#404

You don't have test the "typical current" option in the uefi?
It's the official workaround.

If you have still freezes with the B2 stepping ouch.
I tho they have at least fixed it there...

You means it freeze completely and you need to press reset?

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-05-03:

#405

(In reply to Matthew Vaughn from comment #318)
> To answer the question re: the new Ryzen CPUs:
>
> I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board,
> and I can positively confirm that the random soft-lockups are very much
> still a thing.

This is truly appalling. Thanks for the information.

Revision history for this message

In Linux Kernel Bug Tracker #196683, nethershaw (nethershaw-linux-kernel-bugs) wrote on 2018-05-04:

#406

(In reply to Michaël Colignon from comment #319)
> You don't have test the "typical current" option in the uefi?
> It's the official workaround.

I only found this bug report in the last hour, so I hadn't yet switched on the 'Typical Current Idle' option in UEFI. I just switched that on during my upgrade to kernel 4.16.7 a moment ago. I'll return here and report my results.

> You means it freeze completely and you need to press reset?

That's correct; up until now, it would freeze completely and I needed to press reset to reboot the machine. This had happened several times in the last 24 hours before I started looking around for reports of the issue and landed here. It's a brand-new installation.

Revision history for this message

In Linux Kernel Bug Tracker #196683, oyvinds (oyvinds-linux-kernel-bugs) wrote on 2018-05-04:

#407

Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01. This is over ONE YEAR since the first release of Ryzen CPUs.

Can ya'all verify for me that this actually works - long term - and that I can remove both the rcu_nocbs=0-7 kernel boot option and my custom disable-c6-on-boot.service & disable-c6-on-suspend.service (which disables both C6 package and core)?

Or do I need to keep either rcu_nocbs or zenstates --c6-disable?

I would very much like to know if Typical Idle would be enough. I realize I could simply test this but I don't want to experiment and have random hangs. I'm on kernel 4.17rc3 btw.

If Typical Idle (I also see a "auto" option") actually fixes this then it's still a TOTAL SCANDAL since it's been a year now. But hey, better late than never.

Revision history for this message

In Linux Kernel Bug Tracker #196683, nethershaw (nethershaw-linux-kernel-bugs) wrote on 2018-05-04:

#408

I'm testing with just the 'Typical Current Idle' option and no other countermeasures enabled right now. I'll give it 24 hours before I'm convinced, given the frequency of previously observed lockups. So far, there have been none since my last comment, which is longer than any previous uptime interval I've had on this hardware since installing it.

For what it's worth, the 'auto' option for the current idle setting is no good. That's the default, and was the setting in effect when I was observing lockups.

Revision history for this message

In Linux Kernel Bug Tracker #196683, it (it-linux-kernel-bugs) wrote on 2018-05-04:

#409

(In reply to oyvinds from comment #322)
> Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option
> in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01.
> This is over ONE YEAR since the first release of Ryzen CPUs.
>
> Can ya'all verify for me that this actually works - long term - and that I
> can remove both the rcu_nocbs=0-7 kernel boot option and my custom
> disable-c6-on-boot.service & disable-c6-on-suspend.service (which disables
> both C6 package and core)?
>
> Or do I need to keep either rcu_nocbs or zenstates --c6-disable?
>
> I would very much like to know if Typical Idle would be enough. I realize I
> could simply test this but I don't want to experiment and have random hangs.
> I'm on kernel 4.17rc3 btw.

I didn't experience any lockups since more than 2 months. Only Typical Idle enabled, nothing else changed.
Running Ubuntu 17.10 stock kernel 4.13.0-39-generic

Revision history for this message

In Linux Kernel Bug Tracker #196683, chewi (chewi-linux-kernel-bugs) wrote on 2018-05-04:

#410

(In reply to oyvinds from comment #322)
> Gigabyte's GA-AX370-Gaming-5 motherboard finally got the Typical Idle option
> in BIOS version F23f with AGESA 1.0.0.2a + SMU FW 43.18 dated 2018/05/01.
> This is over ONE YEAR since the first release of Ryzen CPUs.

I have this board and it was available in F22 over a month ago. Still not great but I honestly don't think AMD were even aware of this issue before last August or so. The other segfault issue was getting far more attention and many wrongly assumed the freezes were related.

> I would very much like to know if Typical Idle would be enough.

It is enough. I have disabled all other workarounds and I've maybe had one freeze since though I can't remember for sure and that may well have been down to something else.

Revision history for this message

In Linux Kernel Bug Tracker #196683, KuroNeko (kuroneko-linux-kernel-bugs) wrote on 2018-05-04:

#411

(In reply to Matthew Vaughn from comment #318)
> I'm running an AMD Ryzen 7 2700X on an ASUS Crosshair VII Hero X470 board,
> and I can positively confirm that the random soft-lockups are very much
> still a thing. Nothing appears in dmesg, and nothing appears in the system
> journal.

Very disappointing. AMD really isn't taking this serious. Since power consumption is really important for me, and my PC is often idling (so either live with freezes or accept even higher power consumption), this really means no AMD. Especially since their Threadripper next gen is supposed to use an older Epyc core, not even Zen+. And since they're not taking this seriously, my guess is Zen 2 next year will still be buggy. Way to go AMD.

And with Intel just getting hit with another Spectre round, I guess their fixes will slow their CPU's down even slower than my old non patched Sandy Bridge.

*grumble*

Oh well, this is one way to get upgraditis cured I guess.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-05-04:

#412

I lost interest in my Ryzen 7 1800X machine... but this morning I got round to upgrading the BIOS on the ASUS Prime X370-Pro (to 4011, hot off the press: "Update AGESA 1.0.0.2a + SMU 43.18" whatever that means).

I was running with rcu_nocbs and zenstates --c6-package-disable. That ran for 11 days and then froze.

I last heard from "<email address hidden>":

Thank you for the update and confirming that your BeQuite Straight Power 11
supports 0A minimum load.

  Because your system still freezes, it could be due to cross loading problems
  which can result in the power supply turning off when a load changes or
  result in voltages becoming out of specification causing system crashes and
  hangs.

  What the Power Supply Idle Control option does is disables the lowest power
  state for the CPU. This is where ALL cores are not in operation and the
  entire CCX or Core complex is taken down.

  There are many levels of power states that a core can be in from C1 to C6,
  CC6 and finally PC6. The Power Supply Idle Control option is designed to
  keep enough current on the rail so that power supply does not go out of
  regulation.

  The Power Supply Idle Control option is part of an AGESA update from AMD
  provided to the motherboard vendors for validation and implementation in
  their BIOS updates. However, it is motherboard vendors decision as this
  which BIOS version will contain the Power Supply Idle Control option.

So now I have options: "low current idle", "typical current idle" and "auto". Neither AMD nor ASUS seem to think it necessary to document what those mean.

I have set "typical current idle". I note that zenstates shows that both C6 States Package and Core are Enabled.

I guess I am back to waiting and seeing.

Revision history for this message

In Linux Kernel Bug Tracker #196683, jvdelisle (jvdelisle-linux-kernel-bugs) wrote on 2018-05-04:

#413

(In reply to Chris Hall from comment #327)
--- snip ---
> So now I have options: "low current idle", "typical current idle" and
> "auto". Neither AMD nor ASUS seem to think it necessary to document what
> those mean.
>
> I have set "typical current idle". I note that zenstates shows that both C6
> States Package and Core are Enabled.
>
> I guess I am back to waiting and seeing.

So have you tried the "auto" mode? or "low current idle". Cant hurt to try them and just let the machine run doing something.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-05-04:

#414

Auto must make use of low current idle, the setting who gives freeze.
I can say than auto gives freeze as for Matthew.

Revision history for this message

In Linux Kernel Bug Tracker #196683, nethershaw (nethershaw-linux-kernel-bugs) wrote on 2018-05-05:

#415

24 hours of uptime without a lockup after only setting "typical current idle." Considering my rig had been locking up after very short uptimes without this setting, I'd say that's a significant difference.

Revision history for this message

In Linux Kernel Bug Tracker #196683, tcl_de (tclde-linux-kernel-bugs) wrote on 2018-05-05:

#416

(In reply to Chris Hall from comment #327)

--snip--
> I last heard from "<email address hidden>":
>
> Thank you for the update and confirming that your BeQuite Straight Power 11
> supports 0A minimum load.
>
> Because your system still freezes, it could be due to cross loading
> problems
> which can result in the power supply turning off when a load changes or
> result in voltages becoming out of specification causing system crashes and
> hangs.
>
> What the Power Supply Idle Control option does is disables the lowest power
> state for the CPU. This is where ALL cores are not in operation and the
> entire CCX or Core complex is taken down.
>
> There are many levels of power states that a core can be in from C1 to C6,
> CC6 and finally PC6. The Power Supply Idle Control option is designed to
> keep enough current on the rail so that power supply does not go out of
> regulation.
>
> The Power Supply Idle Control option is part of an AGESA update from AMD
> provided to the motherboard vendors for validation and implementation in
> their BIOS updates. However, it is motherboard vendors decision as this
> which BIOS version will contain the Power Supply Idle Control option.
>
--snip--
Actually there should not be any cross loading problems by design here as this PSU uses DC/DC converters to regulate +5V and +3.3V independently of +12V.

I suspect that the problem might rather be related to the voltage regulation on the mainboard.
It's good to know however that the Power Supply Idle Control option is part of an AGESA update.

Revision history for this message

In Linux Kernel Bug Tracker #196683, ledesillusionniste (ledesillusionniste-linux-kernel-bugs) wrote on 2018-05-05:

#417

I suggest we fill all a spreadsheet on the web with our hardware, PSU included, for have the final word of this story.
I let you decide the right tool for it. I can host it if needed.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kmueller (kmueller-linux-kernel-bugs) wrote on 2018-05-14:

#418

(In reply to Klaus Mueller from comment #313)

> I'm testing at the moment w/ Bios 4008 / default configuration - means w/o
> using "Typical Power idle" option and Linux 4.16.6. I didn't see any problem
> so far - but this isn't a final result until now - it may change anytime.

=> Got a hang yesterday.
watchdog: BUG: soft lockup - CPU#8 stuck for 23s! [worker:11105]

Went back again to BIOS 3404 / Linux 4.14.x and overclocking which is known to be stable for me.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernelorg (kernelorg-linux-kernel-bugs) wrote on 2018-05-16:

#419

Someone needs to make Tomshardware, HardOCP, Anandtech aware of this thread.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-05-16:

#420

I think that is a great idea. I have time to waste at airports soon, so can draft an email message.

Receipients:
Tomshardware, HardOCP, Anandtech, Ars Technica, Tweakers.net, Phoronix, ... anyone else?

Contents:
- Total system freezes on full idle
- Seemingly/definitely(?) a hardware bug
- AMD has been contacted multiple times but refuses to acknowledge the issue
- AMD first blamed PSUs
- AMD then pushed mobo manufacturers to add an undocumented BIOS option, which in its default setting does not solve the problem and with other settings doesn't always solve the problem
- Cause is still unknown
- Issue is also present in Ryzen 2xxx CPUs

Other suggestions welcome before I start writing it up!

Revision history for this message

In Linux Kernel Bug Tracker #196683, simon (simon-linux-kernel-bugs) wrote on 2018-05-16:

#421

I would suggest to change one of your lines:

> - AMD then pushed mobo manufacturers to add an undocumented BIOS option,
> which in its default setting does not solve the problem and with other
> settings MOST TIMES (NOT ALWAYS) solve the problem

and add one sentence:
here the settings: /advanced/amdcbs/zen-common-options/"power supply idel control"
changed this from "auto" to "typical"

simoN

Revision history for this message

In Linux Kernel Bug Tracker #196683, dagecko (dagecko-linux-kernel-bugs) wrote on 2018-05-16:

#422

(In reply to kernel from comment #335)
> I think that is a great idea. I have time to waste at airports soon, so can
> draft an email message.
>

Just to tell part of my story: I bought the well-known Asus laptop with Ryzen CPU. It was exposing the segfault bug. But before I was most sure about that, it burnt just two weeks after having bought it... I sent it to Asus, hoping for a new lapotp. Asus offically told me that they replaced the motherboard and the keyboard (they did not tell anything about if the CPU was replaced, and I was not able to check this).
The new but repaired laptop did not expose the segfault bug anymore (same kernel version). But it exposed the soft-lock bug. I didn't know this was this bug at that time. So I returned the laptop again to Asus claiming to be reimbursed. Asus repaired it again by changing the motherboard again. The bug was still present. (and for the interested people I could finally got reimbursed).

Then I decided to do as usual: build my own desktop. And I was facing the soft-lock bug until I tricked the voltages...

To go back to the story, and if we can trust Asus, a motherboard change made the segfault bug to disappear (AMD officially 'replaced' such 'faulty' CPUs) but made the soft-lock bug to appear. A second change of the motherboard did not improve anything.

What I guess (since this cannot be really ensured) is that AMD does not know where all these bugs come from. And AMD does not know how to resolve them. AMD does not even know if this can be solved.

What we can know is that the guy responsible for the Ryzen architecture was hired by AMD back again (ater having moved to Intel or Apple) but he left AMD 3 years ago (so before Ryzen went out). He moved to Tesla and is now working to Intel again.
Link is here: https://en.wikipedia.org/wiki/Jim_Keller_(engineer)

Revision history for this message

In Linux Kernel Bug Tracker #196683, dagecko (dagecko-linux-kernel-bugs) wrote on 2018-05-16:

#423

And before my last paragraph could be misinterpreted, I was just meaning that this guy is the one to contact if we want some concrete information.

Revision history for this message

In Linux Kernel Bug Tracker #196683, kernel (kernel-linux-kernel-bugs) wrote on 2018-05-18:

#424

I've sent the aforementioned outlets an email. For completeness, here's the entire thing:

Receipients:
Ars Technica: https://arstechnica.wufoo.com/forms/z7p8x7/
Tomshardware: http://www.purch.com/about/#contact-general
HardOCP: <email address hidden>
Anandtech: http://www.purch.com/about/#contact-general
Tweakers.net: <email address hidden>
Phoronix: https://www.phoronix-media.com/?

Subject: Publication of an AMD Ryzen hardware issue

Dear editor,

Since their introduction the AMD Ryzen processors have been plagued by several issues, most notably the segfault issue that occurred under high (compilation) loads. To that particular issue AMD has responded by replacing affected chips.

However, there is another significant issue that affects both Ryzen 1xxx and Threadripper CPUs, as well as the newer Ryzen 2xxx processors. It appears Epyc is not affected (although sample size is one in this case).

The most complete storyline on this issue can be found in the link below [1], however for an overview the rest of my email attempts to summarise the issue.

This issue results in a complete system freeze, occurring under full idle conditions, and requires a hard reset. Evidence suggests this is a hardware problem, since several workarounds have been found that mitigate/solve the issue, such as disabling C6 entirely (inefficient), overclocking and tweaking voltages (not for everyone), or running processes that keep the CPU active at all times (again, inefficient and pointless).

AMD has been contacted multiple times but refuses to acknowledge the issue. At some point in one reply AMD blamed users' PSUs; this is obviously nonsensical as the issue occurs on a wide variety of PSUs including brand new models, the only constant is the Ryzen platform.

In response, AMD has pushed motherboard manufacturers to add an otherwise undocumented BIOS option, "Power Supply Idle Control", as part of their AGESA update. However, in its default value this setting does *not* solve the problem, and with other settings doesn't *always* solve the problem. To be precise, this setting needs to be changed from "auto" to "typical":
/advanced/amdcbs/zen-common-options/"power supply idle control"
However it is unknown what this option actually does.

The issue is also present in the laptop platforms. In example of one user, he sent his laptop back to Asus repeatedly for motherboard replacements, but the issue remains.

The root cause of this issue is still unknown, and moreover, the issue is still present in the latest Ryzen 2xxx CPUs. AMDs refusal to acknowledge let alone resolve the issue has led to this email, in the hope that media attention to this problem will inspire AMD to take action. Hence, I propose that your outlet publishes an article about the issue.

Kind regards,
<email address hidden>

[1] https://bugzilla.kernel.org/show_bug.cgi?id=196683

I've sent the aforementioned outlets an email. For completeness, here's the entire thing:

Receipients:
Ars Technica: https://arstechnica.wufoo.com/forms/z7p8x7/
Tomshardware: http://www.purch.com/about/#contact-general
HardOCP: hardnews@hardocp.com
Anandtech: http://www.purch.com/about/#contact-general
Tweakers.net: redactie@tweakers.net
Phoronix: https://www.phoronix-media.com/?