Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
206
This bug affects 35 people
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tags: removed: kernel-da-key
information type: Public → Public Security
information type: Public Security → Public
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
538 comments hidden view all 618 comments

Created attachment 280669
dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied

some older info here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566

I have three machines with new 2700x CPUs, and all three of them experience freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with make -j16 in a loop to test for stability.

I'm not sure it is the same bug, because I observe different behaviour: one machine, that was compiling QEMU, froze during the night, and the one left idle worked for 1 day. Another thing that SEEM to help is disabling SMT: machine with disabled SMT is currently at 7 hours of uptime, which is much more, than usual 1-3 hours before freeze. Windows 10 seem to work fine: I used WSL with ubuntu 18.04 for the same test compilations for 3 days, and no freezes with the same bios settings.

Things I've tried that didn't help:
idle=nomwait
rcu_nocbs=0-15 with new kernel (4.20.3)
disabled cool n quiet and c6 states
Typical Current Idle in uefi
Set SoC voltage to 1.1v
Set DRAM voltage to 1.3v
Update to latest BIOS
High-end PSU

(In reply to Borislav Petkov from comment #489)
> Btw, is there any particular reason why you're running a 32-bit kernel?
> If not, I'd consider switching to a 64-bit kernel which is a lot more
> and widely tested.

Yes, we know. This box has been running forever, always upgraded to the latest Fedora, and we recently upgraded the hardware to the Ryzen from a non-64-capable P4. That's why it's 32b. We will upgrade it to 64-bit during next scheduled downtime in a month.

Maxim: I think your problem is different from this bug. From everything I've read everywhere, the mwait and/or idle:typical tweaks always solve this bug. You've done even the esoteric stuff like nocbs and voltage override and your problem remains. Might (must?) be something else.

(In reply to Maxim Bakulin from comment #492)
> Created attachment 280669 [details]
> dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
>
> some older info here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566
>
> I have three machines with new 2700x CPUs, and all three of them experience
> freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with
> make -j16 in a loop to test for stability.
>
> I'm not sure it is the same bug, because I observe different behaviour: one
> machine, that was compiling QEMU, froze during the night, and the one left
> idle worked for 1 day. Another thing that SEEM to help is disabling SMT:
> machine with disabled SMT is currently at 7 hours of uptime, which is much
> more, than usual 1-3 hours before freeze. Windows 10 seem to work fine: I
> used WSL with ubuntu 18.04 for the same test compilations for 3 days, and no
> freezes with the same bios settings.
>
> Things I've tried that didn't help:
> idle=nomwait
> rcu_nocbs=0-15 with new kernel (4.20.3)
> disabled cool n quiet and c6 states
> Typical Current Idle in uefi
> Set SoC voltage to 1.1v
> Set DRAM voltage to 1.3v
> Update to latest BIOS
> High-end PSU

With my Threadripper 2950X I can confirm that none of these were sufficient. I have an excellent PSU which I have confirmed is operating well better than spec, the voltage is definitely sufficient, the power states people are saying cause this are disabled, mwait is disabled, AGESA is current, etc.

(In reply to Another User from comment #481)
> (In reply to Aaron Muir Hamilton from comment #480)
> > So my box was mostly idle for the last day or so, and had locked up while
> > the monitor was asleep, so I had to reset it.
>
> Can you test your system with idle=halt parameter instead of idle=nomwait?

Now trying idle=halt, as you suggested. We'll see.

I did power consumption tests with idle=nomwait option on my Huawei Matebook D 14 Ryzen 2500u with power meter.

With idle=nomwait idle power consumption is about 4W less!! I did few reboots to be sure about it.

Folks, just a quick thing: please check whether you have the latest BIOS and if not, do upgrade it and check if it makes any difference.

Thx.

OS: Antergos Linux 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux.
CPU: AMD Ryzen 7 1700 Eight-Core Processor.
CPU flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate sme ssbd sev ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca
MB: ASRock AB350 Gaming-ITX/ac; Version: P4.60; Release Date: 04/19/2018
RAM: F4-2400C15D-16GFXR, 32 GB DDR4.

Attempt #1:
a) Set BIOS / Advanced / AMD CBS / Zen Common Options / Power Supply Idle Control to "Typical Current Idle".

Result: Hangs

Attempt #2:
a) Set /etc/default/grub to: GRUB_CMDLINE_LINUX_DEFAULT="quiet idle=nomwait resume=UUID=9d2d2002-406a-4eb5-bd21-fdb889831991"
b) Ran update-grub
c) Rebooted
d) BIOS remains at "Typical Current Idle"

Result: Hangs

Journal Log:

Jan 24 17:21:00 kernel: BUG: unable to handle kernel paging request at ffffffffffffffff
Jan 24 17:21:00 kernel: PGD 3a0a0e067 P4D 3a0a0e067 PUD 3a0a10067 PMD 0
Jan 24 17:21:00 kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 24 17:21:00 kernel: CPU: 1 PID: 994 Comm: xfwm4 Not tainted 4.20.3-arch1-1-ARCH #1
Jan 24 17:21:00 kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Gaming-ITX/ac, BIOS P4.60 04/19/2018
Jan 24 17:55:01 kernel: rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
Jan 24 17:55:01 kernel: NMI watchdog: Watchdog detected hard LOCKUP on cpu 4
Jan 24 17:55:27 kernel: rcu: INFO: rcu_preempt detected expedited stalls on CPUs/tasks:
Jan 24 17:59:39 kernel: ACPI BIOS Warning (bug): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20181003/tbfadt-624)
Jan 24 17:59:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)

$ sudo cat /boot/grub/grub.cfg | grep -i idle
 linux /vmlinuz-linux root=UUID=16a8e892-ec00-4316-b62d-5b60da84a7fb rw quiet idle=nomwait resume=UUID=9d2d2002-406a-4eb5-bd21-fdb889831991

Have not tried:

1) idle=halt
2) Disable C6 state with zenstates.py
3) BIOS upgrade

(In reply to T X from comment #498)
> Jan 24 17:59:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported
> by HW (0x0)

So this looks like a broken BIOS.

> Jan 24 17:21:00 kernel: Hardware name: To Be Filled By O.E.M. To Be
> Filled By O.E.M./AB350 Gaming-ITX/ac, BIOS P4.60 04/19/2018

Looking at Asrock's website, there's newer BIOS for your board, AFAICT,
so can you update it and rerun the same test? Or what you were doing to
cause the splat.

Also, please upload full dmesg after you've updated the BIOS.

Thx.

(In reply to Borislav Petkov from comment #497)
> Folks, just a quick thing: please check whether you have the latest BIOS and
> if not, do upgrade it and check if it makes any difference.
>
> Thx.

Already running latest BIOS (2018/12/19 for ASRock B450 Pro4 and 2018/12/17 for ASUS X470 Pro).

I found out that openSUSE Tumbleweed with stock 4.20.0 kernel doesn't appear to freeze. I tried copying suse's kernel and /lib/modules to Xubuntu 18.04LTS, but it didn't seem to help, xubuntu still freezes. Maybe, xubuntu's firmware packages are old? But I could've done something wrong with the transplant procedure :) I'm going to test Xubuntu 18.10 during the weekends, maybe it has newer packages and does not freeze.

(In reply to Maxim Bakulin from comment #500)
> I found out that openSUSE Tumbleweed with stock 4.20.0 kernel doesn't appear
> to freeze. I tried copying suse's kernel and /lib/modules to Xubuntu
> 18.04LTS, but it didn't seem to help, xubuntu still freezes. Maybe,
> xubuntu's firmware packages are old? But I could've done something wrong
> with the transplant procedure :)

Yeah, I wouldn't do that.

What's stopping you from building a kernel on your machine and
installing it?

The web is full of tutorials like this one:

https://www.linux.com/learn/how-build-latest-linux-kernel-debian-linus-git-repo

for example, which is for debian-based systems.

Upgraded BIOS as follows:

1. https://www.asrock.com/mb/AMD/Fatal1ty%20AB350%20Gaming-ITXac/index.asp
2. Click Support > BIOS
3. Download Version 5.30 through Global site (ignore "AMD all in 1 VGA driver")
4. Formatted USB as FAT32
5. Copied and unzipped AB350 Gaming-ITXac(5.30)ROM.zip to USB
6. Flashed BIOS as per https://www.asrock.com/support/BIOSIG.asp?cat=BIOS8

$ dmidecode -t bios -q
  Version: P5.30
  Release Date: 12/18/2018

BIOS installed correctly.

$ journalctl -b

Jan 25 09:38:33 kernel: Command line: BOOT_IMAGE=/vmlinuz-linux root=UUID=16a8e892-ec00-4316-b62d-5b60da84a7fb rw quiet idle=nomwait resume=...
Jan 25 09:38:33 kernel: ACPI BIOS Warning (bug): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1>
Jan 25 09:38:33 kernel: mtrr: your CPUs had inconsistent variable MTRR settings
Jan 25 09:38:33 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
Jan 25 09:38:33 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)
Jan 25 09:38:33 kernel: sp5100-tco sp5100-tco: Watchdog hardware is disabled
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: ccp enabled
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: psp initialization failed
Jan 25 09:38:33 kernel: ccp 0000:27:00.2: enabled
Jan 25 09:38:38 kernel: kauditd_printk_skb: 23 callbacks suppressed

Since the BIOS was flashed, I assume that "Typical Current Idle" was not set.

Splat was caused by idling for ~20 minutes. Will try again with "Typical Current Idle" and, if it hangs again, will provide the dmesg log.

Additional hardware:

* Chassis: https://www.hdplex.com/hdplex-h5-fanless-computer-case.html
* PSU: https://www.hdplex.com/hdplex-internal-400w-ac-dc-adapter-with-active-pfc-and-19vdc-output.html and https://www.hdplex.com/hdplex-400w-hi-fi-dc-atx-power-supply-16v-24v-wide-range-voltage-input.html

(In reply to Borislav Petkov from comment #501)
> What's stopping you from building a kernel on your machine and
> installing it?

I tried and it didn't help. After freezes started happening at Xubuntu 18.04LTS with stock 4.15, I compiled latest 4.20.3 kernel and added CONFIG_RCU_NOCB_CPU=y, as it was suggested here, but freezes didn't stop. There's my dmesg at comment 492 with latest kernel.

(In reply to Maxim Bakulin from comment #503)
> I tried and it didn't help. After freezes started happening at Xubuntu
> 18.04LTS with stock 4.15, I compiled latest 4.20.3 kernel and added
> CONFIG_RCU_NOCB_CPU=y, as it was suggested here, but freezes didn't stop.
> There's my dmesg at comment 492 with latest kernel.

Lemme get this straight:

opensuse 4.20.0 kernel doesn't freeze the box but 4.20.3 stable kernel does?!

(In reply to T X from comment #502)
> Jan 25 09:38:33 kernel: ACPI: [Firmware Bug]: BIOS _OSI(Linux) query ignored
> Jan 25 09:38:33 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not supported
> by HW (0x0)

Yah, even with the new BIOS that's still there. Doesn't look like it has been
fixed.

If I send you a debugging patch, would you be able to apply it, build a
kernel, boot into it and get me dmesg from the box?

Thx.

I'm running with idle=halt, and I have not experienced a freeze for a couple days now, FWIW.

Using:

* BIOS Version: P5.30, Release Date: 12/18/2018
* idle=nomwait
* Power Supply Idle Control set to "Typical Current Idle"

The ~20-minute idle lock-up issue appears fixed, despite:

Jan 25 10:21:39 kernel: [Firmware Bug]: ACPI MWAIT C-state 0x0 not support...

If it locks up overnight, I'll report back with the dmesg.

Can people confirm that idle=halt fixes the issue, like Aaron says in comment #506?

> If I send you a debugging patch, would you be able to apply it, build a
kernel, boot into it and get me dmesg from the box?

I haven't built a kernel in about decade, but if you provide step-by-step instructions for how to do so, I'll attempt to build the patched kernel and then post the dmesg.

$ uname -a
Linux 4.20.3-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 16 22:38:58 UTC 2019 x86_64 GNU/Linux

$ cat /etc/issue
Antergos Linux

(In reply to Maxim Bakulin from comment #492)
> Created attachment 280669 [details]
> dmesg of freeze with 4.20.3 kernel and nomwait, rcu_nocbs, max_cstate applied
>
> some older info here:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085/comments/566
>
> I have three machines with new 2700x CPUs, and all three of them experience
> freezes in xubuntu 18.04 after some time of work. I use compiling QEMU with
> make -j16 in a loop to test for stability.
>
> I'm not sure it is the same bug, because I observe different behaviour: one
> machine, that was compiling QEMU, froze during the night, and the one left
> idle worked for 1 day. Another thing that SEEM to help is disabling SMT:

The Processor errata lists two bugs (SMT-related) for Ryzen 1 and Ryzen 2 (1095 and 1109) with status "no fix planned". If you disable MWAIT but enable SMT, you are left with bug 1095: "Potential Violation of Read Ordering In Lock Operation In SMT (Simultaneous Multithreading) Mode". This can cause crashes. Not necessarily the cause of your crashes :-D

# lsmsr -r 0xc0011020
warning: unknown MSR c0011020
unknown = 0x0006800000000010

On my Ryzen 1600X bit 57 (no idea what it does) is 0. (But I have nosmt=force.) Linux kernel doesn't seem to touch that bit.

Also, if you get "ACPI MWAIT C-state 0x0 not supported by HW (0x0)", mwait is not used by kernel.

(In reply to T X from comment #509)
> I haven't built a kernel in about decade, but if you provide step-by-step
> instructions for how to do so, I'll attempt to build the patched kernel and
> then post the dmesg.

Ok, here we go:

1. Clone stable kernel from here if you haven't done so yet:

$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git

2. checkout 4.20 branch

$ git checkout -b 4.20-stable origin/linux-4.20.y

3. apply the attached patch

$ patch -p1 -i acpi-dump-cstates.diff

4. copy your 4.20 config

$ cp /boot/config-4.20.... .config

5. do

$ make oldconfig

6. build

$ make -j16

7. install kernel, needs root

# make modules_install install

# reboot

Then, select this new kernel in grub, boot it and upload full dmesg.

Thx.

Created attachment 280825
acpi-dump-cstates.diff

(In reply to Borislav Petkov from comment #511)
> Ok, here we go:
> 1. Clone stable kernel from here if you haven't done so yet:
> $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
> 2. checkout 4.20 branch
> $ git checkout -b 4.20-stable origin/linux-4.20.y
> 3. apply the attached patch
> $ patch -p1 -i acpi-dump-cstates.diff
> 4. copy your 4.20 config
> $ cp /boot/config-4.20.... .config

    $ find /boot | grep -i config
    /boot/grub/i386-pc/configfile.mod

I don't think that that's the right file.

    $ sudo updatedb
    $ locate -i config-4
    $ uname -a
    Linux 4.20.4-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 23 00:12:22 UTC 2019 x86_64 GNU/Linux

Where is the config file?

(In reply to T X from comment #513)
> $ sudo updatedb
> $ locate -i config-4
> $ uname -a
> Linux 4.20.4-arch1-1-ARCH #1 SMP PREEMPT Wed Jan 23 00:12:22 UTC 2019
> x86_64 GNU/Linux
>
> Where is the config file?

Arch Linux doesn't package the kernel's config with the kernel. For the current 4.20-arch1 config, see: https://git.archlinux.org/svntogit/packages.git/plain/trunk/config?h=packages/linux — or just use /proc/config.gz.

Bugs are sign of success product or any software because it makes product more effective and perfect to use with the expert’s solutions and user’s feedback. Thanks for finding bug on random soft lockup on new Ryzen build.

Lauren,
https://www.cvfolks.co.uk/

Created attachment 280867
dmesg log after rebooting with acpi-dump-cstates patch

(In reply to Borislav Petkov from comment #511)
> 7. install kernel, needs root
> # make modules_install install

    $ sudo make modules_install install
      ...
      INSTALL virt/lib/irqbypass.ko
      DEPMOD 4.20.5-dirty
    sh ./arch/x86/boot/install.sh 4.20.5-dirty arch/x86/boot/bzImage \
        System.map "/boot"
    Cannot find LILO.

That's a misleading error message as LILO isn't installed on the system, but the new kernel was copied:

    $ ll /boot
    -rw-r--r-- 1 root root 5851008 Jan 29 22:40 vmlinuz

GRUB seems to have loaded it fine.

See attached dmesg-acpi-dump-cstates.txt for details.

I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen notebook more stable than anything else I've tried. I have not experienced freezes for some days now.

(In reply to Philip Rosvall from comment #518)
> I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen
> notebook more stable than anything else I've tried. I have not experienced
> freezes for some days now.

Can you upload dmesg from the working and non-working kernels pls?

Thx.

Created attachment 280915
dmesg w/o boot parameter idle=halt

Created attachment 280917
dmesg with boot parameter idle=halt

(In reply to Borislav Petkov from comment #519)
> (In reply to Philip Rosvall from comment #518)
> > I can confirm that "idle=halt" makes both my Ryzen desktop and Ryzen
> > notebook more stable than anything else I've tried. I have not experienced
> > freezes for some days now.
>
> Can you upload dmesg from the working and non-working kernels pls?
>
> Thx.

I have only used the workaround since kernel 4.20.4, so I just booted the kernel now without and with the parameter. As can be seen in line 704 in the log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT C-state 0x0 not supported by HW (0x0)". This error does not occur when "idle=halt" is set.

(In reply to Philip Rosvall from comment #522)
> I have only used the workaround since kernel 4.20.4, so I just booted the
> kernel now without and with the parameter. As can be seen in line 704 in the
> log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT
> C-state 0x0 not supported by HW (0x0)". This error does not occur when
> "idle=halt" is set.

Yeah, this is starting to look more and more like a BIOS issue on
certain mobos.

Since your machine is issuing that ACPI MWAIT error, can you apply the
debugging patch from comment #512, build a kernel, boot into it
without "idle=halt" and upload dmesg again?

Thx.

Created attachment 280921
dmesg w/o boot parameter idle=halt, patched acpi-dump-cstates

(In reply to Borislav Petkov from comment #523)
> (In reply to Philip Rosvall from comment #522)
> > I have only used the workaround since kernel 4.20.4, so I just booted the
> > kernel now without and with the parameter. As can be seen in line 704 in
> the
> > log without "idle=halt", we see the error "[Firmware Bug]: ACPI MWAIT
> > C-state 0x0 not supported by HW (0x0)". This error does not occur when
> > "idle=halt" is set.
>
> Yeah, this is starting to look more and more like a BIOS issue on
> certain mobos.
>
> Since your machine is issuing that ACPI MWAIT error, can you apply the
> debugging patch from comment #512, build a kernel, boot into it
> without "idle=halt" and upload dmesg again?
>
> Thx.

Here you go!

Created attachment 280961
Don't do mwait on B1 and earlier

Ok, here's a test patch ontop of 4.20-stable.

It should practically make idle=halt the default on revisions before B2. You check which revision you have by doing

$ grep stepping /proc/cpuinfo

The number must be < 2.

For the folks with B2 machines we need to keep debugging.

Thx.

(In reply to Borislav Petkov from comment #526)
> Created attachment 280961 [details]
> Don't do mwait on B1 and earlier

What's the downside of generally disabling mwait? I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/ 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI firmware bug messages disappear. That's all.

(In reply to Klaus Mueller from comment #528)
> What's the downside of generally disabling mwait?

So in your case, you can't do MWAIT to enter C1 anyway because your
revision doesn't support it. This is why you're seeing those firmware
messages.

> I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/
> 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI
> firmware bug messages disappear. That's all.

That's an "interesting" way to fix it but if it works ... :)

The intention of the fix is to make idle=halt the default for obvious reasons.

HTH.

(In reply to Borislav Petkov from comment #529)
> (In reply to Klaus Mueller from comment #528)
> > What's the downside of generally disabling mwait?
>
> So in your case, you can't do MWAIT to enter C1 anyway because your
> revision doesn't support it. This is why you're seeing those firmware
> messages.

Hmm, if rev. 1 doesn't support MWAIT - why can it be a problem anyway at the same time which must be fixed by disabling the usage of MWAIT? I seem to miss something?

>
> > I'm using a Ryzen 7 1700X and don't have any problem ("fixed" it w/
> > 200 MHz of CPU Overclocking). The only thing I see, is, that the ACPI
> > firmware bug messages disappear. That's all.
>
> That's an "interesting" way to fix it but if it works ... :)
>
> The intention of the fix is to make idle=halt the default for obvious
> reasons.

Isn't this already done automatically if MWAIT isn't supported?

(In reply to Klaus Mueller from comment #530)
> Hmm, if rev. 1 doesn't support MWAIT - why can it be a problem anyway
> at the same time which must be fixed by disabling the usage of MWAIT?
> I seem to miss something?

That's a good question but, frankly, I don't have a very exact answer to
it right now.

In order to understand what's *really* going on in the
cstate detection code one would need to instrument
at least acpi_processor_get_power_info_cst() and
acpi_processor_ffh_cstate_probe() to figure out what exactly does the
kernel parse from those CST objects and what it uses to try to enter
idle.

And do all that instrumentation on an affected system.

My current suspicion is is that it tries to enter idle with
misconfigured states and under certain conditions, it misses the wakeup,
leading to the stall.

This is all a conjecture anyway.

Now my patch simply falls back to the good old idle entry on AMD where
we simply do HLT and we won't even attempt to enter idle the ACPI way.

Makes sense?

Displaying first 40 and last 40 comments. View all 618 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.