Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
134
This bug affects 24 people
Affects Status Importance Assigned to Milestone
Linux
Unknown
Unknown
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tgui (eric-c-morgan) wrote :
tgui (eric-c-morgan) wrote :

More details.

- latest agesa
- disabled nouveau
- disabled randomized mem
- upped SOC voltage
- disabled C-states / cool-n-quiet
- using kernel 4.13.4 with RCU changes made, though only CONFIG_RCU_NOCB_CPU was available in 4.13.4

I'm open to trying other kernels if requested.

Franck Charras (franckc) wrote :

It seems that this problem is widespread, and is not related to https://bugzilla.kernel.org/show_bug.cgi?id=196481 which is linked in this thread.

tgui (eric-c-morgan) wrote :

Franck: I very much agree that its not related to what you linked. The link you refer to deals with heavy load compilation segfaults, which is addressed with a RMAd CPU.

This thread seems to be mostly centered around system idle / low usage complete system crash. No logs, blank screen.

I have a movie playing on my server on loop to hopefully keep the machine from becoming too idle.... which I hate doing.

tgui (eric-c-morgan) wrote :

PLEASE post your concerns to this kernel.org thread with this issue!

https://bugzilla.kernel.org/show_bug.cgi?id=196683

CONFIG_RCU_NOCB_CPU_ALL kernel option was removed and would likely fix this issue if put back! Supposedly there is a kernel command param you can pass to alleviate some of this, but thats a hack in my mind.

Franck Charras (franckc) wrote :

I've added "rcu_nocbs=0-15" to the kernel boot parameters this morning (along with "processor.max_cstate=1" last week), will update if a new freeze occurs.

Franck Charras (franckc) wrote :

Almost 1 month of uptime, it looks solved.

Mike R (mi9e) wrote :

@franckc did you have to rebuild your kernel? I tried the kernel options without that and still had failures. I've just rebuilt with CONFIG_RCU_NOCB_CPU to see if now the rcu_nocbs=0-15 resolves my lockups

Franck Charras (franckc) wrote :

YES the boot option alone has no effect, you must also compile the kernel with CONFIG_RCU_NOCB_CPU. You should be fine now (I have this confirmed on several different configs). Here's a tutorial: http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Mario Emmenlauer (emmenlau) wrote :

I have compiled the current Ubuntu 17.10 kernel 4.13.0-17-generic with CONFIG_RCU_NOCB_CPU
and boot option "rcu_nocbs=0-15" and just had a system freeze again. My System is a Ryzen 7 1700X
on an ASrock AB350 Pro4 mainboard with 32GB Ram @3200.
I do not use the kernel option "processor.max_cstate=1", is it suggested for this problem?

Franck Charras (franckc) wrote :

No "processor.max_cstate=1" is not necessary. I'd also suggest to double check that your motherboard bios is up to date. If it persists I'd try with a more recent kernel (I'm using 13.6) because IIRC the rcu options where a little bit different in earlier versions of 4.13.

Mike R (mi9e) wrote :

Cool Thanks for the confirmation Franck I've rebuilt 4.13.0-17 with instructions from here https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel (Probably similarly to Mario). I'm going to test it with this first (since don't really want to reboot again). But might get stuff ready with a 4.14.3 (which is current latest).

Franck Charras (franckc) wrote :

If you rely on the boot option "rcu_nocbs=0-15" you also need to reboot to enable it.

Mike R (mi9e) wrote :

Yep I did :) thanks

Tom Reynolds (tomreyn) wrote :

The workaround discussed at http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen fixes the recurring (roughly once or twice in 6 hours of operation, during low or idle CPU load periods) system freezes for this Ryzen 7 1800X (early weeks), AMD microcode 0x8001129, ASRock X370 Taichi (Firmware P3.20, BIOS 5.12, Release Date: 09/08/2017) system running XUbuntu 16.04 LTS (with latest updates)

All the mainline kernels up to linux-image-4.13.10-041310-generic, and official Ubuntu kernels (default 16.04 kernel, hwe) all kept freezing - although I had only tested them without "rcu_nocbs=0-15". I did test hwe-edge (4.10.0.20.13), both with and without "rcu_nocbs=0-15". Both variants gave me freezes.

tags: removed: kernel-da-key
Mike R (mi9e) wrote :

Well I've hit 48 hours uptime since booting to my custom build kernel (Linux ganymede 4.13.0-17-generic #20+ryzen SMP Sun Dec 3 20:18:51 MST 2017 x86_64 x86_64 x86_64 GNU/Linux) with rcu_nocbs=0-15 in my boot command and disabled ASLR. That's about double my previous uptime record, I'll report back again if I hit 1 week

Mario Emmenlauer (emmenlau) wrote :

Its great that the frequency of freezes is reduced. But I guess many people here expect a stable
system and can work with nothing less. Ryzen-Processors are attractive for CI-systems, and I can
not have regular freezes in our CI. I think this is an all-or-nothing-issue, and I still had a
freeze with latest Ubuntu Kernel and fixes applied (after 8hrs idle time).

Franck Charras (franckc) wrote :

It could be something else then ? After applying the fix on several machines, I have no freezes to report, more than 1 month of uptime, it looks rock stable. Did you also disable ASLR ? Also I compiled a kernel from git.kernel.org as suggested in the tutorial, not Ubuntu Kernel.

Mario Emmenlauer (emmenlau) wrote :

I think its very very unlikely to have a different underlying cause, even though I there is no guarantee. I assume that I am affected by the same problem because:
(1) The problem manifests identical to other reports here
(2) The freeze happens particularly during idle times, not under heavy load
(3) This is the only freeze we encountered in years, on many different desktop
    and server machines with partially identical or overlapping software setup,
    and partially overlapping hardware (graphics boards, hard disks, peripherals).
(4) The frequency of freezes was drastically reduced after I performed the
    suggested steps: compiling a custom kernel with CONFIG_RCU_NOCB_CPU,
    setting rcu_nocbs=0-15, and disabling ASLR.
It would be a very very rare coincidence if my issue is unrelated, albeit its never impossible. I have also updated to latest BIOS but with no improvements. I will report if there are further freezes.

Tom Reynolds (tomreyn) wrote :

~emmenlau: Try setting up a serial console to or netconsole from this system to get a better idea of why / how it's freezing.

information type: Public → Public Security
information type: Public Security → Public
Mike R (mi9e) wrote :

After 5 days uptime with my custom ubuntu kernel build I got a different crash with a reboot vs the watchdog error. I'm now running a mainline kernel 4.14.3 with ASLR disabled and rcu_nocbs=0-15 and have restarted my no crash counter.

Huygens (huygens-25) wrote :

We are experiencing the same behaviour and error messages with a Ryzen5 1600 CPU and an ASUS PRIME A320M-K motherboard.

What we can add is:

1. When no VMs are running, even if there are long idle periods, the machines can be stable for several days on Ubuntu 16.04.3 using the HWE Kernel 4.10.
2. When we run a single VM on the machine, but both the VM and the host are idle, then we get a crash. The machine is unresponsive, we see CPU soft lockups in the screen, no logs, etc. We tested it with Ubuntu 16.04.3 using the stable and edge HWE kernel (so 4.10 and 4.13), we also even tested with Ubuntu 17.10 and Kernel 4.13. And latest test was with Ubuntu 16.04.3 with custom built Kernel based on HWE edge (so 4.13) with the config CONFIG_RCU_NOCB_CPU, we also got a crash within the following night on one of the 2 machines (note that we did not have ASLR deactivated).

For our latest test we did:
- Ubuntu 16.04.3
- HWE Edge custom built kernel with CONFIG_RCU_NOCB_CPU
- ASLR deactivated
- boot parameters: rcu_nocbs=0-11 processor.max_cstate=1

We set this up last Friday, on Monday morning both machine were down, so it did not help.

PS: our motherboard had the latest "BIOS/UEFI" update which includes AMD AGESA 1.0.0.6B, we even upgraded to a newer firmware (which ASUS has since removed) with stated that it contains AMD AGESA 1.0.0.7A. This morning we saw again another firmware update from ASUS, we are going to apply it on top of our other changes, but if it does not work, we will try to return the machine (and that would be a pity, but stability is the number 1 feature more than the number of core or throughput).

Mike R (mi9e) wrote :

Well another watchdog failure with 4.14.5

I also updated to AGESA 1.0.0.6B when I rebooted to 4.14.5.

Just building a 4.14.6 to try again.

Mike R (mi9e) wrote :

well another lock up. I've now disabled C6 in the bios and added max_cstates=1 to the boot parameters

Download full text (4.9 KiB)

I experienced the same issue with Ryzen 5 1600 and Ryzen 7 1700, with AsRock AB350M PRO4 and Asus Prime B350M-A.
I try with ubuntu 16.04 normal and hwe kernel, and ubuntu 17.10.

On the system with Asus motherboard, I found a detailed log:
Dec 19 06:25:04 am02 kernel: [227622.995187] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/0:1:115]
Dec 19 06:25:04 am02 kernel: [227622.996201] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter ip_tables x_tables snd_hda_codec_hdmi nls_iso8859_1 eeepc_wmi asus_wmi sparse_keymap edac_mce_amd edac_core snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer i2c_piix4 snd ccp soundcore mac_hid 8250_dw i2c_designware_platform i2c_designware_core shpchp kvm_amd kvm irqbypass ib_iser rdma_cm iw_cm ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1
Dec 19 06:25:04 am02 kernel: [227622.996242] raid0 multipath linear nouveau crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc mxm_wmi video i2c_algo_bit ttm aesni_intel drm_kms_helper aes_x86_64 crypto_simd syscopyarea glue_helper sysfillrect cryptd sysimgblt fb_sys_fops drm r8169 ahci mii libahci wmi fjes gpio_amdpt gpio_generic
Dec 19 06:25:04 am02 kernel: [227622.996263] CPU: 0 PID: 115 Comm: kworker/0:1 Tainted: G L 4.10.0-28-generic #32~16.04.2-Ubuntu
Dec 19 06:25:04 am02 kernel: [227622.996264] Hardware name: System manufacturer System Product Name/PRIME B350M-A, BIOS 0611 05/05/2017
Dec 19 06:25:04 am02 kernel: [227622.996271] Workqueue: events netstamp_clear
Dec 19 06:25:04 am02 kernel: [227622.996274] task: ffff9d12fa288000 task.stack: ffffba804361c000
Dec 19 06:25:04 am02 kernel: [227622.996278] RIP: 0010:smp_call_function_many+0x1f1/0x250
Dec 19 06:25:04 am02 kernel: [227622.996280] RSP: 0018:ffffba804361fce8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff10
Dec 19 06:25:04 am02 kernel: [227622.996283] RAX: 0000000000000003 RBX: 0000000000000010 RCX: 000000000000000c
Dec 19 06:25:04 am02 kernel: [227622.996285] RDX: ffff9d12fe91d0f8 RSI: 0000000000000010 RDI: ffff9d12fe021d30
Dec 19 06:25:04 am02 kernel: [227622.996286] RBP: ffffba804361fd20 R08: 0000000000000000 R09: 000000000000fffe
Dec 19 06:25:04 am02 kernel: [227622.996287] R10: ffffeceadfd69100 R11: ffff9d12fe021d30 R12: ffffffffbb634c60
Dec 19 06:25:04 am02 kernel: [227622.996288] R13: 0000000000000000 R14: ffff9d12fe61a340 R15: 000000000001a300
Dec 19 06:25:04 am02 kernel: [227622.996291] FS: 0000000000000000(0000) GS:ffff9d12fe600000(0000) knlGS:0000000000000000
Dec 19 06:25:04 am02 kernel: [227622.996292] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 19 06:25:04 am02 kernel: [227622.996294] CR2: 00007f5780200a58 CR3: 00000007f67ca000 CR4: 00000000003406f0
Dec 19 06:25:04 am02 kernel: [227622.996296] Call Trace:
Dec 19 06:25:04 am02 kernel: [227622.996301] ? netif_receiv...

Read more...

Mike R (mi9e) wrote :

Well I've now hit 3 weeks of uptime with C6 disabled in the BIOS and
GRUB_CMDLINE_LINUX_DEFAULT="rcu_nocbs=0-15 processor.max_cstate=1"

and 4.14.6 built from the mainline kernel repo. At this point I'm satisfied that for me at least my Ryzen is now stable.

Mike R (mi9e) wrote :

Just to be clear my mainline kernel is configured with CONFIG_RCU_NOCB_CPU to yes.

Mario Emmenlauer (emmenlau) wrote :

Ok, after updating to the latest BIOS the issue seems resolved for me. I'm now also at an uptime of over 3 weeks, and before I could never get past 1 week. I've updated the BIOS of my ASRock AB350 Pro4 AMD B350 to 3.20 with AGESA 1.0.0.6b.

I am currently running kernel 4.14.5 with CONFIG_RCU_NOCB_CPU and boot option "rcu_nocbs=0-15". My System is a Ryzen 7 1700X on ASrock AB350 Pro4 with 32GB Ram @3200.

Tom Reynolds (tomreyn) wrote :

So, to sum up, we have several reports here that Ubuntu's default kernels freeze on low load situations with Ryzen CPUs (or a subset of Ryzen CPUs?) up to and including the latest mainline releases. This bug differs from bugzilla.kernel.org #196481 (user space segfaults only on high CPU load).

For low load freezes we have a reliable workaround for Xenial (probably also for newer releases) across different hardware components (differences in exact Ryzen model, mainboards, memory models and configurations) and kernel versions:

* Update to latest UEFI
* Kernel >= 4.10.0 (4.14.5 mainline tested successfully) built with CONFIG_RCU_NOCB_CPU
* The result of `echo rcu_nocbs=0-$(($(nproc)-1))` appended to GRUB_CMDLINE_LINUX

In my experience, with the above changes applied, the following BIOS changes are NOT necessary:

* Disabling SMT
* Disabling ALSR

It is unclear whether it is strictly necessary to disable C6 in BIOS, comments welcome.

I would think CONFIG_RCU_NOCB_CPU is an acceptable compromise (for this CPU type) if it provides the stability AMD has, for exactly one year now, been unable to provide to Linux users on their current platform.

Tom Reynolds (tomreyn) wrote :

Link to correct upstream bug report

Mike R (mi9e) wrote :

@tomreyn

My corrent config to get stability

C6 disabled in BIOS
ALSR disabled
kernel >= 4.14.4 with CONFIG_RCU_NOCB_CPU
rcu_nocbs=0-15 processor.max_cstate=1 in GRUB_CMDLINE_LINUX

I got lockups still with ASLR disabled and rcu_nocbs with the custom built kernel, once I added C6 disabled in bios and processor.max_cstate=1 I've gotten 6 weeks of uptime

Kevin Wong (stonecracker) wrote :

CPU: Ryzen Threadripper 1920X
Ubuntu 16.04 LTS

Identical problems to the above. Agonizing, irritating, random, low-load idle-state freezes. No issues while under heavy load.

My fixes:

Disabled c-states in the BIOS
Disabled ASLR in the BIOS
Disabled AMD Cool 'n Quiet in the BIOS.

Compiled a new kernel from source, (4.13.16) but with "Offload RCU_callback processing" aka CONFIG_RCU_NOCB_CPU, chosen in the kernel config.

I added "rcu_nocbs=0-23" to the "GRUB_CMDLINE_LINUX" parameter in /etc/default/grub (and then ran "sudo update-grub"). "0-23", to cover all 24 threads on the 12 core Ryzen,

Rebooted.

Only with all of those fixes (yep, all of them, tested every one of them separately -- that took an eon -- have I finally gotten a stable system.

Zero issues now after 8 weeks. Let's hope that holds.

Haven't tried any kernel updates, though.

In addition to the above, I also consulted these references:

http://blog.programster.org/stabilizing-ubuntu-16-04-on-ryzen

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

http://blog.programster.org/how-to-disable-aslr

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.