Ryzen 1800X freeze - rcu_sched detected stalls on CPUs/tasks

Bug #1690085 reported by Vincent on 2017-05-11
212
This bug affects 36 people
Affects Status Importance Assigned to Milestone
Linux
Expired
Medium
linux (Ubuntu)
High
Unassigned

Bug Description

Hi,

We aregetting various kernel crash on a pretty new config.
We're using Ryzen 1800X CPU with X370 Gaming Pro Carbon MB (7A32V1) using latest BIOS available (1.52)

We are running Ubuntu 17.04 (amd64), we've tried different kernel version, native one and releases from http://kernel.ubuntu.com/~kernel-ppa/mainline/ too.
Tested kernel version:

native 17.04 kernel
4.10.15

Issues are the same, we're getting random freeze on the machine.

Here is kern.log entry when happening :

May 10 22:41:56 dev2 kernel: [24366.186246] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:41:56 dev2 kernel: [24366.187618] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=913449
May 10 22:41:56 dev2 kernel: [24366.188977] (detected by 12, t=1860207 jiffies, g=10001, c=10000, q=4656)
May 10 22:41:56 dev2 kernel: [24366.190344] Task dump for CPU 0:
May 10 22:41:56 dev2 kernel: [24366.190345] swapper/0 R running task 0 0 0 0x00000008
May 10 22:41:56 dev2 kernel: [24366.190348] Call Trace:
May 10 22:41:56 dev2 kernel: [24366.190354] ? native_safe_halt+0x6/0x10
May 10 22:41:56 dev2 kernel: [24366.190355] ? default_idle+0x20/0xd0
May 10 22:41:56 dev2 kernel: [24366.190358] ? arch_cpu_idle+0xf/0x20
May 10 22:41:56 dev2 kernel: [24366.190360] ? default_idle_call+0x23/0x30
May 10 22:41:56 dev2 kernel: [24366.190362] ? do_idle+0x16f/0x200
May 10 22:41:56 dev2 kernel: [24366.190364] ? cpu_startup_entry+0x71/0x80
May 10 22:41:56 dev2 kernel: [24366.190366] ? rest_init+0x77/0x80
May 10 22:41:56 dev2 kernel: [24366.190368] ? start_kernel+0x464/0x485
May 10 22:41:56 dev2 kernel: [24366.190369] ? early_idt_handler_array+0x120/0x120
May 10 22:41:56 dev2 kernel: [24366.190371] ? x86_64_start_reservations+0x24/0x26
May 10 22:41:56 dev2 kernel: [24366.190372] ? x86_64_start_kernel+0x14d/0x170
May 10 22:41:56 dev2 kernel: [24366.190373] ? start_cpu+0x14/0x14
May 10 22:44:56 dev2 kernel: [24546.188093] INFO: rcu_sched detected stalls on CPUs/tasks:
May 10 22:44:56 dev2 kernel: [24546.189461] 0-...: (1 GPs behind) idle=49b/1/0 softirq=28561/28563 fqs=935027
May 10 22:44:56 dev2 kernel: [24546.190823] (detected by 14, t=1905212 jiffies, g=10001, c=10000, q=4740)
May 10 22:44:56 dev2 kernel: [24546.192191] Task dump for CPU 0:
May 10 22:44:56 dev2 kernel: [24546.192192] swapper/0 R running task 0 0 0 0x00000008
May 10 22:44:56 dev2 kernel: [24546.192195] Call Trace:
May 10 22:44:56 dev2 kernel: [24546.192199] ? native_safe_halt+0x6/0x10
May 10 22:44:56 dev2 kernel: [24546.192201] ? default_idle+0x20/0xd0
May 10 22:44:56 dev2 kernel: [24546.192203] ? arch_cpu_idle+0xf/0x20
May 10 22:44:56 dev2 kernel: [24546.192204] ? default_idle_call+0x23/0x30
May 10 22:44:56 dev2 kernel: [24546.192206] ? do_idle+0x16f/0x200
May 10 22:44:56 dev2 kernel: [24546.192208] ? cpu_startup_entry+0x71/0x80
May 10 22:44:56 dev2 kernel: [24546.192210] ? rest_init+0x77/0x80
May 10 22:44:56 dev2 kernel: [24546.192211] ? start_kernel+0x464/0x485
May 10 22:44:56 dev2 kernel: [24546.192213] ? early_idt_handler_array+0x120/0x120
May 10 22:44:56 dev2 kernel: [24546.192214] ? x86_64_start_reservations+0x24/0x26
May 10 22:44:56 dev2 kernel: [24546.192215] ? x86_64_start_kernel+0x14d/0x170
May 10 22:44:56 dev2 kernel: [24546.192217] ? start_cpu+0x14/0x14

Depending on the kernel version, we've got NMI watchdog errors related to CPU stuck (mentioning the CPU core id, which is random).
Crash is happening randomly, but in general after some hours (3-4h).

Now, we've installed kernel 4.11.0-041100-generic #201705041534 this morning and waiting for crash...
For now, the machine is not "used", at least, it's not CPU stressed...

Thanks
---
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
DistroRelease: Ubuntu 17.04
InstallationDate: Installed on 2017-05-09 (1 days ago)
InstallationMedia: Ubuntu-Server 17.04 "Zesty Zapus" - Release amd64 (20170412)
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=fr_FR.UTF-8
 SHELL=/bin/bash
Tags: zesty
Uname: Linux 4.11.0-041100-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True

Vincent (hvincent13) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1690085

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected zesty
description: updated

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Lipido (lipido) wrote :

Hi Vicent,

Did you experience more crashes with kernel 4.11?

Thank you!

Vincent (hvincent13) wrote :

Hi Lipido,

Yes, i'm still experiencing crashes, even when using 4.11

Please find full kern.log in attachment.

Regards,

Lipido (lipido) wrote :

Ok, me too :-(

Crashes appear from 24-48h of operation.

My hardware is:
- SSUS PRIME B350-Plus
- Amd Ryzen 5 1600

OS: Ubuntu 16.04

Tried:
- Update BIOS to the latest (v 609)
- Kernel 4.10
- Disable SMT in bios (from 12 threads to 6 threads)
- Boot clocksource=tsc iommu=soft
- Disable IOMMU in bios

None of these work. I was waiting for 4.11 to be available for Ubuntu as an official package, but it seems that this will not work either.

Another link talking about what seems to be the same issue:

https://forum.level1techs.com/t/ryzen-vs-ubuntu/115715/22

Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc1/

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
tags: added: kernel-da-key
Vincent (hvincent13) wrote :

Hi Joseph,

I've just installed 4.12-rc1.
Now waiting for a crash... or not (hope so !)

Regards,

Vincent (hvincent13) wrote :

Just go a kernel panic...

How to add the tag kernel-bug-exists-upstream ?

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report[0]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

Please follow the instructions on the wiki page[0]. The first step is to email the appropriate mailing list. If no response is received, then a bug may be opened on bugzilla.kernel.org.

Once this bug is reported upstream, please add the tag: 'kernel-bug-reported-upstream'.

[0] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Vincent (hvincent13) wrote :

Hi,

Could you please tell me which address I have to send the mail to?
I don't really understand how to achieve this bug report on the mailing list.

Thanks,

Vincent (hvincent13) wrote :

Hi,

It looks like that the system is stable when removing "nouveau" driver.
Waiting 24/48h and will post again.

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

I also met the problem, and your advice removing "nouveau" succeeded for work-around to stabilize system. Before it freeze every 3-5hr after booting up, but now working without problem for 2 days.

My enviroment is:
- cpu: Ryzen 1700
- motherboard: ASUS B350M-A
- graphics card: NVIDIA GK208
- kernel: 4.11.0-041100rc8-generic
- dist: 17.04

Vincent (hvincent13) wrote :

Hi,

5 days update now and no crash.

camparijet => did you installed NVIDIA driver instead ?

Regards,

camparijet (iichikolamp) wrote :

Hi Vincent,

> camparijet => did you installed NVIDIA driver instead ?

No. I don't have to use the card for my purpose, so simply i disable it.

Alex Jones (blenheimears) wrote :

I'm seeing this crash even with the Nvidia official driver.

Alex Jones (blenheimears) wrote :

This is a hardware bug in the CPU. This ticket should be closed as invalid.

Changed in linux (Ubuntu):
status: Triaged → Invalid
Alex Jones (blenheimears) wrote :

Reopening because even though this is a known issue with the CPU we could still implement a workaround. One workaround is to disable address space layout randomization:

echo 0 >/proc/sys/kernel/randomize_va_space

However, that would be disabling a security feature.

Changed in linux (Ubuntu):
status: Invalid → New

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: artful
Marc Rene Schädler (suaefar) wrote :

Was this issue officially confirmed to be a hardware bug in Ryzen processors by AMD?
If so, could you provide a link to the statement?

Disabling address space layout randomization (ASLR) seems to alleviate the problem, but does it solve it?

I am investigating unstable behavior under load which could be related.
There, disabling ASLR is not sufficient!
People suggest to increase SOC voltages, use specific versions of the kernel and the like.
See https://community.amd.com/thread/215773?start=135&tstart=0 for more info.

Alex Jones (blenheimears) wrote :

AMD has not publicly commented on this issue that I'm aware of. This issue has been seen on many different operating systems. DragonFlyBSD includes a workaround for this issue. The workaround on Linux is to compile the kernel with CONFIG_RCU_NOCB_CPU, CONFIG_RCU_NOCB_CPU_ALL, and disable ASLR using "echo 0 >/proc/sys/kernel/randomize_va_space". This can be put into rc.local. It's also possible to hardcode ASLR as disabled into the kernel, but this requires modifying the kernel source, not just the config file. There is a new AGESA update released about a week ago, 1.0.0.6a, although I have not tested whether the new AGESA alone (without any kernel changes) solves the issue.

Alex Jones (blenheimears) wrote :

I just tested 1.0.0.6a AGESA, and it does not solve this issue. In some cases just one program will crash, and in other cases the entire system will crash. I will test the workaround above later.

Alex Jones (blenheimears) wrote :

If any of your RAM timings are odd (eg. 17), setting them to the next even number (eg. 18) helps a lot. Recompiling the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL, and disabling ASLR is still necessary though. It may also be a good idea to give the SOC slightly more voltage, but not more than 1.2 V.

Kai-Heng Feng (kaihengfeng) wrote :

Alex,

Can you provide link on how DragonflyBSD fixed this issue?

Alex Jones (blenheimears) wrote :

This is the DragonFlyBSD commit. https://gitweb.dragonflybsd.org/dragonfly.git/commitdiff/b48dd28447fc8ef62fbc963accd301557fd9ac20

It appears that there are two different ways that the system can crash, which is why it is necessary to both disable ASLR and to compile the kernel with CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL. The ASLR-related crash usually results in a single or a few programs crashing (although if an important program crashes it can bring down the entire system) and happens under heavy load. The other crash only happens if the kernel was compiled without CONFIG_RCU_NOCB_CPU and CONFIG_RCU_NOCB_CPU_ALL (Ubuntu's kernel is compiled without these options) and happens when the system is idle or nearly idle, and results in a complete system crash.

Alex Jones (blenheimears) wrote :

If the motherboard allows it, disabling the OpCache will completely prevent (or at least greatly reduce the probability of) the ASLR-related crash, even if ASLR is enabled in the kernel. As far as I'm aware it has no effect on the other type of crash.

Alex Jones (blenheimears) wrote :

I've also determined that changing the CPU, memory, or SOC voltages or timings has little or no effect on either type of crash.

Kai-Heng Feng (kaihengfeng) wrote :

This is beyond my expertise - let's see what upstream can do.

Torge (cyslider) wrote :

I also experience this problem since I updated yesterday.

I am using Kubuntu 16.04 with KDE backports enabled.
I also have a Ryzen 1800X and an Asrock X370 Gaming professional

I experience this either when I boot and don't log in promptly or when I enter the lockscreen.

After reading this page I tried to run
  seq 1 | xargs -P0 -n1 md5sum /dev/zero &

as root to always keep one CPU core busy and indeed this seems to prevent the lockscreen problem...

I would also like to note that the system was not 100% frozen, once every few minutes it seemed to response for a short time, enabling me to switch to the TTY. Then I always saw some processes hanging at 100% that I could not kill, or the kill was delayed for a long time. I always see the errors in the initial posts during that time. The TTY seemes to work fine though, once entered though.

David Wilson (plottt) wrote :

I've ran into this problem several times. After disabling C-states in the motherboard, my system has been running stable for ~5 weeks uptime so far.

Torge (cyslider) wrote :

Ok, this trick did not help and my PC did not make it through the night without freezing again.

However I finally found the real problem. I installed oibaf a while back as my Kubuntu was flickering all over the place. Now it seems the be the source of my problem. I deinstalled oibaf again and now everything seems fine. No lock screen freezing for me anymore.

Stuart Page (sdpagent) wrote :

Just wanting to report that I am also experiencing this issue with a freshly installed Ubuntu Server 16.04.3 with 4.10 kernel on the 22nd August 2017. All the updates applied and running as a KVM host with hardware:

Ryzen 1700 (non-x)
Motherboard: Asus prime B350-Plus
Bios: - Version 0805
 - Disabled SMT
 - Disabled c-states

xb5i7o (xb5i7o) wrote :

Hi guys,

Im getting the same issues, brand new build.

Ryzen 1800x
Asrock X370 Taichi
BIOS: 3.10 (latest)
I tried disabling cool n quiet and c-state

However there is another option under advanced - for GLOBAL C-States that i just disabled today and i am waiting to see what will happen.

On BIOS v3.00 PC would freeze after 6 hours of idling or not touching anything. Come back to see my keyboard and mouse and everything was frozen.
With BIOS v.3.10 now i get random reboots atleast once a day.

Im running Ubuntu 16.04.3 Kernel 4.4.0-93

Anyone know to direct me where the SoC voltage would be in BIOS? is it the VDD SoC?
Anyone know how to update my kernel to 4.10 if it tells me i have the latest?

Stuart Page (sdpagent) wrote :

I finally managed to figure out how to compile a kernel with RCU_NOCB and disabled ASLR as Alex Jones mentioned, and it appears to have worked for me and another guy who helped me put the tutorial together:

http://blog.programster.org/ubuntu-16-04-compile-custom-kernel-for-ryzen

Kai-Heng Feng (kaihengfeng) wrote :

Stuart,

Does this issue also happen on latest mainline kernel?

Franck Charras (franckc) wrote :

Hi,
I'm getting the same issues on several identical builds, with an ASUS prime X370-PRO motherboard.
It's very hard to analyze since it happens randomly every other week, and it leaves no logs. It seems that the freeze happens at idle after very high memory load (observation after logging CPU and RAM loads).
Compiling the kernel as suggested in this thread didn't work (new freeze this morning).

tgui (eric-c-morgan) wrote :

I too still have random computer shutdowns without logs. Uptime varies from a couple days to a week or so. It happens it seems after being relatively idle for a long period. I do have high memory usage because of the VMs I run.

I've disabled C-states, cool and quiet, tested memory, and so forth. I also compiled a new kernel as mentioned by Stuart. I do not have segfaults with compilations.

I am willing to run tests and provide info. Please let me know if anyone wants something.

Ubuntu 16.04
Asrock x370 ITX
32GB Ram
Ryzen 1700

Franck Charras (franckc) wrote :

Ryzen CPUs manufactured before week 24 of this year were known to have issues (especially the segfault issue). It was officially fixed for all ryzen manufactured after week 30. All my ryzen are pre-24 CPUs and they all have this silent crash issue. Does it also happen with the post week 30 ryzen ? (@tgui what is yours ?)

AS (as2008) wrote :

I can confirm the issue with a week 33 Ryzen 1700 on Asus Prime B350 Plus.
I had segfaults with my previous 1700 (week 22 iirc). No more segfaults with the week 33 CPU but still random crashes on (long time) idle.

tags: removed: kernel-da-key
information type: Public → Public Security
information type: Public Security → Public
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
684 comments hidden view all 764 comments

Anyone here seeing this one: Bug 1738650 - Kernel 5.2.5 graphics unstable

https://bugzilla.redhat.com/show_bug.cgi?id=1738650

Everytning working great as long as I stay on kernel-5.1.20-300.fc30.x86_64

(In reply to raulvior.bcn from comment #637)

I got a reboot with the ondemand governor at 1d 14h of uptime.
I have reactivated the Typical Power Supply idle switch. Currently 4 days and 2 hours of uptime. The longest I've ever seen.

The minimum voltage reported by the UEFI is 0.83 V. Using the performance governor the minimum voltage reported is 1.26 V. This corresponds with P2 and P1 P-states voltages.

> (In reply to raulvior.bcn from comment #635)
> ASUSTeK COMPUTER INC. CROSSHAIR VI HERO 7403 08/20/2019
> AMD Ryzen 7 1800X Eight-Core Processor
> 16410MB
> 2560x1440 pixels
> Radeon RX 580 Series (POLARIS10, DRM 3.27.0, 5.0.0-25-generic, LLVM 8.0.0)
> Linux 5.0.0-25-generic (x86_64) #26-Ubuntu SMP Thu Aug 1 12:04:58 UTC 2019
> GNU C library / (Ubuntu GLIBC 2.29-0ubuntu2) 2.29
> Ubuntu 19.04
> BOOT_IMAGE=/vmlinuz-5.0.0-25-generic
> root=UUID=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ro quiet splash vt.handoff=1

After one year, I am finally using my 2700x without any issues, latest Ubuntu 18.04, Linux 5.0.0-27-generic. No patches, no kernel options. I put the system to sleep everyday and it always comes back when I press a key the next day.

I updated the BIOS of the AB350M-DS3Hto F31, the most recent update before 3rd Gen AMD support was added.

I also put a PSU that I had around, a TT-450NL2NK. The one before had less watts. I let the Typical Current Idle option, but I don't know whether it has some effect.

I also have a GeForce GTX 1050 Ti now.

I changed quite a few things, so I cannot really say what improved my setup. But I want to report that the system is finally stable.

Regards.

(In reply to Nelson Castillo from comment #399)
> Hello there. I read most of the thread a week ago.
>
> My machine was freezing when idle and when not idle. I read most of this
> thread and thanks to it got a stable setup.
>
> @Daniel: When you say that you think the problem is fixed, do you know what
> change/Linux version fixes the issue for most people?
>
> -------------
>
> Now my report.
>
> I have the following setup:
>
> Motherboard: AB350M-DS3H (F23d BIOS).
> CPU: Ryzen 2700X.
> No overclock.
>
> I still wonder whether my CPU is doing too much extra work (I don't know
> what cpufreq would report if a C6 state was reached).
>
> cpufreq stats:
>
> CPU 0: 3.70 GHz:6.64%, 3.20 GHz:1.15%, 2.20 GHz:92.21%
> CPU15: 3.70 GHz:4.86%, 3.20 GHz:0.80%, 2.20 GHz:94.34%
>
> Anyway, to make the things work I had to use three tweaks. I tried
> individual tweaks to no avail.
>
> - Select in BIOS: "Typical Current Idle"
> - Start Linux with: idle=nomwait
> - Disable C6 states (both core and package) with Zenstates.py
>
> Before doing the last step Zenstates.py reports Core enabled and Package
> disabled.
>
> I'm using Ubuntu 18.04.01 LTS. I didn't compile Linux with
> CONFIG_RCU_NOCB_CPU / CONFIG_RCU_NOCB_CPU_ALL.
>
> So, things are working for me. But if you think I should test a new Linux
> version that is supposed to fix the issue please let me know.

I've been lightly following this thread, but it is still a seemingly unresolved issue that impedes me from using Linux. Has there been any real progress on this? Or is the resolution still a vague 'we more or less know where the issue is but we're still waiting for someone upstream to take action to fix it'?

I have a 2700X on MSI X470 Gaming Plus. Update the BIOS to support my new 3800X also fixed this issue to me. Both CPUs are works well on Debian Buster without any extra kernel parameter and without disable C6.

Hi, i bought Acer Aspire 3, A315-41G, Ryzen 5 2500U in January and since then i had CPU lock up, i manage to install arch but i got CPU lockUP on every second boot, on every other distribution i got CPU lockUP in installer...i never use C6 disable script, but i try all possibile kernel parameters and it never works...

Three days ago i found new BIOS patch Update PI code v1.1.0.8. from 2019/08/12, on acer web site, after install everything works, no lockUPs, all distribution i try works with no problems, no need for any grub or kernel parameters...

Sry for bad english,

Regards

Download full text (6.9 KiB)

Sep 10 21:13:56 BelliashPC kernel: BUG: kernel NULL pointer dereference, address: 000000000000000d
Sep 10 21:13:56 BelliashPC kernel: #PF: supervisor write access in kernel mode
Sep 10 21:13:56 BelliashPC kernel: #PF: error_code(0x0002) - not-present page
Sep 10 21:13:56 BelliashPC kernel: PGD 0 P4D 0
Sep 10 21:13:56 BelliashPC kernel: Oops: 0002 [#1] SMP NOPTI
Sep 10 21:13:56 BelliashPC kernel: CPU: 11 PID: 30804 Comm: kworker/11:2 Tainted: P O T 5.2.13-gentoo #1
Sep 10 21:13:56 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep 10 21:13:56 BelliashPC kernel: Workqueue: 0x0 (events)
Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39 c5 74
Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS: 00010002
Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX: ffff8d78aecdfa80 RCX: ffff8d789fa1b400
Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI: ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08: ffff8d78aece0180 R09: 0000000000001800
Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11: ffffffffffffffff R12: ffff8d78a4112b68
Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14: ffff8d78a4112b40 R15: ffff8d78a4112b40
Sep 10 21:13:56 BelliashPC kernel: FS: 0000000000000000(0000) GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
Sep 10 21:13:56 BelliashPC kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3: 000000069063a000 CR4: 0000000000340ee0
Sep 10 21:13:56 BelliashPC kernel: Call Trace:
Sep 10 21:13:56 BelliashPC kernel: kthread+0xf8/0x130
Sep 10 21:13:56 BelliashPC kernel: ? set_worker_desc+0xb0/0xb0
Sep 10 21:13:56 BelliashPC kernel: ? kthread_park+0x80/0x80
Sep 10 21:13:56 BelliashPC kernel: ret_from_fork+0x22/0x40
Sep 10 21:13:56 BelliashPC kernel: Modules linked in: nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O)
Sep 10 21:13:56 BelliashPC kernel: CR2: 000000000000000d
Sep 10 21:13:56 BelliashPC kernel: ---[ end trace 23be40ec0c6dea2a ]---
Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39 c5 74
Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS: 00010002
Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX: ffff8d78aecdfa80 RCX: ffff8d789fa1b400
Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI: ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08: ffff8d78aece0180 R09: 0000000000001800
Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11: ffffffffffffffff R12...

Read more...

Download full text (3.7 KiB)

And another example of calltrace:

Sep 5 21:25:32 BelliashPC kernel: NMI backtrace for cpu 1
Sep 5 21:25:32 BelliashPC kernel: CPU: 1 PID: 13564 Comm: DOM File Tainted: P D O T 5.2.11-gentoo #1
Sep 5 21:25:32 BelliashPC kernel: Hardware name: Gigabyte Technology Co., Ltd. B450 AORUS ELITE/B450 AORUS ELITE, BIOS F42a 07/31/2019
Sep 5 21:25:32 BelliashPC kernel: Call Trace:
Sep 5 21:25:32 BelliashPC kernel: <IRQ>
Sep 5 21:25:32 BelliashPC kernel: dump_stack+0x46/0x60
Sep 5 21:25:32 BelliashPC kernel: nmi_cpu_backtrace.cold+0x14/0x53
Sep 5 21:25:32 BelliashPC kernel: ? lapic_can_unplug_cpu.cold+0x42/0x42
Sep 5 21:25:32 BelliashPC kernel: nmi_trigger_cpumask_backtrace+0x89/0x8b
Sep 5 21:25:32 BelliashPC kernel: rcu_dump_cpu_stacks+0x7b/0xa9
Sep 5 21:25:32 BelliashPC kernel: rcu_sched_clock_irq.cold+0x1a2/0x38d
Sep 5 21:25:32 BelliashPC kernel: ? tick_sched_do_timer+0x50/0x50
Sep 5 21:25:32 BelliashPC kernel: update_process_times+0x24/0x60
Sep 5 21:25:32 BelliashPC kernel: tick_sched_timer+0x33/0x70
Sep 5 21:25:32 BelliashPC kernel: __hrtimer_run_queues+0xe7/0x180
Sep 5 21:25:32 BelliashPC kernel: hrtimer_interrupt+0x100/0x220
Sep 5 21:25:32 BelliashPC kernel: smp_apic_timer_interrupt+0x56/0x90
Sep 5 21:25:32 BelliashPC kernel: apic_timer_interrupt+0xf/0x20
Sep 5 21:25:32 BelliashPC kernel: </IRQ>
Sep 5 21:25:32 BelliashPC kernel: RIP: 0010:queued_spin_lock_slowpath+0x3f/0x1a0
Sep 5 21:25:32 BelliashPC kernel: Code: 0f ba 2f 08 0f 92 c0 0f b6 c0 c1 e0 08 89 c2 8b 07 30 e4 09 d0 a9 00 01 ff ff 75 18 85 c0 74 0e 8b 07 84 c0 74 08 f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 c3 f6 c4 01 75 04 c6 47 01 00 48 c7 c0
Sep 5 21:25:32 BelliashPC kernel: RSP: 0018:ffffa299085afc48 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
Sep 5 21:25:32 BelliashPC kernel: RAX: 00000000003c0101 RBX: ffffa299085afd10 RCX: 000000008a2ee8ff
Sep 5 21:25:32 BelliashPC kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffa299034fdb44
Sep 5 21:25:32 BelliashPC kernel: RBP: ffffa299085afcc0 R08: ffffa299085afcc0 R09: 0000000000000000
Sep 5 21:25:32 BelliashPC kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 00007fa4d675aa38
Sep 5 21:25:32 BelliashPC kernel: R13: 0000000000000000 R14: 0000000000000000 R15: ffffa299034fdb40
Sep 5 21:25:32 BelliashPC kernel: futex_wait_setup+0x77/0x110
Sep 5 21:25:32 BelliashPC kernel: futex_wait+0xcf/0x230
Sep 5 21:25:32 BelliashPC kernel: ? ep_poll_callback+0x256/0x280
Sep 5 21:25:32 BelliashPC kernel: do_futex+0x15e/0xc60
Sep 5 21:25:32 BelliashPC kernel: ? pipe_write+0x382/0x410
Sep 5 21:25:32 BelliashPC kernel: ? new_sync_write+0x12c/0x1d0
Sep 5 21:25:32 BelliashPC kernel: __x64_sys_futex+0x137/0x170
Sep 5 21:25:32 BelliashPC kernel: do_syscall_64+0x4a/0x110
Sep 5 21:25:32 BelliashPC kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9
Sep 5 21:25:32 BelliashPC kernel: RIP: 0033:0x7fa4e76c7067
Sep 5 21:25:32 BelliashPC kernel: Code: 5c 24 68 48 89 44 24 50 e8 96 38 00 00 e8 c1 3d 00 00 89 de 41 89 c1 40 80 f6 80 45 31 d2 31 d2 4c 89 ff b8 ca 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 71 44 89 cf e8 f9 3d 00 00 31 f6 4c 89 f7 e8
Sep 5 21:25:32 BelliashPC kernel: RSP: 002b...

Read more...

> Sep 10 21:13:56 BelliashPC kernel: Workqueue: 0x0 (events)
> Sep 10 21:13:56 BelliashPC kernel: RIP: 0010:worker_thread+0xfc/0x3b0
> Sep 10 21:13:56 BelliashPC kernel: Code: 0f 84 8e 02 00 00 48 8b 4c 24 18 49
> 39 4f 38 0f 85 a9 02 00 00 83 e0 fb 41 89 47 60 ff 4a 34 49 8b 17 49 8b 47
> 08 48 89 42 08 <48> 89 10 4d 89 3f 4d 89 7f 08 48 8b 45 20 4c 8d 6d 20 49 39
> c5 74
> Sep 10 21:13:56 BelliashPC kernel: RSP: 0018:ffffaec608acfec0 EFLAGS:
> 00010002
> Sep 10 21:13:56 BelliashPC kernel: RAX: 000000000000000d RBX:
> ffff8d78aecdfa80 RCX: ffff8d789fa1b400
> Sep 10 21:13:56 BelliashPC kernel: RDX: ffff8d78a4112780 RSI:
> ffff8d78a91e56c0 RDI: ffff8d78aecdfa80
> Sep 10 21:13:56 BelliashPC kernel: RBP: ffff8d78aecdfa80 R08:
> ffff8d78aece0180 R09: 0000000000001800
> Sep 10 21:13:56 BelliashPC kernel: R10: 0000000000000000 R11:
> ffffffffffffffff R12: ffff8d78a4112b68
> Sep 10 21:13:56 BelliashPC kernel: R13: ffff8d78aecdfaa0 R14:
> ffff8d78a4112b40 R15: ffff8d78a4112b40
> Sep 10 21:13:56 BelliashPC kernel: FS: 0000000000000000(0000)
> GS:ffff8d78aecc0000(0000) knlGS:0000000000000000
> Sep 10 21:13:56 BelliashPC kernel: CS: 0010 DS: 0000 ES: 0000 CR0:
> 0000000080050033
> Sep 10 21:13:56 BelliashPC kernel: CR2: 00000000000000b0 CR3:
> 000000069063a000 CR4: 0000000000340ee0
> Sep 10 21:13:56 BelliashPC kernel: Call Trace:
> Sep 10 21:13:56 BelliashPC kernel: kthread+0xf8/0x130
> Sep 10 21:13:56 BelliashPC kernel: ? set_worker_desc+0xb0/0xb0
> Sep 10 21:13:56 BelliashPC kernel: ? kthread_park+0x80/0x80
> Sep 10 21:13:56 BelliashPC kernel: ret_from_fork+0x22/0x40
> Sep 10 21:13:56 BelliashPC kernel: Modules linked in: nvidia_drm(PO)
> nvidia_modeset(PO) nvidia(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O)
> vboxdrv(O)

You have proprietary crap loaded.

Try reproducing it with the latest upstream kernel and *without* that
nvidia* and vbox* gunk.

If you can, please open a separate bug and add me to CC.

Thx.

This happened also w/o vbox (I installed it few days ago) and nvidia proprietary drivers. I used to use nouveau with the same results. I will open a new ticket.

I have freezes in ryzen 7 2700u in the acer swift sf315-41 model, kernels tested, 5.2

Download full text (7.4 KiB)

I was able to update my BIOS to version 18, but my system still locks up.
I tried the following with the new BIOS:
 - use factory defaults
 - disable SMT
 - disable SMT with Typical Current Idle
 - all of the above with SVM disabled/enabled
Right now I set the power supply idle control to "Low ..." and will report back.

On a positive note, I got an error message:

Sep 14 13:33:04 user-MS-7B85 kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000160
Sep 14 13:33:04 user-MS-7B85 kernel: #PF error: [normal kernel read fault]
Sep 14 13:33:04 user-MS-7B85 kernel: PGD 0 P4D 0
Sep 14 13:33:04 user-MS-7B85 kernel: Oops: 0000 [#1] SMP NOPTI
Sep 14 13:33:04 user-MS-7B85 kernel: CPU: 6 PID: 1620 Comm: QXcbEventReader Not tainted 5.0.0-27-generic #28~18.04.1-Ubuntu
Sep 14 13:33:04 user-MS-7B85 kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B85/B450 GAMING PRO CARBON AC (MS-7B85), BIOS 1.80 07/22/2019
Sep 14 13:33:04 user-MS-7B85 kernel: RIP: 0010:pick_next_task_fair+0x225/0x6f0
Sep 14 13:33:04 user-MS-7B85 kernel: Code: 40 78 48 3d 80 39 e2 b9 75 6c 4c 8b 65 b0 eb 2a 4c 89 e7 45 31 ed e8 aa c6 ff ff 84 c0 75 41 4c 89 e7 4c 89 ee e8 4b 3b ff ff <4c> 8b a0 60 01 00 00 4d 85 e4 0f 84 e9 00 00 00 4d 8b 6c 24 40 4d
Sep 14 13:33:04 user-MS-7B85 kernel: RSP: 0018:ffffaa51c34ff938 EFLAGS: 00010046
Sep 14 13:33:04 user-MS-7B85 kernel: RAX: 0000000000000000 RBX: ffff8e210e7a2d80 RCX: 000000000000000a
Sep 14 13:33:04 user-MS-7B85 kernel: RDX: ffffaa51c34ff9c0 RSI: 0000000000000000 RDI: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: RBP: ffffaa51c34ff998 R08: 00000000000f48d3 R09: ffff8e2109ca8a00
Sep 14 13:33:04 user-MS-7B85 kernel: R10: ffffaa51c34ff898 R11: 0000000000000001 R12: ffff8e210e7a2e00
Sep 14 13:33:04 user-MS-7B85 kernel: R13: 0000000000000000 R14: ffff8e21089820c0 R15: ffffaa51c34ff9c0
Sep 14 13:33:04 user-MS-7B85 kernel: FS: 00007fa560ab6700(0000) GS:ffff8e210e780000(0000) knlGS:0000000000000000
Sep 14 13:33:04 user-MS-7B85 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 14 13:33:04 user-MS-7B85 kernel: CR2: 0000000000000160 CR3: 00000004081dc000 CR4: 0000000000340ee0
Sep 14 13:33:04 user-MS-7B85 kernel: Call Trace:
Sep 14 13:33:04 user-MS-7B85 kernel: __schedule+0x173/0x850
Sep 14 13:33:04 user-MS-7B85 kernel: ? ep_poll_callback+0x93/0x2a0
Sep 14 13:33:04 user-MS-7B85 kernel: schedule+0x2c/0x70
Sep 14 13:33:04 user-MS-7B85 kernel: schedule_hrtimeout_range_clock+0x181/0x190
Sep 14 13:33:04 user-MS-7B85 kernel: ? __wake_up_common+0x73/0x130
Sep 14 13:33:04 user-MS-7B85 kernel: ? add_wait_queue+0x44/0x50
Sep 14 13:33:04 user-MS-7B85 kernel: ? __pollwait+0xaf/0xe0
Sep 14 13:33:04 user-MS-7B85 kernel: schedule_hrtimeout_range+0x13/0x20
Sep 14 13:33:04 user-MS-7B85 kernel: poll_schedule_timeout.constprop.11+0x46/0x70
Sep 14 13:33:04 user-MS-7B85 kernel: do_sys_poll+0x3d6/0x590
Sep 14 13:33:04 user-MS-7B85 kernel: ? kmem_cache_free+0x1a7/0x1d0
Sep 14 13:33:04 user-MS-7B85 kernel: ? kmem_cache_free+0x1a7/0x1d0
Sep 14 13:33:04 user-MS-7B85 kernel: ? update_load_avg+0x8b/0x5a0
Sep 14 13:33:04 user-MS-7B85 kernel: ? update_load_avg+0x8b/0x5a0
Sep 14 13:33:04 user-MS-7B85 kernel: ? __...

Read more...

(In reply to txrx from comment #651)
> I was able to update my BIOS to version 18, but my system still locks up.
> I tried the following with the new BIOS:
> - use factory defaults
> - disable SMT
> - disable SMT with Typical Current Idle
> - all of the above with SVM disabled/enabled
> Right now I set the power supply idle control to "Low ..." and will report
> back.
>

You got the same error as me. There is also bug 204811 for that.

(In reply to txrx from comment #651)

Typical Current Idle might not be working. Read the sensor output. If voltage is not higher than without enabling it, try to increase the core voltage.

My Ryzen 7 1800X seems to not produce hangs since I upgraded to 1003ABB with an ASUS Crosshair VI Hero and enabled Typical current idle.

> I was able to update my BIOS to version 18, but my system still locks up.
> I tried the following with the new BIOS:
> - use factory defaults
> - disable SMT
> - disable SMT with Typical Current Idle
> - all of the above with SVM disabled/enabled
> Right now I set the power supply idle control to "Low ..." and will report
> back.
>

https://cdn.arstechnica.net/wp-content/uploads/2019/10/rdrand-test.zip - download, extract, launch ./test-rdrand and check if produces something different than 0xffffffff

Download full text (3.9 KiB)

Could this be related?

[620533.9804061 RBP: 0000000000000000 R08: 0000000000000001 R09: 00007f9014000080 [620533.981792] R10: 00007f9014de0270 R11: 0000000000000206 R12: 00007f90477fdlfe [620533.983039] R13: 00007f90477fdlff R14: 00007f901ffff700 R15: 00007f901fffe540 [620541.767819] watchdog: BUG: soft lockup - CPU#4 stuck for 220 Enc4:Eantoor:5552.1 [620541.769193] Modules linked in: tun ueth nf_conntrack_netlink nfnetlink xfrm_user xfrn_algo xt_addrtype iptable filter xt c, ntrack br_petfilter bridge stp 11c ouerlay xt_nat xt_tcpudp iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defraliipu6 -n id.nfjlefrag_ipu4 nouueau snd_bda_codec_realtek snd_hda_codec_generic ledtrig_audio mxn_wmi i2c_algo_bit ttm edac_mCe_amd snd a_intel drmitms_helper snd_hda_codec eeepc_wmi kum_and snd_hda_core sparse_keymap drm rfkill snd_hudep joydeu inputied s snd, cm r8169 wml_trmof kum irgbypass realtek snd_timer libphy agpgart ccp snd syscopyarea sysfillrect sysimgblt sp5100_tco sou ndcore fb_sys_fops rng_core i2cpiix4 crctlOdif_pclmul klOtemp crc32_pclmul ghash_clmulni_inte1 aesni_intel aes_x86_64 crypto_si md crypt& glue helper pcspkr wmi pinctrl_and pcc_cpufreq eudeu gpio_andpt mac_hid acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd gr ace sunrpc ip_tables x tables ext4 crc16 mbcache jbd2 raid456 libcrc32c crc32c_generic async_raid6_recou 1620541.7692211 async memcpy async_pq async_xor xor async_tx raid6_pq raidl md_mod hid_generic sd_mod usbhid hid ahci libahci 1 ibata crc32c_intel xhcipci xhci_bcd scsi_mod 1620541.7638311 CPU: 4 PID: 5552 Comm: nc4:Eantoor Tainted: G U L 5.2.11-archl-l-ARCH #1 [620541.765314] Hardware name: System manufacturer System Product Name/PRIME B450-PLUS, BIBS 1804 07/29/2019 [620541.766760] RIP: 0010:smp_call_function_many+Ox20b/Ox270 [620541.766216] Code: e6 8a ae 78 00 3b 05 f8 Od 21 01 89 c7 Of 83 7a fe ff ff 48 63 c7 48 8b Ob 48 03 Oc c5 60 ba d6 al 8b 41 1 6 a8 01 74 Oa f3 90 <ft> 51 16 83 e2 01 75 f6 eb c9 48 c7 c2 c0 14 f4 al 4c 89 fe 89 df [620541.791144] RSP: 0016:ffffaefd63e2bcd0 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffl3 [620541.7925011 RAX: 0000000000000003 RBX: ffff8fe6be72b440 RCX: ffff8fe6be7flle0 1620541.793697] RDX: 0000000000000001 BSI: 0000000000000000 RDI: 0000000000000007 [620541.795357] RBP: ffffffffa0c7c7e0 R08: ffff8fe6be72b448 R09: ffff8fe6be72b470 1620541.796751] R10: ffff8fe6be72b448 R11: 0000000000000005 R12: ffff8febbe729d40 1620541:798198] R13: ffff8fe6be72b448 R14: 0000000000000001 R15: 0000000000000140 [620541.799600] FS: 00007f9044d5c700(0000) G3:ffff8fe6be700000(0000) knlGS:0000000000000000 [620541.800986] CS: 0010 DS: 0000 ES: 0000 CH0: 0000000080050033 [620541.802414] CR2: 00007f9062e8 f3b0 CR3: 00000003ae65a000 CR4: 00000000003406e0 [620541.803847] Call Trace: [629541.80-r 8] flush tlb mm_range+Oxe1/0x140 1620541.8065503 tlb_flush mmu+Ox72/0x140 [628541.807929] tlb_finish_mmu+Ox3d/Ox70 t62e541me92?el zap_page_range+Oxl4b/Oxia0 [620541.810589] se sys maduise+Ox613/0x890 [620541.811932] ? do syscall_64+0x5f/Ox1d0 [628541.813277] do syscall_64+0x5f/Ox1d0 [629541.8146481 ? page_fault+Ox8/0x30 [628541.815992] entry SYSCALL64afterbaframe+Ox44/0xa9 [620541i817395] RIP: 0833:0x7f9062dd7597 russdriAluff...

Read more...

(In reply to Jaap Crezee from comment #655)
> Could this be related?
>
> [620533.9804061 RBP: 0000000000000000 R08: 0000000000000001 R09:
> 00007f9014000080 [620533.981792] R10: 00007f9014de0270 R11: 0000000000000206
> R12: 00007f90477fdlfe [620533.983039] R13: 00007f90477fdlff R14:
> 00007f901ffff700 R15: 00007f901fffe540 [620541.767819] watchdog: BUG: soft
> lockup - CPU#4 stuck for 220 Enc4:Eantoor:5552.1 [620541.769193] Modules
> linked in: tun ueth nf_conntrack_netlink nfnetlink xfrm_user xfrn_algo

See also bug report 205017. It is difficult to capture the traces when it first happens. May or may not be the same bug.

Since I set up remote console logging and upgraded my BIOS (again), I have not seen the hang anymore.
Although I like this, I still have no more clues about the initial trace/problem.

FYI

I am running on a

        Manufacturer: ASUSTeK COMPUTER INC.
        Product Name: PRIME B450-PLUS
        Version: 1820
        Release Date: 09/12/2019
model name : AMD Ryzen 5 1600 Six-Core Processor

Download full text (9.7 KiB)

Ever since upgrading to a Ryzen (1700X), I have experienced frequent system freezes, which may be related to the problems discussed here.

The freeze mostly happens during a certain heavily threaded task with disk io.

Symptoms:

* Screen completely freezes, including mouse pointer,
* Existing SSH connections die, no new connection can be established,
* System can no longer switch to text console,
* LEDs indicate **unceasing disk activity**,
* System still responds to pings,
* Alt-SysRq keys remain active, but cannot output to screen even if already in text console.

I've succeeded in capturing kernel logging after a freeze using netconsole:

This timeout message appears:

    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290] __schedule+0x24e/0x880
    [35042.581295] ? bit_wait+0x60/0x60
    [35042.581300] schedule+0x2c/0x80
    [35042.581304] io_schedule+0x16/0x40
    [35042.581308] bit_wait_io+0x11/0x60
    [35042.581313] __wait_on_bit+0x4c/0x90
    [35042.581317] out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323] ? bit_waitqueue+0x40/0x40
    [35042.581328] __wait_on_buffer+0x32/0x40
    [35042.581333] jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337] ? __switch_to_asm+0x41/0x70
    [35042.581343] kjournald2+0xc8/0x270
    [35042.581347] ? kjournald2+0xc8/0x270
    [35042.581351] ? wait_woken+0x80/0x80
    [35042.581355] kthread+0x121/0x140
    [35042.581359] ? commit_timeout+0x20/0x20
    [35042.581363] ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366] ret_from_fork+0x22/0x40
    [35042.581242] INFO: task jbd2/dm-2-8:610 blocked for more than 120 seconds.
    [35042.581259] Not tainted 4.15.0-62-generic #69-Ubuntu
    [35042.581262] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [35042.581273] jbd2/dm-2-8 D 0 610 2 0x80000000
    [35042.581278] Call Trace:
    [35042.581290] __schedule+0x24e/0x880
    [35042.581295] ? bit_wait+0x60/0x60
    [35042.581300] schedule+0x2c/0x80
    [35042.581304] io_schedule+0x16/0x40
    [35042.581308] bit_wait_io+0x11/0x60
    [35042.581313] __wait_on_bit+0x4c/0x90
    [35042.581317] out_of_line_wait_on_bit+0x90/0xb0
    [35042.581323] ? bit_waitqueue+0x40/0x40
    [35042.581328] __wait_on_buffer+0x32/0x40
    [35042.581333] jbd2_journal_commit_transaction+0xdac/0x1730
    [35042.581337] ? __switch_to_asm+0x41/0x70
    [35042.581343] kjournald2+0xc8/0x270
    [35042.581347] ? kjournald2+0xc8/0x270
    [35042.581351] ? wait_woken+0x80/0x80
    [35042.581355] kthread+0x121/0x140
    [35042.581359] ? commit_timeout+0x20/0x20
    [35042.581363] ? kthread_create_worker_on_cpu+0x70/0x70
    [35042.581366] ret_from_fork+0x22/0x40

Also, I have thousands of lines of output for blocked tasks. Most traces look more or less like this:

    [34274.346748] sysrq: SysRq : Show Blocked State
    [34274.346766] ...

Read more...

I applied the latest asrock BIOS with new options "amd cbs global c-state control" to disable voltage lowering when idle.

Even with all BIOS settings, custom kernels and params, disabling C states, power supplies and so forth I'm done. 2 years of this BS.

I picked up a 65 watt Intel i9 9900.

Good luck to you all and thanks for all the ideas and help. I hope to revisit AMD Ryzen based linux servers in a few years.

(In reply to eric.c.morgan from comment #659)
> I applied the latest asrock BIOS with new options "amd cbs global c-state
> control" to disable voltage lowering when idle.
>
> Even with all BIOS settings, custom kernels and params, disabling C states,
> power supplies and so forth I'm done. 2 years of this BS.
>
> I picked up a 65 watt Intel i9 9900.
>
> Good luck to you all and thanks for all the ideas and help. I hope to
> revisit AMD Ryzen based linux servers in a few years.

I think you had a faulty hardware. With just the typical idle current it fix the trouble.
That happens.

On the other side, Windows kernel has no problem with default parameters, so i think something in Linux is not good.

(In reply to Michaël Colignon from comment #660)
> (In reply to eric.c.morgan from comment #659)
> > I applied the latest asrock BIOS with new options "amd cbs global c-state
> > control" to disable voltage lowering when idle.
> >
> > Even with all BIOS settings, custom kernels and params, disabling C states,
> > power supplies and so forth I'm done. 2 years of this BS.
> >
> > I picked up a 65 watt Intel i9 9900.
> >
> > Good luck to you all and thanks for all the ideas and help. I hope to
> > revisit AMD Ryzen based linux servers in a few years.
>
> I think you had a faulty hardware. With just the typical idle current it fix
> the trouble.
> That happens.
>
> On the other side, Windows kernel has no problem with default parameters, so
> i think something in Linux is not good.

Agree, see https://bugzilla.kernel.org/show_bug.cgi?id=205017

(In reply to Michaël Colignon from comment #660)
> (In reply to eric.c.morgan from comment #659)
> > I applied the latest asrock BIOS with new options "amd cbs global c-state
> > control" to disable voltage lowering when idle.
> >
> > Even with all BIOS settings, custom kernels and params, disabling C states,
> > power supplies and so forth I'm done. 2 years of this BS.
> >
> > I picked up a 65 watt Intel i9 9900.
> >
> > Good luck to you all and thanks for all the ideas and help. I hope to
> > revisit AMD Ryzen based linux servers in a few years.
>
> I think you had a faulty hardware. With just the typical idle current it fix
> the trouble.
> That happens.
>
> On the other side, Windows kernel has no problem with default parameters, so
> i think something in Linux is not good.

Likely not bad hardware. My 1700 was RMAd for the segfault issue (another pain point), and has passed all testing. Memory has been tested as well, many times over, with different profiles and settings.

Tiny i9-9900 itx ready for transplant. I feel dirty. https://i.imgur.com/fTvWrWd.png

> Likely not bad hardware. My 1700 was RMAd for the segfault issue (another
> pain point), and has passed all testing. Memory has been tested as well,
> many times over, with different profiles and settings.
>

I also had my 1600 RMAd, but still I experience hangs now and then. I have the impression that the hangs are worse lately. But I have no hard statistics.

Folks,

this bugzilla entry, with the amount of different issues reported and the amount of FUD all collected in one place, is prohibitively hard for a debugger to handle. So, I'd suggest if you still would like your issue looked at, to open a separate bug.

And please refrain from commenting on a bug with your own issue - it is much easier to open a separate one first and then when it turns out that it is a known issue, to merge the two bugzilla entries than to keep them apart in a single report.

Thx.

The bug here in question is still up again but I later noticed that I've mixed up some old files and it popped from no where disturbed me for about a week untill i finally reached out to a friend who helped me cleared that, but anyway thanks for the suggestions here. https://abokiplay.com

I really loves this site I give the ceo of this site really deserves an accolades. Well done for the hard and and smart work got this site. Don forget to <a href="https://intrumentalfreebeats.com"> latest Instrumental and freebeats for 2020 </a><a href="https://naijaflash.com.ng"> download latest 2020 songs </a><a href="https://fulloaded.ng"> Naija songs </a>

I really loves this site I give the ceo of this site really deserves an accolades. Well done for the hard and and smart work got this site. Don forget to [url=https://naijaflash.com.ng/]I love this ![/url]
[url=https://intrumentalfreebeats.com/]I love this ![/url][url=https://fulloaded.ng/]I love this ![/url]
[url=https://intrumentalfreebeats.com/]I love this ![/url]

I as the thread starter fully agree with Borislav and set this bug report to status resolved. Thank you all very much for the interesting discussion!

Having started with massive stability problems on my new Ryzen build, I reported this issue to address a potential problem in the Linux kernel. In the hundreds of comments, which surely could be very interesting for everybody affected, I found a workaround - disabling C6 states with the python script attached to a comment above. Since then the system was perfectly stable. Test with the "kill Ryzen" script also showed me, the CPU was not affected by the other huge problem early adopters had.

A few days ago I finally found the time to update the AB350 BIOS to the latest version and set the "typical idle current" in the options. I had only 2 issues since then: The bootloader entry for my OS was broken after the update and network manager did not bring up the LAN interface anymore. But after resolving these issues: No more stability problems. Of course I deactivated my systemd service for c6 states. I am still on the stable branch of Manjaro and use the latest 5.5 series kernel.

See you!

Changed in linux:
status: Confirmed → Expired

Dancehall music, also called ragga or dub, style of Jamaican popular music that had its genesis in the political turbulence of the late 1970s and became Jamaica’s dominant music in the 1980s and ’90s. Central to dancehall is the deejay, who raps, or “toasts,” over a prerecorded rhythm track (bass guitar and drums), or “dub.”

The rise of deejay Yellowman in the early 1980s marked the transition from mainstream reggae to dancehall music that took place in Jamaican nightclubs. In addition to the explicitly http://promodj.com/music-lyrics political lyrics of songs of the early 1980s such as “Operation Eradication” and “Soldier Take Over,” Yellowman incorporated into his repertoire salacious lyrics that became widely known as “slackness,” a Jamaicanism for licentiousness. Drawing on the raunchy tradition of mento, an earlier form of Jamaican dance music that barely disguised sexual discourse in metaphor, and on the spirit of the Caribbean calypso folk song, to which mento is kin, Yellowman teasingly addressed both sex and politics in his radical critique of society in the wake of the failure of Jamaica’s experiment with socialism under Prime Minister Michael Manley.

 do all that instrumentation on an affected system.

source: https://Getnaijamusic.com

The COVID-19 pandemic is posing an unprecedented threat to EU/EEA countries and the UK, which have been experiencing widespread transmission of the virus in the community for several weeks. In addition, there has been an increasing number of reports of COVID-19 outbreaks in long-term care homes across Europe with high associated mortality, highlighting the extreme vulnerability of the elderly in this setting.

Sources:
https://geomp3.com/mp3/music-mp3-download/
https://geomp3.com/
https://geomp3.com/mp3/latest-gossip/
http://www.geomp3.net/

(In reply to hoper from comment #33)
> Just a small message to say "me too". I spend the last 4 months spending a
> lot of time and money to change the motherboard, the cpu, the power...
> Before doing research and find that these freezes are software related :(
>
> I tried (and manage) to compile my own kernel with CONFIG_RCU_NOCB_CPU=Y and
> so on. Before that, my server always crashed after 2 ou 3 days. With this
> custom kernel, yes, it's better... The freeze only appear after 8 or 9 days.
> But the freeze are still here. And I guess I will just sell all this stuff
> and go back to intel :(
>
> I can't understand that this information "LINUX + RYZEN = NOT STABLE" is not
> spread everywhere. Lot's of people out there are loosing lot's of time and
> money, I'm sure of that. Of course I'm grateful for all open sources
> developers (I'm sharing also what I can :) and I really hope that the root
> cause of this bug will be found and corrected in the next few months... We
> need to be able to use linux on RYZEN ! (perfect cpu for servers).
>
> If someone manage to make this bug disappear (and have an uptime > 30 days),
> please share ! how you did that, with enough details for beginner like me :)

New song download , Kizunguzungu Vdj Jones ft Parroty,
Kabagazi, Volkhano, B-Razor (34Gvng), Dj MikeKay, New
song download https://geoserve.com.ng/diamond-platnumz-holla/ >Diamond Platnumz – Holla mp3</a> This
the third project this year 2020. I get the pleasure to
feature brand new talents on the Gengetone plaform.
Parroty
“Mr Pupupu”, Some coastal favour by Kabagazi, B-Razor
from the wabebe hit makers,34 Gvng. Volkhano
representing
1960 delivers the hook!! Introducing Dj Mike Kay a street
DJ
based in Nairobi,Kenya
https://www.geniusfuns.co.ke/nyashinski-malaika-mp3-download/ " >nyashinski — malaika mp3
download</a>

Hi folks!

For those who are looking for a solution or already found a solution, there is a new update of AGESA rolling out. The new version 1.0.0.4 claims:

* Improved system stability when switching through ACPI power states.

It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept default optimised BIOS settings. When I used ZenState to check cpu status, the C6 state is now:

C6 State - Package - Enabled
C6 State - Core - Disabled

Previously they were all enabled, I guessed the new AGESA release finally solved this problem from source? I will report back days later to see if the system still hangs.

(In reply to Charles Lim from comment #673)

> For those who are looking for a solution or already found a solution, there
> is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
>
> * Improved system stability when switching through ACPI power states.
>
> It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept
> default optimised BIOS settings. When I used ZenState to check cpu status,
> the C6 state is now:
>
> C6 State - Package - Enabled
> C6 State - Core - Disabled
>
> Previously they were all enabled, I guessed the new AGESA release finally
> solved this problem from source? I will report back days later to see if the
> system still hangs.

Are you sure about the AGESA version string? For the Dell OptiPlex 5055, firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*, and then grepping for `AGESA!` in the strings/hexdump, it says it includes AGESA version 1.0.0.7a.

    $ strings 1\ --\ 1\ OptiPlex\ System\ BIOS\ v1.1.20.bin | grep -A1 AGESA!
    %pAGESA!V9
    SummitPI-AM4 1.0.0.7a

[1]: https://www.dell.com/support/home/de-de/drivers/driversdetails?driverid=w6mw5&oscode=wt64a&productcode=optiplex-5055-ryzen-cpu
[2]: https://github.com/platomav/BIOSUtilities

(In reply to Paul Menzel from comment #674)
> (In reply to Charles Lim from comment #673)
>
> > For those who are looking for a solution or already found a solution, there
> > is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> >
> > * Improved system stability when switching through ACPI power states.
> >
> > It has arrived on my Asus PRIME B350M days ago, I have upgraded and kept
> > default optimised BIOS settings. When I used ZenState to check cpu status,
> > the C6 state is now:
> >
> > C6 State - Package - Enabled
> > C6 State - Core - Disabled
> >
> > Previously they were all enabled, I guessed the new AGESA release finally
> > solved this problem from source? I will report back days later to see if
> the
> > system still hangs.
>
> Are you sure about the AGESA version string? For the Dell OptiPlex 5055,
> firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*,
> and then grepping for `AGESA!` in the strings/hexdump, it says it includes
> AGESA version 1.0.0.7a.
>
> $ strings 1\ --\ 1\ OptiPlex\ System\ BIOS\ v1.1.20.bin | grep -A1 AGESA!
> %pAGESA!V9
> SummitPI-AM4 1.0.0.7a
>
> [1]:
> https://www.dell.com/support/home/de-de/drivers/
> driversdetails?driverid=w6mw5&oscode=wt64a&productcode=optiplex-5055-ryzen-
> cpu
> [2]: https://github.com/platomav/BIOSUtilities

1.0.0.7a is for Zen+
1.0.0.4 is for Zen2.

AMD resets versioning every new CPU generation. 1.0.0.4 is newer than 1.0.0.7

(In reply to Rafal Kupiec from comment #675)
> (In reply to Paul Menzel from comment #674)
> > (In reply to Charles Lim from comment #673)
> >
> > > For those who are looking for a solution or already found a solution,
> there
> > > is a new update of AGESA rolling out. The new version 1.0.0.4 claims:
> > >
> > > * Improved system stability when switching through ACPI power states.

[…]

> > Are you sure about the AGESA version string? For the Dell OptiPlex 5055,
> > firmware version 1.1.20 [1], running it through *Dell PFS BIOS Extractor*,
> > and then grepping for `AGESA!` in the strings/hexdump, it says it includes
> > AGESA version 1.0.0.7a.

[…]

> 1.0.0.7a is for Zen+
> 1.0.0.4 is for Zen2.
>
> AMD resets versioning every new CPU generation. 1.0.0.4 is newer than 1.0.0.7

Thank you for the clarification. Though I am confused now, as I thought you could use Zen2 devices in “Zen+ boards” (boards original for Zen+). So, AGESA 1.0.0.4 for Zen2 also support the predecessor generation?

There is nothing like Zen+ or Zen2 board... There are AM4 motherboards based on different chipsets. All of them supports Zen, Zen+ and Zen2. If you want to install Zen2 in such board you must first ensure it has flashed BIOS that supports new CPU. The piece of software that includes CPU microcode is called AGESA. It is responsible for initializing CPU and memory at least. And it is redistributed as BIOS upgrade by motherboards manufacturers.

So if you want to install Zen2 CPU on AM4 motherboard, you need to make sure it has flashed BIOS with AGESA 1.0.0.1 that brings support for Zen2 CPUs. And that AGESA 1.0.0.1 is something newer than 1.0.0.7. It still supports Zen and Zen+, but since it is dedicated for Zen2, the versioning restarts from 1.0.0.0 what gets people confusing.

Hi folks again!

I have been testing the new AGESA 1.0.0.4 these days. Currently my uptime is reaching 3days+, which was impossible before - the system always hangs within 2 days. Hence I consider the new firmware indeed solved this Soft Lock issue!

I'm testing on Ubuntu Focal, acpi related settings are all left default. On my Asus PRIME B350M, default optimised settings are being used.

In addition, the CPU C6 State seems also normal, during idle, CPU enters its idle state with very low power consumption (measured externally). k10temp reports Vcore 850mV(approx.).

>Thank you for the clarification. Though I am confused now, as I thought you
>could use Zen2 devices in “Zen+ boards” (boards original for Zen+). So, AGESA
>1.0.0.4 for Zen2 also support the predecessor generation?

It seems that to address to this versioning mess, AMD claims that this 1.0.0.4 release reunited all architectures into one codebase.[1] I guess they also "reunited" the version number. The AGESA was also 1.0.0.7 on my motherboard before.

---REFERENCES---
[1]: https://www.reddit.com/r/Amd/comments/dtgutp/an_update_on_the_am4_platform_agesa_1004/

Looks like your BIOS applies the fix.. but check https://mpvibe.com/ and https://digitalskillsolution.com/ I think they will be helpful

Displaying first 40 and last 40 comments. View all 764 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.