AMD Ryzen 7 1800X system (Dell Inspiron 5675) randomly freezes

Bug #1753346 reported by Pragnesh Sampat
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
High
Unassigned

Bug Description

This is happening every day or so. I cannot see anything in the kern.log or syslog that give me a clue on where/what the issue might. I was using Ubuntu 17.10 earlier and upgraded to Ubuntu 18.04 on 3/3/2018, since I thought kernel 4.15 might have better support for the ryzen 7 and also maybe amdgpu for the radeon 580 might be newer. I have removed most external devices except for networking and HDMI and audio out.

Attached a tarball of info, although I believe that since I used 'ubuntu-bug linux' command, the information might be duplicated, apologize for any extra attachments.

$ ls -l dell-5675-sysinfo/
total 168
-rw-r--r-- 1 pss pss 83139 Mar 4 19:47 dmesg
-rw-r--r-- 1 pss pss 26154 Mar 4 19:44 dmidecode
-rw-r--r-- 1 pss pss 1507 Mar 4 19:45 gpu-manager.log
-rw-r--r-- 1 pss pss 110 Mar 4 19:44 lsb-release-a
-rw-r--r-- 1 pss pss 1464 Mar 4 19:44 lscpu
-rw-r--r-- 1 pss pss 4353 Mar 4 19:46 lspci
-rw-r--r-- 1 pss pss 4105 Mar 4 20:05 lspci-vvv-radeon-580.txt
-rw-r--r-- 1 pss pss 19318 Mar 4 19:48 mprime-p95v294b8.linux64-results-2018-03-04.txt
-rw-r--r-- 1 pss pss 104 Mar 4 19:44 uname-a
-rw-r--r-- 1 pss pss 35 Mar 4 20:27 version.log

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-10-generic 4.15.0-10.11
ProcVersionSignature: Ubuntu 4.15.0-10.11-generic 4.15.3
Uname: Linux 4.15.0-10-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: pss 1801 F.... pulseaudio
 /dev/snd/controlC0: pss 1801 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Sun Mar 4 20:21:55 2018
InstallationDate: Installed on 2018-02-16 (16 days ago)
InstallationMedia: Ubuntu 17.10 "Artful Aardvark" - Release amd64 (20180105.1)
MachineType: Dell Inc. Inspiron 5675
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.utf8
 SHELL=/bin/bash
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-10-generic.efi.signed root=UUID=d6a2ae1e-5421-435f-8e4c-e377133cebc9 ro quiet splash vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-10-generic N/A
 linux-backports-modules-4.15.0-10-generic N/A
 linux-firmware 1.172
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-03-04 (1 days ago)
dmi.bios.date: 12/04/2017
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.3.6
dmi.board.name: 07PR60
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 3
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: 1.3.6
dmi.modalias: dmi:bvnDellInc.:bvr1.3.6:bd12/04/2017:svnDellInc.:pnInspiron5675:pvr1.3.6:rvnDellInc.:rn07PR60:rvrA00:cvnDellInc.:ct3:cvr1.3.6:
dmi.product.family: Inspiron
dmi.product.name: Inspiron 5675
dmi.product.version: 1.3.6
dmi.sys.vendor: Dell Inc.

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :
Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote : Re: AMD 7 1800X system (Dell Inspiron 5675) randomly freezes

Most of the times, I have had to powercycle to recover. Sometimes pings work, but not ssh. When this happens, I could use a 'ALT-SysReq USB' sequence to try and reboot, which appeared to work, but eventually had to powercycle.

summary: - AMD 7 1800X system (Dell Inspiron 5675) randomly freezes - have to
- powercycle
+ AMD 7 1800X system (Dell Inspiron 5675) randomly freezes
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Re: AMD 7 1800X system (Dell Inspiron 5675) randomly freezes

Does this issue go away if you boot back into the prior kernel?

Changed in linux (Ubuntu):
importance: Undecided → High
status: Confirmed → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16-rc4

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :

Since I bought the system, around 2/15 or so, it has only run Ubuntu 17.10 and 18.04. These are the kernels I could find in kern.log. It has frozen in all 3 of them.

$ cat /tmp/kversions.txt
kern.log.3:Feb 15 22:59:33 roke kernel: [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0-32-generic.efi.signed root=UUID=d6a2ae1e-5421-435f-8e4c-e377133cebc9 ro quiet splash vt.handoff=7
kern.log.2:Feb 19 18:27:12 roke kernel: [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.13.0-36-generic.efi.signed root=UUID=d6a2ae1e-5421-435f-8e4c-e377133cebc9 ro quiet splash vt.handoff=7
kern.log.1:Mar 3 19:10:37 roke kernel: [ 0.000000] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.15.0-10-generic.efi.signed root=UUID=d6a2ae1e-5421-435f-8e4c-e377133cebc9 ro quiet splash vt.handoff=1

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :

Since the last boot, it has been up for:

$ uptime
 08:29:04 up 13:54, 1 user, load average: 0.37, 0.32, 0.16

I should be able to try the mainline builds at some point, I am wondering if I should leave it alone for a couple of days since this specific boot has a USB backup hard drive removed. I removed it yesterday in case it might have had an impact. After the USB drive test is done, if the problem persists, I will try your suggestion of mainline kernels.

Thanks for your help.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I guess it's the same one as [1]?

Does the system freeze when you are using it? Or does it freeze when it's idle?

[1] https://bugzilla.kernel.org/show_bug.cgi?id=196683

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :

Most of the freezes are when the system was idle. I caught one free as I was typing "cat /proc/cpuinfo" (froze somewhere in the middle of that). But IIRC, that was one time, all other freezes were idle.

It has not yet frozen since:

$ uptime
 09:07:18 up 1 day, 14:32, 1 user, load average: 0.18, 0.19, 0.16

That bugzilla link you showed me has lots of information, I think C6 states could be relevant and interesting, thanks for that link! Per discussion in that thread, I have the zenstates.py installed and working, but have not changed anything yet, just listed the states.

Will review the thread more closely and see where we go.

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :
Download full text (3.3 KiB)

I was about to give up trying to recreate this. Everything was updated as a hour or so earlier today. Meaning in the last 5 days, whatever updates were prompted had been applied as and when they showed up and I got around to doing the updates. System had not been restarted since the time below. It had been up for over 5 days now since the last time it froze.

00:28:01 up 5 days, 5:53, 1 user, load average: 2.58, 0.96, 0.62

$ uname -a
Linux roke 4.15.0-10-generic #11-Ubuntu SMP Tue Feb 13 18:23:35 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu Bionic Beaver (development branch)
Release: 18.04
Codename: bionic

But then this happened. This is the first time I have seen any logs that could be useful ...

Mar 10 00:32:01 roke kernel: [453428.641614] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:43:crtc-0] flip_done timed out
Mar 10 00:33:51 roke kernel: [453478.603354] INFO: rcu_sched detected stalls on CPUs/tasks:
Mar 10 00:33:51 roke kernel: [453478.603364] 8-...!: (1 GPs behind) idle=b6c/0/0 softirq=4280452/4280453 fqs=0
Mar 10 00:33:51 roke kernel: [453478.603368] 9-...!: (11 GPs behind) idle=3bc/0/0 softirq=962607/962607 fqs=0
Mar 10 00:33:51 roke kernel: [453478.603371] 10-...!: (0 ticks this GP) idle=a2c/0/0 softirq=3620248/3620248 fqs=0
Mar 10 00:33:51 roke kernel: [453478.603375] 11-...!: (40 GPs behind) idle=cd8/0/0 softirq=3434620/3434620 fqs=0
Mar 10 00:33:51 roke kernel: [453478.603378] 12-...!: (1 GPs behind) idle=ac8/0/0 softirq=5369219/5369220 fqs=0
Mar 10 00:33:51 roke kernel: [453478.603381] 13-...!: (39 GPs behind) idle=528/0/0 softirq=10045630/10045630 fqs=0
Mar 10 00:33:51 roke kernel: [453478.603383] (detected by 4, t=15002 jiffies, g=8374930, c=8374929, q=535)
Mar 10 00:33:51 roke kernel: [453478.603388] Sending NMI from CPU 4 to CPUs 8:
Mar 10 00:33:51 roke kernel: [453488.529115] Sending NMI from CPU 4 to CPUs 9:
Mar 10 00:33:51 roke kernel: [453498.453348] Sending NMI from CPU 4 to CPUs 10:
Mar 10 00:33:51 roke kernel: [453508.377571] Sending NMI from CPU 4 to CPUs 11:
Mar 10 00:33:51 roke kernel: [453518.301796] Sending NMI from CPU 4 to CPUs 12:
Mar 10 00:33:51 roke kernel: [453528.226048] Sending NMI from CPU 4 to CPUs 13:
Mar 10 00:33:51 roke kernel: [453538.150337] rcu_sched kthread starved for 26177 jiffies! g8374930 c8374929 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=10
Mar 10 00:33:51 roke kernel: [453538.150340] rcu_sched I 0 8 2 0x80000000
Mar 10 00:33:51 roke kernel: [453538.150344] Call Trace:
Mar 10 00:33:51 roke kernel: [453538.150351] __schedule+0x297/0x8a0
Mar 10 00:33:51 roke kernel: [453538.150354] schedule+0x2c/0x80
Mar 10 00:33:51 roke kernel: [453538.150356] schedule_timeout+0x15d/0x350
Mar 10 00:33:51 roke kernel: [453538.150360] ? __next_timer_interrupt+0xe0/0xe0
Mar 10 00:33:51 roke kernel: [453538.150364] rcu_gp_kthread+0x53a/0x960
Mar 10 00:33:51 roke kernel: [453538.150367] kthread+0x121/0x140
Mar 10 00:33:51 roke kernel: [453538.150370] ? rcu_note_context_switch+0x150/0x150
Mar 10 00:33:51 roke kernel: [453538.150371] ...

Read more...

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :

From Kai-Heng Feng's reference above, disabling C6 state looks useful to try at some point.

Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :
summary: - AMD 7 1800X system (Dell Inspiron 5675) randomly freezes
+ AMD Ryzen 7 1800X system (Dell Inspiron 5675) randomly freezes
Revision history for this message
Pragnesh Sampat (pragnesh-sampat) wrote :

I used Zenstates-Linux from Kai-Heng Feng's reference to https://bugzilla.kernel.org/show_bug.cgi?id=196683 and things have been much stabler, I don't think I have seen a freeze since then.

https://github.com/r4m0n/ZenStates-Linux

I have been updating all the while and rebooting, so hard to tell if something else got picked up, but I suspect the C6 disable did the trick. At some point, I might try enabling it back again, maybe after the 18.04 release.

At this point, not sure if you want to keep this bug open, whatever works for you guys.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Using a script to write MSR isn't a solution.

I wrote a patch [1] to disable C6 directly inside the kernel, but apparently users are not happy about that.

Since it's the same one as LP: #1690085, we can mark this one as a dupe.

[1] https://bugzilla.kernel.org/attachment.cgi?id=274853&action=diff

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.