Ryzen 1800X soft lockups

Bug #1900401 reported by ed20900
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

System freezes. Sometimes I can move the mouse, sometimes not.
Dmesg contains soft lockups, hard lockups and MCEs.
Hardware:
CPU: Ryzen 7 1800X
RAM: Crucial Ballistix 32 GiB DDR4
Motherboard: ASUS Crosshair VI Hero
PSU: Corsair TX750v2

I have also tested with a Corsair TX750M and a Gigabyte Aorus Ultra Gaming. I've also used a different memory kit, Corsair LPX Vengeance. Still had issues.

I have had the processor replaced by AMD two times. The first time because it segfaulted while executing this script: https://github.com/suaefar/ryzen-test/blob/master/kill-ryzen.sh
And the second time because the processor thrown MCEs while executing that same script.

Now I'm left with MCEs collected at boot or with soft and hard lockups collected by the kernel with the NMI watchdog timer.

Having the extension GNOME Vitals polling hardware sensors with asus-wmi-sensors makes the freezes more frequent.

This bug opened in mainline kernel's bugzilla https://bugzilla.kernel.org/show_bug.cgi?id=208201 provides a C program which can trigger errors too (unknown NMIs and PCIe AER). There are two versions:
* V1 https://bugzilla.kernel.org/attachment.cgi?id=292941
* V2 https://bugzilla.kernel.org/attachment.cgi?id=292949

[41717.839567] Uhhuh. NMI received for unknown reason 2d on CPU 11.
[41717.839568] Do you have a strange power saving mode enabled?
[41717.839568] Dazed and confused, but trying to continue

and

pcieport 0000:00:01.3: AER: Corrected error received: 0000:00:00.0
pcieport 0000:00:01.3: AER: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
pcieport 0000:00:01.3: AER: device [1022:1453] error status/mask=00001000/00006000
pcieport 0000:00:01.3: AER: [12] Timeout

This PCIe error comes from the bus providing USB ports. The Crosshair VI Hero motherboard has shown dead USB ports at boot sometimes, leaving the system without keyboard and mouse and Ubuntu booting slowly.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-generic 5.4.0.51.54
ProcVersionSignature: Ubuntu 5.4.0-51.56-generic 5.4.65
Uname: Linux 5.4.0-51-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.9
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: raul 2671 F.... pulseaudio
 /dev/snd/controlC0: raul 2671 F.... pulseaudio
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
Date: Mon Oct 19 10:28:31 2020
HibernationDevice:

InstallationDate: Installed on 2014-06-05 (2327 days ago)
InstallationMedia: Ubuntu 14.04 LTS "Trusty Tahr" - Release amd64 (20140417)
MachineType: System manufacturer System Product Name
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-51-generic root=UUID=c46a90a3-0e37-4fb0-87ea-a14d4713dfc4 ro iommu=pt crashkernel=512M-:192M
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-51-generic N/A
 linux-backports-modules-5.4.0-51-generic N/A
 linux-firmware 1.187.3
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to focal on 2020-04-26 (175 days ago)
dmi.bios.date: 07/31/2020
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 7901
dmi.board.asset.tag: Default string
dmi.board.name: CROSSHAIR VI HERO
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr7901:bd07/31/2020:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnCROSSHAIRVIHERO:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
ed20900 (ed20900) wrote :
Revision history for this message
ed20900 (ed20900) wrote :

dump_acpi_tables.py crashed with an error of permission denied while accessing to /sys/firmware/acpi/tables/SSDT4.

I'm attaching the output of journalctl -k -b all which contains MCE errors, soft, hard lockups and NMI and PCIe AER.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
ed20900 (ed20900) wrote :

I have tested two ASUS Crosshair VII Hero and I'm now on an MSI X470 Gaming Pro MAX.
I've also used four different AMD Ryzen 7 1800X processors, three of them coming from AMD's RMA process. With the last RMA, the representative told me it was a task for my Linux vendor to fix this. Well, I'm using Ubuntu... so I leave here the message.

The bug opened on Linux's bugzilla has more logs.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
ed20900 (ed20900) wrote :

I tested with Linux 5.8.16 (https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.8.16/) and the lockups persisted. I installed a Ryzen 7 2700X and I have not had a single lockup in 17 days of uptime.

Is there any change in Linux 5.10 rc2 (or the latest, rc6) which affects this?

Revision history for this message
ed20900 (ed20900) wrote :

I have installed Linux 5.10-rc2. Currently, 30 minutes of uptime.

Revision history for this message
ed20900 (ed20900) wrote :

It crashed. Nothing appears in the dmesg, except a corrected AER error.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please see if kernel parameter "pcie_aspm=force pcie_aspm.policy=performance" helps.

Revision history for this message
ed20900 (ed20900) wrote :

I tried "pcie_aspm=off nvme_core.default_ps_max_latency_us=0" but did not help. I'll try with "force" and "performance" policy.

I ended fed up with crashes and upgraded to a Ryzen 7 2700X, with the rest of the system unchanged. It still crashes from time to time, but it takes much more time. Instead of 48h of uptime it can last weeks.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.