Long, multi-threaded compilations, segfault in Ryzen

Bug #1708222 reported by Paulo J. S. Silva
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

I have a Ryzen 1700X on a MSI B350 Motherboard and 64 Gb of RAM (Corsair LPX 2400). If I do a very intensive multi-threaded compilation session sometimes I get a segfault. This seems to be a problem with Ryzen it self nad maybe it is related to the bug described in bug #1690085 but I believe it is not the same. This bug affects many Linux users with Ryzen, see for example this thread in the AMD forum: https://community.amd.com/thread/215773?start=0&tstart=0 or the Gentoo Wiki that talks about this problem in the Troubleshoting section of their Ryzen page: https://wiki.gentoo.org/wiki/Ryzen#Troubleshooting

It is also very easy to verify if you have a processor with the problem. Fortunately some smart people have created a simple script that always shows the problem in my system and in the systems of the other people of the thread. The script can be found in

https://github.com/suaefar/ryzen-test

You just have to clone the repository using git, move to the ryzen-test directory and run ./kill_ryzen.sh. It is a very simple script, it downloads gcc-7.1 source code into a vram disk and start #processors simultaneous compilation of it. If any compilation fails it writes a message in the console saying how long it took to get the failure. After a few minutes, the build in my system fails unless I turn off SMT. With SMT off it can take many hours, but still fails in less than one day.

I am opening this bug report because I believe we should try to verify if this is a widespread problem and inform potential users of the problems. Hopefully AMD or the Kernel developers can find a workaround. I have also already opened a bug report in the Linux Kernel Bugzilla (https://bugzilla.kernel.org/show_bug.cgi?id=196481), but unfortunately it is not calling the attention of the kernel developers.
---
ApportVersion: 2.20.4-0ubuntu4.5
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-id', '/dev/snd/pcmC2D0c', '/dev/snd/controlC2', '/dev/snd/by-path', '/dev/snd/hwC1D0', '/dev/snd/pcmC1D2c', '/dev/snd/pcmC1D0c', '/dev/snd/pcmC1D0p', '/dev/snd/controlC1', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D11p', '/dev/snd/pcmC0D10p', '/dev/snd/pcmC0D9p', '/dev/snd/pcmC0D8p', '/dev/snd/pcmC0D7p', '/dev/snd/pcmC0D3p', '/dev/snd/controlC0', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 17.04
HibernationDevice: RESUME=UUID=b299e86c-e86e-4608-b025-586133a3f5a6
InstallationDate: Installed on 2017-05-30 (63 days ago)
InstallationMedia: Ubuntu 17.04 "Zesty Zapus" - Release amd64 (20170412)
IwConfig:
 enp33s0 no wireless extensions.

 lo no wireless extensions.
MachineType: Micro-Star International Co., Ltd MS-7A34
Package: linux (not installed)
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.10.0-29-generic root=UUID=ea2ac8eb-cc5f-42a3-b7db-3d6a870496a9 ro quiet iommu=soft splash rcu_nocbs=1-15 vt.handoff=7
ProcVersionSignature: Ubuntu 4.10.0-29.33-generic 4.10.17
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-29-generic N/A
 linux-backports-modules-4.10.0-29-generic N/A
 linux-firmware 1.164.1
RfKill:

Tags: zesty
Uname: Linux 4.10.0-29-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo users video
_MarkForUpload: True
dmi.bios.date: 07/06/2017
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1.71
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: B350 TOMAHAWK (MS-7A34)
dmi.board.vendor: Micro-Star International Co., Ltd
dmi.board.version: 1.0
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Micro-Star International Co., Ltd
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1.71:bd07/06/2017:svnMicro-StarInternationalCo.,Ltd:pnMS-7A34:pvr1.0:rvnMicro-StarInternationalCo.,Ltd:rnB350TOMAHAWK(MS-7A34):rvr1.0:cvnMicro-StarInternationalCo.,Ltd:ct3:cvr1.0:
dmi.product.name: MS-7A34
dmi.product.version: 1.0
dmi.sys.vendor: Micro-Star International Co., Ltd

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote :
Revision history for this message
Paulo J. S. Silva (pjssilva) wrote :

The other log file.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1708222

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected zesty
description: updated
Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : CRDA.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : JournalErrors.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : Lspci.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : Lsusb.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : ProcEnviron.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : ProcModules.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : UdevDb.txt

apport information

Revision history for this message
Paulo J. S. Silva (pjssilva) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Opinion
status: Opinion → Confirmed
Revision history for this message
Paulo J. S. Silva (pjssilva) wrote :

It had just exploded in Phoronix! See

https://phoronix.com/scan.php?page=news_item&px=Ryzen-Test-Stress-Run

Michael can also reproduce the problem using his test suite using

PTS_CONCURRENT_TEST_RUNS=4 TOTAL_LOOP_TIME=60 phoronix-test-suite stress-run build-linux-kernel build-php build-apache pgbench apache redis

I hope AMD get their act together soon.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.13 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc4

Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
tags: added: kernel-da-key
tags: added: kernel-bug-exists-upstream
Revision history for this message
Paulo J. S. Silva (pjssilva) wrote :

It seems to be confirmed as a hardware problem. See:

https://www.phoronix.com/scan.php?page=news_item&px=Ryzen-Segv-Response

I have entered in contact with AMD and am doing an RMA of my CPU in the next days. From what I could grasp the bug is usual in the first batches of Ryzen, so there might be many affected CPUs in the wild. AMD is not issuing a recall, it will treat with it in a case by case basis.

Anyone can check if their CPU has the problem by running the kill-ryzen.sh script described in the original bug report. If your CPU has the problem contact AMD technical support using the link in the Phoronix article linked above.

Maybe I should mark this bug as invalid? At least the bug report will remain here so that other affected Ubuntu users may find out how to proceed.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.