/var/log/syslog and /var/log/kern.log reports millions of nvme errors

Bug #1852420 reported by Detlef Brendle
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Hi,
since upgrading Ubuntu from 18.04 to 19.10 I see tons of error messages in /var/log/syslog and /var/log/kern.log saying:

Nov 13 11:18:50 detlef-Precision-5520 kernel: [ 2342.863294] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0
Nov 13 11:18:50 detlef-Precision-5520 kernel: [ 2342.863305] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Nov 13 11:18:50 detlef-Precision-5520 kernel: [ 2342.863309] nvme 0000:04:00.0: AER: device [144d:a804] error status/mask=00000001/00006000
Nov 13 11:18:50 detlef-Precision-5520 kernel: [ 2342.863312] nvme 0000:04:00.0: AER: [ 0] RxErr
Nov 13 11:18:50 detlef-Precision-5520 kernel: [ 2342.895923] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0

Im running on
5.3.0-22-generic #24-Ubuntu SMP Sat Nov 9 17:34:30 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I have a Samsung SSD 960 Evo in my system and it worked fine before upgrading.
However the system runs fine ( no crashes or other errors) but these messages are very annoying and it lead to a "no space left on device" once with crashing entirely.

Thanks,
detlef
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu8.2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: detlef 1914 F.... pulseaudio
 /dev/snd/pcmC0D3p: detlef 1914 F...m pulseaudio
 /dev/snd/pcmC0D0c: detlef 1914 F...m pulseaudio
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 19.10
InstallationDate: Installed on 2019-11-05 (8 days ago)
InstallationMedia: Ubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
MachineType: Dell Inc. Precision 5520
Package: linux (not installed)
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.3.0-22-generic root=UUID=d5ce3ce0-94d8-45e4-b4c3-5446497c141b ro quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 5.3.0-22.24-generic 5.3.7
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-22-generic N/A
 linux-backports-modules-5.3.0-22-generic N/A
 linux-firmware 1.183.1
Tags: eoan
Uname: Linux 5.3.0-22-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip docker lpadmin lxd plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 05/23/2019
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.15.0
dmi.board.name: 06X96V
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.15.0:bd05/23/2019:svnDellInc.:pnPrecision5520:pvr:rvnDellInc.:rn06X96V:rvrA00:cvnDellInc.:ct10:cvr:
dmi.product.family: Precision
dmi.product.name: Precision 5520
dmi.product.sku: 07BF
dmi.sys.vendor: Dell Inc.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1852420

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: eoan
Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : CRDA.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : IwConfig.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : Lspci.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : Lsusb.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : ProcEnviron.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : ProcModules.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : PulseList.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : RfKill.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : UdevDb.txt

apport information

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote :

Hi,
any idea what I should do ?

Thanks,
detlef

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Which kernel is the last working one?

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote :

Before 19.10 I used 18.04. There I did not see that error.
Im not sure which kernel version I used before but definately not 5.3.0

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Would it be possible for you to do a kernel bisection?

First, find the last good -rc kernel and the first bad -rc kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/

Then,
$ sudo apt build-dep linux
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ git bisect start
$ git bisect good $(the good version you found)
$ git bisect bad $(the bad version found)
$ make localmodconfig
$ make -j`nproc` deb-pkg
Install the newly built kernel, then reboot with it.
If the issue still happens,
$ git bisect bad
Otherwise,
$ git bisect good
Repeat to "make -j`nproc` deb-pkg" until you find the commit that causes the regression.

Revision history for this message
Detlef Brendle (detlef-brendle-2) wrote :

Hello,

this is a bit too much to ask from me.

Sorry,
I wait for a new kernel and then hope the problem will be gone.

Detlef

Revision history for this message
htrex (hantarex) wrote :

I'm seeing the same issue on a Dell XPS 9560, which is basically the same laptop as the Precision 5520 with a different GPU.

[ 335.012863] pcieport 0000:00:1d.0: AER: Corrected error received: 0000:04:00.0
[ 335.012869] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 335.012871] nvme 0000:04:00.0: AER: device [144d:a804] error status/mask=00000001/00006000
[ 335.012873] nvme 0000:04:00.0: AER: [ 0] RxErr

Linux OrionXPS 5.3.0-26-generic #28~18.04.1-Ubuntu SMP Wed Dec 18 16:40:14 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
htrex (hantarex) wrote :

[ 509.467168] nvme 0000:04:00.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[ 509.467172] nvme 0000:04:00.0: AER: device [144d:a804] error status/mask=00000001/00006000
[ 509.467174] nvme 0000:04:00.0: AER: [ 0] RxErr

Linux OrionXPS 5.3.0-42-generic #34~18.04.1-Ubuntu SMP Fri Feb 28 13:42:26 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
anydoby (anydoby) wrote :

Confirming this for Dell 9560. Also have 25% CPU eaten by the journal daemon.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

anydoby,

Does your system use the same Samsung NVMe?

Revision history for this message
htrex (hantarex) wrote :

Confirmed with Ubuntu 20.04 kernel 5.4.0-29
Linux OrionXPS 5.4.0-29-generic #33-Ubuntu SMP Wed Apr 29 14:32:27 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

hantarex, is it a Precision 5520 too?

Revision history for this message
Joshua Huh (joshua-in-boots) wrote :

I have the same problem on Pop!_OS 20.04. I temporarily disabled AER by using "sudo kernelstub -a pci=noaer".

Ubuntu uses GRUB, so append GRUB_CMDLINE_LINUX_DEFAULT line in /etc/default/grub adding "pci=noaer".

Revision history for this message
Bjorn Helgaas (bjorn-helgaas) wrote :

We should not need to use "pci=noaer". Generally we should not see reproducible PCIe Correctable Errors in significant numbers. Some have reported that "pcie_aspm=off" avoids the errors. If that's the case for you, see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2043665/comments/6 and help me investigate it!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.