nvme smart error count email after upgrading to 20.04

Bug #1878264 reported by Mike Bernson
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
smartmontools
Unknown
Unknown
smartmontools (Debian)
New
Unknown
smartmontools (Ubuntu)
Fix Released
Low
Unassigned
Jammy
Triaged
Undecided
Andreas Hasenack
Mantic
Triaged
Undecided
Andreas Hasenack

Bug Description

I just upgraded from 19.10 to 20.04.

I am getting email from smart about errors on the nvme ssds.

If I look in the syslog I see for all nvme devices:
Device: /dev/nvme0, number of Error Log entries increased from 485 to 48
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme0: Shutdown timeout set to 8 seconds

but runing smartctl on the device:
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

Email:
This message was generated by the smartd daemon running on:

   host name: mike-think
   DNS domain: mlb.org

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 485 to 488This message was generated by the smartd daemon running on:

   host name: mike-think
   DNS domain: mlb.org

The following warning/error was logged by the smartd daemon:

Device: /dev/nvme0, number of Error Log entries increased from 485 to 488

Device info:
SAMSUNG MZVLB2T0HALB-000L7, S/N:S4GCNF0N100119, FW:3M2QEXG7, 2.04 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
The original message about this issue was sent at Mon Mar 30 22:32:48 2020 EDT
Another message will be sent in 24 hours if the problem persists.

Full smartctl -x:smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-29-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: SAMSUNG MZVLB2T0HALB-000L7
Serial Number: S4GCNF0N100119
Firmware Version: 3M2QEXG7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB]
Namespace 1 Utilization: 22,365,745,152 [22.3 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 8101b7cc7c
Local Time is: Tue May 12 14:43:46 2020 EDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 84 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 8.00W - - 0 0 0 0 0 0
 1 + 6.30W - - 1 1 1 1 0 0
 2 + 3.50W - - 2 2 2 2 0 0
 3 - 0.0760W - - 3 3 3 3 210 1200
 4 - 0.0050W - - 4 4 4 4 2000 8000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 47 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 4,338,357 [2.22 TB]
Data Units Written: 725,358 [371 GB]
Host Read Commands: 8,501,335
Host Write Commands: 4,632,710
Controller Busy Time: 36
Power Cycles: 223
Power On Hours: 26
Unsafe Shutdowns: 156
Media and Data Integrity Errors: 0
Error Information Log Entries: 488
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 47 Celsius
Temperature Sensor 2: 40 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-generic 5.4.0.29.34
ProcVersionSignature: Ubuntu 5.4.0-29.33-generic 5.4.30
Uname: Linux 5.4.0-29-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp nvidia_modeset zcommon znvpair nvidia
ApportVersion: 2.20.11-0ubuntu27
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: mike 3053 F.... pulseaudio
 /dev/snd/controlC0: mike 3053 F.... pulseaudio
CasperMD5CheckResult: skip
CurrentDesktop: ubuntu:GNOME
Date: Tue May 12 14:34:09 2020
InstallationDate: Installed on 2020-03-30 (42 days ago)
InstallationMedia: Ubuntu 19.10 "Eoan Ermine" - Release amd64 (20191017)
MachineType: LENOVO 20QNCTO1WW
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.4.0-29-generic root=UUID=6973399c-724d-4607-985d-9426190dc41b ro
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-29-generic N/A
 linux-backports-modules-5.4.0-29-generic N/A
 linux-firmware 1.187
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/08/2020
dmi.bios.vendor: LENOVO
dmi.bios.version: N2NET35W (1.20 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20QNCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: SDK0R32862 WIN
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN2NET35W(1.20):bd01/08/2020:svnLENOVO:pn20QNCTO1WW:pvrThinkPadP53:rvnLENOVO:rn20QNCTO1WW:rvrSDK0R32862WIN:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad P53
dmi.product.name: 20QNCTO1WW
dmi.product.sku: LENOVO_MT_20QN_BU_Think_FM_ThinkPad P53
dmi.product.version: ThinkPad P53
dmi.sys.vendor: LENOVO

Revision history for this message
Mike Bernson (mike-mlb) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
affects: linux (Ubuntu) → smartmontools (Ubuntu)
Revision history for this message
Christian Franke (christian-franke) wrote :

See related upstream ticket:
https://www.smartmontools.org/ticket/1222

Revision history for this message
Paride Legovini (paride) wrote :

Thanks; I linked the upstream bug to this report.

Debian bug: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900244

Both are about 1 year old, no patches have been proposed.

Changed in smartmontools (Ubuntu):
status: Confirmed → Triaged
Changed in smartmontools (Debian):
status: Unknown → New
Paride Legovini (paride)
Changed in smartmontools (Ubuntu):
importance: Undecided → Low
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
I come by retriaging bugs that were dormant for too long - trying to give them another chance.

There was no progress on the upstream case at all :-/
Thereby sadly there is not much one can act on yet - since Debian&Ubuntu are kind of waiting for upstreams fix or at least position on this I think we need to set this to incomplete until that happens.

I have no login on that tracker, but maybe to not give up entirely someone with a login could ping and nudge the case a bit?

Changed in smartmontools (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Nick (kousu) wrote :
Download full text (6.7 KiB)

I'm getting this on

```
root@abbey:~# head /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
```

with this hardware:

```
root@abbey:~# lspci -vv -d 144d:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 (rev 01) (prog-if 02 [NVM Express])
        Subsystem: Samsung Electronics Co Ltd PM963 2.5" NVMe PCIe SSD
        Physical Slot: 0
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 32 bytes
        Interrupt: pin A routed to IRQ 33
        NUMA node: 0
        Region 0: Memory at dfe00000 (64-bit, non-prefetchable) [size=16K]
        Region 2: I/O ports at 6000 [size=256]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
                Address: 0000000000000000 Data: 0000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
                DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 128 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x4 (ok)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS...

Read more...

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I just found this bug while googling for the same error messages.

https://github.com/linux-nvme/nvme-cli/issues/1224 was an attempt at discovering exactly what command that was, but it went stale (and the troubleshooting steps didn't show anything in my case).

Revision history for this message
Nick (kousu) wrote :

A fix was released 6 months ago! https://www.smartmontools.org/ticket/1222#comment:10

It hasn't made it out to Ubuntu yet; I'm running mantic and smartmontools is at 7.3, while they explained (https://www.smartmontools.org/ticket/1222#comment:16) this needs 7.4:

root@server:~# apt-cache policy smartmontools
smartmontools:
  Installé : 7.3-1
  Candidat : 7.3-1
 Table de version :
 *** 7.3-1 500
        500 http://ca.archive.ubuntu.com/ubuntu mantic/main amd64 Packages
        100 /var/lib/dpkg/status

It's gotten noisier too, since I decided to put my NVMe drive to work; now I'm getting multiple emails a day with these spurious errors.

But I see in https://code.launchpad.net/ubuntu/+source/smartmontools that noble has 7.4, so that means that in 4 months when the next Ubuntu comes out this will finally be fixed.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thank you Nick, I think other than waiting for noble we should check if [1] would fit as a backport to at least Jammy. It is called a "feature" in its change to the changelog, but it actually just masks these false positives AFAICS.

[1]: https://www.smartmontools.org/changeset/5472

tags: added: server-todo
Changed in smartmontools (Ubuntu Jammy):
status: New → Triaged
Changed in smartmontools (Ubuntu Mantic):
status: New → Triaged
Changed in smartmontools (Ubuntu):
status: Incomplete → Fix Released
Changed in smartmontools (Ubuntu Jammy):
assignee: nobody → Andreas Hasenack (ahasenack)
Changed in smartmontools (Ubuntu Mantic):
assignee: nobody → Andreas Hasenack (ahasenack)
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I have a system that gets an increase of 2 in the error count for the 0x2002 event in the error-log every time I reboot. Its NVMe is using v1.2. Another system has an 1.4 device, and doesn't show the error. Same ubuntu release on both (mantic), same kernel.

I'll use this to troubleshoot. In the end, it looks like we have two things going on:
- something issuing wrong/invalid commands (kernel/libnvme?)
- smartd being overzealous and considering those critical errors (the linked patch should fix that).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.