Every time the system boots I get a worrisome email like
> SMART error (ErrorCount) detected on host: abbey
>
> Device: /dev/nvme0, number of Error Log entries increased from 0 to 1
>
> Device info:
> SAMSUNG MZVPV256HDGL-000L7, S/N:S27MNYAH710579, FW:5L6QBXW7
I was going to throw the chip out, but then I ran `badblocks` over the disk and it found nothing so I got suspicious and decided to investigate deeper.
If I use `nvme error-log` (from `apt install nvme-cli`) I can see the errors all look like
```
.................
Entry[ 0]
.................
error_count : 23
sqid : 0
cmdid : 0x1019
status_field : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
```
> I'm 99.99% sure that 0x4004 is "You tried to talk to me with NVMe 1.z but i only speak NVMe 1.x-y"
and that thread suggests that a solution might to be upgrade the Samsung firmware on the drive so that it becomes compatible again -- though that's a relatively difficult process.
So would this be an incompatibility with the kernel or with smartmontools? The Debian bug makes it sound like it's with the kernel.
I'm getting this on
``` CODENAME= jammy /www.ubuntu. com/" /help.ubuntu. com/" /bugs.launchpad .net/ubuntu/"
root@abbey:~# head /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_
ID=ubuntu
ID_LIKE=debian
HOME_URL="https:/
SUPPORT_URL="https:/
BUG_REPORT_URL="https:/
```
with this hardware:
```
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0- ,D1-,D2- ,D3hot- ,D3cold- )
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/8 Maskable- 64bit+
Address: 0000000000000000 Data: 0000
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <64us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s (ok), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR+
10BitTagCom p- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPo werReduction Not Supported, EmergencyPowerR eductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCa p: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
AtomicOpsCt l: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCo mpliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationCom plete+ EqualizationPhase1+
Equalizatio nPhase2+ EqualizationPhase3+ LinkEqualizatio nRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b0] MSI-X: Enable+ Count=9 Masked-
Vector table: BAR=0 offset=00003000
PBA: BAR=0 offset=00002000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCa p- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [148 v1] Device Serial Number 00-00-00- 00-00-00- 00-00
Capabilities: [158 v1] Power Budgeting <?>
Capabilities: [168 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Capabilities: [188 v1] Latency Tolerance Reporting
Capabilities: [190 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommon ModeRestoreTime =10us PortTPowerOnTim e=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_ CommonMode= 0us LTR1.2_ Threshold= 0ns
L1SubCtl2: T_PwrOn=10us
root@abbey:~# lspci -vv -d 144d:
02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 (rev 01) (prog-if 02 [NVM Express])
Subsystem: Samsung Electronics Co Ltd PM963 2.5" NVMe PCIe SSD
Physical Slot: 0
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0, Cache Line Size: 32 bytes
Interrupt: pin A routed to IRQ 33
NUMA node: 0
Region 0: Memory at dfe00000 (64-bit, non-prefetchable) [size=16K]
Region 2: I/O ports at 6000 [size=256]
Max snoop latency: 0ns
Max no snoop latency: 0ns
Kernel driver in use: nvme
Kernel modules: nvme
```
Every time the system boots I get a worrisome email like
> SMART error (ErrorCount) detected on host: abbey
>
> Device: /dev/nvme0, number of Error Log entries increased from 0 to 1
>
> Device info:
> SAMSUNG MZVPV256HDGL-000L7, S/N:S27MNYAH710579, FW:5L6QBXW7
I was going to throw the chip out, but then I ran `badblocks` over the disk and it found nothing so I got suspicious and decided to investigate deeper.
If I use `nvme error-log` (from `apt install nvme-cli`) I can see the errors all look like
``` INVALID_ FIELD: A reserved coded value or an unsupported value in a defined field)
.................
Entry[ 0]
.................
error_count : 23
sqid : 0
cmdid : 0x1019
status_field : 0x2002(
phase_tag : 0
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
```
I was able to find a clue from someone on Reddit: https:/ /www.reddit. com/r/DataHoard er/comments/ gspbur/ nvme_errors_ but_smart_ selfassessment_ passed_ need/
> I'm 99.99% sure that 0x4004 is "You tried to talk to me with NVMe 1.z but i only speak NVMe 1.x-y"
and that thread suggests that a solution might to be upgrade the Samsung firmware on the drive so that it becomes compatible again -- though that's a relatively difficult process.
So would this be an incompatibility with the kernel or with smartmontools? The Debian bug makes it sound like it's with the kernel.