nvme disks fail intermittently after I/O QID timeout aborts with status 0x371, resume function after power cycle

Bug #1992106 reported by Joshua Sjoding
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

We have an ubuntu server with eight Samsung 980 Pro PCIe 4.0 NVMe SSDs (model MZ-V8P1T0BW). The nvme drives fail sporadically, leaving messages like this in dmesg:

[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 998 QID 1 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 999 QID 1 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 627 QID 7 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 628 QID 7 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 134 QID 22 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 298 QID 42 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 299 QID 42 timeout, aborting
[Mon Oct 3 01:54:49 2022] nvme nvme7: I/O 838 QID 44 timeout, aborting
[Mon Oct 3 01:55:20 2022] nvme nvme7: I/O 998 QID 1 timeout, reset controller
[Mon Oct 3 01:55:51 2022] nvme nvme7: I/O 16 QID 0 timeout, reset controller
[Mon Oct 3 01:56:42 2022] nvme nvme7: Device not ready; aborting reset, CSTS=0x1
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:56:42 2022] nvme nvme7: Abort status: 0x371
[Mon Oct 3 01:57:03 2022] nvme nvme7: Device not ready; aborting reset, CSTS=0x1
[Mon Oct 3 01:57:03 2022] nvme nvme7: Removing after probe failure status: -19
[Mon Oct 3 01:57:23 2022] nvme nvme7: Device not ready; aborting reset, CSTS=0x1
[Mon Oct 3 01:57:23 2022] nvme7n1: detected capacity change from 1953525168 to 0
[Mon Oct 3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 934235440 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
[Mon Oct 3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 934230640 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Mon Oct 3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 934230752 op 0x1:(WRITE) flags 0x0 phys_seg 7 prio class 0
[Mon Oct 3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 934230472 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[Mon Oct 3 01:57:23 2022] blk_update_request: I/O error, dev nvme7n1, sector 934235552 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0

The server is currently running this OS and kernel:

* Ubuntu 22.04.1 LTS (GNU/Linux 5.15.0-48-generic x86_64).

We first encountered this issue over a year ago, not long after the machine was first set up. At the time it was running this OS and kernel:

* Ubuntu 20.04.3 LTS (GNU/Linux 5.4.0-88-generic x86_64)

We have encountered this issue repeatedly over the past year. The time between failures can be hours, days or weeks. 1-3 weeks is typical.

A zip file with 13 dmesg outputs sampled over the last year is attached.

Powering down the machine, then powering it up again, always brings the disk back into working condition.

Of the eight nvme disks present, the disk that fails appears to be random.

Here are some possibly related bug reports:

* #1910866 (I first mentioned this issue there as a comment)
* #1991291

Here is some context for the server's use and configuration, in case it's useful:

* The motherboard is a Supermicro H12SSL-NT with PCIe bifurcation support
* The M.2 NVMe disks are connected through a pair of ASUS Hyper M.2 X16 PCIe 4.0 X4 Expansion Cards, with 4 disks attached to each card. These cards rely on the 4x4x4x4x PCIe bifurcation feature supplied by the motherboard.
* The disks are paired into zfs vdev mirrors, with a stripe across 4 mirrors forming a zfs pool.
* The machine is used as a virtualization host (VDI server), running Windows guests on linux KVM.

ProblemType: Bug
DistroRelease: Ubuntu 22.04
Package: linux-image-5.15.0-48-generic 5.15.0-48.54
ProcVersionSignature: Ubuntu 5.15.0-48.54-generic 5.15.53
Uname: Linux 5.15.0-48-generic x86_64
NonfreeKernelModules: nvidia zfs zunicode zavl icp zcommon znvpair
AlsaVersion: Advanced Linux Sound Architecture Driver Version k5.15.0-48-generic.
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.20.11-0ubuntu82.1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D12p', '/dev/snd/pcmC0D11p', '/dev/snd/pcmC0D10p', '/dev/snd/pcmC0D9p', '/dev/snd/pcmC0D8p', '/dev/snd/pcmC0D7p', '/dev/snd/pcmC0D3p', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Card0.Amixer.info: Error: [Errno 2] No such file or directory: 'amixer'
Card0.Amixer.values: Error: [Errno 2] No such file or directory: 'amixer'
CasperMD5CheckResult: pass
Date: Thu Oct 6 15:00:04 2022
InstallationDate: Installed on 2021-06-04 (489 days ago)
InstallationMedia: Ubuntu-Server 20.04.2 LTS "Focal Fossa" - Release amd64 (20210201.2)
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
MachineType: Supermicro Super Server
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-48-generic root=UUID=3ecf491d-34aa-454f-9a15-0744851f71a8 ro crashkernel=512M-:192M
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-48-generic N/A
 linux-backports-modules-5.15.0-48-generic N/A
 linux-firmware 20220329.git681281e4-0ubuntu3.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to jammy on 2022-07-09 (89 days ago)
WifiSyslog:

dmi.bios.date: 04/14/2022
dmi.bios.release: 5.14
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 2.4
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: H12SSL-NT
dmi.board.vendor: Supermicro
dmi.board.version: 1.01
dmi.chassis.asset.tag: To be filled by O.E.M.
dmi.chassis.type: 17
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr2.4:bd04/14/2022:br5.14:svnSupermicro:pnSuperServer:pvr0123456789:rvnSupermicro:rnH12SSL-NT:rvr1.01:cvnSupermicro:ct17:cvr0123456789:skuTobefilledbyO.E.M.:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: Super Server
dmi.product.sku: To be filled by O.E.M.
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Revision history for this message
Joshua Sjoding (joshua.sjoding) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.