We encountered an instance that had a nvme failure very early on in boot today. I've updated our internal Canonical case as well as our Amazon case on this, but posting relevant details here as well for consistency:
# uname -a
Linux XXX 4.4.0-1069-aws #79-Ubuntu SMP Mon Sep 24 15:01:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
We encountered an instance that had a nvme failure very early on in boot today. I've updated our internal Canonical case as well as our Amazon case on this, but posting relevant details here as well for consistency:
# uname -a
Linux XXX 4.4.0-1069-aws #79-Ubuntu SMP Mon Sep 24 15:01:41 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/lsb-release RELEASE= 16.04 CODENAME= xenial DESCRIPTION= "Ubuntu 16.04.5 LTS"
DISTRIB_ID=Ubuntu
DISTRIB_
DISTRIB_
DISTRIB_
# echo type $EC2_INSTANCE_TYPE
type m5.xlarge
# lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1 259:0 0 10G 0 disk /
# ls -al /dev/nvme* /dev/xvd* /dev/sd*
ls: cannot access '/dev/xvd*': No such file or directory
crw------- 1 root root 248, 0 Oct 31 15:02 /dev/nvme0
brw-rw---- 1 root disk 259, 0 Oct 31 15:02 /dev/nvme0n1
lrwxrwxrwx 1 root root 7 Oct 31 15:02 /dev/sda1 -> nvme0n1
# dmesg | grep '63\.' dead_ctrl_ work [nvme] e03>] dump_stack+ 0x63/0x90 3af>] iounmap. part.1+ 0x7f/0x90 3ec>] iounmap+0x2c/0x30 bfa>] nvme_dev_ unmap.isra. 35+0x1a/ 0x30 [nvme] d1e>] nvme_remove+ 0xce/0xe0 [nvme] e0e>] pci_device_ remove+ 0x3e/0xc0 654>] __device_ release_ driver+ 0xa4/0x150 723>] device_ release_ driver+ 0x23/0x30 bda>] pci_stop_ bus_device+ 0x7a/0xa0 d3a>] pci_stop_ and_remove_ bus_device_ locked+ 0x1a/0x30 62c>] nvme_remove_ dead_ctrl_ work+0x3c/ 0x50 [nvme] 86b>] process_ one_work+ 0x16b/0x490 bdb>] worker_ thread+ 0x4b/0x4d0 b90>] ? process_ one_work+ 0x490/0x490 e47>] kthread+0xe7/0x100 301>] ? __schedule+ 0x301/0x7f0 d60>] ? kthread_ create_ on_node+ 0x1e0/0x1e0 e35>] ret_from_ fork+0x55/ 0x80 d60>] ? kthread_ create_ on_node+ 0x1e0/0x1e0 00-00000000febf bfff>
[ 63.401466] nvme 0000:00:1f.0: I/O 0 QID 0 timeout, disable controller
[ 63.505790] nvme 0000:00:1f.0: Cancelling I/O 0 QID 0
[ 63.505812] nvme 0000:00:1f.0: Identify Controller failed (-4)
[ 63.507536] nvme 0000:00:1f.0: Removing after probe failure
[ 63.507604] iounmap: bad address ffffc90001b40000
[ 63.508941] CPU: 1 PID: 351 Comm: kworker/1:3 Tainted: P O 4.4.0-1069-aws #79-Ubuntu
[ 63.508943] Hardware name: Amazon EC2 m5.xlarge/, BIOS 1.0 10/16/2017
[ 63.508948] Workqueue: events nvme_remove_
[ 63.508950] 0000000000000286 3501e2639044a4d2 ffff8800372bfce0 ffffffff923ffe03
[ 63.508952] ffff88040dd878f0 ffffc90001b40000 ffff8800372bfd00 ffffffff9206d3af
[ 63.508954] ffff88040dd878f0 ffff88040dd87a58 ffff8800372bfd10 ffffffff9206d3ec
[ 63.508956] Call Trace:
[ 63.508961] [<ffffffff923ff
[ 63.508965] [<ffffffff9206d
[ 63.508967] [<ffffffff9206d
[ 63.508969] [<ffffffffc039a
[ 63.508972] [<ffffffffc039b
[ 63.508976] [<ffffffff92441
[ 63.508980] [<ffffffff9254f
[ 63.508982] [<ffffffff9254f
[ 63.508986] [<ffffffff9243a
[ 63.508988] [<ffffffff9243a
[ 63.508990] [<ffffffffc039a
[ 63.508994] [<ffffffff9209d
[ 63.508996] [<ffffffff9209d
[ 63.508998] [<ffffffff9209d
[ 63.509001] [<ffffffff920a3
[ 63.509005] [<ffffffff92823
[ 63.509007] [<ffffffff920a3
[ 63.509009] [<ffffffff92827
[ 63.509011] [<ffffffff920a3
[ 63.509013] Trying to free nonexistent resource <00000000febf80
# modinfo nvme 4.4.0-1069- aws/kernel/ drivers/ nvme/host/ nvme.ko 675C497B 0002001sv* sd*bc*sc* i* sv*sd*bc01sc08i 02* 000A822sv* sd*bc*sc* i* 000A821sv* sd*bc*sc* i* 0000003sv* sd*bc*sc* i* 0005845sv* sd*bc*sc* i* 000F1A5sv* sd*bc*sc* i* 0000953sv* sd*bc*sc* i* timeout in seconds for admin commands (uint) timeout: timeout in seconds for controller shutdown (byte) interrupts: int ps_max_ latency_ us:max power saving latency for new devices; use PM QOS to change per device (ulong)
filename: /lib/modules/
version: 1.0
license: GPL
author: Matthew Wilcox <email address hidden>
srcversion: 5CF522443B009A8
alias: pci:v0000106Bd0
alias: pci:v*d*
alias: pci:v0000144Dd0
alias: pci:v0000144Dd0
alias: pci:v00001C58d0
alias: pci:v00008086d0
alias: pci:v00008086d0
alias: pci:v00008086d0
depends:
retpoline: Y
intree: Y
vermagic: 4.4.0-1069-aws SMP mod_unload modversions retpoline
parm: admin_timeout:
parm: io_timeout:timeout in seconds for I/O (uint)
parm: shutdown_
parm: use_threaded_
parm: use_cmb_sqes:use controller's memory buffer for I/O SQes (bool)
parm: nvme_major:int
parm: nvme_char_major:int
parm: default_
# systool -m nvme -va
Module = "nvme"
Attributes: 8675C497B"
coresize = "65536"
initsize = "0"
initstate = "live"
refcnt = "1"
srcversion = "5CF522443B009A
taint = ""
uevent = <store method only>
version = "1.0"
Parameters: ps_max_ latency_ us= "100000" timeout = "5"
admin_timeout = "60"
default_
io_timeout = "4294967295"
shutdown_
use_cmb_sqes = "Y"
Sections: 3780" 3000" 33d8" 0cea" linkonce. this_module= "0xffffffffc03a 3400" 8000" gnu.build- id = "0xffffffffc03a 1000" ctions = "0xffffffffc03a 1b88" 1060" 2349" 1d78" 1b28" bb08" 9000" 7000" 2be0" 1040" 1030" strings = "0xffffffffc03a 25d3" 2730" 25f0"
.bss = "0xffffffffc03a
.data = "0xffffffffc03a
.data.unlikely = "0xffffffffc03a
.exit.text = "0xffffffffc03a
.gnu.
.init.text = "0xffffffffc03a
.note.
.parainstru
.rodata = "0xffffffffc03a
.rodata.str1.1 = "0xffffffffc03a
.rodata.str1.8 = "0xffffffffc03a
.smp_locks = "0xffffffffc03a
.strtab = "0xffffffffc03a
.symtab = "0xffffffffc03a
.text = "0xffffffffc039
__bug_table = "0xffffffffc03a
__kcrctab_gpl = "0xffffffffc03a
__ksymtab_gpl = "0xffffffffc03a
__ksymtab_
__mcount_loc = "0xffffffffc03a
__param = "0xffffffffc03a