nvme controller is down will reset (regression in zesty on XPS laptop)

Bug #1682704 reported by Mike C. Fletcher on 2017-04-14
40
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux-signed (Ubuntu)
Undecided
Unassigned

Bug Description

I've just upgraded a Dell XPS 15" (9550, early 2016 model) with a Samsung NVME drive. Machine was stable under Kubuntu 16.10 with the same drive. After the upgrade to Zesty I've now seen 3 hard lockups (machine loses root fs) with the following message printed:

    nvme controller is down will reset

there are also messages printed to the virtual console reporting failure to write to the underlying disk from the home-directory encfs.

Linux tass 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 17.04 (Kubuntu)

dmesg about nvme:
[ 1.748864] nvme nvme0: pci function 0000:04:00.0
[ 1.864553] nvme0n1: p1 p2 p3 p4 p5 p6
[ 2.961181] EXT4-fs (nvme0n1p6): mounted filesystem with ordered data mode. Opts: (null)
[ 4.172546] EXT4-fs (nvme0n1p6): re-mounted. Opts: errors=remount-ro

NVME cli shows 57 errors in the error-log, all seeming to be invalid field or invalid namespace. Not sure if that's since boot or since machine creation.

Smartctrl shows...
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-19-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: PM951 NVMe SAMSUNG 512GB
Serial Number: S29PNXAH142328
Firmware Version: BXV77D0Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 365,503,283,200 [365 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Thu Apr 13 23:21:32 2017 EDT
Firmware Updates (0x06): 3 Slots
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 32 Pages

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 6.00W - - 0 0 0 0 5 5
 1 + 4.20W - - 1 1 1 1 30 30
 2 + 3.10W - - 2 2 2 2 100 100
 3 - 0.0700W - - 3 3 3 3 500 5000
 4 - 0.0050W - - 4 4 4 4 2000 22000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 50%
Percentage Used: 0%
Data Units Read: 2,724,346 [1.39 TB]
Data Units Written: 6,568,756 [3.36 TB]
Host Read Commands: 52,921,997
Host Write Commands: 157,530,880
Controller Busy Time: 1,349
Power Cycles: 831
Power On Hours: 5,358
Unsafe Shutdowns: 46
Media and Data Integrity Errors: 0
Error Information Log Entries: 57

Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
  0 57 0 0x0004 0x4016 0x000 0 1 -
  1 56 0 0x0004 0x4016 0x000 0 1 -
  2 55 0 0x0004 0x4016 0x000 0 1 -
  3 54 0 0x0004 0x4016 0x000 0 1 -
  4 53 0 0x0004 0x4016 0x000 0 1 -
  5 52 0 0x0004 0x4016 0x000 0 1 -
  6 51 0 0x0004 0x4016 0x000 0 1 -
  7 50 0 0x0004 0x4016 0x000 0 1 -
  8 49 0 0x001f 0x4004 0x000 0 0 -
  9 48 0 0x001e 0x4004 0x000 0 0 -
 10 47 0 0x001f 0x4004 0x000 0 0 -
 11 46 0 0x001e 0x4004 0x000 0 0 -
 12 45 0 0x001f 0x4004 0x000 0 0 -
 13 44 0 0x001e 0x4004 0x000 0 0 -
 14 43 0 0x0000 0x4016 0x000 0 1 -
 15 42 0 0x0004 0x4016 0x000 0 1 -
... (41 entries not shown)

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed (Ubuntu):
status: New → Confirmed
Shantanu Goel (shantanu-goel) wrote :

Affects me on a Dell Precision 5510 (Same as xps 15 except a different intel wireless card) as well ever since I upgraded to zesty this morning. Around 5 locks ups so far with same symptoms as above in 12 hours.

Shantanu Goel (shantanu-goel) wrote :

This looks like a duplicate of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 but someone else should confirm.

Arnold Greving (arnold-arnox) wrote :
Download full text (22.5 KiB)

I'm having the same problem with my Dell Precision 5510. As soon as the crash occurs the file system is mounted in read-only mode and a few seconds or minutes later the entire machine crashes. The problem occurs with the following kernels that I tested:
- linux-image-4.10.0-20-generic
- linux-image-4.10.0-19-generic
- linux-image-4.11.0-041100rc7-generic_4.11.0-041100rc7.201704161731

Dell Inc. Precision 5510/08R8KJ, BIOS 01.01.19 01/25/2016

I configured my laptop to send the kernel messages via syslog to another machine. Below are the messages from the crash.

*** Before the crash ***
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.110539] xhci_hcd 0000:0a:00.0: remove, state 1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.110543] usb usb4: USB disconnect, device number 1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.110544] usb 4-1: USB disconnect, device number 2
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202761] xhci_hcd 0000:0a:00.0: Host halt failed, -19
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202781] xhci_hcd 0000:0a:00.0: Host not accessible, reset failed.
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202782] xhci_hcd 0000:0a:00.0: USB bus 4 deregistered
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202787] xhci_hcd 0000:0a:00.0: remove, state 4
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202805] usb usb3: USB disconnect, device number 1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.203066] xhci_hcd 0000:0a:00.0: USB bus 3 deregistered
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.246781] pci_bus 0000:3e: busn_res: can not insert [bus 3e] under [bus 07-0a] (conflicts with (null) [bus 07-0a])
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.246812] pci 0000:3e:00.0: [8086:15b5] type 00 class 0x0c0330
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.246831] pci 0000:3e:00.0: reg 0x10: [mem 0xd9f00000-0xd9f0ffff]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247002] pci 0000:3e:00.0: supports D1 D2
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247003] pci 0000:3e:00.0: PME# supported from D0 D1 D2 D3hot D3cold
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247161] pcieport 0000:07:02.0: PCI bridge to [bus 3e]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247169] pcieport 0000:07:02.0: bridge window [mem 0xd9f00000-0xd9ffffff]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247175] pci_bus 0000:3e: [bus 3e] partially hidden behind bridge 0000:07 [bus 07-0a]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247197] pci_bus 0000:07: Allocating resources
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247577] xhci_hcd 0000:3e:00.0: xHCI Host Controller
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247582] xhci_hcd 0000:3e:00.0: new USB bus registered, assigned bus number 3
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.248805] xhci_hcd 0000:3e:00.0: hcc params 0x200077c1 hci version 0x110 quirks 0x00009810
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.249199] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.249200] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.249201] usb u...

Thim Thom (thimhh) wrote :

Dell Precision 5510

since ugrade from 16.04 to 17.04 this bug happens to me.

Currently I'm on the latest kernel from the standard release:
4.10.0-21-generic #23-Ubuntu SMP Fri Apr 28 16:14:22 UTC 2017

Tried it with

SSD PMM951 NVMe 256 GB
SSD 960 PRO M.2 512 GB

Behaviour as already describes:

FS becomes read only, then full stop a few moments later.

Once I was able to make real screenshots, because a console was left and somehow half operational.

Mike C. Fletcher (mcfletch) wrote :

Thim Thom you can work around the problem by editing the GRUB command-line in /etc/default/grub

```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nouveau.modeset=0 nvme_core.default_ps_max_latency_us=6000"
```
and then running `update-grub` and rebooting. There is a link to the duplicate bug above which includes this work-around, though you have to read the comments to see it. Apparently there's a fix in the kernel pending, but I haven't actually tried disabling the flag to see if it's fixed.

Thim Thom (thimhh) wrote :

Followed the instructions from here and from duplicate, but
nvme_core.default_ps_max_latency_us=6000
did not work for me.
I'm now 0 and will try if this helps.

17.04 4.12 from mainline, Bios from May, Dell Precision 5510, Samsung 512 GB

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers