nvme controller is down will reset (regression in zesty on XPS laptop)

Bug #1682704 reported by Mike C. Fletcher
40
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux-signed (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I've just upgraded a Dell XPS 15" (9550, early 2016 model) with a Samsung NVME drive. Machine was stable under Kubuntu 16.10 with the same drive. After the upgrade to Zesty I've now seen 3 hard lockups (machine loses root fs) with the following message printed:

    nvme controller is down will reset

there are also messages printed to the virtual console reporting failure to write to the underlying disk from the home-directory encfs.

Linux tass 4.10.0-19-generic #21-Ubuntu SMP Thu Apr 6 17:04:57 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Ubuntu 17.04 (Kubuntu)

dmesg about nvme:
[ 1.748864] nvme nvme0: pci function 0000:04:00.0
[ 1.864553] nvme0n1: p1 p2 p3 p4 p5 p6
[ 2.961181] EXT4-fs (nvme0n1p6): mounted filesystem with ordered data mode. Opts: (null)
[ 4.172546] EXT4-fs (nvme0n1p6): re-mounted. Opts: errors=remount-ro

NVME cli shows 57 errors in the error-log, all seeming to be invalid field or invalid namespace. Not sure if that's since boot or since machine creation.

Smartctrl shows...
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-19-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: PM951 NVMe SAMSUNG 512GB
Serial Number: S29PNXAH142328
Firmware Version: BXV77D0Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Controller ID: 1
Number of Namespaces: 1
Namespace 1 Size/Capacity: 512,110,190,592 [512 GB]
Namespace 1 Utilization: 365,503,283,200 [365 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Thu Apr 13 23:21:32 2017 EDT
Firmware Updates (0x06): 3 Slots
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Maximum Data Transfer Size: 32 Pages

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
 0 + 6.00W - - 0 0 0 0 5 5
 1 + 4.20W - - 1 1 1 1 30 30
 2 + 3.10W - - 2 2 2 2 100 100
 3 - 0.0700W - - 3 3 3 3 500 5000
 4 - 0.0050W - - 4 4 4 4 2000 22000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
 0 + 512 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff)
Critical Warning: 0x00
Temperature: 35 Celsius
Available Spare: 100%
Available Spare Threshold: 50%
Percentage Used: 0%
Data Units Read: 2,724,346 [1.39 TB]
Data Units Written: 6,568,756 [3.36 TB]
Host Read Commands: 52,921,997
Host Write Commands: 157,530,880
Controller Busy Time: 1,349
Power Cycles: 831
Power On Hours: 5,358
Unsafe Shutdowns: 46
Media and Data Integrity Errors: 0
Error Information Log Entries: 57

Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
  0 57 0 0x0004 0x4016 0x000 0 1 -
  1 56 0 0x0004 0x4016 0x000 0 1 -
  2 55 0 0x0004 0x4016 0x000 0 1 -
  3 54 0 0x0004 0x4016 0x000 0 1 -
  4 53 0 0x0004 0x4016 0x000 0 1 -
  5 52 0 0x0004 0x4016 0x000 0 1 -
  6 51 0 0x0004 0x4016 0x000 0 1 -
  7 50 0 0x0004 0x4016 0x000 0 1 -
  8 49 0 0x001f 0x4004 0x000 0 0 -
  9 48 0 0x001e 0x4004 0x000 0 0 -
 10 47 0 0x001f 0x4004 0x000 0 0 -
 11 46 0 0x001e 0x4004 0x000 0 0 -
 12 45 0 0x001f 0x4004 0x000 0 0 -
 13 44 0 0x001e 0x4004 0x000 0 0 -
 14 43 0 0x0000 0x4016 0x000 0 1 -
 15 42 0 0x0004 0x4016 0x000 0 1 -
... (41 entries not shown)

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-signed (Ubuntu):
status: New → Confirmed
Revision history for this message
Shantanu Goel (shantanu-goel) wrote :

Affects me on a Dell Precision 5510 (Same as xps 15 except a different intel wireless card) as well ever since I upgraded to zesty this morning. Around 5 locks ups so far with same symptoms as above in 12 hours.

Revision history for this message
Shantanu Goel (shantanu-goel) wrote :

This looks like a duplicate of https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1678184 but someone else should confirm.

Revision history for this message
Arnold Greving (arnold-arnox) wrote :
Download full text (22.5 KiB)

I'm having the same problem with my Dell Precision 5510. As soon as the crash occurs the file system is mounted in read-only mode and a few seconds or minutes later the entire machine crashes. The problem occurs with the following kernels that I tested:
- linux-image-4.10.0-20-generic
- linux-image-4.10.0-19-generic
- linux-image-4.11.0-041100rc7-generic_4.11.0-041100rc7.201704161731

Dell Inc. Precision 5510/08R8KJ, BIOS 01.01.19 01/25/2016

I configured my laptop to send the kernel messages via syslog to another machine. Below are the messages from the crash.

*** Before the crash ***
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.110539] xhci_hcd 0000:0a:00.0: remove, state 1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.110543] usb usb4: USB disconnect, device number 1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.110544] usb 4-1: USB disconnect, device number 2
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202761] xhci_hcd 0000:0a:00.0: Host halt failed, -19
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202781] xhci_hcd 0000:0a:00.0: Host not accessible, reset failed.
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202782] xhci_hcd 0000:0a:00.0: USB bus 4 deregistered
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202787] xhci_hcd 0000:0a:00.0: remove, state 4
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.202805] usb usb3: USB disconnect, device number 1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.203066] xhci_hcd 0000:0a:00.0: USB bus 3 deregistered
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.246781] pci_bus 0000:3e: busn_res: can not insert [bus 3e] under [bus 07-0a] (conflicts with (null) [bus 07-0a])
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.246812] pci 0000:3e:00.0: [8086:15b5] type 00 class 0x0c0330
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.246831] pci 0000:3e:00.0: reg 0x10: [mem 0xd9f00000-0xd9f0ffff]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247002] pci 0000:3e:00.0: supports D1 D2
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247003] pci 0000:3e:00.0: PME# supported from D0 D1 D2 D3hot D3cold
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247161] pcieport 0000:07:02.0: PCI bridge to [bus 3e]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247169] pcieport 0000:07:02.0: bridge window [mem 0xd9f00000-0xd9ffffff]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247175] pci_bus 0000:3e: [bus 3e] partially hidden behind bridge 0000:07 [bus 07-0a]
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247197] pci_bus 0000:07: Allocating resources
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247577] xhci_hcd 0000:3e:00.0: xHCI Host Controller
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.247582] xhci_hcd 0000:3e:00.0: new USB bus registered, assigned bus number 3
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.248805] xhci_hcd 0000:3e:00.0: hcc params 0x200077c1 hci version 0x110 quirks 0x00009810
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.249199] usb usb3: New USB device found, idVendor=1d6b, idProduct=0002
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.249200] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1
kern.log.1:Apr 27 14:55:08 arnox kernel: [ 1118.249201] usb u...

Revision history for this message
Thim Thom (thimhh) wrote :

Dell Precision 5510

since ugrade from 16.04 to 17.04 this bug happens to me.

Currently I'm on the latest kernel from the standard release:
4.10.0-21-generic #23-Ubuntu SMP Fri Apr 28 16:14:22 UTC 2017

Tried it with

SSD PMM951 NVMe 256 GB
SSD 960 PRO M.2 512 GB

Behaviour as already describes:

FS becomes read only, then full stop a few moments later.

Once I was able to make real screenshots, because a console was left and somehow half operational.

Revision history for this message
Mike C. Fletcher (mcfletch) wrote :

Thim Thom you can work around the problem by editing the GRUB command-line in /etc/default/grub

```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nouveau.modeset=0 nvme_core.default_ps_max_latency_us=6000"
```
and then running `update-grub` and rebooting. There is a link to the duplicate bug above which includes this work-around, though you have to read the comments to see it. Apparently there's a fix in the kernel pending, but I haven't actually tried disabling the flag to see if it's fixed.

Revision history for this message
Thim Thom (thimhh) wrote :

Followed the instructions from here and from duplicate, but
nvme_core.default_ps_max_latency_us=6000
did not work for me.
I'm now 0 and will try if this helps.

17.04 4.12 from mainline, Bios from May, Dell Precision 5510, Samsung 512 GB

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.