nvme drive probe failure

Bug #1626894 reported by Tommy Giesler
80
This bug affects 14 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Unassigned
Xenial
Fix Released
Critical
Dan Streetman
Yakkety
Fix Released
Critical
Dan Streetman

Bug Description

After upgrading from linux-image-4.4.0-38-generic to proposed update linux-image-4.4.0-39-generic, NVMe drives are no longer working. dmesg shows a probe failure.

On the previous kernel version everything is working as expected.
----------------->%-----------------
[ 1.005243] Hardware name: FUJITSU D3417-B1/D3417-B1, BIOS V5.0.0.11 R1.12.0.SR.2 for D3417-B1x 04/01/2016
[ 1.005349] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[ 1.005484] 0000000000000286 00000000b6c91251 ffff880fe6e8bce0 ffffffff813f1f83
[ 1.005800] ffff880fe02150f0 ffffc90006a7c000 ffff880fe6e8bd00 ffffffff8106bdff
[ 1.006117] ffff880fe02150f0 ffff880fe0215258 ffff880fe6e8bd10 ffffffff8106be3c
[ 1.006433] Call Trace:
[ 1.006509] [<ffffffff813f1f83>] dump_stack+0x63/0x90
[ 1.006589] [<ffffffff8106bdff>] iounmap.part.1+0x7f/0x90
[ 1.006668] [<ffffffff8106be3c>] iounmap+0x2c/0x30
[ 1.006770] [<ffffffffc007a64a>] nvme_dev_unmap.isra.35+0x1a/0x30 [nvme]
[ 1.007048] [<ffffffffc007b75e>] nvme_remove+0xce/0xe0 [nvme]
[ 1.007140] [<ffffffff81443409>] pci_device_remove+0x39/0xc0
[ 1.007220] [<ffffffff815549f1>] __device_release_driver+0xa1/0x150
[ 1.007301] [<ffffffff81554ac3>] device_release_driver+0x23/0x30
[ 1.007382] [<ffffffff8143be7a>] pci_stop_bus_device+0x8a/0xa0
[ 1.007462] [<ffffffff8143bfca>] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 1.007559] [<ffffffffc007a09c>] nvme_remove_dead_ctrl_work+0x3c/0x50 [nvme]
[ 1.007642] [<ffffffff8109a3e5>] process_one_work+0x165/0x480
[ 1.007722] [<ffffffff8109a74b>] worker_thread+0x4b/0x4c0
[ 1.007801] [<ffffffff8109a700>] ? process_one_work+0x480/0x480
[ 1.007881] [<ffffffff810a0928>] kthread+0xd8/0xf0
[ 1.007959] [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
[ 1.008041] [<ffffffff81831a8f>] ret_from_fork+0x3f/0x70
[ 1.008120] [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
[ 1.008222] Trying to free nonexistent resource <00000000f7100000-00000000f7103fff>
[ 1.008276] genirq: Flags mismatch irq 0. 00000080 (nvme1q0) vs. 00015a00 (timer)
[ 1.008281] Trying to free nonexistent resource <000000000000d000-000000000000d0ff>
[ 1.008282] nvme 0000:02:00.0: Removing after probe failure
[ 1.008645] Trying to free nonexistent resource <000000000000e000-000000000000e0ff>
[ 1.027213] iounmap: bad address ffffc90006ae0000
[ 1.027456] CPU: 2 PID: 86 Comm: kworker/2:1 Not tainted 4.4.0-39-generic #59-Ubuntu
-----------------%<-----------------

CVE References

Revision history for this message
Tommy Giesler (guardion) wrote :
Revision history for this message
Tommy Giesler (guardion) wrote :
Revision history for this message
Tommy Giesler (guardion) wrote :
Revision history for this message
Tommy Giesler (guardion) wrote :
Revision history for this message
Tommy Giesler (guardion) wrote :
Revision history for this message
Tommy Giesler (guardion) wrote :
Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Tim Gardner (timg-tpi) wrote :

There are 2 patches that might affect nvme:

UBUNTU: SAUCE: nvme: Don't suspend admin queue that wasn't created
NVMe: Don't unmap controller registers on reset

Changed in linux (Ubuntu):
importance: Undecided → Medium
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Yakkety):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Xenial):
status: New → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a test kernel with commit d2b59ee("UBUNTU: SAUCE: nvme: Don't suspend admin queue that wasn't created") reverted it can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1626894/d2b59ee-reverted/

Can you see if this kernel resolves this bug?

I'll also build a kernel with 30d6592 reverted.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The second test kernel is available at:
http://kernel.ubuntu.com/~jsalisbury/lp1626894/30d6592-reverted/

Can you see if either of these kernels resolve this bug?

tags: added: kernel-da-key xenial
Revision history for this message
Stefan Lindblad (fairglow) wrote : Re: [Bug 1626894] Re: nvme drive probe failure
Download full text (3.5 KiB)

The second one (30d6592-reverted) worked for me, but not the first one.

Thanks,
/Stefan

On Fri, Sep 23, 2016 at 7:41 PM Joseph Salisbury <
<email address hidden>> wrote:

> The second test kernel is available at:
> http://kernel.ubuntu.com/~jsalisbury/lp1626894/30d6592-reverted/
>
> Can you see if either of these kernels resolve this bug?
>
>
> ** Tags added: kernel-da-key xenial
>
> --
> You received this bug notification because you are subscribed to a
> duplicate bug report (1627040).
> https://bugs.launchpad.net/bugs/1626894
>
> Title:
> nvme drive probe failure
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux source package in Xenial:
> In Progress
> Status in linux source package in Yakkety:
> Fix Released
>
> Bug description:
> After upgrading from linux-image-4.4.0-38-generic to proposed update
> linux-image-4.4.0-39-generic, NVMe drives are no longer working. dmesg
> shows a probe failure.
>
> On the previous kernel version everything is working as expected.
> ----------------->%-----------------
> [ 1.005243] Hardware name: FUJITSU D3417-B1/D3417-B1, BIOS V5.0.0.11
> R1.12.0.SR.2 for D3417-B1x 04/01/2016
> [ 1.005349] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
> [ 1.005484] 0000000000000286 00000000b6c91251 ffff880fe6e8bce0
> ffffffff813f1f83
> [ 1.005800] ffff880fe02150f0 ffffc90006a7c000 ffff880fe6e8bd00
> ffffffff8106bdff
> [ 1.006117] ffff880fe02150f0 ffff880fe0215258 ffff880fe6e8bd10
> ffffffff8106be3c
> [ 1.006433] Call Trace:
> [ 1.006509] [<ffffffff813f1f83>] dump_stack+0x63/0x90
> [ 1.006589] [<ffffffff8106bdff>] iounmap.part.1+0x7f/0x90
> [ 1.006668] [<ffffffff8106be3c>] iounmap+0x2c/0x30
> [ 1.006770] [<ffffffffc007a64a>] nvme_dev_unmap.isra.35+0x1a/0x30
> [nvme]
> [ 1.007048] [<ffffffffc007b75e>] nvme_remove+0xce/0xe0 [nvme]
> [ 1.007140] [<ffffffff81443409>] pci_device_remove+0x39/0xc0
> [ 1.007220] [<ffffffff815549f1>] __device_release_driver+0xa1/0x150
> [ 1.007301] [<ffffffff81554ac3>] device_release_driver+0x23/0x30
> [ 1.007382] [<ffffffff8143be7a>] pci_stop_bus_device+0x8a/0xa0
> [ 1.007462] [<ffffffff8143bfca>]
> pci_stop_and_remove_bus_device_locked+0x1a/0x30
> [ 1.007559] [<ffffffffc007a09c>]
> nvme_remove_dead_ctrl_work+0x3c/0x50 [nvme]
> [ 1.007642] [<ffffffff8109a3e5>] process_one_work+0x165/0x480
> [ 1.007722] [<ffffffff8109a74b>] worker_thread+0x4b/0x4c0
> [ 1.007801] [<ffffffff8109a700>] ? process_one_work+0x480/0x480
> [ 1.007881] [<ffffffff810a0928>] kthread+0xd8/0xf0
> [ 1.007959] [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
> [ 1.008041] [<ffffffff81831a8f>] ret_from_fork+0x3f/0x70
> [ 1.008120] [<ffffffff810a0850>] ? kthread_create_on_node+0x1e0/0x1e0
> [ 1.008222] Trying to free nonexistent resource
> <00000000f7100000-00000000f7103fff>
> [ 1.008276] genirq: Flags mismatch irq 0. 00000080 (nvme1q0) vs.
> 00015a00 (timer)
> [ 1.008281] Trying to free nonexistent resource
> <000000000000d000-000000000000d0ff>
> [ 1.008282] nvme 0000:02:00.0: Removing after p...

Read more...

Revision history for this message
Stefan Lindblad (fairglow) wrote :

I've tested both and the second one (30d6592-reverted) worked for me! The first one (d2b59ee-reverted) got stuck.

Many thanks,
/Stefan

Revision history for this message
Tommy Giesler (guardion) wrote :

Sorry for the late respond, I didn't have access to the hardware during the weekend.

I've just tested both kernels and can confirm, that the second one works correctly:
----------------->%-----------------
root@Ubuntu-1604-xenial-64-minimal ~ # uname -a
Linux Ubuntu-1604-xenial-64-minimal 4.4.0-40-generic #60~lp1626894Commit30d6592Reverted SMP Fri Sep 23 17:06:33 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
root@Ubuntu-1604-xenial-64-minimal ~ # ls /dev/nvme*
/dev/nvme0 /dev/nvme0n1 /dev/nvme1 /dev/nvme1n1
-----------------%<-----------------

Let me know, if you need any additional testing.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The is a patch that may fix this issue without a revert. I built a test kernel with this patch. It can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1626894/PATCHED/

Can you test this kernel and see if it resolves this bug? You will need to install both the linux-image and linux-image-extra .deb packages.

Changed in linux (Ubuntu Xenial):
importance: Undecided → Critical
Changed in linux (Ubuntu Yakkety):
importance: Medium → Critical
Revision history for this message
Stefan Lindblad (fairglow) wrote :

Tried it successfully. Thanks, great work!

$ cat /proc/version
Linux version 4.4.0-41-generic (root@gomeisa) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #61~lp1626894KamalPatched SMP Mon Sep 26 19:28:58 UTC 2016

Changed in linux (Ubuntu Xenial):
assignee: nobody → Kamal Mostafa (kamalmostafa)
Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

Thanks for the quick feedback Stefan. We'll get this fix into the pipeline immediately.

https://lists.ubuntu.com/archives/kernel-team/2016-September/080123.html

Revision history for this message
Tommy Giesler (guardion) wrote :

Thanks everyone for your quick response and fix of the problem!

I can also confirm that Kamal's Kernel works.

$ cat /proc/version
Linux version 4.4.0-41-generic (root@gomeisa) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #61~lp1626894KamalPatched SMP Mon Sep 26 19:28:58 UTC 2016
$ ls /dev/nvme*
/dev/nvme0 /dev/nvme0n1 /dev/nvme0n1p1 /dev/nvme0n1p2 /dev/nvme0n1p3 /dev/nvme1 /dev/nvme1n1 /dev/nvme1n1p1 /dev/nvme1n1p2 /dev/nvme1n1p3

Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Kamal Mostafa (kamalmostafa) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Mario Limonciello (superm1) wrote :

In trying to confirm bug 1619756 I encountered this using an earlier version of 4.4.0-*.
I can confirm that 4.4.0-41-generic #61 fixes this bug for NVMe on my system.

tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (17.5 KiB)

This bug was fixed in the package linux - 4.4.0-42.62

---------------
linux (4.4.0-42.62) xenial; urgency=low

  * Fix GRO recursion overflow for tunneling protocols (LP: #1631287)
    - tunnels: Don't apply GRO to multiple layers of encapsulation.
    - gro: Allow tunnel stacking in the case of FOU/GUE

  * CVE-2016-7039
    - SAUCE: net: add recursion limit to GRO

linux (4.4.0-41.61) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1628204

  * nvme drive probe failure (LP: #1626894)
    - (fix) NVMe: Don't unmap controller registers on reset

linux (4.4.0-40.60) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1627074

  * Permission denied in CIFS with kernel 4.4.0-38 (LP: #1626112)
    - Fix memory leaks in cifs_do_mount()
    - Compare prepaths when comparing superblocks
    - SAUCE: Fix regression which breaks DFS mounting

  * Backlight does not change when adjust it higher than 50% after S3
    (LP: #1625932)
    - SAUCE: i915_bpo: drm/i915/backlight: setup and cache pwm alternate
      increment value
    - SAUCE: i915_bpo: drm/i915/backlight: setup backlight pwm alternate
      increment on backlight enable

linux (4.4.0-39.59) xenial; urgency=low

  [ Joseph Salisbury ]

  * Release Tracking Bug
    - LP: #1625303

  * thunder: chip errata w/ multiple CQEs for a TSO packet (LP: #1624569)
    - net: thunderx: Fix for issues with multiple CQEs posted for a TSO packet

  * thunder: faulty TSO padding (LP: #1623627)
    - net: thunderx: Fix for HW issue while padding TSO packet

  * CVE-2016-6828
    - tcp: fix use after free in tcp_xmit_retransmit_queue()

  * Sennheiser Officerunner - cannot get freq at ep 0x83 (LP: #1622763)
    - SAUCE: (no-up) ALSA: usb-audio: Add quirk for sennheiser officerunner

  * Backport E3 Skylake Support in ie31200_edac to Xenial (LP: #1619766)
    - EDAC, ie31200_edac: Add Skylake support

  * Ubuntu 16.04 - Full EEH Recovery Support for NVMe devices (LP: #1602724)
    - SAUCE: nvme: Don't suspend admin queue that wasn't created

  * ISST-LTE:pNV: system ben is hung during ST (nvme) (LP: #1620317)
    - blk-mq: Allow timeouts to run while queue is freezing
    - blk-mq: improve warning for running a queue on the wrong CPU
    - blk-mq: don't overwrite rq->mq_ctx

  * lsattr 32bit does not work on 64bit kernel (Inappropriate ioctl error)
    (LP: #1619918)
    - btrfs: bugfix: handle FS_IOC32_{GETFLAGS, SETFLAGS, GETVERSION} in
      btrfs_ioctl

  * radeon: monitor connected to onboard VGA doesn't work with Xenial
    (LP: #1600092)
    - drm/radeon/dp: add back special handling for NUTMEG

  * initramfs includes qle driver, but not firmware (LP: #1623187)
    - qed: add MODULE_FIRMWARE()

  * [Hyper-V] Rebase Hyper-V to 4.7.2 (stable) (LP: #1616677)
    - hv_netvsc: Implement support for VF drivers on Hyper-V
    - hv_netvsc: Fix the list processing for network change event
    - Drivers: hv: vmbus: Introduce functions for estimating room in the ring
      buffer
    - Drivers: hv: vmbus: Use READ_ONCE() to read variables that are volatile
    - Drivers: hv: vmbus: Export the vmbus_set_event() API
    - lcoking/barriers, arch: Use smp barriers...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Gabriel Klein (gabriel-klein) wrote :

Hi:

4.4.0-38: My computer was working perfectly.
4.4.0-41 and 4.4.0-42: My computer is "crashing" quite often.

When I go back to 4.4.0-38, my computer does not crash anymore.

When I see a crash, It's like having the nve drive disconnected.
All my environment crashes. If I do a "ls", I see that the command is missing - all commands are missing and that I've no access to my drive anymore. When I reboot everything is working again (of course).

Sadly I cannot send more logs as the drive with the logs is not working when this problem happened. I will remap them (/var/log) in the future if needed to see where the problem come from.

Revision history for this message
Stéphane B (ktv) wrote :
Download full text (3.6 KiB)

Servers with two controllers. The second one disappear (with a kernel trace).

> cat /proc/version
Linux version 4.4.0-47-generic (buildd@lcy01-03) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.2) ) #68-Ubuntu SMP Wed Oct 26 19:39:52 UTC 2016

After upgrading kernel, my ZFS pool becomes DEGRADED:
> zpool status
  pool: zp0
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
 invalid. Sufficient replicas exist for the pool to continue
 functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: none requested
config:

 NAME STATE READ WRITE CKSUM
 zp0 DEGRADED 0 0 0
   mirror-0 DEGRADED 0 0 0
     nvme0n1 ONLINE 0 0 0
     9486952355712335023 UNAVAIL 0 0 0 was /dev/nvme1n1

Only ONE controller listed: !!

> nvme list
Node SN Model Version Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 CVMD4391006B800GGN INTEL SSDPE2ME800G4 1.0 1 800,17 GB / 800,17 GB 512 B + 0 B 8DV10102

The bug isn't fixed for me.

[ 68.950042] nvme 0000:82:00.0: I/O 0 QID 0 timeout, disable controller
[ 69.054149] nvme 0000:82:00.0: Cancelling I/O 0 QID 0
[ 69.054182] nvme 0000:82:00.0: Identify Controller failed (-4)
[ 69.060132] nvme 0000:82:00.0: Removing after probe failure
[ 69.060284] iounmap: bad address ffffc9000cf34000
[ 69.065020] CPU: 14 PID: 247 Comm: kworker/14:1 Tainted: P OE 4.4.0-47-generic #68-Ubuntu
[ 69.065034] Hardware name: Supermicro SYS-F618R2-RC1+/X10DRFR-N, BIOS 2.0 01/27/2016
[ 69.065040] Workqueue: events nvme_remove_dead_ctrl_work [nvme]
[ 69.065050] 0000000000000286 00000000e10d6171 ffff8820340efce0 ffffffff813f5aa3
[ 69.065052] ffff88203454b4f0 ffffc9000cf34000 ffff8820340efd00 ffffffff8106bdff
[ 69.065054] ffff88203454b4f0 ffff88203454b658 ffff8820340efd10 ffffffff8106be3c
[ 69.065056] Call Trace:
[ 69.065068] [<ffffffff813f5aa3>] dump_stack+0x63/0x90
[ 69.065089] [<ffffffff8106bdff>] iounmap.part.1+0x7f/0x90
[ 69.065093] [<ffffffff8106be3c>] iounmap+0x2c/0x30
[ 69.065097] [<ffffffffc01c364a>] nvme_dev_unmap.isra.35+0x1a/0x30 [nvme]
[ 69.065099] [<ffffffffc01c475e>] nvme_remove+0xce/0xe0 [nvme]
[ 69.065108] [<ffffffff81447009>] pci_device_remove+0x39/0xc0
[ 69.065117] [<ffffffff815585e1>] __device_release_driver+0xa1/0x150
[ 69.065119] [<ffffffff815586b3>] device_release_driver+0x23/0x30
[ 69.065123] [<ffffffff8143fa7a>] pci_stop_bus_device+0x8a/0xa0
[ 69.065125] [<ffffffff8143fbca>] pci_stop_and_remove_bus_device_locked+0x1a/0x30
[ 69.065129] [<ffffffffc01c309c>] nvme_remove_dead_ctrl_work+0x3c/0x50 [nvme]
[ 69.065136] [<ffffffff8109a4a5>] process_one_work+0x165/0x480
[ 69.065138] [<ffffffff8109a80b>] worker_thread+0x4b/0x...

Read more...

Revision history for this message
Stéphane B (ktv) wrote :

Anybody to raise to not fixed ?

Revision history for this message
roots (roots) wrote :

Hi,

I second Stephanes post: Please set this bug to NOT FIXED - I'm having this issue with Ubuntu 16.04, Kernels 4.4.0-53, -57, -58 and 4.9.0. Drive is a Samsung 960 Pro M2 SSD on a PCIe-M2 - Adapter hooked to a PCIe16x slot, Gigabyte GA-Z77X-UD3H mobo.

[ 0.715691] nvme0n1: p1
[ 171.849770] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller.
[ 171.863594] nvme 0000:01:00.0: Refused to change power state, currently in D3
[ 171.863688] nvme 0000:01:00.0: Removing after probe failure
[ 171.863695] nvme0n1: detected capacity change from 1024209543168 to 0

Thanks.

Revision history for this message
roots (roots) wrote :

Please mark as OPEN, see my post below. Thanks.

Revision history for this message
Guillaume Mazoyer (respawneral) wrote :

Yes please re-open as this bug does not not seem to be fixed due to the last mentioned change.
See:

[56990.782669] nvme0n1: detected capacity change from 128035676160 to 0
[56990.782652] nvme 0000:03:00.0: Removing after probe failure
[56990.531978] nvme 0000:03:00.0: Failed status: 0xffffffff, reset controller.

I'm using Intel 600P Series M.2.

Revision history for this message
Wayne Scott (wsc9tt) wrote :

Another data point...

I have 2 Intel P3700 NVMe SSDs on an x99 chipset motherboard.
For me Ubuntu 16.04 Kernel 4.4.0-53 works, but -57 and -59 both fail to find the SSDs.

Revision history for this message
Wayne Scott (wsc9tt) wrote :

Odd, but I have another machine with almost the same Intel SSD and it is running kernel -57 just fine.

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Changed in linux (Ubuntu Xenial):
status: Fix Released → Confirmed
Changed in linux (Ubuntu Yakkety):
status: Fix Released → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

For those that are still affected by this bug, can you please test the following kernel:
http://kernel.ubuntu.com/~jsalisbury/lp1626894/d2b59ee-reverted/

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Dan Streetman (ddstreet) wrote :

Alternately, as this may have been caused by my recent NVMe patches, can those affeected by this please test with this kernel PPA which is 4.4.0-59 with my patches reverted:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1626894

Revision history for this message
Wayne Scott (wsc9tt) wrote :

The -40 kernel recommend by jsalisbury faults immediately on boot. The fault scrolled off the page and I don't know how to capture it, but it had nvme routines at the end of it.
Note: the existing kernels don't fault, the probes for these drives just fail.

Trying ddstreet's suggestion now, it appears to replace my existing -59 kernel. That is
fine because that kernel didn't work for me.

Revision history for this message
Wayne Scott (wsc9tt) wrote :

Yeah, ddstreet's -59 kernel boots just fine and sees my NVMe drives.
Let me know if I can help.

Revision history for this message
Wayne Scott (wsc9tt) wrote :

fyi:

# nvme list
Node SN Model Version Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- -------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 CVFT43320040400BGN INTEL SSDPEDMD400G4 1.0 1 400.09 GB / 400.09 GB 512 B + 0 B 8DV10102
/dev/nvme1n1 CVFT4332002C400BGN INTEL SSDPEDMD400G4 1.0 1 400.09 GB / 400.09 GB 512 B + 0 B 8DV10102

(These are P3700 drives, the ones ktv listed earlier were P3600)

Dan Streetman (ddstreet)
Changed in linux (Ubuntu Xenial):
assignee: Kamal Mostafa (kamalmostafa) → Dan Streetman (ddstreet)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Dan Streetman (ddstreet)
Revision history for this message
Dan Streetman (ddstreet) wrote :

Guillaume, roots, Stéphane, can you also test with my PPA kernel from comment 30? It would help to know if your NVMe failures are also due to the patches we're reverting, or some other problem.

Revision history for this message
roots (roots) wrote :

@Wayne Scott: I succeeded in catching some of those scrolling screen messages (for another issue) by filming the screen output with my mobile phone. However, I had to set a high recording framerate (50 or 60Hz).

Revision history for this message
roots (roots) wrote :

@All:

In the meantime, I've switched to a different motherboard, where the issue does not occur _within linux_ anymore (with none of the default Ubuntu Kernels (4.4.0-53, -57, -59, 4.8.0-32 and 4.9.0 from the mainline ppa). This applies for both the NVME SSD being hooked to a PCIe-Slot (via adpater) as well as running it directly at the M2 slot. However, occasionally my mobo/BIOS won't find the SSD at POST, so it won't be available for boot. Cold-Switching off the PC and turning it on again after 1..2min will usually help. The mobo is a Gigabyte GA-Z170X-UD3.

So, to all who are affected by the issue: What's your chipset and mobo brand / model?

Thanks.

Revision history for this message
Gerd Peter (maxprox) wrote :

With this Hardware:

a skylake Fujitsu D3417-B Mainboard with Intel C236 Chipset with 64GB DDR4 ECC RAM and a
E3-1245-v5 XEON and one ADATA SSD M.2 2280 NVMe 1.2 PCIe Gen3x4 128GB XPG SX8000
 with ext4 and thin LVM

I also get this error:

=> [45512.825928] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller.
=> [45513.276990] nvme 0000:01:00.0: Removing after probe failure
=> [45513.276997] nvme0n1: detected capacity change from 128035676160 to 0
[45513.507206] Aborting journal on device dm-0-8.
[45513.507226] Buffer I/O error on dev dm-0, logical block 3702784, lost sync page write
[45513.507248] JBD2: Error -5 detected when updating journal superblock for dm-0-8.
[45513.507555] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45513.507585] EXT4-fs error (device dm-0): ext4_journal_check_start:56: Detected aborted journal
[45513.507619] EXT4-fs (dm-0): Remounting filesystem read-only
[45513.507643] EXT4-fs (dm-0): previous I/O error to superblock detected
[45513.507656] Buffer I/O error on dev dm-0, logical block 0, lost sync page write
[45519.236744] device-mapper: thin: 251:4: metadata operation 'dm_pool_commit_metadata' failed: error = -5
[45519.236766] device-mapper: thin: 251:4: aborting current metadata transaction
[45519.236949] device-mapper: thin: 251:4: failed to abort metadata transaction
[45519.236977] device-mapper: thin: 251:4: switching pool to failure mode
[45519.236978] device-mapper: thin metadata: couldn't read superblock
[45519.236989] device-mapper: thin: 251:4: failed to set 'needs_check' flag in metadata
[45519.237004] device-mapper: thin: 251:4: dm_pool_get_metadata_transaction_id returned -22
[46805.070494] rrdcached[2458]: segfault at c0 ip 00007fb12ab3b1ed sp 00007fb126e376b0 error 4 in libc-2.19.so[7fb12aaf5000+1a1000]

Revision history for this message
Guillaume Mazoyer (respawneral) wrote :

For the record, I'm using my SSD on on M2 slot on an Intel NUC (http://www.intel.com/content/www/us/en/nuc/nuc-kit-nuc5i3ryk.html).

@ddstreet, I tried to use your kernel but for whatever reason, it won't zork with my LVM based installation.

Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Luis Henriques (henrix)
Changed in linux (Ubuntu Yakkety):
status: Confirmed → Fix Committed
Revision history for this message
John Donnelly (jpdonnelly) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Revision history for this message
Dan Streetman (ddstreet) wrote :

Wayne, Guillaume, roots, Stéphane, can any/all of you test with the -proposed kernel (using directions from comment 39) to verify the bug is fixed?

Revision history for this message
Guillaume Mazoyer (respawneral) wrote :

Will do, this testing can be tricky since the issue appears randomly (can take minutes, hours, even a day for me).
Can we review the patch for the issue somewhere as well?

Revision history for this message
roots (roots) wrote :

Dan,

unfortunately I can't, as the bug won't show up with my new mainboard anymore, that's why I suspected it to be chipset-specific (see https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1626894/comments/36).

roots.

Revision history for this message
roots (roots) wrote :

Maybe some symptoms with my old mainboard will be helpful for general debugging this issue:

-The drive was *sometimes* recognised fine at boot in first place, showed with "lspci" and "nvme list" would give me correct output
-After boot, the drive would go "missing" after a duration of approx. 5 to 1000 or more seconds for no obvious reason, here's what dmesg | grep nvme gave:

[ 37.805783] nvme 0000:01:00.0: Failed status: 0xffffffff, reset controller.
[ 37.825887] nvme 0000:01:00.0: Refused to change power state, currently in D3
[ 37.825934] nvme nvme0: Removing after probe failure status: -19
[ 37.825941] nvme0n1: detected capacity change from 1024209543168 to 0

Furthermore, this could be interesting:

-If I changed from default BIOS settings to overclocked ones, the NVME drive would not be seen by the _OS_ anymore
-If my HDD drive was going to sleep in Ubuntu (via hdparm), the NVME drive would get disconnected _instantly_, with dmesg showing the same output as above.

The two last effects were reproducible, they occured with kernels 4.4.0-53, 4.4.0-57, 4.4.0-58, 4.9.0. Especially hdparm -Y (or -y) /dev/myhdd would cause the NVME drive to become lost instantly.

Hope this may be of help.

tags: removed: kernel-key
tags: added: kernel-da-key
Revision history for this message
Gerd Peter (maxprox) wrote :

i does not work with the stock kernel, I work with a patched kernel.
(https://forum.proxmox.com/threads/nvme-storage-issue.31572/page-2#post-159444)

my first results, only Only a few hours old, Are up to now good:
(a skylake Fujitsu D3417-B Mainboard with Intel C236 Chipset)

only one Line in dmesg:
# dmesg | grep -i nvme
[ 0.893264] nvme0n1: p1 p2 p3

and up to now no errors

this is the part from lspci -v
...
01:00.0 Non-Volatile memory controller: Silicon Motion, Inc. Device 2260 (rev 03) (prog-if 02 [NVM Express])
 Subsystem: Silicon Motion, Inc. Device 2260
 Flags: bus master, fast devsel, latency 0, IRQ 16
 Memory at f7010000 (64-bit, non-prefetchable) [size=16K]
 Expansion ROM at f7000000 [disabled] [size=64K]
 Capabilities: [40] Power Management version 3
 Capabilities: [70] Express Endpoint, MSI 00
 Capabilities: [b0] MSI-X: Enable+ Count=16 Masked-
 Capabilities: [100] Advanced Error Reporting
 Capabilities: [158] #19
 Capabilities: [178] Latency Tolerance Reporting
 Capabilities: [180] L1 PM Substates
 Kernel driver in use: nvme

Revision history for this message
Guillaume Mazoyer (respawneral) wrote :

Seems to be working for me now.

Revision history for this message
mmlb (mmlb) wrote :

hey @ddstreet I've just tried your kernel at https://launchpad.net/~ddstreet/+archive/ubuntu/lp1626894 and it now shows both disks for me. Please let me know how we can get this moved along. Happy to test.

Revision history for this message
Dan Streetman (ddstreet) wrote :

> hey @ddstreet I've just tried your kernel at
> https://launchpad.net/~ddstreet/+archive/ubuntu/lp1626894 and it now shows both disks for me.
> Please let me know how we can get this moved along. Happy to test.

It's already in -proposed, please see https://wiki.ubuntu.com/Testing/EnableProposed for details on how to use the proposed repository.

Revision history for this message
mmlb (mmlb) wrote :

Hmm I tried that earlier today and thought it failed to boot, let me try -proposed again now.

Revision history for this message
mmlb (mmlb) wrote :

Just installed kernel 4.4.0-62-generic from -proposed and both drives are appearing with no problems in dmesg.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.8 KiB)

This bug was fixed in the package linux - 4.4.0-62.83

---------------
linux (4.4.0-62.83) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1657430

  * Backport DP MST fixes to i915 (LP: #1657353)
    - SAUCE: i915_bpo: Fix DP link rate math
    - SAUCE: i915_bpo: Validate mode against max. link data rate for DP MST

  * Ubuntu xenial - 4.4.0-59-generic i3 I/O performance issue (LP: #1657281)
    - blk-mq: really fix plug list flushing for nomerge queues

linux (4.4.0-61.82) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1656810

  * Xen MSI setup code incorrectly re-uses cached pirq (LP: #1656381)
    - SAUCE: xen: do not re-use pirq number cached in pci device msi msg data

  * nvme drive probe failure (LP: #1626894)
    - nvme: revert NVMe: only setup MSIX once

linux (4.4.0-60.81) xenial; urgency=low

  [ John Donnelly ]

  * Release Tracking Bug
    - LP: #1656084

  * Couldn't emulate instruction 0x7813427c (LP: #1634129)
    - KVM: PPC: Book3S PR: Fix illegal opcode emulation

  * perf: 24x7: Eliminate domain name suffix in event names (LP: #1560482)
    - powerpc/perf/hv-24x7: Fix usage with chip events.
    - powerpc/perf/hv-24x7: Display change in counter values
    - powerpc/perf/hv-24x7: Display domain indices in sysfs
    - powerpc/perf/24x7: Eliminate domain suffix in event names

  * i386 ftrace tests hang on ADT testing (LP: #1655040)
    - ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps
      to it

  * VMX module autoloading if available (LP: #1651322)
    - powerpc: Add module autoloading based on CPU features
    - crypto: vmx - Convert to CPU feature based module autoloading

  * ACPI probe support for AD5592/3 configurable multi-channel converter
    (LP: #1654497)
    - SAUCE: iio: dac: ad5592r: Add ACPI support
    - SAUCE: iio: dac: ad5593r: Add ACPI support

  * Xenial update to v4.4.40 stable release (LP: #1654602)
    - btrfs: limit async_work allocation and worker func duration
    - Btrfs: fix tree search logic when replaying directory entry deletes
    - btrfs: store and load values of stripes_min/stripes_max in balance status
      item
    - Btrfs: fix qgroup rescan worker initialization
    - USB: serial: option: add support for Telit LE922A PIDs 0x1040, 0x1041
    - USB: serial: option: add dlink dwm-158
    - USB: serial: kl5kusb105: fix open error path
    - USB: cdc-acm: add device id for GW Instek AFG-125
    - usb: hub: Fix auto-remount of safely removed or ejected USB-3 devices
    - usb: gadget: f_uac2: fix error handling at afunc_bind
    - usb: gadget: composite: correctly initialize ep->maxpacket
    - USB: UHCI: report non-PME wakeup signalling for Intel hardware
    - ALSA: usb-audio: Add QuickCam Communicate Deluxe/S7500 to
      volume_control_quirks
    - ALSA: hiface: Fix M2Tech hiFace driver sampling rate change
    - ALSA: hda/ca0132 - Add quirk for Alienware 15 R2 2016
    - ALSA: hda - ignore the assoc and seq when comparing pin configurations
    - ALSA: hda - fix headset-mic problem on a Dell laptop
    - ALSA: hda - Gate the mic jack on HP Z1 Gen3 AiO
    - ALSA: hd...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.8.0-37.39

---------------
linux (4.8.0-37.39) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1659381

  * Mouse cursor invisible or does not move (LP: #1646574)
    - drm/nouveau/disp/nv50-: split chid into chid.ctrl and chid.user
    - drm/nouveau/disp/nv50-: specify ctrl/user separately when constructing
      classes
    - drm/nouveau/disp/gp102: fix cursor/overlay immediate channel indices

 -- Benjamin M Romer <email address hidden> Wed, 25 Jan 2017 16:12:02 -0200

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
André Düwel (aduewel) wrote :

Is it possible that this bug could also affects zesty (linux-image-4.10.0-19-generic)?

For me it feels like this, randomly my system (Dell XPS15 9550, Samsung NVMe PM951 512GB) hangs and I am unable to execute commands. Even dmesg is not callable :/ ... I can reproduce this bug most of the times by generating heavy load on the disk: `dd if=/dev/zero of=test bs=4M ` which exits with a message like "unable to write to read only filesystem".

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

@Andre,

I believe what happens here is filed at LP: #1678184.
Kernel parameter 'nvme_core.default_ps_max_latency_us=0' can workaround the issue as a temporary fix.
Please also try other value I suggested in comment #6 in LP: #1678184.

Revision history for this message
Bryan (utefan1) wrote :

I just ran into this issue (2018-01-04) on Amazon AWS. The nvme drive was not detected on Amazon's Ubunto 16.04. Fixed after running
apt-get update
apt-get upgrade
apt-get dist-upgrade
reboot.

dist-upgrade seems to be the key. lsb_release -a says 16.04.03 now.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.