bionic 4.15 nvme regression from trusty 4.4 with two identical devices

Bug #1803692 reported by James Dingwall on 2018-11-16
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned

Bug Description

I have a system containing two identical nvme devices. When booting a trusty PXE image with kernel 4.4.0-38-generic both devices are detected and available:

# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x8086
ssvid : 0x8086
sn : BTHH82250N1X1P0E
mn : INTEL SSDPEKKF010T8L
fr : L08P
...

# nvme id-ctrl /dev/nvme1
NVME Identify Controller:
vid : 0x8086
ssvid : 0x8086
sn : BTHH82250N261P0E
mn : INTEL SSDPEKKF010T8L
fr : L08P
...

# dmesg | grep nvme
[ 5.106516] nvme0n1: p1 p2 p3 p4
[ 5.106615] nvme1n1: p1 p2

After booting a bionic PXE image based on 4.15.0-38-generic only the first nvme device is enabled, the second is detected but disabled as both devices have the same nqn:

nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2017-12.org.nvmeexpress:uuid:11111111-2222-3333-4444-555555555555).
nvme nvme1: Removing after probe failure status: -22

The nqn string is found in the device firmware rather than being generated by Linux but there does not seem to be an operation in nvme-cli to change this. (It is also questionable if the device firmware value is correct according to section 7.9 of https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3a-20171024_ratified.pdf. My reading of the specification is that the string should start nqn.2014-08.org.nvmeexpress:uuid: with a random UUID, and I assume a random UUID per device.)

The Windows 10 installation provided on the system did not have any problems operating with both devices.

Looking at the kernel nvme driver history suggests that in 4.4 it didn't care or validate the nqn but now it does there is a problem.

Our typical installation is a zpool mirror across two devices and this is preventing us moving from trusty to bionic.

This is a report of a similar issue: https://ask.fedoraproject.org/en/question/128422/one-of-two-identical-m2-nvme-drives-disabling-due-to-same-nqn/

It may be worth noting that if the nvme device does not provide an nqn then it seems one is generated based on the device serial number so a system with two Samsung MZVLB256HAHQ devices works fine.
---
ApportVersion: 2.14.1-0ubuntu3.21
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 3440 F.... pulseaudio
CasperVersion: 1.340.2
CurrentDmesg: [ 151.172010] init: plymouth-stop pre-start process (4137) terminated with status 1
DistroRelease: Ubuntu 14.04
IwConfig:
 lo no wireless extensions.

 eth1 no wireless extensions.

 eth0 no wireless extensions.
LiveMediaBuild: Ubuntu 14.04.5 LTS "Trusty Tahr" - Release amd64 (20160803)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 89e5:1001
 Bus 001 Device 002: ID 17ef:6099 Lenovo
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 30C8S04Y00
NonfreeKernelModules: zfs zunicode zcommon znvpair zavl
Package: linux (not installed)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: us1931.efi root=/dev/nfs boot=casper netboot=nfs nfsroot=192.168.10.150:/srv/boot/us1931 locale=en_GB.UTF-8 keyb=gb mirror/country=GB ip=dhcp zinstall= BOOTIF=01-30-9c-23-cb-2a-46 toram
ProcVersionSignature: Ubuntu 4.4.0-38.57~14.04.1-generic 4.4.19
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-38-generic N/A
 linux-backports-modules-4.4.0-38-generic N/A
 linux-firmware 1.127.22
RfKill:

Tags: trusty
Uname: Linux 4.4.0-38-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 08/17/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: M1VKT1BA
dmi.board.name: 3138
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN 3305152508085
dmi.chassis.type: 3
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrM1VKT1BA:bd08/17/2018:svnLENOVO:pn30C8S04Y00:pvrThinkStationP330:rvnLENOVO:rn3138:rvrSDK0J40697WIN3305152508085:cvnLENOVO:ct3:cvrNone:
dmi.product.name: 30C8S04Y00
dmi.product.version: ThinkStation P330
dmi.sys.vendor: LENOVO
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 1989 F.... pulseaudio
CasperVersion: 1.394
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 18.04
LiveMediaBuild: Ubuntu 18.04.1 LTS "Bionic Beaver" - Release amd64 (20180725)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 89e5:1001
 Bus 001 Device 002: ID 17ef:6099 Lenovo
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 30C8S04Y00
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: bionic-sysprep.efi root=/dev/nfs boot=casper netboot=nfs nfsroot=192.168.10.150:/srv/boot/zynstra-bionic locale=en_GB.UTF-8 keyb=gb mirror/country=GB ip=dhcp BOOTIF=01-30-9c-23-cb-2a-46 toram
ProcVersionSignature: Ubuntu 4.15.0-38.41-generic 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-38-generic N/A
 linux-backports-modules-4.15.0-38-generic N/A
 linux-firmware 1.173.1
RfKill:

Tags: bionic
Uname: Linux 4.15.0-38-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 08/17/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: M1VKT1BA
dmi.board.name: 3138
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN 3305152508085
dmi.chassis.type: 3
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrM1VKT1BA:bd08/17/2018:svnLENOVO:pn30C8S04Y00:pvrThinkStationP330:rvnLENOVO:rn3138:rvrSDK0J40697WIN3305152508085:cvnLENOVO:ct3:cvrNone:
dmi.product.family: ThinkStation P330
dmi.product.name: 30C8S04Y00
dmi.product.version: ThinkStation P330
dmi.sys.vendor: LENOVO

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1803692

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial

apport information

tags: added: apport-collected trusty
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

tags: added: bionic
removed: apport-collected trusty xenial
tags: added: apport-collected trusty

apport information

description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.20 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete

The issue exists upstream:

$ uname -r
4.20.0-042000rc3-generic
$ dmesg | grep nvme
[ 1.857861] nvme nvme0: pci function 0000:02:00.0
[ 1.857900] nvme nvme1: pci function 0000:04:00.0
[ 1.968167] nvme0n1: p1 p2 p3 p4
[ 2.072227] nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2017-12.org.nvmexpress:uuid:11111111-2222-3333-4444-555555555555).
[ 2.072231] nvme nvme1: Removing after probe failure status: -22
[ 4.408478] systemd[1]: Set hostname to <nvmetest>.

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

The linked thread on fedoraproject.org indicates that a firmware update for the device will be necessary to resolve this issue.

I have changed the bug to 'invalid' as I do not expect a fix in Ubuntu for the firmware bug.

https://downloadcenter.intel.com/download/28320/Known-Issue-Intel-SSD-760p-Pro-7600p-Series-SubNQN-Conflict-on-Linux

Changed in linux (Ubuntu):
status: Confirmed → Invalid

I tried to force install the Intel firmware which contains the fix but that did not work as the utility refused to overwrite the custom Lenovo firmware. I opened a support case with Lenovo referencing the Intel fix but the case was closed with 'we don't support Linux' :( Perhaps it would be possible to have a quirk for the faulty firmware...

A workaround in the kernel nvme driver has been tested and proposed: http://lists.infradead.org/pipermail/linux-nvme/2018-November/021366.html

The fix has been merged to the mainline kernel tree as commit 6299358d198a0635da2dd3c4b3ec37789e811e44.

Changed in linux (Ubuntu):
status: Invalid → Confirmed
tags: added: kernel-fixed-upstream
removed: kernel-bug-exists-upstream

The fix is available in the at least the 4.18 branch: https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/log/drivers/nvme/host?h=Ubuntu-hwe-4.18.0-21.22_18.04.1

This is good enough for our needs so closing the issue.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers