bionic 4.15 nvme regression from trusty 4.4 with two identical devices

Bug #1803692 reported by James Dingwall
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

I have a system containing two identical nvme devices. When booting a trusty PXE image with kernel 4.4.0-38-generic both devices are detected and available:

# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x8086
ssvid : 0x8086
sn : BTHH82250N1X1P0E
mn : INTEL SSDPEKKF010T8L
fr : L08P
...

# nvme id-ctrl /dev/nvme1
NVME Identify Controller:
vid : 0x8086
ssvid : 0x8086
sn : BTHH82250N261P0E
mn : INTEL SSDPEKKF010T8L
fr : L08P
...

# dmesg | grep nvme
[ 5.106516] nvme0n1: p1 p2 p3 p4
[ 5.106615] nvme1n1: p1 p2

After booting a bionic PXE image based on 4.15.0-38-generic only the first nvme device is enabled, the second is detected but disabled as both devices have the same nqn:

nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2017-12.org.nvmeexpress:uuid:11111111-2222-3333-4444-555555555555).
nvme nvme1: Removing after probe failure status: -22

The nqn string is found in the device firmware rather than being generated by Linux but there does not seem to be an operation in nvme-cli to change this. (It is also questionable if the device firmware value is correct according to section 7.9 of https://nvmexpress.org/wp-content/uploads/NVM-Express-1_3a-20171024_ratified.pdf. My reading of the specification is that the string should start nqn.2014-08.org.nvmeexpress:uuid: with a random UUID, and I assume a random UUID per device.)

The Windows 10 installation provided on the system did not have any problems operating with both devices.

Looking at the kernel nvme driver history suggests that in 4.4 it didn't care or validate the nqn but now it does there is a problem.

Our typical installation is a zpool mirror across two devices and this is preventing us moving from trusty to bionic.

This is a report of a similar issue: https://ask.fedoraproject.org/en/question/128422/one-of-two-identical-m2-nvme-drives-disabling-due-to-same-nqn/

It may be worth noting that if the nvme device does not provide an nqn then it seems one is generated based on the device serial number so a system with two Samsung MZVLB256HAHQ devices works fine.
---
ApportVersion: 2.14.1-0ubuntu3.21
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 3440 F.... pulseaudio
CasperVersion: 1.340.2
CurrentDmesg: [ 151.172010] init: plymouth-stop pre-start process (4137) terminated with status 1
DistroRelease: Ubuntu 14.04
IwConfig:
 lo no wireless extensions.

 eth1 no wireless extensions.

 eth0 no wireless extensions.
LiveMediaBuild: Ubuntu 14.04.5 LTS "Trusty Tahr" - Release amd64 (20160803)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 89e5:1001
 Bus 001 Device 002: ID 17ef:6099 Lenovo
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 30C8S04Y00
NonfreeKernelModules: zfs zunicode zcommon znvpair zavl
Package: linux (not installed)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: us1931.efi root=/dev/nfs boot=casper netboot=nfs nfsroot=192.168.10.150:/srv/boot/us1931 locale=en_GB.UTF-8 keyb=gb mirror/country=GB ip=dhcp zinstall= BOOTIF=01-30-9c-23-cb-2a-46 toram
ProcVersionSignature: Ubuntu 4.4.0-38.57~14.04.1-generic 4.4.19
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-38-generic N/A
 linux-backports-modules-4.4.0-38-generic N/A
 linux-firmware 1.127.22
RfKill:

Tags: trusty
Uname: Linux 4.4.0-38-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 08/17/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: M1VKT1BA
dmi.board.name: 3138
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN 3305152508085
dmi.chassis.type: 3
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrM1VKT1BA:bd08/17/2018:svnLENOVO:pn30C8S04Y00:pvrThinkStationP330:rvnLENOVO:rn3138:rvrSDK0J40697WIN3305152508085:cvnLENOVO:ct3:cvrNone:
dmi.product.name: 30C8S04Y00
dmi.product.version: ThinkStation P330
dmi.sys.vendor: LENOVO
---
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 1989 F.... pulseaudio
CasperVersion: 1.394
CurrentDesktop: ubuntu:GNOME
DistroRelease: Ubuntu 18.04
LiveMediaBuild: Ubuntu 18.04.1 LTS "Bionic Beaver" - Release amd64 (20180725)
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 003: ID 89e5:1001
 Bus 001 Device 002: ID 17ef:6099 Lenovo
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: LENOVO 30C8S04Y00
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: bionic-sysprep.efi root=/dev/nfs boot=casper netboot=nfs nfsroot=192.168.10.150:/srv/boot/zynstra-bionic locale=en_GB.UTF-8 keyb=gb mirror/country=GB ip=dhcp BOOTIF=01-30-9c-23-cb-2a-46 toram
ProcVersionSignature: Ubuntu 4.15.0-38.41-generic 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-38-generic N/A
 linux-backports-modules-4.15.0-38-generic N/A
 linux-firmware 1.173.1
RfKill:

Tags: bionic
Uname: Linux 4.15.0-38-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 08/17/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: M1VKT1BA
dmi.board.name: 3138
dmi.board.vendor: LENOVO
dmi.board.version: SDK0J40697 WIN 3305152508085
dmi.chassis.type: 3
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrM1VKT1BA:bd08/17/2018:svnLENOVO:pn30C8S04Y00:pvrThinkStationP330:rvnLENOVO:rn3138:rvrSDK0J40697WIN3305152508085:cvnLENOVO:ct3:cvrNone:
dmi.product.family: ThinkStation P330
dmi.product.name: 30C8S04Y00
dmi.product.version: ThinkStation P330
dmi.sys.vendor: LENOVO

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1803692

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: xenial
Revision history for this message
James Dingwall (a-james-launchpad) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected trusty
description: updated
Revision history for this message
James Dingwall (a-james-launchpad) wrote : BootDmesg.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : CRDA.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : Lspci.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcInterrupts.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcModules.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : UdevDb.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : UdevLog.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : WifiSyslog.txt

apport information

tags: added: bionic
removed: apport-collected trusty xenial
tags: added: apport-collected trusty
Revision history for this message
James Dingwall (a-james-launchpad) wrote : AlsaInfo.txt

apport information

description: updated
Revision history for this message
James Dingwall (a-james-launchpad) wrote : CRDA.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : CurrentDmesg.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : IwConfig.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : Lspci.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcInterrupts.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : ProcModules.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : PulseList.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : UdevDb.txt

apport information

Revision history for this message
James Dingwall (a-james-launchpad) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.20 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
James Dingwall (a-james-launchpad) wrote :

The issue exists upstream:

$ uname -r
4.20.0-042000rc3-generic
$ dmesg | grep nvme
[ 1.857861] nvme nvme0: pci function 0000:02:00.0
[ 1.857900] nvme nvme1: pci function 0000:04:00.0
[ 1.968167] nvme0n1: p1 p2 p3 p4
[ 2.072227] nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2017-12.org.nvmexpress:uuid:11111111-2222-3333-4444-555555555555).
[ 2.072231] nvme nvme1: Removing after probe failure status: -22
[ 4.408478] systemd[1]: Set hostname to <nvmetest>.

tags: added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
James Dingwall (a-james-launchpad) wrote :

The linked thread on fedoraproject.org indicates that a firmware update for the device will be necessary to resolve this issue.

Revision history for this message
James Dingwall (a-james-launchpad) wrote :

I have changed the bug to 'invalid' as I do not expect a fix in Ubuntu for the firmware bug.

https://downloadcenter.intel.com/download/28320/Known-Issue-Intel-SSD-760p-Pro-7600p-Series-SubNQN-Conflict-on-Linux

Changed in linux (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
James Dingwall (a-james-launchpad) wrote :

I tried to force install the Intel firmware which contains the fix but that did not work as the utility refused to overwrite the custom Lenovo firmware. I opened a support case with Lenovo referencing the Intel fix but the case was closed with 'we don't support Linux' :( Perhaps it would be possible to have a quirk for the faulty firmware...

Revision history for this message
James Dingwall (a-james-launchpad) wrote :

A workaround in the kernel nvme driver has been tested and proposed: http://lists.infradead.org/pipermail/linux-nvme/2018-November/021366.html

Revision history for this message
James Dingwall (a-james-launchpad) wrote :

The fix has been merged to the mainline kernel tree as commit 6299358d198a0635da2dd3c4b3ec37789e811e44.

Changed in linux (Ubuntu):
status: Invalid → Confirmed
tags: added: kernel-fixed-upstream
removed: kernel-bug-exists-upstream
Revision history for this message
James Dingwall (a-james-launchpad) wrote :

The fix is available in the at least the 4.18 branch: https://kernel.ubuntu.com/git/ubuntu/ubuntu-bionic.git/log/drivers/nvme/host?h=Ubuntu-hwe-4.18.0-21.22_18.04.1

This is good enough for our needs so closing the issue.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Brad Figg (brad-figg)
tags: added: ubuntu-certified
tags: added: cscc
Revision history for this message
panticz.de (panticz.de) wrote :

A fix for those who has this issue with Intel SSD / NVMe drives:

Firmware update with Intel SSD Data Center Tool (DCT)
https://downloadcenter.intel.com/search?keyword=SSD+Firmware+Update+Tool

wget https://downloadmirror.intel.com/28999/eng/Intel_SSD_Data_Center_Tool_3.0.20_Linux.zip -O /tmp/Intel_SSD_Data_Center_Tool_3.0.20_Linux.zip
unzip -d /tmp /tmp/Intel_SSD_Data_Center_Tool_3.0.20_Linux.zip
sudo dpkg -i /tmp/isdct_3.0.20-1_amd64.deb

# show nvme data
isdct show -intelssd

isdct load -intelssd 0
reboot
isdct load -intelssd 1

Revision history for this message
tinodj (gjorgjioski) wrote :

Hi,

I am still hitting this same problem on Ubuntu 20.04.02 with 2 AData XPG S70 disks. I've tried to hide them behind RAID but this is not working as well, I am not sure whether this is is the issue:

https://superuser.com/questions/1605038/nvme-raid-on-linux-with-amd-ryzen-line-up-possible

I doubt AData/XPG will come soon with firmware that resolves this.

I am not sure if this porposal here can work for me:

https://<email address hidden>/msg2217138.html

I am stuck here :( Any help is highly appreciated.

Revision history for this message
James Dingwall (a-james-launchpad) wrote :

Do you have messages like this in the kernel boot log?

[ 2.072227] nvme nvme1: ignoring ctrl due to duplicate subnqn (nqn.2017-12.org.nvmexpress:uuid:11111111-2222-3333-4444-555555555555).
[ 2.072231] nvme nvme1: Removing after probe failure status: -22

It looks like since 5.2 (1b1031ca63b2ce1d3b664b35b77ec94e458693e9) the message might have changed to be:

"Duplicate cntlid %u with %s, rejecting\n"

If you can't see either of these can you include any messages matching nvme. If it is the same issue then you can try building your kernel with a NVME_QUIRK_IGNORE_DEV_SUBNQN quirk added for your device in drivers/nvme/host/pci.c. I see that in the current HEAD of the kernel tree these AData devices already have this quirk:

        { PCI_DEVICE(0x10ec, 0x5762), /* ADATA SX6000LNP */
                .driver_data = NVME_QUIRK_IGNORE_DEV_SUBNQN, },
        { PCI_DEVICE(0x1cc1, 0x8201), /* ADATA SX8200PNP 512GB */
                .driver_data = NVME_QUIRK_NO_DEEPEST_PS |
                                NVME_QUIRK_IGNORE_DEV_SUBNQN, },

so it might be a simple backport.

Revision history for this message
tinodj (gjorgjioski) wrote :

Hi,

I am getting exactly that one:
 Duplicate cntlid 0 with nvme0, rejecting
 Removing after probe failure status: -22

And indeed it is ADATA S70. So you are saying, if I add that and build the kernel, it might work? Sounds good, I will give a try, thank you so much for the direction!

Revision history for this message
tinodj (gjorgjioski) wrote :

Ahhh, but from what I am seeing here https://github.com/torvalds/linux/blob/master/drivers/nvme/host/core.c, this quirks wouldn't work in my case, right? :(

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Please file a new bug if it's a different device.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.