nvme name floated after boot with 4.15.0 kernel

Bug #1792660 reported by Zhanglei Mao on 2018-09-15
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Guilherme G. Piccoli
Bionic
Medium
Guilherme G. Piccoli

Bug Description

nvme device name such as /dev/nvme?n?p? would be floated that is symbol link to different real ssd device after reboot in 4.15.0 kernel for 16.04.5 HWE and 18.04 GA-kernel. This are not found on 16.04.5 GA-kernel ( 4.4.0)

Zhanglei Mao (zhanglei-mao) wrote :

It was found on v4.15.0 kernel and not for v4.4 kernel.

Zhanglei Mao (zhanglei-mao) wrote :
Zhanglei Mao (zhanglei-mao) wrote :
Zhanglei Mao (zhanglei-mao) wrote :

It is also found on 16.04.5 hwe kernel

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1792660/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
summary: - nvme name floated after boot in Ubuntu 18.04
+ nvme name floated after boot with 4.15.0 kernel
description: updated
description: updated
Zhanglei Mao (zhanglei-mao) wrote :

In the 4.15.0 kernel, a mismatch was found under /sys/class/nvme/, for example for the /sys/class/nvme/nvme1/, there is a of nvme3n1 directory which might be wrong. In 4.4 kernel, it is a matched directory of nvme1n1. Please refer to attach screen shot for more details.

Zhanglei Mao (zhanglei-mao) wrote :

In the 4.15.0 kernel, a mismatch was found under /sys/class/nvme/, for example for the /sys/class/nvme/nvme1/, there is a of nvme3n1 directory which might be wrong. In 4.4 kernel, it is a matched directory of nvme1n1. Please refer to attach screen shot for more details.

affects: ubuntu → kernel-package (Ubuntu)
affects: kernel-package (Ubuntu) → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1792660

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu Bionic):
status: New → Triaged
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Changed in linux (Ubuntu Bionic):
importance: Undecided → Medium
tags: added: bot-stop-nagging kernel-da-key
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc4

Joseph Salisbury (jsalisbury) wrote :

If the bug still exists in the mainline kernel, we can perform a kernel bisect to identify the commit that introduced this regression.

Zhanglei Mao (zhanglei-mao) wrote :

The log which collect via
apport-cli --save=/tmp/apport-log linux

Changed in linux (Ubuntu):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Bionic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)

Hi Zhanglei, I've faced this issue some time ago - in fact, it's not
a bug, but some annoyance caused by the multipath introduction in the
nvme driver.

It started recently, after [0] - the introduction of NVMe multipath
brought a change in the way namespaces' "identity" is calculated. Basically,
another level of indirection was added, now we have the "ns_head" entity,
also driver started to use the subsystem instance instead of the ctrl
instance (even with no multipath used), so we may have this link
"mismatch" you have observed.

Recently this was improved by [1], so now if user sets the kernel command-line
parameter "nvme_core.multipath=0", the old behavior was restored. Can
you give it a try? This fix was added in 4.15.0-34 bionic's kernel
(same for Xenial's HWE kernel).

Also, this is the same old discussion of the SCSI naming sdX - we shouldn't
rely on the numbering, usually it's recommended to use some permanent
index, like the partition UUID.

Cheers,

Guilherme

[0] ed754e5dee ("nvme: track shared namespaces")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed754e5dee

[1] a785dbccd9 ("nvme/multipath: Fix multipath disabled naming collisions")
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a785dbccd9

Zhanglei Mao (zhanglei-mao) wrote :

@Guiherme

Firstly, thank you much to provide those inforamtion. I asked customer to check on 4.15.0-34 with kernel parameter of "nvme_core.multipath=0", it was reported the same ( nvme device name would be changed every reboot).

Hi Zhanglei, thanks for the test and screenshot.

I can't say for sure, but based on the screenshot, seems they are still running 4.15.0-29 - I'm seeing the BOOT_IMAGE entry in the /proc/cmdline. Specifically, this doesn't mean much (one can boot like a 4.4 kernel and add a BOOT_IMAGE of a 4.15, no harm), but it's a clue that customer didn't boot the right kernel, and the patch that restores the old behavior to nvme driver when the parameter is used is not present in 4.15.0-29.

So, I'd ask a new test, this time customer could run "uname -r", "cat /proc/cmdline" and after that, "ls -l /sys/block/*" to check the mapping between the nvme devices.

Cheers,

Guilherme

Zhanglei Mao (zhanglei-mao) wrote :

hi Guilherme,

The partner have tested both for 4.16.0 and 4.15.1 ( I guess it is newer than 4.15.0-34), it was reported that the issues is same. The 4.15.1 testing results are as below:
zlmao@zlmao-T460s:~/tmp$ cat test_result.txt
uname -r
4.15.1-041501-generic

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.15.1-041501-generic root=UUID=3e875566-3c1b-4bda-a869-6eb59ff1624a ro nvme_core.multipath=0

ls -l /sys/block/*
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop0 -> ../devices/virtual/block/loop0
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop1 -> ../devices/virtual/block/loop1
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop2 -> ../devices/virtual/block/loop2
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop3 -> ../devices/virtual/block/loop3
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop4 -> ../devices/virtual/block/loop4
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop5 -> ../devices/virtual/block/loop5
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop6 -> ../devices/virtual/block/loop6
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/loop7 -> ../devices/virtual/block/loop7
lrwxrwxrwx 1 root root 0 Sep 28 10:13 /sys/block/nvme0n1 -> ../devices/pci0000:80/0000:80:03.2/0000:83:00.0/nvme/nvme2/nvme0n1
lrwxrwxrwx 1 root root 0 Sep 28 10:13 /sys/block/nvme1n1 -> ../devices/pci0000:80/0000:80:03.3/0000:84:00.0/nvme/nvme3/nvme1n1
lrwxrwxrwx 1 root root 0 Sep 28 10:09 /sys/block/nvme1n2 -> ../devices/pci0000:80/0000:80:03.3/0000:84:00.0/nvme/nvme3/nvme1n2
lrwxrwxrwx 1 root root 0 Sep 28 10:13 /sys/block/nvme2n1 -> ../devices/pci0000:80/0000:80:03.0/0000:81:00.0/nvme/nvme0/nvme2n1
lrwxrwxrwx 1 root root 0 Sep 28 10:13 /sys/block/nvme3n1 -> ../devices/pci0000:80/0000:80:03.1/0000:82:00.0/nvme/nvme1/nvme3n1
zlmao@zlmao-T460s:~/tmp$

thanks,
Mao

Hi Mao, I see...the partner is not using a regular Ubuntu build.
This patch was introduced upstream in kernel 4.17, so it's not present in a regular 4.15 or 4.16. It is present in our Ubuntu kernel though, because it was backported and added, but I can't guarantee it's present in the custom kernels the partner is using.

Some alternatives for us to resolve this issue:

a) If I could access the source tree of the partner kernel, I can check and even add myself the patch in case it's not there;

b) The partner could test Ubuntu official kernel 4.15.0-34;

c) The partner could test upstream kernel 4.17 (or 4.18).

Preference is alternative (b), but (c) is ok too. The first alternative is the least preferred.
Cheers,

Guilherme

Guo Yaowen (guoyaowen) wrote :

hi Guilherme,
I have tested kernel 4.17.0,but the issuse is same.The 4.17.0 testing results are as below:
uname -r
4.17.0-041700rc1-generic

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.17.0-041700rc1-generic root=UUID=3e875566-3c1b-4bda-a869-6eb59ff1624a ro nvme_core.multipath=0

ls -l /sys/block/*
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop0 -> ../devices/virtual/block/loop0
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop1 -> ../devices/virtual/block/loop1
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop2 -> ../devices/virtual/block/loop2
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop3 -> ../devices/virtual/block/loop3
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop4 -> ../devices/virtual/block/loop4
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop5 -> ../devices/virtual/block/loop5
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop6 -> ../devices/virtual/block/loop6
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/loop7 -> ../devices/virtual/block/loop7
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/nvme0n1 -> ../devices/pci0000:80/0000:80:03.2/0000:83:00.0/nvme/nvme2/nvme0n1
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/nvme1n1 -> ../devices/pci0000:80/0000:80:03.3/0000:84:00.0/nvme/nvme3/nvme1n1
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/nvme1n2 -> ../devices/pci0000:80/0000:80:03.3/0000:84:00.0/nvme/nvme3/nvme1n2
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/nvme2n1 -> ../devices/pci0000:80/0000:80:03.1/0000:82:00.0/nvme/nvme1/nvme2n1
lrwxrwxrwx 1 root root 0 Sep 29 11:19 /sys/block/nvme3n1 -> ../devices/pci0000:80/0000:80:03.0/0000:81:00.0/nvme/nvme0/nvme3n1

Or can you provide a download link for the Ubuntu kernel? Let me do the same test.

Thanks,
Guo Yaowen

Guo Yaowen (guoyaowen) wrote :

hi Guilherme,

I upgraded to version 34 of the Ubuntu kernel with the "apt-get upgrade" command, and the problem was solved. So I think the kernel version of the kernel.ubuntu.com site doesn't fit into the patch, so can we say that it's Ubuntu's own behavior?

Thanks,
Guo Yaowen

Hi Guo, thanks for your tests!

So, to confirm:

a) With Ubuntu kernel 4.15.0-34, using the kernel parameter "nvme_core.multipath=0", you _don't_ see the issue;

b) With kernel 4.17.0-041700rc1-generic, even using the parameter "nvme_core.multipath=0", you *can reproduce* the issue.

Right?

I've noticed that the fix patch is present in v4.17-rc4, but I see a "rc1" in your kernel version - it could explain why even in this 4.17 version you can reproduce the issue.
I'll build an upstream 4.17 and re-test in my environment, to double-check if it fixes for me.

Cheers,

Guilherme

I've just tested a mainline kernel version 4.17.0, and nvme names didn't float when using the kernel parameter "nvme_core.multipath=0", which reinforces that the fix patch is present in 4.17, so Guo: I guess your 4.17 version is really based on 4.17-rc1.

Let me know if there's anything else to investigate in this LP. In my understanding, we did have a bug after the nvme multipath introduction, but now kernel has a fix which is available upstream after kernel v4.17, and available in Ubuntu kernel 4.15.x series, after 4.15.0-34. To keep the nvme naming behavior as before, we need to use the kernel parameter "nvme_core.multipath=0".

Thanks,

Guilherme

Guo Yaowen (guoyaowen) wrote :

hi Guilherme,
I couldn't find a normal Linux-image package in 4.17 and 4.17-rac4, just a package with unsigned. Could you please tell me if this package is available.

  linux-headers-4.17.0-041700_4.17.0-041700.201806041953_all.deb
   linux-headers-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb
   linux-headers-4.17.0-041700-lowlatency_4.17.0-041700.201806041953_amd64.deb
   //linux-image-unsigned-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb
   //linux-image-unsigned-4.17.0-041700-lowlatency_4.17.0-041700.201806041953_amd64.deb
   linux-modules-4.17.0-041700-generic_4.17.0-041700.201806041953_amd64.deb
   linux-modules-4.17.0-041700-lowlatency_4.17.0-041700.201806041953_amd64.deb

By the way, can you tell me the relation between the kernel with RAC and the kernel without RAC?

Thanks,
Guo Yaowen

Hi Guo, the Ubuntu kernels schedule can be checked here: https://wiki.ubuntu.com/Kernel/Support .

Ubuntu kernel team chooses a kernel version for a specific release, and once that release is available, there's a support schedule. Some versions have long-term support, so they're called LTS.

In particular, the ascending list of Ubuntu kernel starting in 4.4 is: 4.4(*), 4.8, 4.10, 4.13, 4.15(*) and now, to be released this month with Ubuntu 18.10, 4.18(*). The versions with asterisks are officially supported (or to be supported, in case of 4.18).

Notice that version 4.17 was never a supported version for any Ubuntu release, so there is no official package for it. The LTS versions of Ubuntu (like 14.04, 16.04 and 18.04) receive kernel updates in form of the HWE kernel packages (https://wiki.ubuntu.com/Kernel/LTSEnablementStack), so you can for example get v4.15 in Ubuntu 16.04 (although its release version is 4.4).
Notice the HWE releases are supported for 6 months, and then superseded by the latest HWE. Except the final HWE kernel, which is supported until Ubuntu's release end-of-life (4.15 , for example, is supported in 16.04 until it's EOL, in 2021).

That said, my suggestion is to use v4.15, which is the latest supported version, and fixes your problem. You can use v4.18 without support right now in 18.04 (from kernel team's unstable PPA) - it'll be the official release for Ubuntu 18.10, and the first HWE for Ubuntu 18.04 (in February, probably). Version 4.18 is not supported (and will never be) for 16.04.

About RAC, I don't know what it means - can you clarify?
If you still have any questions about the kernel versions, let me know.
Cheers,

Guilherme

Guo Yaowen (guoyaowen) wrote :

hi Guilherme,
Get it! Thank you very much for your support.

Guo Yaowen

You're very welcome Guo! I'll mark this as resolved, in case you have questions,
feel free to comment here.

Cheers,

Guilherme

Changed in linux (Ubuntu):
status: Triaged → Fix Released
Changed in linux (Ubuntu Bionic):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers