Create by-dname symlinks based on NVMe namespace UUIDs instead of controller serial numbers

Bug #1830913 reported by Dmitrii Shcherbakov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
curtin
Triaged
Medium
Unassigned

Bug Description

Currently by-dname symlinks are created based on serial numbers of controllers.

We started getting hardware with multi-path support that uses a different naming scheme and can support multiple namespaces per controller.

It would be good to bind block device symlinks for namespaces to namespace GUIDs (NGUID) instead of controller serial numbers.

See https://bugs.launchpad.net/curtin/+bug/1735839/comments/16 for more details and the NVMe spec reference around NGUID uniqueness and lifetime.

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/multipath.c?id=Ubuntu-4.15.0-51.55#n17
static bool multipath = true;
module_param(multipath, bool, 0444);

https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/bionic/tree/drivers/nvme/host/multipath.c?id=Ubuntu-4.15.0-51.55#n22
/*
 * If multipathing is enabled we need to always use the subsystem instance
 * number for numbering our devices to avoid conflicts between subsystems that
 * have multiple controllers and thus use the multipath-aware subsystem node
 * and those that have a single controller and use the controller node
 * directly.
 */
void nvme_set_disk_name(char *disk_name, struct nvme_ns *ns,
   struct nvme_ctrl *ctrl, int *flags)
{
 if (!multipath) {
  sprintf(disk_name, "nvme%dn%d", ctrl->instance, ns->head->instance);
 } else if (ns->head->disk) {
  sprintf(disk_name, "nvme%dc%dn%d", ctrl->subsys->instance,
    ctrl->cntlid, ns->head->instance);
  *flags = GENHD_FL_HIDDEN;

// ...

For example, with this controller and a default namespace that we wanted to use:

81:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 172Xa/172Xb (rev 01)

uname -a
Linux node-9 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

udevadm info -q all -n /dev/nvme0n1 | grep DEVPATH
E: DEVPATH=/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1

readlink /sys/class/block/nvme0n1
../../devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1

readlink /sys/class/block/nvme0c33n1
../../devices/pci0000:80/0000:80:01.0/0000:81:00.0/nvme/nvme0/nvme0c33n1

lsblk -p
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
/dev/sda 8:0 0 1.8T 0 disk
├─/dev/sda1 8:1 0 512M 0 part /boot/efi
└─/dev/sda2 8:2 0 1.8T 0 part /
/dev/sdb 8:16 0 1.8T 0 disk
/dev/sdc 8:32 0 1.8T 0 disk
/dev/sdd 8:48 0 1.8T 0 disk
/dev/sde 8:64 0 1.8T 0 disk
/dev/sdf 8:80 0 1.8T 0 disk
/dev/sdg 8:96 0 1.8T 0 disk
/dev/sdh 8:112 0 1.8T 0 disk
/dev/sdi 8:128 0 1.8T 0 disk
/dev/sdj 8:144 0 1.8T 0 disk
/dev/sdk 8:160 0 1.8T 0 disk
/dev/sdl 8:176 0 111.3G 0 disk
/dev/sdm 8:192 0 3.7T 0 disk
/dev/sr0 11:0 1 1024M 0 rom
/dev/nvme0n1 259:1 0 1.5T 0 disk

ls -la /dev/nvme*
crw------- 1 root root 243, 0 May 29 14:10 /dev/nvme0
brw-rw---- 1 root disk 259, 1 May 29 14:10 /dev/nvme0n1

nvme0c33n1 is hidden is not shown in /dev/ (and lsblk does not report it) because GENHD_FL_HIDDEN is applied to those block device entries:

See:
https://github.com/torvalds/linux/commit/8ddcd653257c18a669fcb75ee42c37054908e0d6
https://patchwork.kernel.org/patch/10015051/

Tags: cpe-onsite
description: updated
Revision history for this message
Ryan Harper (raharper) wrote :

"nvme0c33n1 is hidden is not shown in /dev/ (and lsblk does not report it) because GENHD_FL_HIDDEN is applied to those block device entries:"

Is this related at all to the request?

Further, curtin will emit dname rules that use ID_WWN, which from your paste before appears to include the NGUID, no?

ID_WWN=nvme.8086-6e766d656330-51454d55204e564d65204374726c-00000001

So I think we're already doing what's needed here. Please correct me if I'm wrong.

I'd support a feature request against maas/curtin for rich NVME support (creating namespaces, using the nvme command to modify/tweak features, functions, extract IDs etc); MAAS would need to use the nvme tool to extract this info, curtin would need a new storage type (type:nvme) and config to control etc).

Changed in curtin:
status: New → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Hmm, indeed. Looks like with commit 35fd01f05ca2932a0174c8f7c35cb7e8625c58d5 the order was changed and my older comment in the other bug is no longer valid (sorry about that):

- byid_keys = ['ID_SERIAL', 'ID_WWN_WITH_EXTENSION']
+ byid_keys = ['ID_WWN_WITH_EXTENSION', 'ID_WWN',
+ 'ID_SERIAL', 'ID_SERIAL_SHORT']

cat /etc/udev/rules.d/nvme0n1-renamed.rules
# Written by curtin
SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVTYPE}=="disk", ENV{ID_WWN}=="nvme.8086-6e766d656330-51454d55204e564d65204374726c-00000001", ENV{ID_SERIAL}=="QEMU NVMe Ctrl_nvmec0", ENV{ID_SERIAL_SHORT}=="nvmec0", SYMLINK+="disk/by-dname/nvme0n1-renamed"

udevadm info -q all -n /dev/nvme0n1

https://paste.ubuntu.com/p/x6x2vRJXJ2/

E: ID_SERIAL=QEMU NVMe Ctrl_nvmec1
E: ID_SERIAL_SHORT=nvmec1
E: ID_WWN=nvme.8086-6e766d656331-51454d55204e564d65204374726c-00000001

> Is this related at all to the request?

Yes, I think this may be the only thing for multi-path then: when we are matching a device against a set of parameters in a uevent, we currently have a requirement for ID_WWN, ID_SERIAL and ID_SERIAL_SHORT to match. The latter two seem to contain controller-related information while in the multi-path case accessing a namespace is possible through any controller.

I will ask for a full `udevadm info -q all -n /dev/nvme0n1` output tomorrow so that we can see which values are present there for a namespace in the multipath case.

Revision history for this message
Ryan Harper (raharper) wrote : Re: [Bug 1830913] Re: Create by-dname symlinks based on NVMe namespace UUIDs instead of controller serial numbers

On Wed, May 29, 2019 at 2:40 PM Dmitrii Shcherbakov <
<email address hidden>> wrote:

>
> I will ask for a full `udevadm info -q all -n /dev/nvme0n1` output
> tomorrow so that we can see which values are present there for a
> namespace in the multipath case.
>

OK. I would be surprised if the udev info contained hidden values; that
seems like it would lead to bugs where rules reference HIDDEN attributes.

Let's see what we find.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Ryan,

Got the needed output, it looks like the namespace block device created on a virtual NVMe subsystem has ID_SERIAL and ID_SERIAL_SHORT referring to a particular controller:

$ readlink /sys/block/nvme0*
../devices/pci0000:80/0000:80:01.0/0000:81:00.0/nvme/nvme0/nvme0c33n1
../devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
ubuntu@node-7:~$ udevadm info -q all -n /dev/nvme0n1
P: /devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
N: nvme0n1
S: disk/by-dname/nvme0n1
S: disk/by-id/nvme-Dell_Express_Flash_PM1725b_1.6TB_AIC_S47ANE0M300982
S: disk/by-id/nvme-eui.343741304d3009820025384500000004
E: DEVLINKS=/dev/disk/by-id/nvme-Dell_Express_Flash_PM1725b_1.6TB_AIC_S47ANE0M300982 /dev/disk/by-dname/nvme0n1 /dev/disk/by-id/nvme-eui.343741304d3009820025384500000004
E: DEVNAME=/dev/nvme0n1
E: DEVPATH=/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
E: DEVTYPE=disk
E: ID_MODEL=Dell Express Flash PM1725b 1.6TB AIC
E: ID_SERIAL=Dell Express Flash PM1725b 1.6TB AIC_S47ANE0M300982
E: ID_SERIAL_SHORT=S47ANE0M300982
E: ID_WWN=eui.343741304d3009820025384500000004
E: MAJOR=259
E: MINOR=1
E: SUBSYSTEM=block
E: TAGS=:systemd:
E: USEC_INITIALIZED=8800965

I don't have an NVMeOF setup to verify this but from what I understand a namespace can be discovered via one of multiple controllers and, therefore, may get different a ID_SERIAL or ID_SERIAL_SHORT content in a uevent. So there is a chance that the by-dname udev rule will be affected by the order of discovery of namespaces (i.e. a symlink will only be created if a namespace is discovered through the same controller that was used during commissioning).

If we matched only against ID_WWN then we would not care about the controller through which a namespace has been discovered.

> OK. I would be surprised if the udev info contained hidden values

What I meant was that both of the namespace-related entries are present in sysfs:

/sys/block/nvme0c33n1
/sys/block/nvme0n1

But only one entry is present in devtmpfs because GENHD_FL_HIDDEN flag is applied to a controller-specific gendisk in the kernel:

/dev/nvme0n1

Revision history for this message
Ryan Harper (raharper) wrote :

On Thu, May 30, 2019 at 7:50 AM Dmitrii Shcherbakov <
<email address hidden>> wrote:

> Ryan,
>
> Got the needed output, it looks like the namespace block device created
> on a virtual NVMe subsystem has ID_SERIAL and ID_SERIAL_SHORT referring
> to a particular controller:
>
> $ readlink /sys/block/nvme0*
> ../devices/pci0000:80/0000:80:01.0/0000:81:00.0/nvme/nvme0/nvme0c33n1
> ../devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
> ubuntu@node-7:~$ udevadm info -q all -n /dev/nvme0n1
> P: /devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
> N: nvme0n1
> S: disk/by-dname/nvme0n1
> S: disk/by-id/nvme-Dell_Express_Flash_PM1725b_1.6TB_AIC_S47ANE0M300982
> S: disk/by-id/nvme-eui.343741304d3009820025384500000004
> E:
> DEVLINKS=/dev/disk/by-id/nvme-Dell_Express_Flash_PM1725b_1.6TB_AIC_S47ANE0M300982
> /dev/disk/by-dname/nvme0n1
> /dev/disk/by-id/nvme-eui.343741304d3009820025384500000004
> E: DEVNAME=/dev/nvme0n1
> E: DEVPATH=/devices/virtual/nvme-subsystem/nvme-subsys0/nvme0n1
> E: DEVTYPE=disk
> E: ID_MODEL=Dell Express Flash PM1725b 1.6TB AIC
> E: ID_SERIAL=Dell Express Flash PM1725b 1.6TB AIC_S47ANE0M300982
> E: ID_SERIAL_SHORT=S47ANE0M300982
> E: ID_WWN=eui.343741304d3009820025384500000004
> E: MAJOR=259
> E: MINOR=1
> E: SUBSYSTEM=block
> E: TAGS=:systemd:
> E: USEC_INITIALIZED=8800965
>
>
> I don't have an NVMeOF setup to verify this but from what I understand a
> namespace can be discovered via one of multiple controllers and, therefore,
> may get different a ID_SERIAL or ID_SERIAL_SHORT content in a uevent. So
> there is a chance that the by-dname udev rule will be affected by the order
> of discovery of namespaces (i.e. a symlink will only be created if a
> namespace is discovered through the same controller that was used during
> commissioning).
>
> If we matched only against ID_WWN then we would not care about the
> controller through which a namespace has been discovered.
>

So, in a NVME *fabric*, separate machines which have different controller
serial numbers (ID_SERIAL, ID_SERIAL_SHORT) may point to the same WWN,
IIUC. I believe that the controller is going to be the *same* on the
commissioned node as when it is deployed. So I still think these rules
will be fine.

>
> > OK. I would be surprised if the udev info contained hidden values
>
> What I meant was that both of the namespace-related entries are present
> in sysfs:
>
> /sys/block/nvme0c33n1
> /sys/block/nvme0n1
>
> But only one entry is present in devtmpfs because GENHD_FL_HIDDEN flag
> is applied to a controller-specific gendisk in the kernel:
>
> /dev/nvme0n1
>

Yes, I get that; What I'm saying is that I would consider it a udev/kernel
bug if nvme0c33n1 were to ever show up in a value in the udev database,
specifically because nvme0c33n1 is *not* a block device (it's
hidden/ignored).

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1830913
>
> Title:
> Create by-dname symlinks based on NVMe namespace UUIDs instead of
> controller serial numbers
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/curtin/+bug/1830913/+subscriptions
>

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

> separate machines which have different controller serial numbers
> I believe that the controller is going to be the *same* on the commissioned node as when it is deployed

Yes, for 1 controller per server.

For the 1 server/2+ controllers case each controller (say of the same model) will have different serial numbers. If the original controller through which a namespace was accessible is removed from a server the expectation with multi-path is that another controller will provide access to it.

I believe this may be problematic in the following case:

1) 2 controllers are present in a server;
2) the server is commissioned and deployed (udev rules tied to WWN and SERIAL are generated for discovered namespaces);
3) the server is shutdown and one card is removed from it (the one with the serial written to a udev rule);
4) the server is brought back up and namespaces are rediscovered through a different controller with a different serial => by-dname rules did not match based on a different serial number.

Theoretically, just rebooting the server without removing 1 controller might yield this situation as well depending on the order they are enumerated in.

Does it look like a sane scenario or am I missing something obvious?

As for the implementation, in the 1.3 NVMe spec controllers have a CMIC field to indicate whether an NVM subsystem can contain multiple controllers or not and a subsystem NQN (CNTLID) based on which is used to put controllers under a subsystem in sysfs.

https://github.com/torvalds/linux/commit/180de0070048340868c7bc841fc12e75556bb629

> consider it a udev/kernel bug if nvme0c33n1 were to ever show up in a value in the udev database

Right, agreed.

Changed in curtin:
status: Incomplete → New
Revision history for this message
Ryan Harper (raharper) wrote :

After discussion we concluded that:

1) nvme by-id symlinks include controller specific values which should *NOT* be a part of the by-di symlink name; that is, the left-hand-side of the symlink should not change depending on which controller picks up the namespace. We should confirm and file a bug against the nvme udev rules.

2) Until (1) is resolved, to be robust against (1), curtin should OMIT the by-id symlink when creating dname rules for NVME devices.

Changed in curtin:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
kai li (richardu-zt) wrote :

hi!
   I encounter a same problem, want to ask whether solve?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.