/dev/by-dname symlinks based on wwn/wwid

Bug #1735839 reported by Dmitrii Shcherbakov on 2017-12-01
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Undecided
Unassigned
curtin
High
Unassigned

Bug Description

Currently /dev/disk/by-dname symlinks are not created if a device does not have a partition table or a super block.

There needs to be a way to create auto-create by-dname symlinks based on unique wwn/eui identifiers and symlinks via udev rules installed during the deployment stage.

Old description:
Block devices can be identified by a GPT placed on them.

Each partition table contains a UUID which uniquely identifies a device.

sudo blkid -o udev -p /dev/nvme0n1
ID_PART_TABLE_UUID=8c5e98a4-0d1e-40ae-8f43-80544a2c204b
ID_PART_TABLE_TYPE=gpt

gdisk:
Command (? for help): p
Disk /dev/nvme0n1: 1000215216 sectors, 476.9 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 8C5E98A4-0D1E-40AE-8F43-80544A2C204B

We could leverage that for software that needs to create its own partitions after deployment in MAAS for example. In this case there will be no need to create a new partition table post-deployment but there will be no partitions or file systems on a block device which will be ready for any use.

Example: Ceph needs to use block devices to create partitions but "zapping" of a partition table is configurable. If a partition table is pre-zapped during the deployment and a new clean one is created ceph-disk can just create partitions on an existing partition table. This may help us workaround this: pad.lv/1604501

Serial numbers are not super reliable as they depend on a block driver while in this case we rely on a superblock written at the deployment time. Even if we have an odd device that doesn't have reliable means of identification we can rely on a persistent UUID we ourselves put onto that device.

MAAS needs to grow an ability to configure empty block devices which will get a clean partition table at will too.

I think it's worthwhile to create a udev rule for ubuntu server in general: /dev/disk/by-gpt-uuid/<uuid>

Related branches

description: updated
Ryan Harper (raharper) on 2017-12-01
Changed in curtin:
importance: Undecided → Wishlist
status: New → Triaged
Dmitrii Shcherbakov (dmitriis) wrote :

For dname links and regular disk devices it looks like ID_PART_TABLE_UUID is already used:

https://bazaar.launchpad.net/~curtin-dev/curtin/trunk/view/head:/curtin/commands/block_meta.py#L229
def make_dname(volume, storage_config):
...
    if vol.get('type') == "disk":
        rule.append(compose_udev_equality('ENV{DEVTYPE}', "disk"))
        rule.append(compose_udev_equality('ENV{ID_PART_TABLE_UUID}', ptuuid))

For other devices this is not present by default but they have other identification methods.

   elif vol.get('type') == "bcache":
        rule.append(compose_udev_equality("ENV{DEVNAME}", path))

Still, this is not usable for disks without partitions from MAAS.

Lee Trager (ltrager) on 2017-12-07
Changed in maas:
status: New → Triaged
importance: Undecided → Wishlist
milestone: none → 2.4.x
Scott Moser (smoser) wrote :

The issue with us provding /dev/disk/by-dname/ links based on content inside the disk is that the content can move. Ultimately when we added 'by-dname' links, we were intending it to be "disk name". Ie, the user would provide a name for the disk in maas and then could assume that /dev/disk/by-dname/<name> would point to that disk.

The implementation, however, ended up relying on contents of the disk.

The problem with this approach is that the contents can move.

Consider:
 a.) user boots system with disks named 'blue' and 'red' and 'boot'. Red is initially provisioned as empty.
 b.) user moves data from 'blue' to 'red'. exactly how doesn't matter, but consider 'dd if=/dev/disk/by-dname/blue of=/dev/disk/by-dname/red'. and then wipes /dev/disk/by-dname/blue.
 c.) user reboots
 d.) user formats /dev/disk/by-dname/blue to put new data on it.

If the data *on* blue was used to identify it, the user will have just destroyed their data, as /dev/disk/by-dname/blue will point to the 'red' disk.

What we decided was that to work our way out of this issue, we should/would provide a /dev/disk/by-<something> that would *only* consider serial numbers or wwid as basis for its names.

Dmitrii Shcherbakov (dmitriis) wrote :

Having dnames tied to wwids (/sys/class/block/<nvme-name>/wwid) would be a good solution it seems as currently there are no dname symlinks at all unless there is some form of a superblock present on a device.

The use-case for that is software (SDS like Ceph, databases etc.) which manages raw block devices and does not expect partition tables or any other superblocks on them and may error out or delete superblocks/ptables that are already present.

This happens with charm-based ceph deployments for example - charms error out if they detect a non-pristine device (the one that has some lower LBAs occupied by non-zero values). So pre-creating a partition table to trigger dname logic is not a universal workaround.

Juju storage also depends on /dev/disk/by-dname symlink availability as this is the only way to reliably retrieve a persistent device path from MAAS as cloud provider.

summary: - feature request: /dev/by-dname symlinks based on GUID patition table
- UUIDs
+ feature request: /dev/by-dname symlinks based on wwn/wwid
summary: - feature request: /dev/by-dname symlinks based on wwn/wwid
+ /dev/by-dname symlinks based on wwn/wwid
Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (5.4 KiB)

subscribed ~field-critical

Considering https://bugs.launchpad.net/juju/+bug/1691694 and some other design issues which make Juju storage not applicable for day 2 operations when used with MAAS for now this lp has been re-qualified into a bug by FE as, besides Juju storage, there is no way to logically tie dnames in MAAS with wwn/eui identifiers (stored as idpath parameter) and use them in charm config values to identify storage devices without partition tables/superblocks in a persistent way.

There are two types of NVME devices below SSDPE21K375GA and SSDPE2KX040T7 while they all appear as /dev/nvme<something>. dnames pointing to persistent symlinks are needed for them to be logically meaningful.

ls -al /dev/disk/by-id/
total 0
drwxr-xr-x 2 root root 1040 Jul 23 09:55 .
drwxr-xr-x 8 root root 160 Jul 23 09:55 ..
lrwxrwxrwx 1 root root 9 Jul 23 10:02 ata-XF1230-1A0960_7CW009J0 -> ../../sdb
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009J0-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009J0-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009J0-part3 -> ../../sdb3
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009J0-part4 -> ../../sdb4
lrwxrwxrwx 1 root root 9 Jul 23 10:02 ata-XF1230-1A0960_7CW009NY -> ../../sda
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009NY-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009NY-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009NY-part3 -> ../../sda3
lrwxrwxrwx 1 root root 10 Jul 23 10:02 ata-XF1230-1A0960_7CW009NY-part4 -> ../../sda4
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-name-fnos-nvme08:0 -> ../../md0
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-name-fnos-nvme08:1 -> ../../md1
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-name-fnos-nvme08:2 -> ../../md2
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-name-fnos-nvme08:3 -> ../../md3
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-uuid-09b65367:fd20d811:43cd7b64:9e775cbb -> ../../md0
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-uuid-7242c36e:8e717cf2:528c60ba:2cfd1711 -> ../../md2
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-uuid-7839e948:130f8c65:0ba12c3c:6d8931d3 -> ../../md1
lrwxrwxrwx 1 root root 9 Jul 23 10:02 md-uuid-bb9db413:cd06ea5b:8d689c5a:d2fa83cd -> ../../md3
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE21K375GA_PHKE730200GL375AGN -> ../../nvme8n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802000Y84P0IGN -> ../../nvme4n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802001054P0IGN -> ../../nvme1n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802101JV4P0IGN -> ../../nvme9n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802101JX4P0IGN -> ../../nvme5n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802101KJ4P0IGN -> ../../nvme0n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802101RP4P0IGN -> ../../nvme7n1
lrwxrwxrwx 1 root root 13 Jul 23 10:02 nvme-INTEL_SSDPE2KX040T7_PHLF802101RT4P0IGN -> ../../nv...

Read more...

description: updated
Dmitrii Shcherbakov (dmitriis) wrote :

I may have found a workaround for a very specific case I am working on which may drop this to ~field-high:

1) naming a device in MAAS (e.g. nvme9n1p1 -> optane);
2) creating a whole-disk partition (which results in a partition table with a partition UUID being created);
3) deploying a node.

This results in a partition uuid-based symlink (no device symlink though) being created by curtin.

$ ls -al /dev/disk/by-dname/ | grep nvme
lrwxrwxrwx 1 root root 15 Jul 23 12:08 optane-part1 -> ../../nvme9n1p1

Using this as a workaround may only work for ceph-volume-based charm-ceph-osd deployments where it does not matter if a device special file is a partition or the whole device as LVM is used to slice it into chunks.

I will update this bug accordingly if I am able to work with that.

Dmitrii Shcherbakov (dmitriis) wrote :

I am going to set this to ~field-high now as all-disk partition approach allowed the project affected by this to be moved forward.

Dmitrii Shcherbakov (dmitriis) wrote :
Download full text (3.5 KiB)

Based on https://bugs.launchpad.net/curtin/+bug/1735839/comments/2

by-dname was intended to be "by device name" and ended up to be "by data name"

Curtin docs promise disk naming and partition naming:

https://curtin.readthedocs.io/en/latest/topics/storage.html
"If the name key is present, curtin will create a udev rule that makes a symbolic link to the disk with the given name value. This makes it easy to find disks on an installed system. The links are created in /dev/disk/by-dname/<name>. A link to each partition on the disk will also be created at /dev/disk/by-dname/<name>-part<number>..."

> "What we decided was that to work our way out of this issue, we should/would provide a /dev/disk/by-<something> that would *only* consider serial numbers or wwid as basis for its names."

I think "device name" vs "data name" usage should depend on an application and we need 2 symlink types:

* a "devicename" symlink should mean a wwn or serial number symlink for hw devices themselves, for virtual devices (partitions, bcache, md-raid etc.) - a superblock unique ID symlink;
* a "dataname" symlink should mean a GPT UUID/MBR disk identifier symlink for hw devices with a partition table, partition UUID symlink for partitions, superblock UUIDs for virtual devices.

There are several classes of block devices that I see:

1) hardware block devices with a unique factory identifier (wwn, EUI etc.);

lrwxrwxrwx 1 root root 14 Jul 25 16:28 nvme-eui.01000000010000005cd2e407a30d4f51 -> ../../nvme10n1
lrwxrwxrwx 1 root root 9 Jul 25 16:28 wwn-0x5000c500813becef -> ../../sda

2) hardware-level virtual block devices (created by RAID controllers) - WWN stability will depend on the raid controller's implementation but generally a RAID-controller superblock is written to every device in a RAID.

journalctl -k | grep -i lsi
JUN 08 16:38:57 ubuntu kernel: scsi 0:2:1:0: Direct-Access LSI MR9271-8i 3.46 PQ: 0 ANSI: 5

lspci | grep -i lsi
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208 [Thunderbolt] (rev 05)

tree /dev/disk/by-id/ | grep sda
├── scsi-3600605b006deb2f0ff00001f01ff0cdb -> ../../sda
└── wwn-0x600605b006deb2f0ff00001f01ff0cdb -> ../../sda

3) virtual block devices. Availability of persistent and unique identifiers depends on hypervisor support and persistent configuration that a hypervisor provides (libvirt domain xml files -> qemu args, VMDK file metadata etc.)

For QEMU, IDE and SCSI devices can utilize a wwn passed via qemu command line and virtio devices can utilize a serial number parameter.

https://git.qemu.org/?p=qemu.git;a=commit;h=95ebda85e09ed2b7f00deb2adbdafa5ccf5db948
https://git.qemu.org/?p=qemu.git;a=commitdiff;h=64cc22841e72d37d577416f5836155ecd0a9bfb6
https://git.qemu.org/?p=qemu.git;a=commitdiff;h=9a502563eef7d7c2c9120b237059426e229eefe9

https://libvirt.org/git/?p=libvirt.git;a=commitdiff;h=6e73850b01ee7d5816a803d684e9d669dad036f3

virtio:

#...
-drive file=/var/lib/libvirt/images/maas-1.qcow2,format=qcow2,if=none,id=drive-virtio-disk0,serial=a76f8801-92d3-4aff-9c96-a51a054517a0 # ...

cat /sys/class/block/vda/serial
a76f8801-92d3-4aff-9

(virtio)scsi:

virsh dumpxml maas | grep wwn
      <wwn>5...

Read more...

Changed in maas:
status: Triaged → Invalid
importance: Wishlist → Undecided
milestone: 2.4.x → none
Andres Rodriguez (andreserl) wrote :

Marking this as invalid for MAAS provided that:

1. MAAS doesn't create the udev rules matching to disk names.
2. MAAS already sends UUID's to curtin, as to whether curtin uses them/new ones are created, I dont know.

That said, can you please explain the problem you are actually hitting? The explanation on the bug reports explains what you are wanting to achieve and why, but doesn't highlight what problem you are currently facing.

Also, since I believe you would be using juju, and MAAS already has a UUID attached to a disk which knows which dname it has, why shouldn't juju be using such information to provide you what you need, instead of having to request a uuid being created on the disk.

And lastly, since this is a feature request, should this really be tagged as 'field-*' ? IIRC, since this is not a regression or a proper bug, and this is a feature request for functionality that doesn't yet exist, this should be tagged as such.

Dmitrii Shcherbakov (dmitriis) wrote :

Andres,

The problem we have is quite simple: we cannot rely on Juju storage as of today and can only use charm config options which require lists of devices, i.e.

osd-devices: /dev/disk/by-<some-logical-name-from-maas>/logicalname0 /dev/disk/by-<some-logical-name-from-maas>/logicalname1 /dev/disk/by-<some-logical-name-from-maas>/logicalname2
journal
osd-journal: # ... space-separated device symlinks
bluestore-wal: # ... space-separated device symlinks
bluestore-db: # ... space-separated device symlinks

There is a single config string for all units of ceph-osd for a given node type so we have to use logical naming: IDs are unique per-node so by-id links cannot be used directly. Logical names have to be used as we have an all-NVMe setup with 2 device types: "regular" P4500 NVME data devices and P4800x Optane devices. Those devices are not plugged uniformly in terms of slots across machines and /dev/nvme<devnum>n<namespace-num> entries look the same on all nodes but may point to a different device type depending on a node.

# 375G Optane device is nvme8n1 but nvme4n1 on a different node
lrwxrwxrwx 1 root root 13 Jul 25 16:28 nvme-INTEL_SSDPE21K375GA_PHKE730200JK375AGN -> ../../nvme8n1
# 4T P4500 device
lrwxrwxrwx 1 root root 13 Jul 25 16:28 nvme-INTEL_SSDPE2KX040T7_PHLF802000X84P0IGN -> ../../nvme9n1

So the following configuration would result in an optane device being used as a data device on some nodes which would silently create an invalid Ceph device configuration:

osd-devices: /dev/nvme0n1 /dev/nvme1n1 # ... until /dev/nvme7n1
osd-journal: /dev/nvme8n1

Tags and Juju storage would have solved this problem quite nicely but only if there were no other issues to address.

There have been some operational concerns when using Juju storage with MAAS as a provider that got communicated to us from the operations teams and also some current bugs:

https://bugs.launchpad.net/juju/+bug/1783084
https://bugs.launchpad.net/juju/+bug/1778033
https://bugs.launchpad.net/juju/+bug/1691694

We currently cannot use partition tags due to the lack of that feature in MAAS https://bugs.launchpad.net/maas/+bug/1713239 and we rely on NVME device partitions for allocating some NVMe storage for bcache and using some for filestore journal or bluestore WAL & DB. So we cannot use Juju storage with MAAS-provided partitions either.

Likewise, we cannot rely on pre-created filesystems in MAAS and fstab fs UUID-based entries as:

1) with ceph encryption configured by charms file systems are created on top of device-mapper block devices that MAAS is not aware of;
2) the new stable ceph backend (bluestore) does not rely on kernel-mounted file systems so file system UUIDs are not usable.

I agree that MAAS already exposes relevant information for physical devices and composable devices (md-raid, bcache etc.) to curtin.

However, having MAAS to provide serials or WWNs for virtio or virtio-scsi devices would be a MAAS feature request in my view as there is some work involved in rendering proper libvirt domain xml.

Andres Rodriguez (andreserl) wrote :

To better understand the context of this issue, I briefly discussed it with Dmitrii on IRC and we have identified 3 areas:

 A. Currently, we are relying on charm options to specify which storage devices to use for ceph data and journal or wal/db devices (/dev/disk/by-<logical-name-in-maas>/logicalname0 /dev/disk/by-<logical-name-in-maas>/logicalname1 and so on). That said:
 - The work around is to rename the block devices in MAAS (which renames the partition prefix and the dname symlink).
 - The desired fix is to use Juju storage instead (which is also supported by the ceph-osd charm), which would require solving [2] in MAAS.

 B. Curtin currently writes symlinks (by-dname) based on GPT partition UUIDs, partition UUIDs and other superblock identifiers. Curtin, however, doesn't create that for clean devices. As such, if curtin were to *also* create symlinks based on WWN or serial ID, it would help with the use cases. This means *this* bug applies for is valid for curtin, but invalid for MAAS.

 C. MAAS to preseed disk UUIDs for KVM pods VMs, so that when B is solved, disks would have proper symlinks. Requires a new bug to be filed for MAAS.

[2]: https://bugs.launchpad.net/maas/+bug/1713239

Ryan Harper (raharper) on 2018-10-31
Changed in curtin:
importance: Wishlist → High
status: Triaged → In Progress

This bug is fixed with commit 3f03b1dd to curtin on branch master.
To view that commit see the following URL:
https://git.launchpad.net/curtin/commit/?id=3f03b1dd

Changed in curtin:
status: In Progress → Fix Committed
Dmitrii Shcherbakov (dmitriis) wrote :

Manually checked on a physical node with 1 device after replacing changed curtin files on a MAAS node (the code isn't in -proposed yet at the time of writing), seems to work:

ubuntu@redacted:~$ blkid
/dev/sda1: LABEL="root" UUID="2b708114-354c-4ece-9d2e-3b7c202cbf17" TYPE="ext4" PARTUUID="18b7f148-01"

ubuntu@redacted:~$ tree /dev/disk/by-dname/
/dev/disk/by-dname/
├── sda -> ../../sda
├── sda-part1 -> ../../sda1
└── sdb -> ../../sdb

0 directories, 3 files

ubuntu@emere:~$ ls -al /dev/disk/by-id/ | grep sdb
lrwxrwxrwx 1 root root 9 Dec 3 14:14 scsi-3600508b1001c914bbd22b1f25777ef8c -> ../../sdb
lrwxrwxrwx 1 root root 9 Dec 3 14:14 wwn-0x600508b1001c914bbd22b1f25777ef8c -> ../../sdb

ubuntu@emere:~$ cat /etc/udev/rules.d/sdb.rules
# Written by curtin
SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="3600508b1001c914bbd22b1f25777ef8c", ENV{ID_WWN_WITH_EXTENSION}=="0x600508b1001c914bbd22b1f25777ef8c", SYMLINK+="disk/by-dname/sdb"

SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}=="3600508b1001c914bbd22b1f25777ef8c", ENV{ID_WWN_WITH_EXTENSION}=="0x600508b1001c914bbd22b1f25777ef8c", SYMLINK+="disk/by-dname/sdb-part%n"

I'll try to get a node with an NVMe device (or create a virtual one) and check there as well.

This bug is believed to be fixed in curtin in version 18.2. If this is still a problem for you, please make a comment and set the state back to New

Thank you.

Changed in curtin:
status: Fix Committed → Fix Released
Dmitrii Shcherbakov (dmitriis) wrote :

With this change curtin will error out (resulting in failed deployments in MAAS) if a serial or wwn is not specified on a block device. In my case this failure happened on a VM created manually via virsh (not pods) with block devices that do not have explicit serial or wwn values:

        Running command ['blkid', '-o', 'export', '/dev/vda'] with allowed return codes [0, 2] (capture=True)
        Running command ['udevadm', 'info', '--query=property', '/dev/vda'] with allowed return codes [0] (capture=True)
        An error occured handling 'vda': RuntimeError - Cannot create disk tag udev rule for /dev/vda [id=vda], missing 'serial' or 'wwn' value
        finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: configuring disk: vda
        TIMED BLOCK_META: 4.837
        finish: cmd-install/stage-partitioning/builtin/cmd-block-meta: FAIL: curtin command block-meta
        Traceback (most recent call last):
          File "/curtin/curtin/commands/main.py", line 201, in main
            ret = args.func(args)
          File "/curtin/curtin/log.py", line 97, in wrapper
            return log_time("TIMED %s: " % msg, func, *args, **kwargs)
          File "/curtin/curtin/log.py", line 79, in log_time
            return func(*args, **kwargs)
          File "/curtin/curtin/commands/block_meta.py", line 83, in block_meta
            return meta_custom(args)
          File "/curtin/curtin/commands/block_meta.py", line 1601, in meta_custom
            handler(command, storage_config_dict)
          File "/curtin/curtin/commands/block_meta.py", line 506, in disk_handler
            make_dname(info.get('id'), storage_config)
          File "/curtin/curtin/commands/block_meta.py", line 271, in make_dname
            byid = make_dname_byid(path, error_msg="id=%s" % vol.get('id'))
          File "/curtin/curtin/commands/block_meta.py", line 250, in make_dname_byid
            "missing 'serial' or 'wwn' value" % error_msg)
        RuntimeError: Cannot create disk tag udev rule for /dev/vda [id=vda], missing 'serial' or 'wwn' value
        Cannot create disk tag udev rule for /dev/vda [id=vda], missing 'serial' or 'wwn' value

Stderr: ''

After adding serial numbers curtin still uses the old by-id value (supplied by maas from the db) to try and lookup a device by serial so a recommissioning is needed after serial values are updated:

          File "/curtin/curtin/block/__init__.py", line 686, in lookup_disk
            raise ValueError("no disk with serial '%s' found" % serial_udev)
        ValueError: no disk with serial 'drive-scsi0-0-0-1' found
        no disk with serial 'drive-scsi0-0-0-1' found

Adding serial to VMs created via virsh pods was fixed in maas for 2.5:
https://bugs.launchpad.net/maas/+bug/1785755

I think we may break previous MAAS versions though.

Andres Rodriguez (andreserl) wrote :

+1. Curtin should be backward compatible and not exclusively require this.

Dmitrii Shcherbakov (dmitriis) wrote :

Another thing I noticed is that the udev rule for NVMe namespaces is not based on a wwid of a namespace - it is tied to a serial of a controller. For NVMe devices with a single namespace this is ok but NVMe controllers can create multiple namespaces each with its own unique identifier (e.g. in the NVMeOF use-case).

https://paste.ubuntu.com/p/WJYpnWmnxp/
for i in `ls -1 /dev/vd* /dev/sd* /dev/nvme*n*` ; do echo "Info for device $i" ; cat /etc/udev/rules.d/`basename $i`.rules ; echo "Unique attribute: " ; udevadm info -a --path $(udevadm info -q path -n $i) | grep -P 'serial|wwid' ; done

Info for device /dev/nvme0n1
# Written by curtin
SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVTYPE}=="disk", ENV{ID_SERIAL}=="QEMU NVMe Ctrl_nvmec0", SYMLINK+="disk/by-dname/nvme0n1"

SUBSYSTEM=="block", ACTION=="add|change", ENV{DEVTYPE}=="partition", ENV{ID_SERIAL}=="QEMU NVMe Ctrl_nvmec0", SYMLINK+="disk/by-dname/nvme0n1-part%n"
Unique attribute:
    ATTR{wwid}=="nvme.8086-6e766d656330-51454d55204e564d65204374726c-00000001"
    ATTRS{serial}=="nvmec0 "

nvme.8086-6e766d656330-51454d55204e564d65204374726c-00000001 is a wwid of a namespace.

So, I think we should prefer ID_WWN over ID_SERIAL in general and use ID_SERIAL as a fallback:

udevadm info --query=property --name /dev/nvme0n1
DEVLINKS=/dev/disk/by-id/nvme-QEMU_NVMe_Ctrl_nvmec0 /dev/disk/by-path/pci-0000:00:03.0-nvme-1 /dev/disk/by-id/nvme-nvme.8086-6e766d656330-51454d55204e564d65204374726c-00000001 /dev/disk/by-dname/nvme0n1
DEVNAME=/dev/nvme0n1
DEVPATH=/devices/pci0000:00/0000:00:03.0/nvme/nvme0/nvme0n1
DEVTYPE=disk
ID_MODEL=QEMU NVMe Ctrl
ID_PATH=pci-0000:00:03.0-nvme-1
ID_PATH_TAG=pci-0000_00_03_0-nvme-1
ID_SERIAL=QEMU NVMe Ctrl_nvmec0
ID_SERIAL_SHORT=nvmec0
ID_WWN=nvme.8086-6e766d656330-51454d55204e564d65204374726c-00000001
MAJOR=259
MINOR=0
SUBSYSTEM=block
TAGS=:systemd:
USEC_INITIALIZED=1792441

NVMe spec references:

[1] "5.15.2 Identify Namespace data structure
Namespace Globally Unique Identifier (NGUID): This field contains a 128-bit value that is globally unique and assigned to the namespace when the namespace is created. This field remains fixed throughout the life of the namespace and is preserved across namespace and controller operations (e.g., controller reset, namespace format, etc.)."

A namespace ID is extracted from a controller and exported via sysfs via a wwid node:
https://github.com/torvalds/linux/blob/v4.14/drivers/nvme/host/core.c#L802-L818
https://github.com/torvalds/linux/blob/v4.14/drivers/nvme/host/core.c#L2054-L2081

[2] Namespaces also have non-unique (across controllers or NVMe subsystems) numeric namespace identifiers (NSID) that are similar to SCSI LUNs (nvme<controller-num>n<namespace-identifier>), see “Section 6.1.2 Valid and Invalid NSIDs” in the NVMe specification:

“Associated with each controller namespace is a namespace ID, labeled as NSID 1 and NSID 2, that is used by the controller to reference a specific namespace.

Changed in curtin:
status: Fix Released → New
Dmitrii Shcherbakov (dmitriis) wrote :

Andres, more on comment #15:

ID_SERIAL and ID_SERIAL_SHORT are generated for SCSI devices if not provided via domain XML but not for virtio-blk.

https://bugs.launchpad.net/maas/+bug/1807505
https://bugs.launchpad.net/maas/+bug/1807505/comments/3

Ryan Harper (raharper) wrote :

I didn't think there was a MAAS released that would provide only path: to disks. This is somewhat surprising since path is just not stable.

While we can warn when we attempt to create a dname for a disk without serial/wwn and move on; I do think MAAS should require serial/wwn. No one looks at the curtin log file on successful deploys; so this leaves the node open to unreliable dname if someone modifies the partition table (which is what this change was to fix) or plugs in a USB drive. Should we inject 'unstable' in to the dname value?

/dev/disk/by-dname/UNSTABLE-sda ../../sda

In terms of rule preference; we have some conflicting values. On some scsi disks, we get:

ID_SERIAL
ID_SERIAL_SHORT
ID_WWN
ID_WWN_WITH_EXTENSION

So, I'd like to see if we can build a reasonable order of preference. My initial thought based on my experience would be to prefer values in this order:

ID_WWN_WITH_EXTENSION
ID_WWN
ID_SERIAL
ID_SERIAL_SHORT

And lastly, curtin currently will embed as many of these values as are found.

DEVTYPE=={disk}, ENV{ID_WWN_WITH_EXTENSION}=="val", ENV{ID_WWN}=="val"

I think by combining these values in the rule; we'll ensure we don't misidentify a disk.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers