Comment 9 for bug 1735839

Dmitrii Shcherbakov (dmitriis) wrote :


The problem we have is quite simple: we cannot rely on Juju storage as of today and can only use charm config options which require lists of devices, i.e.

osd-devices: /dev/disk/by-<some-logical-name-from-maas>/logicalname0 /dev/disk/by-<some-logical-name-from-maas>/logicalname1 /dev/disk/by-<some-logical-name-from-maas>/logicalname2
osd-journal: # ... space-separated device symlinks
bluestore-wal: # ... space-separated device symlinks
bluestore-db: # ... space-separated device symlinks

There is a single config string for all units of ceph-osd for a given node type so we have to use logical naming: IDs are unique per-node so by-id links cannot be used directly. Logical names have to be used as we have an all-NVMe setup with 2 device types: "regular" P4500 NVME data devices and P4800x Optane devices. Those devices are not plugged uniformly in terms of slots across machines and /dev/nvme<devnum>n<namespace-num> entries look the same on all nodes but may point to a different device type depending on a node.

# 375G Optane device is nvme8n1 but nvme4n1 on a different node
lrwxrwxrwx 1 root root 13 Jul 25 16:28 nvme-INTEL_SSDPE21K375GA_PHKE730200JK375AGN -> ../../nvme8n1
# 4T P4500 device
lrwxrwxrwx 1 root root 13 Jul 25 16:28 nvme-INTEL_SSDPE2KX040T7_PHLF802000X84P0IGN -> ../../nvme9n1

So the following configuration would result in an optane device being used as a data device on some nodes which would silently create an invalid Ceph device configuration:

osd-devices: /dev/nvme0n1 /dev/nvme1n1 # ... until /dev/nvme7n1
osd-journal: /dev/nvme8n1

Tags and Juju storage would have solved this problem quite nicely but only if there were no other issues to address.

There have been some operational concerns when using Juju storage with MAAS as a provider that got communicated to us from the operations teams and also some current bugs:

We currently cannot use partition tags due to the lack of that feature in MAAS and we rely on NVME device partitions for allocating some NVMe storage for bcache and using some for filestore journal or bluestore WAL & DB. So we cannot use Juju storage with MAAS-provided partitions either.

Likewise, we cannot rely on pre-created filesystems in MAAS and fstab fs UUID-based entries as:

1) with ceph encryption configured by charms file systems are created on top of device-mapper block devices that MAAS is not aware of;
2) the new stable ceph backend (bluestore) does not rely on kernel-mounted file systems so file system UUIDs are not usable.

I agree that MAAS already exposes relevant information for physical devices and composable devices (md-raid, bcache etc.) to curtin.

However, having MAAS to provide serials or WWNs for virtio or virtio-scsi devices would be a MAAS feature request in my view as there is some work involved in rendering proper libvirt domain xml.