Comment 0 for bug 2039614

Revision history for this message
Jeff Lane  (bladernr) wrote :

We (Cert) just updated MAAS from 3.3.x to 3.4.0-RC1. We have, in testflinger, default partition definitions that, because of how MAAS identifies partitions and disks is very reliant on MAAS ids for disk devices and partitions.

For example, prior to the move to 3.4.0, this was the definition for one server (these change and grow more or less complex depending on the number of disks in a machine):
  2 default_disks:
  3 - id: '216'
  4 name: nvme0n1
  5 parent_disk_blkid: '216'
  6 ptable: GPT
  7 type: disk
  8 - device: '882'
  9 id: nvme0n1-part1
 10 number: '882'
 11 parent_disk: '216'
 12 parent_disk_blkid: '216'
 13 size: '536870912'
 14 type: partition
 15 - fstype: fat32
 16 id: 882-format
 17 label: efi
 18 parent_disk: '216'
 19 parent_disk_blkid: '216'
 20 type: format
 21 volume: '882'
 22 - device: 882-format
 23 id: 882-mount
 24 parent_disk: '216'
 25 parent_disk_blkid: '216'
 26 path: /boot/efi
 27 type: mount
 28 - device: '883'
 29 id: nvme0n1-part2
 30 number: '883'
 31 parent_disk: '216'
 32 parent_disk_blkid: '216'
 33 size: '1599778848768'
 34 type: partition
 35 - fstype: ext4
 36 id: 883-format
 37 label: root
 38 parent_disk: '216'
 39 parent_disk_blkid: '216'
 40 type: format
 41 volume: '883' 42 - device: 883-format
 43 id: 883-mount
 44 parent_disk: '216'
 45 parent_disk_blkid: '216'
 46 path: /
 47 type: mount

As you can see, this spells out partitions on a disk with the ID of 216, where the partition id is 882 and 883 to spell out the /boot/efi filesystem and the root filesystem respectively. These IDs were pulled from MAAS and reflected what on would get from a 'maas <name> partition reads <disk_id>. This allows us to provide a means for users to define their own partition scheme (e.g. set up something ceph-like, or bcache or whatever) and then revert things to the default.

After the update, all testflinger deployments now fail seemingly because apparently the partition IDs have been changed. Looking at a dump of this machine via the MAAS CLI, the disk ID has remained the same but the partition IDs are now all it the 16,000s:

bladernr@weavile:~$ maas bladernr partitions read 8pk6f8 216
Success.
Machine-readable output follows:
[
    {
        "uuid": "b838b3db-3266-44da-bdbe-2a90b75df617",
        "size": 1599778848768,
        "bootable": false,
        "tags": [],
        "used_for": "ext4 formatted filesystem mounted at /",
        "type": "partition",
        "path": "/dev/disk/by-dname/nvme0n1-part2",
        "device_id": 216,
        "filesystem": {
            "fstype": "ext4",
            "label": "root",
            "uuid": "21aa8167-f0f7-4166-9e62-57e6504cac8d",
            "mount_point": "/",
            "mount_options": ""
        },
        "id": 16153,
        "system_id": "8pk6f8",
        "resource_uri": "/MAAS/api/2.0/nodes/8pk6f8/blockdevices/216/partition/16153"
    },
    {
        "uuid": "94256eca-c024-454b-b9f2-5c3b79b29611",
        "size": 536870912,
        "bootable": false,
        "tags": [],
        "used_for": "fat32 formatted filesystem mounted at /boot/efi",
        "type": "partition",
        "path": "/dev/disk/by-dname/nvme0n1-part1",
        "device_id": 216,
        "filesystem": {
            "fstype": "fat32",
            "label": "efi",
            "uuid": "1b93141c-af66-4594-a6d2-56e00f097108",
            "mount_point": "/boot/efi",
            "mount_options": ""
        },
        "id": 16152,
        "system_id": "8pk6f8",
        "resource_uri": "/MAAS/api/2.0/nodes/8pk6f8/blockdevices/216/partition/16152"
    }
]

I am pretty sure that testflinger is failing because it expects to see a partition ID of 882 and 883 on disk 216, but those no longer exist.

Should we expect the partition IDs to change every time MAAS is updated, or is this a weird bug this time around (I don't think we've updated MAAS since we implemented the disk layout in testflinger, so it's possible this has always been the case and we just never had a problem with it before).

Note, the only thing that has changed on our end was the MAAS snap update to 3.4.0, we did not update anything in the testflinger agents from yesterday to today, so I'm reasonably certain this is the root cause here, at least from what I have seen over the last 30 minutes or so of poking at this.