Update to maas 3.4.0 snap changed partition ids which broke automated deployments

Bug #2039614 reported by Jeff Lane 
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Status tracked in 3.6
3.4
Triaged
High
Unassigned
3.5
Triaged
High
Unassigned
3.6
Triaged
High
Unassigned

Bug Description

We (Cert) just updated MAAS from 3.3.x to 3.4.0-RC1. We have, in testflinger, default partition definitions that, because of how MAAS identifies partitions and disks is very reliant on MAAS ids for disk devices and partitions.

For example, prior to the move to 3.4.0, this was the definition for one server (these change and grow more or less complex depending on the number of disks in a machine):
  2 default_disks:
  3 - id: '216'
  4 name: nvme0n1
  5 parent_disk_blkid: '216'
  6 ptable: GPT
  7 type: disk
  8 - device: '882'
  9 id: nvme0n1-part1
 10 number: '882'
 11 parent_disk: '216'
 12 parent_disk_blkid: '216'
 13 size: '536870912'
 14 type: partition
 15 - fstype: fat32
 16 id: 882-format
 17 label: efi
 18 parent_disk: '216'
 19 parent_disk_blkid: '216'
 20 type: format
 21 volume: '882'
 22 - device: 882-format
 23 id: 882-mount
 24 parent_disk: '216'
 25 parent_disk_blkid: '216'
 26 path: /boot/efi
 27 type: mount
 28 - device: '883'
 29 id: nvme0n1-part2
 30 number: '883'
 31 parent_disk: '216'
 32 parent_disk_blkid: '216'
 33 size: '1599778848768'
 34 type: partition
 35 - fstype: ext4
 36 id: 883-format
 37 label: root
 38 parent_disk: '216'
 39 parent_disk_blkid: '216'
 40 type: format
 41 volume: '883' 42 - device: 883-format
 43 id: 883-mount
 44 parent_disk: '216'
 45 parent_disk_blkid: '216'
 46 path: /
 47 type: mount

As you can see, this spells out partitions on a disk with the ID of 216, where the partition id is 882 and 883 to spell out the /boot/efi filesystem and the root filesystem respectively. These IDs were pulled from MAAS and reflected what on would get from a 'maas <name> partition reads <disk_id>. This allows us to provide a means for users to define their own partition scheme (e.g. set up something ceph-like, or bcache or whatever) and then revert things to the default.

After the update, all testflinger deployments now fail seemingly because apparently the partition IDs have been changed. Looking at a dump of this machine via the MAAS CLI, the disk ID has remained the same but the partition IDs are now all it the 16,000s:

bladernr@weavile:~$ maas bladernr partitions read 8pk6f8 216
Success.
Machine-readable output follows:
[
    {
        "uuid": "b838b3db-3266-44da-bdbe-2a90b75df617",
        "size": 1599778848768,
        "bootable": false,
        "tags": [],
        "used_for": "ext4 formatted filesystem mounted at /",
        "type": "partition",
        "path": "/dev/disk/by-dname/nvme0n1-part2",
        "device_id": 216,
        "filesystem": {
            "fstype": "ext4",
            "label": "root",
            "uuid": "21aa8167-f0f7-4166-9e62-57e6504cac8d",
            "mount_point": "/",
            "mount_options": ""
        },
        "id": 16153,
        "system_id": "8pk6f8",
        "resource_uri": "/MAAS/api/2.0/nodes/8pk6f8/blockdevices/216/partition/16153"
    },
    {
        "uuid": "94256eca-c024-454b-b9f2-5c3b79b29611",
        "size": 536870912,
        "bootable": false,
        "tags": [],
        "used_for": "fat32 formatted filesystem mounted at /boot/efi",
        "type": "partition",
        "path": "/dev/disk/by-dname/nvme0n1-part1",
        "device_id": 216,
        "filesystem": {
            "fstype": "fat32",
            "label": "efi",
            "uuid": "1b93141c-af66-4594-a6d2-56e00f097108",
            "mount_point": "/boot/efi",
            "mount_options": ""
        },
        "id": 16152,
        "system_id": "8pk6f8",
        "resource_uri": "/MAAS/api/2.0/nodes/8pk6f8/blockdevices/216/partition/16152"
    }
]

I am pretty sure that testflinger is failing because it expects to see a partition ID of 882 and 883 on disk 216, but those no longer exist.

Should we expect the partition IDs to change every time MAAS is updated, or is this a weird bug this time around (I don't think we've updated MAAS since we implemented the disk layout in testflinger, so it's possible this has always been the case and we just never had a problem with it before).

Note, the only thing that has changed on our end was the MAAS snap update to 3.4.0, we did not update anything in the testflinger agents from yesterday to today, so I'm reasonably certain this is the root cause here, at least from what I have seen over the last 30 minutes or so of poking at this.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

We need to confirm that an upgrade from 3.3.x to 3.4.0-rc2 changes partition IDs, and re-triage this issue after having this outcome.

Changed in maas:
importance: Undecided → High
milestone: none → 3.4.x
status: New → Triaged
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Jeff, could you share more information on what testflinger does and how it relies on data returned by MAAS? Also, the configuration snippet shared in the description seems to have lost formatting and is hard to interpret - could you attach that as a file if relevant?

Revision history for this message
Jeff Lane  (bladernr) wrote (last edit ):

So what this came from was a need from SQA to create custom partition layouts for some of their testing (setting up ceph, bcache, etc). Unfortunately, the only way to do that is to specify the desired layout to MAAS prior to deployment, which permanently changes the partition layout for the machine. This causes the problem where WE need the layouts to be basic and flat, with all secondary disks with simple single partitions for cert testing needs. BUT if SQA runs a test it permanently rewrites the disk layout until you re-write it again (e.g there's no way in MAAS to do a one-time-only partition layout for a deployment).

SO Testflinger agents (the things that issue commands to MAAS to allocate and deploy nodes using MAAS CLI commands) do one of two things:

1: If you submit a job that includes a custom partition layout, testflinger will apply your custom layout to the machine during provisioning so that if you desire a ... MD RAID layout, that's what you get.

or 2: If you submit a job with NO custom partition layout, the testflinger agent has a copy of what the default layout should be (our simple flat layouts for cert purposes) and will tell MAAS to configure the drives that way.

(EDIT to add that the "default storage definition" that we store in the config for each testflinger agent (now called device connectors?) is pulled directly from maas using some MAAS CLI commands)

We had to implement #2 because of the aforementioned gap where you can't specify a one-time-only partition layout while preserving the original layout during deployment.

We briefly considered retrieving the "current layout" from MAAS on the fly, deploying with the custom layout, and then re-configuring the node using the previously retrieved "current" layout but decided against that because it leads easily to a case where a job could change the partition scheme, then fail, and never return the original partition layout, leaving a machine in a permanently altered state.

Attached is one of the agent config files that includes the "default" partition layout.

Additionally, all of this is defined in the testflinger source:
https://github.com/canonical/snappy-device-agents/tree/main/src/testflinger_device_connectors/devices/maas2

in maas2-storage.py and called from maas2.py

description: updated
Revision history for this message
Jeff Lane  (bladernr) wrote :

Also, weirdly, LP kills the formatting but if you click on the Edit button to change the summary, the formatting is still there :/ weird.

Revision history for this message
Jeff Lane  (bladernr) wrote :

@alanec and @pwlars did the dev work on this and can correct anything I got wrong above...

Revision history for this message
Jeff Lane  (bladernr) wrote :

Updated more clear attachment

Changed in maas:
milestone: 3.4.x → 3.5.x
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.