Installer crashes when setting up software RAID and disks have duplicate WWN

Bug #2003654 reported by Patrik Lundin
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
curtin
Fix Committed
Undecided
Unassigned

Bug Description

Trying to set up software RAID on a 22.04 server where the physical disks share the same WWN the installer will crash. This is because curtin will try to look up both disks using the same WWN, leading to follow-up operations (like running sgdisk) working on the same disk twice. This makes sgdisk unhappy, which leads to the installer crashing

Here is some lsblk output from such a server showing the values:
```
$ lsblk -S -d -o TRAN,NAME,TYPE,MODEL,SERIAL,SIZE,WWN
TRAN NAME TYPE MODEL SERIAL SIZE WWN
sata sda disk M.2 (S80) 3ME4 YCA12009140310160 119.2G 0x502b2a201d1c1b1a
sata sdb disk M.2 (S80) 3ME4 YCA11905030450003 119.2G 0x502b2a201d1c1b1a
usb sr0 rom Virtual CDROM0 AAAABBBBCCCC1 1024M
usb sr1 rom Virtual CDROM1 AAAABBBBCCCC1 1024M
usb sr2 rom Virtual CDROM2 AAAABBBBCCCC1 1024M
usb sr3 rom Virtual CDROM3 AAAABBBBCCCC1 1024M
```

There you can see that serial is unique, but the WWN is unfortunately the same.

Here is output from the installer crash report where the issue occurs (here the interal drives happen to be sdb and sdc instead of sda/sdb):
```
start: cmd-install/stage-partitioning/builtin/cmd-block-meta: configuring partition: partition-0
 get_path_to_storage_volume for volume disk-sdc({'ptable': 'gpt', 'serial': 'M.2_(S80)_3ME4_YCA11905030450003', 'wwn': '0x502b2a201d1c1b1a', 'path': '/dev/sdc', 'preserve': False, 'name': '', 'grub_device': False, 'type': 'disk', 'id': 'disk-sdc'})
 Processing serial 0x502b2a201d1c1b1a via udev to 0x502b2a201d1c1b1a
 lookup_disks found: ['wwn-0x502b2a201d1c1b1a']
 Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sdb'] with allowed return codes [0] (capture=True)
 /dev/sdb is multipath device? False

[...]

Preparing partition location on disk /dev/sdb
 Wiping 1M on /dev/sdb at offset 1048576
 Running command ['sgdisk', '--new', '1:2048:2203647', '--typecode=1:ef00', '/dev/sdb'] with allowed return codes [0] (capture=True)
 Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sdb'] with allowed return codes [0] (capture=True)
 /dev/sdb is multipath device? False
```
You can see that the 'path' field is '/dev/sdc', but even so the follow-up commands operate on /dev/sdb instead.

Then a bit later:
```
start: cmd-install/stage-partitioning/builtin/cmd-block-meta: configuring partition: partition-1
 get_path_to_storage_volume for volume disk-sdb({'ptable': 'gpt', 'serial': 'M.2_(S80)_3ME4_YCA12009140310160', 'wwn': '0x502b2a201d1c1b1a', 'path': '/dev/sdb', 'preserve': False, 'name': '', 'grub_device': False, 'type': 'disk', 'id': 'disk-sdb'})
 Processing serial 0x502b2a201d1c1b1a via udev to 0x502b2a201d1c1b1a
 lookup_disks found: ['wwn-0x502b2a201d1c1b1a', 'wwn-0x502b2a201d1c1b1a-part1']
 Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sdb'] with allowed return codes [0] (capture=True)
 /dev/sdb is multipath device? False

[...]

Preparing partition location on disk /dev/sdb
 Wiping 1M on /dev/sdb at offset 1048576
 Running command ['sgdisk', '--new', '1:2048:2203647', '--typecode=1:ef00', '/dev/sdb'] with allowed return codes [0] (capture=True)
 An error occured handling 'partition-1': ProcessExecutionError - Unexpected error while running command.
 Command: ['sgdisk', '--new', '1:2048:2203647', '--typecode=1:ef00', '/dev/sdb']
 Exit code: 4
 Reason: -
 Stdout: ''
 Stderr: Could not create partition 1 from 2048 to 2203647
         Error encountered; not saving changes.
```

Above you can see that the disk is supposed to be /dev/sdb this time, and here it actually results in working on /dev/sdb, but this is too late as it has already executed the same sgdisk command earlier, and here it breaks.

I tried building my own 22.04 installer with a patched curtin library: https://github.com/eest/curtin/commit/1052baa37b81cd348e17d17a2ef545c93ce26505

Here is the diff for future reference:
```
diff --git a/curtin/commands/block_meta.py b/curtin/commands/block_meta.py
index f3f19dc2..918acb34 100644
--- a/curtin/commands/block_meta.py
+++ b/curtin/commands/block_meta.py
@@ -455,7 +455,7 @@ def get_path_to_storage_volume(volume, storage_config):
         # Get path to block device for disk. Device_id param should refer
         # to id of device in storage config
         volume_path = None
- for disk_key in ['wwn', 'serial', 'device_id', 'path']:
+ for disk_key in ['serial', 'device_id', 'path']:
             vol_value = vol.get(disk_key)
             try:
                 if not vol_value:
```

Basically I just ignore looking for WWN instead going directly for serial, and with this patch the installer managed to succeed. Here is output from that successful run (here the internal drives end up as sda/sdb again):

```
start: cmd-install/stage-partitioning/builtin/cmd-block-meta: configuring partition: partition-0
get_path_to_storage_volume for volume disk-sdb({'ptable': 'gpt', 'serial': 'M.2_(S80)_3ME4_YCA11905030450003', 'wwn': '0x502b2a201d1c1b1a', 'path': '/dev/sdb', 'preserve': False, 'name': '', 'grub_device': False, 'type': 'disk', 'id': 'disk-sdb'})
Processing serial M.2_(S80)_3ME4_YCA11905030450003 via udev to M.2_(S80)_3ME4_YCA11905030450003
Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sdb'] with allowed return codes [0] (capture=True)
/dev/sdb is multipath device member? False

[...]

Preparing partition location on disk /dev/sdb
Wiping 1M on /dev/sdb at offset 1048576
Running command ['sgdisk', '--new', '1:2048:2203647', '--typecode=1:ef00', '/dev/sdb'] with allowed return codes [0] (capture=True)
Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sdb'] with allowed return codes [0] (capture=True)
/dev/sdb is multipath device? False
```

and then later, the second disk (now actually operating on the expected disk):
```
start: cmd-install/stage-partitioning/builtin/cmd-block-meta: configuring partition: partition-1
get_path_to_storage_volume for volume disk-sda({'ptable': 'gpt', 'serial': 'M.2_(S80)_3ME4_YCA12009140310160', 'wwn': '0x502b2a201d1c1b1a', 'path': '/dev/sda', 'wipe': 'superblock', 'preserve': False, 'name': '', 'grub_device': False, 'type': 'disk', 'id': 'disk-sda'})
Processing serial M.2_(S80)_3ME4_YCA12009140310160 via udev to M.2_(S80)_3ME4_YCA12009140310160
Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sda'] with allowed return codes [0] (capture=True)
/dev/sda is multipath device member? False

[...]

Preparing partition location on disk /dev/sda
Wiping 1M on /dev/sda at offset 1048576
Running command ['sgdisk', '--new', '1:2048:2203647', '--typecode=1:ef00', '/dev/sda'] with allowed return codes [0] (capture=True)
Running command ['udevadm', 'info', '--query=property', '--export', '/dev/sda'] with allowed return codes [0] (capture=True)
/dev/sda is multipath device? False
```

From what I can tell at least the following bugs also are a result of either WWN or serial being duplicates:
https://bugs.launchpad.net/curtin/+bug/1955511
https://bugs.launchpad.net/curtin/+bug/1929213

It seems to me the proper fix for this would be that at some stage in the code where the hardware data has been assembled you should iterate over the disks and mark disks with duplicate keys in an additional field or something. If this was known it would be easy to do somehing like "if disk_key in duplicate_disk_keys, continue" both for WWNs and serials in the curtin code I modified.

Related branches

Revision history for this message
Patrik Lundin (eest) wrote (last edit ):

And some additional information, curtin seems to look up stuff in /dev/disk/by-id/ in its lookup_disk(serial) function: https://github.com/canonical/curtin/blob/b08eecd68cf5f1bccf4255b3d00a77af51c159f7/curtin/block/__init__.py#L921-L924

On a machine where the WWN is duplicate, there will only exists links for one of the disks:
```
$ ls -l /dev/disk/by-id/* | grep 502b2a201d1c1b1a
lrwxrwxrwx 1 root root 9 Jan 20 12:49 /dev/disk/by-id/scsi-3502b2a201d1c1b1a -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 20 12:49 /dev/disk/by-id/scsi-3502b2a201d1c1b1a-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 20 12:49 /dev/disk/by-id/scsi-3502b2a201d1c1b1a-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 9 Jan 20 12:49 /dev/disk/by-id/wwn-0x502b2a201d1c1b1a -> ../../sdb
lrwxrwxrwx 1 root root 10 Jan 20 12:49 /dev/disk/by-id/wwn-0x502b2a201d1c1b1a-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 Jan 20 12:49 /dev/disk/by-id/wwn-0x502b2a201d1c1b1a-part2 -> ../../sdb2
```

So this can not be used to detect the duplication.

Revision history for this message
Aaron Rainbolt (arraybolt3) wrote (last edit ):

It seems like this might be fixable by having get_path_to_storage_volume() function check *all* of the IDs its been given, not just the first one that works. If they all point to the same disk, then it's good. If one or more points to a different disk (i.e., the WWN points to /dev/sdb and the serial points to /dev/sda), then a preferred ID should be used as the correct one (serial in this case). This should be relatively simple to implement (I hope) and at least more accurate than what we have currently.

The tricky question with this solution is which ID should be preferred in each circumstance. Obviously it looks like serial number should be preferred over WWN, but is there something else that should be preferred over even the serial number?

Of course this solution then raises the question of what happens if you have two drives with the same serial number but different WWNs somehow, so perhaps this needs to be combined with a way of storing previously seen drives and using duplicate checking, or something similar.

Olivier Gayot (ogayot)
Changed in curtin:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.