installing ubuntu on a former md raid volume makes system unusable

Bug #1828558 reported by lvm on 2019-05-10
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
partman-base (Ubuntu)
High
Michael Hudson-Doyle
Bionic
Undecided
Michael Hudson-Doyle
Disco
Undecided
Michael Hudson-Doyle

Bug Description

[impact]
Installing ubuntu on a disk that was previously a md raid volume leads to a system that doesn't boot (or perhaps does not reliably boot)

[test case]
Create a disk image that has a md RAID 6, metadata 0.90 device on it using the attached "mkraid6" script.

$ sudo mkraid6

Install to it in a VM:

$ kvm -m 2048 -cdrom ~/isos/ubuntu-18.04.2-desktop-amd64.iso -drive file=raid2.img,format=raw

Reboot into the installed system. Check that it boots and that there are no occurrences of linux_raid_member in the output of "sudo wipefs /dev/sda".

SRU member request: testing other, regular installation scenarios to sanity check for regressions (comment #10).

[regression potential]
The patch makes a change to a core part of the partitioner. A bug here could crash the installer, rendering it impossible to install. The code is adapted from battle-tested code in wipefs from util-linux and has been somewhat tested before uploading to eoan. The nature of the code makes regressions beyond crashing the installer or failing to do what it's supposed to very unlikely -- it is hard to see how this could result on data loss on a drive not selected to be formatted, for example.

[original description]
18.04 is installed using GUI installer in 'Guided - use entire volume' mode on a disk which was previously used as md raid 6 volume. Installer repartitions the disk and installs the system, system reboots any number of times without issues. Then packages are upgraded to the current states and some new packages are installed including mdadm which *might* be the culprit, after that system won't boot any more failing into ramfs prompt with 'gave up waiting for root filesystem device' message, at this point blkid shows boot disk as a device with TYPE='linux_raid_member', not as two partitions for EFI and root (/dev/sda, not /dev/sda1 and /dev/sda2). I was able fix this issue by zeroing the whole disk (dd if=/dev/zero of=/dev/sda bs=4096) and reinstalling. Probably md superblock is not destroyed when disk is partitioned by the installer, not overwritten by installed files and somehow takes precedence over partition table (gpt) during boot.

affects: ubuntu-release-upgrader (Ubuntu) → ubiquity (Ubuntu)
lvm (lvm-royal) wrote :

Zeroing the md superblock in ubiquity - if that's what you are thinking about, will fix this issue in my particular scenario, but what if disk was partitioned in some other way? I am afraid it is a partial workaround, the proper albeit more compex way of handling this issue is to make sure that properly formatted partition table always takes precedence over leftover superblocks during boot.

tags: added: id-5cd5b7c3c1eeca1c0e6458ce
Michael Hudson-Doyle (mwhudson) wrote :

Having looked into this a bit I'm a bit surprised -- assuming the drive was previously part of a raid array with metadata version 0.90 -- that your system ever booted at all. Partman creates a disk label with ped_disk_new_fresh, which arranges for ped_disk_clobber to be called on the disk. This zeroes the first and last 10 kiB of the disk, which will wipe out all the mdraid superblock for all other metadata versions. So if you use a device in a MD raid array with 0.90 metadata then install to it with ubiquity you get a disk that has both a partition table and raid metadata.

udev gets its information about block devices from libblkid and this, in general, seems to check for raid metadata before it checks for a partition table:

mwhudson@ringil:~/images$ blkid --probe raid1.img
raid1.img: VERSION="0.90.0" UUID="c9d611d5-1d1e-839b-14d5-894fb9296617" TYPE="linux_raid_member" USAGE="raid"
mwhudson@ringil:~/images$ blkid --probe --usage filesystem raid1.img
raid1.img: PTUUID="c5e0e910" PTTYPE="dos"
mwhudson@ringil:~/images$ blkid --probe --usage raid raid1.img
raid1.img: VERSION="0.90.0" UUID="c9d611d5-1d1e-839b-14d5-894fb9296617" TYPE="linux_raid_member" USAGE="raid"

Watching udev monitor while I attach a block device like this does show device nodes for the partitions appearing very briefly, so it's possible that your first reboots somehow won the race and managed to mount the partition before the device node went away again -- but in my testing the node for the partition is only present for a few milliseconds. I guess it might be there for longer during the busy environment of early boot. (When I tried to recreate your setup in a VM, the installed system didn't boot even once)

ANYWAY, the fix for this is clearly for the install to clear the md metadata somehow. I think one could make an argument that parted should do this, but there might be some subtle reasons I don't know about that would make this a bad idea. The more unsubtle approach would be to jam a call to "mdadm --zero-superblock" in somewhere.

As for your followup comment:

> Zeroing the md superblock in ubiquity - if that's what you are thinking about, will fix this
> issue in my particular scenario, but what if disk was partitioned in some other way?

I actually think inserting a --zero-superblock in is more or less safe, because all the other superblocks I know about will be wiped by the wiping parted already does (there are probably some obscure ones that will not). We could insert wipefs -a instead to get all the superblocks libblkid (and hence udev!) knows about. Or of course we could zero the entire device but that's likely to be unacceptably slow.

> I am
> afraid it is a partial workaround, the proper albeit more compex way of handling this issue
> is to make sure that properly formatted partition table always takes precedence over leftover
> superblocks during boot.

I don't think there really is a way of choosing which of the superblocks on a device you want to respect. I suppose in theory one could be added, but this gets way further into kernel uevent/udev land than I am confident even speculating about.

affects: ubiquity (Ubuntu) → parted (Ubuntu)
Changed in parted (Ubuntu):
status: New → Triaged
importance: Undecided → High
assignee: nobody → Michael Hudson-Doyle (mwhudson)
affects: parted (Ubuntu) → partman-base (Ubuntu)
Changed in partman-base (Ubuntu):
assignee: Michael Hudson-Doyle (mwhudson) → nobody
Michael Hudson-Doyle (mwhudson) wrote :

parted upstream NACKed fixing it there, so here's an attempt to do it in partman-base instead: https://code.launchpad.net/~mwhudson/ubuntu/+source/partman-base/+git/partman-base/+ref/wipey -- currently testing this.

Changed in partman-base (Ubuntu):
assignee: nobody → Michael Hudson-Doyle (mwhudson)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package partman-base - 206ubuntu4

---------------
partman-base (206ubuntu4) eoan; urgency=medium

  * parted_server.c: Wipe all known superblocks from device in
    command_new_label. (LP: #1828558)

 -- Michael Hudson-Doyle <email address hidden> Fri, 02 Aug 2019 12:31:05 +1200

Changed in partman-base (Ubuntu):
status: Triaged → Fix Released
Michael Hudson-Doyle (mwhudson) wrote :
description: updated
Timo Aaltonen (tjaalton) wrote :

I don't think disco will have a new image built anymore, so an sru there seems a bit pointless?

But netbook installs fetch partman-base udeb from the archive over the
network from -updates on the fly.

On Fri, 2 Aug 2019, 03:35 Timo Aaltonen, <email address hidden> wrote:

> I don't think disco will have a new image built anymore, so an sru there
> seems a bit pointless?
>
> --
> You received this bug notification because you are a member of Ubuntu
> Installer Team, which is subscribed to partman-base in Ubuntu.
> https://bugs.launchpad.net/bugs/1828558
>
> Title:
> installing ubuntu on a former md raid volume makes system unusable
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/partman-base/+bug/1828558/+subscriptions
>

Dimitri John Ledkov (xnox) wrote :

*netboot

On Fri, 2 Aug 2019, 08:35 Dimitri John Ledkov, <email address hidden> wrote:

> But netbook installs fetch partman-base udeb from the archive over the
> network from -updates on the fly.
>
> On Fri, 2 Aug 2019, 03:35 Timo Aaltonen, <email address hidden> wrote:
>
>> I don't think disco will have a new image built anymore, so an sru there
>> seems a bit pointless?
>>
>> --
>> You received this bug notification because you are a member of Ubuntu
>> Installer Team, which is subscribed to partman-base in Ubuntu.
>> https://bugs.launchpad.net/bugs/1828558
>>
>> Title:
>> installing ubuntu on a former md raid volume makes system unusable
>>
>> To manage notifications about this bug go to:
>>
>> https://bugs.launchpad.net/ubuntu/+source/partman-base/+bug/1828558/+subscriptions
>>
>

Michael Hudson-Doyle (mwhudson) wrote :

Also just the usual thing of fixing a bug in all releases newer than the one we really care about (which is bionic in this case)

Changed in partman-base (Ubuntu Bionic):
status: New → In Progress
Changed in partman-base (Ubuntu Disco):
status: New → In Progress
Changed in partman-base (Ubuntu Bionic):
assignee: nobody → Michael Hudson-Doyle (mwhudson)
Changed in partman-base (Ubuntu Disco):
assignee: nobody → Michael Hudson-Doyle (mwhudson)
Łukasz Zemczak (sil2100) wrote :

Risky, but seems okay. Would like to see testing of some other, more normal scenarios as part of verification too: installing on a regular disk, possibly having like two different disks and installing on one (making sure no data loss happened on the other?). At least for bionic.

description: updated
Changed in partman-base (Ubuntu Disco):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-disco

Hello lvm, or anyone else affected,

Accepted partman-base into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/partman-base/206ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in partman-base (Ubuntu Bionic):
status: In Progress → Fix Committed
tags: added: verification-needed-bionic
Łukasz Zemczak (sil2100) wrote :

Hello lvm, or anyone else affected,

Accepted partman-base into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/partman-base/192ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Turns out my fix is wrong: it is clearing the superblocks in NEW_LABEL, but really the disk should not be touched until COMMIT.

tags: added: verification-failed verification-failed-bionic verification-failed-disco
removed: verification-needed verification-needed-bionic verification-needed-disco
Changed in partman-base (Ubuntu):
status: Fix Released → In Progress
Changed in partman-base (Ubuntu):
status: In Progress → Fix Released
Brian Murray (brian-murray) wrote :

Hello lvm, or anyone else affected,

Accepted partman-base into disco-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/partman-base/206ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-disco to verification-done-disco. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-disco. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags: added: verification-needed verification-needed-disco
removed: verification-failed verification-failed-disco
tags: added: verification-needed-bionic
removed: verification-failed-bionic
Brian Murray (brian-murray) wrote :

Hello lvm, or anyone else affected,

Accepted partman-base into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/partman-base/192ubuntu1.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

I have verified the fix for bionic and disco by doing netinstalls with apt-setup/proposed=true on the kernel command line with disk images that had had a raid6 with 0.90 metadata created on them and watching the output of wipefs $image outside of the VM. Immediatly before selecting the "write changes to disk" action in partman:

mwhudson@ringil:~/tmp/netinstall/bionic$ wipefs raid2.img
DEVICE OFFSET TYPE UUID LABEL
raid2.img 0x27fff0000 linux_raid_member 2ff715a9-0173-2508-14d5-894fb9296617
raid2.img 0x1fe dos

And immediately after:

mwhudson@ringil:~/tmp/netinstall/bionic$ wipefs raid2.img
DEVICE OFFSET TYPE UUID LABEL
raid2.img 0x1fe dos

I also checked the installed disk booted and the wipefs output inside the VM.

And the same for disco:

mwhudson@ringil:~/tmp/netinstall/disco$ wipefs raid3.img
DEVICE OFFSET TYPE UUID LABEL
raid3.img 0x27fff0000 linux_raid_member 2ff715a9-0173-2508-14d5-894fb9296617
mwhudson@ringil:~/tmp/netinstall/disco$ wipefs raid3.img
DEVICE OFFSET TYPE UUID LABEL
raid3.img 0x1fe dos

tags: added: verification-done verification-done-bionic verification-done-disco
removed: verification-needed verification-needed-bionic verification-needed-disco

I've now done the testing asked for in comment #10, verifying that additional disks are not touched by installing to one.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package partman-base - 206ubuntu1.2

---------------
partman-base (206ubuntu1.2) disco; urgency=medium

  * Move superblock wiping code from command_new_label to command_commit, as
    the disk is not supposed to be written to until the latter is called.

partman-base (206ubuntu1.1) disco; urgency=medium

  * parted_server.c: Wipe all known superblocks from device in
    command_new_label. (LP: #1828558)

 -- Michael Hudson-Doyle <email address hidden> Tue, 06 Aug 2019 12:04:16 +1200

Changed in partman-base (Ubuntu Disco):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for partman-base has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package partman-base - 192ubuntu1.2

---------------
partman-base (192ubuntu1.2) bionic; urgency=medium

  * Move superblock wiping code from command_new_label to command_commit, as
    the disk is not supposed to be written to until the latter is called.

partman-base (192ubuntu1.1) bionic; urgency=medium

  * parted_server.c: Wipe all known superblocks from device in
    command_new_label. (LP: #1828558)

 -- Michael Hudson-Doyle <email address hidden> Tue, 06 Aug 2019 12:04:16 +1200

Changed in partman-base (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments