Non-selected Raid Array Corruption from Installer

Bug #568183 reported by Alexander Pirdy
This bug report is a duplicate of:  Bug #191119: Installer corrupts raid drive. Edit Remove
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu
New
Undecided
Unassigned
Nominated for Lucid by andrew.dunn

Bug Description

In short this really isn't a bug I found but just one that came up on a mailing list, so for those who want to read the original post should go through the archives of the "Ubuntu user technical support, not for general discussions" <email address hidden> list. The poster indicated that he had no intention of filing a bug report despite a large reason to. This needs to be addressed, or at least a warning given before the Lucid release date, or there could be many angry people especially those with servers!

I am really unable to fully test this but there is a slim chance that it is a duplicate of bug #191119 or #369635 and my apologies if it is some other duplicate. Also I have no idea if this is present (though I find it hard to believe it might be) on past versions of ubuntu.

Regardless here is the verbatim post:

Title: DANGER!!! Problems with 10.04 installer (RAID devices *will* get corrupted)

Long story short: the only way to be safe right now is to physically
remove drives with important data during the install.

I figured out the cause of my RAID problems, and it's a problem with
ubuntu's installer. This will cost people their data if not fixed.
Sorry about the length of this post, but the problem takes a while to
explain.

The following scenario is not the only way your partitions can get
hosed. I simply use it because it's a common use case, it illustrates
what data is where on the hard drives, and it exposes the flaws in the
installer's logic. It also doesn't matter if you don't touch a
particular drive, partition, or file system during the install. The
data on it can still be corrupted.

Suppose you have a hard drive with some partitions on it. On one of
those partitions you have a linux file system which houses your data.
We'll say for the sake of this discussion that sda2 contains an EXT4
file system with your data. So far, so good.

Because this data is too important to rely on a single drive, you decide
to buy some more drives and make a RAID 5 device. You buy 3 more drives
and create similar partitions an them (say, sdb2, sdc2, and sdd2). You
copy the data currently on sda2 somewhere safe, then you use mdadm to
create a RAID5 array with sda2, sdb2, sdc2, and sdd2. The new RAID
device is md0. You create an XFS file system on md0 and move your data
to it*. This is all perfectly fine, but the stage has been set for
disaster with the ubuntu installer.

Later, you decide to do a clean install of ubuntu on sda1 (sda1 is *not*
part of the RAID array), and you get to the partitioning stage and
select manual partitioning. This is where things get really ugly really
fast.

The bug is how the installer detects existing file systems. It simply
reads the raw data in a partition to see if the bits it finds correspond
to a known file system. In the above example, the installer detects the
remnants of the original (non-RAID) file system on sda2 and thinks it's
a current EXT4 file system. Even if you use fdisk to mark sda2's
partition type as 'RAID autodetect' instead of 'linux' (which is no
longer necessary), the installer still detects the partition as having
an EXT4 file system.

Once this 'ghost' file system is detected, the installer gets really
confused about what goes where and will try to write to sda2 during the
install, even if you told the installer to ignore sda2 and just install
to sda1. This corrupts the current XFS file system on md0, and you're
screwed.

The overall flaw here is in the file system detection; you can't just
assume that any sequence of bits you find sitting around on a hard drive
are still current.

A possible solution may be to first check for a RAID superblock, and if
found that trumps all file system detection. I imagine something
similar will have to be done with partitions that are part of an LVM
volume as well.

-Alvin

* In my case, I took a shortcut and created a degraded array (missing
sda2), copied the data from sda2 to the array, added sda2 to the array,
and resynched. I don't think it makes a difference.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :
Revision history for this message
Rashkae (rashkae) wrote :

This bug is not a duplicate of 191119. 191119 is a Bios problem. This bug is all about linux software raid with no fake raid involvment. How do you unmark the duplicate.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

Yes, this is a a dup of 19119. Just because his hardware RAID is reporting the problem it doesn't follow that the hardware caused the problem.

Revision history for this message
Karl Larsen (klarsen1) wrote :

This IS a loader-RAID5 problem. It appears to be on a server version of 10.04 and the writer wonders if earlier versions may also have this bug.

It appears it is only a bug to users that have RAID5 hard drives and perhaps, only his RAID5. He mentioned that he had just gone to RAID5 and was not sure everything was working right yet.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

I have no idea what you mean by 'loader-RAID5'...

It likely doesn't matter what type of RAID is being used. I also tested on the live CD with the same problem. mdadm isn't present on the live CD and the array wasn't assembled. Once again, the big problem is that the installer writes directly to devices with a RAID superblock. And also once again, the broader problem is the way the installer detects file systems: the installer assumes that any bits it finds lying around on a hard drive are still valid, when in fact they could be left over from previous configurations.

description: updated
Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

Please don't remove the duplicate tag. There are 3 parts to every bug report:

1. an initial set of conditions
2. a procedure that was followed
3. actual results which differ from expected results

The reporter started with similar conditions (a RAID array that was unused during the install), followed the same procedure (installed to a device that was not part of the RAID array), and received the same results (a corrupted RAID array because the install wrote directly to a member of the RAID array). Unless you can produce any shred of evidence that can differentiate any of the 3 items above, they are the same bug report.

Revision history for this message
Jeff Lane  (bladernr) wrote :

Attempted to recreate this failure and was unable to do so.

Following as closely as I could to Alvin's description, I had a system running Karmic with a root partition and a second data partition. I moved all data from the data partition to a safe location. I added two disks and using mdadm created a 3 disk software RAID 5 array, mounted that, and moved my data onto the newly created array.

I rebooted to verify the array was intact after a reboot and autostart, and verified that the test data was still present.

I then did a Lucid Install to this system, installing to the first partition on sda (again, recreating Alvin's description step by step).
Of note, partman did NOT indicate that my existing RAID partitions were ext4, in fact, it showed nothing for their type.

Lucid install completed successfully, and afterwards I was able to rebuild the array, configure my new Lucid install to automount that array and was able to verify that my original data (which lived on the array throughout this new install process) was still there and viable.

The only thing I saw of note was that, on rebuilding the array after the Lucid install, mdadm showed the array started in a degraded state with the first two partitions as active, and the third as a spare in a rebuilding state. Once the rebuild was complete, everything seemed just fine.

So unable to confirm, and unable to recreate on my end.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

Thanks for taking the time to help investigate this. Check out my reply to you in the mailing list thread for suggestions on recreating the bug. Number 3 especially is required.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

BTW, please use post to bug #191119. Subscribers to this bug will be notified if you post there, but not vice-versa.

Revision history for this message
Oliver Grawert (ogra) wrote :

possibly related to bug 569900

Revision history for this message
Colin Watson (cjwatson) wrote :

This might well be bug 542210, recently fixed. If so, then it is certainly not bug 191119 since bug 542210 was introduced much more recently than that.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

Yup, that bug is likely the cause of the corruption of the RAID partitions. If I had to make a guess, that NEW_LABEL command was being used on the incorrectly detected file systems, thus overwriting RAID data. I have no idea how that command works so please correct me if that can't be the case.

However, an underlying issue with this bug (and bug 191119) is the installer incorrectly detecting those defunct file systems on the devices comprising the RAID array. Because of that, I'm hesitant to call it a dup, but I don't care so long as the bugs get fixed.

As for bug 191119, are you saying the NEW_LABEL command wasn't used on earlier versions of the installer?

Revision history for this message
Colin Watson (cjwatson) wrote :

NEW_LABEL was used, but the change that caused this to always zero the first 9KiB of the disk was a change in parted 2.1, committed upstream on 13 November 2009 and introduced to Ubuntu on 26 February 2010.

Revision history for this message
Alvin Thompson (alvint-deactivatedaccount) wrote :

Thanks for the answer, but now that I think about it, I'm not sure NEW_LABEL alone would cause wholesale RAID array corruption. Correct me if I'm wrong because I'm not an expert on how parted works, but that bug with NEW_LABEL would only overwrite the RAID superblock and not the data? In which case, the RAID array would just start up in degraded mode and have to be re-synced. In my case, and in the original bug report, and in NickJ's case, the array was not degraded, but rather the data on the array was corrupted.

Even supposing the NEW_LABEL command did cause the corruption, while it may have been an upstream bug that physically wrote to the disk and caused that problem, the root of all evil is still the installer that incorrectly told parted there was a file system present on a device that's a component of the RAID array. Are you saying that parted is also responsible for detecting which file systems are present as well, and the installer only reports it? In any case it needs to be fixed, because only bad things can happen if the installer doesn't know what file systems are actually present.

In any case, there needs to be basic safeguards in place which prevent the installer from writing directly to partitions with a RAID superblock, has a "RAID autodetect" partition type, or are in a logical volume. I guess you could argue that this would rather be an enhancement, in the way you could argue that a car without brakes is not a defect, but rather brakes would be a nice-to-have enhancement.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.