I'm adding this message from the thread to provide as much information for people trying to recreate this bug as possible; Top-posting for (my) convenience on this one... It's nice to see someone actually trying to recreate the bug instead of just flapping their gums trying to sound smart. If you have the time, could you try recreating it again? I have some suggestions to make it more like my scenario (in fact, number 3 below is required for the problem to occur, and number 4 below is likely to be required). I know these suggestions are long, but it would be appreciated. In for a penny, in for a pound, eh? You'll have to do the first install again, but you won't have to actually go through with the second install. You can cancel after getting past the partitioning screen. I've noticed that when things go awry there are two tell-tale signs: 1. On the partitioning screen, the original (now defunct) file system(s) will be detected and show up. 2. Once you select "finish partitioning", the installer will show a list of partition tables that will be modified. One or more RAID partitions will show up on the list, irregardless of that fact that you didn't select them for anything. If those signs are present, the RAID array will be hosed. If the signs are not there, the install will go fine and there's no need to continue. Additionally, you don't have to worry about test data or even mounting the RAID array. When doing the second install, if you post exactly which file systems were detected on the manual partitioning screen and which partitions were shown on the "to be modified" list once you hit "finish partitioning", I'd appreciate it. Now on to the suggestions: 1. It sounds like in your setup you installed for the second time after setting up the RAID array, but before the array finished resyncing for the first time. In my setup, the array had been around for a while and was fully resynced. In fact, I (likely) waited for the array to be fully resynced before even installing XFS on it. If you *did* wait for the drive to finish resyncing before the second install, please RSVP because your array was indeed corrupted, but since it was only one drive the array was somehow able to resync and recover. 2. I was using the ubuntu 10.4 beta 2 server 64-bit disk for the (second) install when things went south. Could you try that one? 3. REQUIRED. It sounds like when doing the second install, you just installed to an existing partition. In order for the problem to occur, you have to remove/create a partition (even though you're leaving the RAID partitions alone). If you recreate the partitions I used (6, below), this will be taken care of. 4. POSSIBLY REQUIRED. When creating the RAID array with the default options as you did in your first test, by default the array is created in degraded mode, with a drive added later and resynced. This makes the initial sync faster. Since I'm a neurotic perfectionist, I always create my arrays with the much more manly and macho "--force" option to create them "properly". Its very possible that doing the initial resync with a degraded array will overwrite the defunct file system, while doing the initial sync in the more "proper" way will not. Please use the "--force" option when creating the array to take this possibility into account. 5. My array has 4 drives. If possible, could you scare up a fourth drive? If not, don't worry about it. Especially if you do number 7. 6. Prior to/during the infamous install, my partitions were as follows. If feasible, recreating as much of this as possible would be appreciated: sd[abcd]1 25GB sd[abcd]2 475GB My RAID5 array was sd[abcd]2 set up as md1, and my file systems were: sda1: ext4 / md1: xfs /data sd[bcd]1: (partitioned, but not used) Note I had no swap partition originally. On the manual partition screen of the ill-fated install, I left the sd[abcd]2 partitions alone (RAID array), deleted all the sd[abcd]1 partitions, and created the following partitions: sd[abcd]1: 22GB RAID5 md2 / sd[abcd]3*: 3GB RAID5 md3 (swap) * POSSIBLY REQUIRED: Note that partition 2 of the drives' partition tables was already taken by the RAID, so I created the 3GB partitions as partition 3, even though the sectors in partition 3 physically resided before the sectors in partition 2. This is perfectly "legal", if not "normal" (fdisk will warn you about this). Please try to recreate this condition if you can, because it's very possible that was the source of the problems. BTW, all partitions were primary partitions. If you don't have that much space, you can likely get away with making sd[abcd]2 as small as needed. 7. I simplified when recreating this bug, but in my original scenario I had 2 defunct file systems detected by the installer: one on sda2 and one on sdd2 (both ext4). That's why I couldn't just fail and remove the corrupted drive even if I had known to do so at that point. I figure the more defunct file systems there are, the more chances you have of recreating the bug. So how about creating file systems on all four partitions (sd[abcd]2) before creating the RAID array? 8. My original setup left the RAID partitions' type as "linux" instead of "RAID autodetect". It's no longer necessary to set the partition type for RAID members, as the presence of the RAID superblock is enough. When recreating the problem I did set the type to "RAID autodetect", but to be thorough, try leaving the type as "linux". 9. If you *really* have too much time on your hands, my original ubuntu install, used for creating the original file systems, was 8.10 desktop 64 bit. I created the non-RAID during the install and the RAID array after the install, after apt-getting mdadm. I seriously doubt this makes a difference though. 10. I was using an external USB DVD-ROM drive to due the install. It's very remotely possible since the drive has to be re-detected during the install process, it could wind up reshuffling the device letters. If you have an external CD or DVD drive, could you try installing with it? If you (or anybody) can try recreating the problem with this new information I'd very much appreciate it. Thanks, Alvin On 04/23/2010 02:22 PM, J wrote: > FWIW, this is what I just went through, step by step to try to > recreate a loss of data on an existing sofware raid array: > > 1: Installed a fresh Karmic system on a single disk with three partitions: > /dev/sda1 = / > /dev/sda2 = /data > /dev/sda3 = swap > > all were primary partitions. > > 2: After installing 9.10, I created some test "important data" by > copying the contents of /etc into /data. > 3: For science, rebooted and verified that /data automounted and the > "important data" was still there. > 4: Shut the system down and added two disks. Rebooted the system. > 5: Moved the contents of /data to /home/myuser/holding/ > 6: created partitions on /dev/sdb and /dev/sdc (the two new disks, one > partiton each) > 7: installed mdadm and xfsprogs, xfsdump > 8: created /dev/md0 with mdadm using /dev/sda2, /dev/sdb1 and > /dev/sdc1 in a RAID5 array > 9: formatted the new raid device as xfs > 10: configured mdadm.conf and fstab to start and automount the new > array to /data at boot time. > 11: mounted /data (my new RAID5 array) and moved the contents of > /home/myuser/holding to /data (essentially moving the "important data" > that used to reside on /dev/sda2 to the new R5 ARRAY). > 12: rebooted the system and verified that A: RAID started, B: /data > (md0) mounted, and C: my data was there. > 13: rebooted the system using Lucid > 14: installed Lucid, choosing manual partitioning as you described. > **Note: the partitioner showed all partitions, but did NOT show the > RAID partitions as ext4 > 15: configured the partitioner so that / was installed to /dev/sda1 > and the original swap partition was used. DID NOT DO ANYTHING with the > RAID partitions. > 16: installed. Installer only showed formatting /dev/sda1 as ext4, > just as I'd specified. > 17: booted newly installed Lucid system. > 18: checked with fdisk -l and saw that all RAID partitions showed as > "Linux raid autodetect" > 19: mdadm.conf was autoconfigured and showed md0 present. > 20: edited fstab to add the md0 entry again so it would mount to /data > 21: did an mdadm --assemble --scan and waited for the array to rebuild > 22: after rebuild/re-assembly was complete, mounted /data (md0) > 23, verified that all the "important data" was still there, in my > array, on my newly installed Lucid system. > > The only thing I noticed was that when I did the assembly, it started > degraded with sda2 and sdb1 as active and sdc1 marked as a spare with > rebuilding in progress. > > Once the rebuild was done was when I mounted the array and verified my > data was still present. > > So... what did I miss in recreating this failure?