Install -- Raid setup cannot see all of my RAID partitions

Bug #22301 reported by Timothy Miller
28
This bug affects 3 people
Affects Status Importance Assigned to Milestone
partman-md (Ubuntu)
Triaged
Medium
Unassigned

Bug Description

Using daily build from 9/21

I set up two RAID partitions (physical volume) on one large drive and one RAID
partition on each of two smaller drives. Then I went to bolt them together in
the RAID setup. The first time, it only saw one of my RAID partitions. I quit
back to the partitioner and then tried again, and now it only sees three of the
four.

Earlier, I was seeing some sort of error reported where the partitioner could
not inform the kernel of the partition changes. I'm not getting that error this
time, but perhaps it's related?

Revision history for this message
Timothy Miller (theosib) wrote :

Also note that if I tell it to setup software RAID and select two of the
partitions, they don't get recorded. That is, when I go back to the
partitioner, it has no listing of any md volumes. I had done a setup before,
where I only had three RAID partitions, and I was able to setup RAID. But now
that I have four, it's gotten very confused.

Rebooting seems to help with some of these partitioning problems, but not this time.

BTW, I'm trying to setup RAID0, but not a single RAID0 of four, but rather two
of two.

Revision history for this message
Timothy Miller (theosib) wrote :

This time, I decided that I would create the first two RAID partitions, then
join them, then the next two and join them. I created the two partitions, then
told it to set up software RAID, but it would only show me one of the
partitions. Exiting the RAID configurer and going back in made it show me both
partitions. I select them both and tell it to Continue, and then tell the next
menu to finish. When I get back to the partitioner, it doesn't show me the MD
volume.

I had done this before, and it worked. The only thing I can think of as
different is how I've decided to divide up the drives.

I have one 15.3G drive, one 6.4G, and one 6.8G.

The first time, I made a 6.4 on each drive and RAID0'd them all together. I
used the other space for swap, boot, and extra.

This time, I want to make a 6.4 and a 6.8 on the 15.3G drive and then raid them
up correspondingly, then I figured I would LVM the two RAID0's together to
concatenate them (assuming I understand LVM right). But I'm not able to get
past the low-level partitioning point and get it to actually let me RAID0 pairs
of partitions.

Revision history for this message
Timothy Miller (theosib) wrote :

I don't want to be a bother, but this is preventing me from installing Ubuntu. Is there any way to prod the developers a little
more to have a look at this?

Thanks.

Revision history for this message
Colin Watson (cjwatson) wrote :

The integration between partman and RAID is hairy at the best of times, and I
confess to not knowing much about it. Could you attach /var/log/partman from the
installer? That might give me enough clues to be going on with ...

Revision history for this message
Timothy Miller (theosib) wrote :

Ok, well, I got set up to do that, but I can't figure out how to get the information off of the machine. None of the usual
networking tools exist in /bin, /usr/bin, /usr/sbin or /sbin. I honestly have no idea how I might copy the file from the installer
environment to another machine so that I could post it here.

Could you either tell me how it's generally expected that one do this, or could you suggest to the developers that they should add
such a tool?

Thanks.

Revision history for this message
Colin Watson (cjwatson) wrote :

nc (a.k.a. netcat) is available; you can 'anna-install openssh-client-udeb' to
get scp; or you can select "Save debug logs" from the main menu to get a few
other options, although you'll need a daily build from 2005-09-23 or later to
get the web server option to work.

Revision history for this message
Timothy Miller (theosib) wrote :

Created an attachment (id=4180)
partman log before attempting RAID setup

This is a log requested. At this point, the disk is partitioned from an
earlier install attempt. All I need to do now is tell the RAID program to
associate pairs of partitions as RAID0. See next attachment.

Revision history for this message
Timothy Miller (theosib) wrote :

Created an attachment (id=4181)
partman log AFTER attempting RAID setup

Next, what I did was enter the RAID program, tell it to create an MD device,
and then select two partitions. It only listed three of the existing four, but
I selected two that I wanted. Then tab and then enter on Continue. I then
tell it to Finish. The partitioner is reentered. It does not display the RAID
device as I had tried to set up. The attachment is the log after all of this.

Revision history for this message
Timothy Miller (theosib) wrote :

Any news on this? Any more research I can do to help solve this problem?

And can we bump up the priority? This isn't something that can be provided as an automatic update, since it's an install thing.
I'm sure the Ubuntu team will want to fix this before release, which isn't very far off.

Revision history for this message
Colin Watson (cjwatson) wrote :

OK, sorry for the delay; I've been flat-out on other bugs.

I've tried to reproduce this and have got nowhere, but at least I have a
slightly better understanding of what I need now. Could you get me the output of
the following commands at each of the two stages at which you provided partman
logs (i.e. after reaching the partitioning stage with previously configured
physical RAID volumes but before doing anything else, and after trying and
failing to assemble RAID devices):

  ls /dev/md
  cat /proc/mdstat
  /usr/lib/partconf/find-partitions --ignore-fstype

I have a suspicion that one or more of these problems may be involved: (1) we
might not be calling udevstart to get the device nodes created, (2) mdadm might
be having trouble with udev for some other reason, (3) there might be confusion
between different names for the same device node (/dev/scsi/... vs.
/dev/discs/...), (4) something I haven't thought of yet. I'll keep investigating.

Revision history for this message
Colin Watson (cjwatson) wrote :

With Fabio's help, I've reproduced this problem in the case where a RAID has
previously been created using some of the physical volumes in question, but has
not been deleted. If this is the case for you, then I can provide you with
commands to run to work around this problem and continue with the installation.

I'm inclined to believe that this case (a non-zero superblock) can't be made
smoother in Breezy with a satisfactory level of risk, because the consequences
of an excessively cavalier change could easily be data loss.

Revision history for this message
Timothy Miller (theosib) wrote :

Created an attachment (id=4327)
Requested info from before starting RAID manager

Capture of:
ls /dev/md
cat /proc/mdstat
/usr/lib/partconf/find-partitions --ignore-fstype

After starting partman but prior to starting the RAID manager.

Revision history for this message
Timothy Miller (theosib) wrote :

Created an attachment (id=4328)
Requested info from after starting RAID manager

See "before". This is from after running the RAID manager and returning back
to partman.

Revision history for this message
Timothy Miller (theosib) wrote :

I'm not sure what dataloss you're referring to, unless you're talking about problems with reformatting RAID partitions that hold
data, and you don't want to reformat them.

I deleted all partitions and then created RAID partitions on the empty disks. Perhaps when a RAID partition is added, it needs to
be flagged with something that says "this is an empty RAID partition, so you can clobber it".

Is there something about what I'm doing that makes this thing think that I want to keep the data in the volumes?

Revision history for this message
Colin Watson (cjwatson) wrote :

(In reply to comment #14)
> I'm not sure what dataloss you're referring to, unless you're talking about
problems with reformatting RAID partitions that hold
> data, and you don't want to reformat them.

At present, we have no user interface code for saying "this partition appears to
be part of an existing RAID set; are you sure you want to use it?" or for
displaying such partitions in a different way. We should certainly have such a
UI. Without that UI, though, the RAID configuration screen can't offer
partitions that appear to be part of an existing RAID set, because it could
easily cause people to overwrite their existing data with a new RAID by accident.

> I deleted all partitions and then created RAID partitions on the empty disks.
 Perhaps when a RAID partition is added, it needs to
> be flagged with something that says "this is an empty RAID partition, so you
can clobber it".

So, what happens is that, when you delete the partition, the partition manager
fails to zero the RAID superblock; all it really does is remove the entry from
the partition table. If you then create a partition starting at the same
position on the disk, the old RAID superblock will still be in place, and the
kernel will think that the partition is part of a deactivated RAID set. Your
/proc/mdstat confirms this:

  md0 : inactive hdd1[2]

To work around this, you need to do the following:

  mdadm --stop /dev/md/0
  mdadm --zero-superblock /dev/hdd1

It should then be possible (possibly after a reboot to de-confuse parted) to
proceed to use this as a physical RAID volume.

> Is there something about what I'm doing that makes this thing think that I
want to keep the data in the volumes?

It's definitely a partitioner bug that it doesn't zero out the superblock when
the partition is deleted, but with two days to go until the release candidate I
think I'd rather leave this as "known bug with known workaround" than as
"probable fix with unknown consequences".

Thanks for your patience!

Revision history for this message
Timothy Miller (theosib) wrote :

Ok, I did what you suggested, and voila, it worked!

I would like to offer to put an explanation of this on the wiki somewhere, but I don't know where to go or how to get authorization.
Drop me a note at <email address hidden> if you'd like me to get started on it.

Thanks.

Revision history for this message
Fabio Massimo Di Nitto (fabbione) wrote :

(In reply to comment #15)
> (In reply to comment #14)
> > I'm not sure what dataloss you're referring to, unless you're talking about
> problems with reformatting RAID partitions that hold
> > data, and you don't want to reformat them.

The point is that the installer is capable of recongnizing old RAID and perhaps
the user
wants only to reenable them. So yes.. you don't want to reformat them if not
excplicitly told
to do so.

>
> At present, we have no user interface code for saying "this partition appears to
> be part of an existing RAID set; are you sure you want to use it?" or for
> displaying such partitions in a different way. We should certainly have such a
> UI. Without that UI, though, the RAID configuration screen can't offer
> partitions that appear to be part of an existing RAID set, because it could
> easily cause people to overwrite their existing data with a new RAID by accident.

We also need to remember that we must be able to undo our changes.. see below

>
> > I deleted all partitions and then created RAID partitions on the empty disks.
> Perhaps when a RAID partition is added, it needs to
> > be flagged with something that says "this is an empty RAID partition, so you
> can clobber it".
>
> So, what happens is that, when you delete the partition, the partition manager
> fails to zero the RAID superblock; all it really does is remove the entry from
> the partition table. If you then create a partition starting at the same
> position on the disk, the old RAID superblock will still be in place, and the
> kernel will think that the partition is part of a deactivated RAID set. Your
> /proc/mdstat confirms this:
>
> md0 : inactive hdd1[2]
>
> To work around this, you need to do the following:
>
> mdadm --stop /dev/md/0
> mdadm --zero-superblock /dev/hdd1
>
> It should then be possible (possibly after a reboot to de-confuse parted) to
> proceed to use this as a physical RAID volume.
>
> > Is there something about what I'm doing that makes this thing think that I
> want to keep the data in the volumes?
>
> It's definitely a partitioner bug that it doesn't zero out the superblock when
> the partition is deleted, but with two days to go until the release candidate I
> think I'd rather leave this as "known bug with known workaround" than as
> "probable fix with unknown consequences".
>
> Thanks for your patience!

IIRC the partitioner also offer the option to revert changes. It is possible however
to restore the md superblocks. Their position is always the same as defined by
the kernel drivers as
it is their size.
So ideally we could back them up somewhere before removing the partition and
restore them later
if the user undo's the changes.
These changes can't make breezy. clearly.. so just for reference:
/*
 * If x is the real device size in bytes, we return an apparent size of:
 *
 * y = (x & ~(MD_RESERVED_BYTES - 1)) - MD_RESERVED_BYTES
 *
 * and place the 4kB superblock at offset y.
 */

#define MD_RESERVED_BYTES (64 * 1024)

(from include/linux/raid/md_p.h)

Fabio

Revision history for this message
Colin Watson (cjwatson) wrote :

(In reply to comment #17)
> IIRC the partitioner also offer the option to revert changes. It is possible
however
> to restore the md superblocks. Their position is always the same as defined by
> the kernel drivers as
> it is their size.
> So ideally we could back them up somewhere before removing the partition and
> restore them later
> if the user undo's the changes.

No need for any of that; you simply zero the superblock in a commit script,
which by definition is run at the point of no return for undoing changes.

Colin Watson (cjwatson)
Changed in partman:
status: Unconfirmed → Confirmed
Colin Watson (cjwatson)
Changed in partman-md:
assignee: kamion → nobody
Revision history for this message
Justin Traer (justin-traer) wrote :

This problem still exists in the 7.04 server install. Bug #33117 is related to this one.

Revision history for this message
Carl Karsten (carlfk) wrote :

still exists in 8.04.

config and logs: http://dev.personnelware.com/carl/temp/May14/b/dhcp11/
/temp/ means it is gone in too weeks.

Revision history for this message
C de-Avillez (hggdh2) wrote :

setting to triaged. Collin stated this is indeed an issue.

Changed in partman-md:
status: Confirmed → Triaged
Revision history for this message
Carl Karsten (carlfk) wrote :

(01:35:10 PM) hggdh: CarlFK: please tar/zip your config, and attach it to the bug -- this way it will survive more than 2 weeks

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Can someone test this bug against Intrepid Server and report if it still exists?

If so, could you please provide me with detailed, step-by-step instructions to reproduce it?

Thanks,
:-Dustin

Revision history for this message
Gabriele Castagneti (gcastagneti) wrote :

Actually, this bug exists also in 10.10.
Terrible!
I hope this problem will be solved quickly.
Thanks,
Gabriele

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.