mdam software raid fails to start up on reboot

Bug #188392 reported by tor-henning
14
Affects Status Importance Assigned to Milestone
mdadm (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: mdadm

I have 2 RAID5 set up through mdadm. md0 contains only IDE-disks and md1 only contains SATA disks.
When i have set up both RAID`s for the first time, the resync starts just fine, but when i restart the server the boot takes 5 minutes longer than usual, and both RAID`s are gone. md0 has only 3 of 6 disks and are therefore stopped, md1 is non-existant.

The kern.log repeat`s this message some ten thousand times (md: array md0 already has disks!)

log content:
Feb 2 15:51:52 foo kernel: [ 676.666987] sd 0:0:1:0: [sdb] 398297088 512-byte hardware sectors (203928 MB)
Feb 2 15:51:52 foo kernel: [ 676.667022] sd 0:0:1:0: [sdb] Write Protect is off
Feb 2 15:51:52 foo kernel: [ 676.667027] sd 0:0:1:0: [sdb] Mode Sense: 00 3a 00 00
Feb 2 15:51:52 foo kernel: [ 676.667058] sd 0:0:1:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 2 15:51:52 foo kernel: [ 676.667064] sdb: sdb1
Feb 2 15:51:52 foo kernel: [ 676.754604] md: bind<sdm>
Feb 2 15:51:52 foo kernel: [ 676.754824] md: bind<sda>
Feb 2 15:51:52 foo kernel: [ 676.755173] md: bind<sdd>
Feb 2 15:51:52 foo kernel: [ 676.755403] md: bind<sdl>
Feb 2 15:51:53 foo kernel: [ 676.904290] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.911736] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.919289] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.926809] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.934386] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.941893] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.949414] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.956935] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.964457] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.971934] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.979412] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.986932] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 676.994599] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 677.002147] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 677.009677] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 677.017163] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 677.024679] md: array md0 already has disks!
Feb 2 15:51:53 foo kernel: [ 677.032153] md: array md0 already has disks!

When i only had set up md1, the same message was repeated, but with md1 instead.

So then i got a tips about reconfiguring mdadm. So i did:
"dpkg-reconfigure mdadm"

and restarted, then both RAID`s came up just fine, and they also did three reboots later, no problem anymore.
the kern.log now says:

Feb 2 18:15:49 foo kernel: [ 62.279977] md: bind<sdb1>
Feb 2 18:15:49 foo kernel: [ 62.280190] md: bind<sdc1>
Feb 2 18:15:49 foo kernel: [ 62.280393] md: bind<sdd1>
Feb 2 18:15:49 foo kernel: [ 62.280594] md: bind<sde1>
Feb 2 18:15:49 foo kernel: [ 62.280792] md: bind<sdf1>
Feb 2 18:15:49 foo kernel: [ 62.281018] md: bind<sdn1>
Feb 2 18:15:49 foo kernel: [ 62.281223] md: bind<sdo1>
Feb 2 18:15:49 foo kernel: [ 62.281426] md: bind<sda1>
Feb 2 18:15:49 foo kernel: [ 62.786545] raid5: device sda1 operational as raid disk 0
Feb 2 18:15:49 foo kernel: [ 62.786552] raid5: device sdn1 operational as raid disk 6
Feb 2 18:15:49 foo kernel: [ 62.786557] raid5: device sdf1 operational as raid disk 5
Feb 2 18:15:49 foo kernel: [ 62.786562] raid5: device sde1 operational as raid disk 4
Feb 2 18:15:49 foo kernel: [ 62.786566] raid5: device sdd1 operational as raid disk 3
Feb 2 18:15:49 foo kernel: [ 62.786571] raid5: device sdc1 operational as raid disk 2
Feb 2 18:15:49 foo kernel: [ 62.786576] raid5: device sdb1 operational as raid disk 1
Feb 2 18:15:49 foo kernel: [ 62.787622] raid5: allocated 8368kB for md1
Feb 2 18:15:49 foo kernel: [ 62.787628] raid5: raid level 5 set md1 active with 7 out of 8 devices, algorithm 2
Feb 2 18:15:49 foo kernel: [ 62.787677] RAID5 conf printout:
Feb 2 18:15:49 foo kernel: [ 62.787681] --- rd:8 wd:7
Feb 2 18:15:49 foo kernel: [ 62.787685] disk 0, o:1, dev:sda1
Feb 2 18:15:49 foo kernel: [ 62.787689] disk 1, o:1, dev:sdb1
Feb 2 18:15:49 foo kernel: [ 62.787693] disk 2, o:1, dev:sdc1
Feb 2 18:15:49 foo kernel: [ 62.787697] disk 3, o:1, dev:sdd1
Feb 2 18:15:49 foo kernel: [ 62.787701] disk 4, o:1, dev:sde1
Feb 2 18:15:49 foo kernel: [ 62.787705] disk 5, o:1, dev:sdf1
Feb 2 18:15:49 foo kernel: [ 62.787709] disk 6, o:1, dev:sdn1
Feb 2 18:15:49 foo kernel: [ 62.787768] RAID5 conf printout:
Feb 2 18:15:49 foo kernel: [ 62.787772] --- rd:8 wd:7
Feb 2 18:15:49 foo kernel: [ 62.787776] disk 0, o:1, dev:sda1
Feb 2 18:15:49 foo kernel: [ 62.787779] disk 1, o:1, dev:sdb1
Feb 2 18:15:49 foo kernel: [ 62.787783] disk 2, o:1, dev:sdc1
Feb 2 18:15:49 foo kernel: [ 62.787787] disk 3, o:1, dev:sdd1
Feb 2 18:15:49 foo kernel: [ 62.787790] disk 4, o:1, dev:sde1
Feb 2 18:15:49 foo kernel: [ 62.787794] disk 5, o:1, dev:sdf1
Feb 2 18:15:49 foo kernel: [ 62.787798] disk 6, o:1, dev:sdn1
Feb 2 18:15:49 foo kernel: [ 62.787801] disk 7, o:1, dev:sdo1

Feb 2 18:15:49 foo kernel: [ 62.787834] md: recovery of RAID array md1
Feb 2 18:15:49 foo kernel: [ 62.787839] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Feb 2 18:15:49 foo kernel: [ 62.787845] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb 2 18:15:49 foo kernel: [ 62.787858] md: using 128k window, over a total of 390708736 blocks.
Feb 2 18:15:49 foo kernel: [ 62.847931] md: bind<sdh1>
Feb 2 18:15:49 foo kernel: [ 62.848250] md: bind<sdi1>
Feb 2 18:15:49 foo kernel: [ 62.848659] md: bind<sdj1>
Feb 2 18:15:49 foo kernel: [ 62.851114] md: bind<sdl1>
Feb 2 18:15:49 foo kernel: [ 62.851379] md: bind<sdm1>
Feb 2 18:15:49 foo kernel: [ 62.852498] md: bind<sdg1>
Feb 2 18:15:49 foo kernel: [ 63.029749] raid5: device sdg1 operational as raid disk 0
Feb 2 18:15:49 foo kernel: [ 63.029756] raid5: device sdl1 operational as raid disk 4
Feb 2 18:15:49 foo kernel: [ 63.029760] raid5: device sdj1 operational as raid disk 3
Feb 2 18:15:49 foo kernel: [ 63.029765] raid5: device sdi1 operational as raid disk 2
Feb 2 18:15:49 foo kernel: [ 63.029770] raid5: device sdh1 operational as raid disk 1
Feb 2 18:15:49 foo kernel: [ 63.031587] raid5: allocated 6286kB for md0
Feb 2 18:15:49 foo kernel: [ 63.031593] raid5: raid level 5 set md0 active with 5 out of 6 devices, algorithm 2
Feb 2 18:15:49 foo kernel: [ 63.031679] RAID5 conf printout:
Feb 2 18:15:49 foo kernel: [ 63.031682] --- rd:6 wd:5
Feb 2 18:15:49 foo kernel: [ 63.031686] disk 0, o:1, dev:sdg1
Feb 2 18:15:49 foo kernel: [ 63.031690] disk 1, o:1, dev:sdh1
Feb 2 18:15:49 foo kernel: [ 63.031693] disk 2, o:1, dev:sdi1
Feb 2 18:15:49 foo kernel: [ 63.031697] disk 3, o:1, dev:sdj1
Feb 2 18:15:49 foo kernel: [ 63.031700] disk 4, o:1, dev:sdl1
Feb 2 18:15:49 foo kernel: [ 63.031752] RAID5 conf printout:
Feb 2 18:15:49 foo kernel: [ 63.031755] --- rd:6 wd:5
Feb 2 18:15:49 foo kernel: [ 63.031758] disk 0, o:1, dev:sdg1
Feb 2 18:15:49 foo kernel: [ 63.031762] disk 1, o:1, dev:sdh1
Feb 2 18:15:49 foo kernel: [ 63.031766] disk 2, o:1, dev:sdi1
Feb 2 18:15:49 foo kernel: [ 63.031769] disk 3, o:1, dev:sdj1
Feb 2 18:15:49 foo kernel: [ 63.031773] disk 4, o:1, dev:sdl1
Feb 2 18:15:49 foo kernel: [ 63.031776] disk 5, o:1, dev:sdm1
Feb 2 18:15:49 foo kernel: [ 63.031818] md: recovery of RAID array md0
Feb 2 18:15:49 foo kernel: [ 63.031823] md: minimum _guaranteed_ speed: 1000 KB/sec/disk.
Feb 2 18:15:49 foo kernel: [ 63.031829] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for recovery.
Feb 2 18:15:49 foo kernel: [ 63.031841] md: using 128k window, over a total of 195358336 blocks.

root@foo:/var/log# uname -a
Linux foo 2.6.22-14-server #1 SMP Tue Dec 18 08:31:40 UTC 2007 i686 GNU/Linux

Revision history for this message
Bob McElrath (bob+ubuntu) wrote :

I just experienced this problem.

Through the "md: array md0 already has disks!" error. Through the spam, if you wait 60s (or whatever the timeout is), you will be dropped to the initramfs shell. You can't see it though through the "already has disks" error 100 times per second. But, I typed 'mdadm --stop /dev/md0' and this stopped the spam! Surprisingly, /dev/md0 existed, and was fine! (though degraded, but I expected that because one drive had failed)

To repeat, I ran 'mdadm --stop /dev/md0' and *after* I ran that, /dev/md0 still existed and was fine.

So, it appears that mdadm is trying to create /dev/md0 *twice* and obviously fails the second time (repeatedly).

Revision history for this message
Johan Jeffery (johan-jeffery) wrote :

I've also run into this. And there seems to be a fair number on the Ubuntu forums that have experienced it also.

This bug can cause Ubuntu 8.04to hang (become completely unresponsive), presumably due to the logs getting filed with 'md: array md0 already has disks!' many times per second. Also, this error start appearing in the kernel log just after you the kernel states it has loaded symbols and it starts in mid error. So, you really don't know what is happening that leads up to this.

I've tried every work around that I can find that has been suggested in the forums, e.g. reinstalling mdadm, rebuilding the array from scratch, etc. But none have worked.

The only work around is to at the earliest opportunity, either in a busy box, if presented with one, or immediately after you log in you do the following:

  1) sudo mdadm -S /dev/md0 ( mdadm -S /dev/md0, if in a busy box)
  2) then check the log with dmesg | tail ( dmesg | /usr/bin/tail, if you are in a busybox)
  3) if the error 'md: array md0 already has disks!' has not stopped, repeat steps 1 & 2 until it does.
       - I'm finding more and more that it takes repeating the steps 3-4 times before things settle down.
  4) once the error stops, if the array is listed in the /etc/fstab, you can issue a sudo mount /dev/md0 (mount /dev/md0, if in a busy box) and things will then work until the next reboot.

Revision history for this message
Johan Jeffery (johan-jeffery) wrote :

Forgot one thing: hardware and configuration. I have and IBM InteliStation A Pro with 2 - 146 Gb SCSI drives, each on their own Adaptec controller, in a RAID0 array.

Revision history for this message
moelles (moelles) wrote :

Thist annoying error just occurred to me.

I have a simple raid1 with two devices as my home. worked fo years.. fresh hard-install a month ago..
i can find it in the mdadm .conf

@ Johan Jefferey: your 4 steps worked! i could finally stop the dmesg-spam. thanks!

after that i had a running md0.

i did a soft reboot and everything went normal again !?!

more details in my next post..

Revision history for this message
moelles (moelles) wrote :

This is happened when i stopped the spam (see above)

.... 1000times
md: array md0 already has disks!
md: array md0 already has disks!
md: array md0 already has disks!
md: md0 stopped.
md: unbind<sdb2>
md: export_rdev(sdb2)
md: bind<sdc2>
md: bind<sdb2>
raid1: raid set md0 active with 2 out of 2 mirrors

after soft reboot (lost the log... it looked roughly like the following):

md: array md0 already has disks! (only once!?! no bind messages before this!)
md: could not bd_claim sd?2 (don't know which sd it was)
md: unbind<sd?2> (i'm unsure with the order of these 2 )
md: export_rdev(sd?2)
md: bind<sdc2> (this looks good...)
md: bind<sdb2>
raid1: raid set md0 active with 2 out of 2 mirrors

ok, after that i removed the ARRAY-entry from the mdadm.conf:

DEVICE partitions
#ARRAY /dev/md0 level=raid1 num-devices=2 UUID=.....

another soft-reboot:

[ 46.255357] md: bind<sdc2>
[ 46.255572] md: bind<sdb2>
[ 46.811324] raid1: raid set md0 active with 2 out of 2 mirrors

thats it!

problem solved? just let it scan for arrays rather than specifying them?

Revision history for this message
Johan Jeffery (johan-jeffery) wrote :

I suspect not since I do not have an Array definition in my mdadm.conf yet I run into the problem with every reboot.

Revision history for this message
moelles (moelles) wrote :

ok, its NOT solved for me too..

it happens to me every 2nd cold boot or so.. after 2-5 x "mdadm --stop /dev/md0" i get the array to work and can actually use the machine..

any chance to fix this??? anyone?

is this bound to ubuntu or does it also happen to other distros? (tempted to try sidux)

Revision history for this message
Alan Jenkins (aj504) wrote :

At least the message spam is fixed in the latest (upstream) version of mdadm. I advise you upgrade; there's at least one segfault bug which I was hitting which has also been fixed. Strong precautions are obviously necessary for those of use with root filesystems on MD.

I suggest making sure you have at least two kernels. Copy mdadm to mdadm.bak, build from source, and overwrite mdadm. Run update-initramfs for *one* kernel, e.g. update-initramfs -u -k 2.6.24-19-generic. Then if that breaks you can recover by booting the *other* kernel and copying mdadm.bak back to mdadm. That's what I did.

Revision history for this message
moelles (moelles) wrote :

another (strange) solution or workaround:

it _seems_ like the problem is gone if i boot without the parameters "quiet usplash" !?!
i modified my menu.lst, ran update-grub and since then the bug didn't occur once..

@ Alan Jenkins: if mdadm is buggy, shouldnt we expect an update via the repo when i gets fixed?

Revision history for this message
Alan Jenkins (aj504) wrote :

It's a reasonable request. I just don't automatically expect it. I confess I'm impatient about bugfixes for programs I rely on in the early boot process :-).

If you've found a solution that works for you, however strange it seems, then patience is probably a virtue.

Revision history for this message
fermulator (fermulator) wrote :

Hey everyone, I was having this exact same problem.

At some point an older array was defined, and there were superblocks for md left on one (or more) of the partitions. Even if I zeroed them out (using sudo mdadm --zero-superblock /dev/sda1), the superblock came back after a reboot! It was very strange. Regardless if the partitions were of type "linux" or "linux raid autodetect", I was getting the above errors and problems as people have already described. However, if the partitions were CLEARED completely, the above repeating error did not occur.

SOLUTION FOR ME:
In the file: /etc/udev/rules.d/85-mdadm.rules
Comment out the only line:
SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="linux_raid*", RUN+="watershed -i udev-mdadm /sbin/mdadm -As"

It looks like for some weird reason, udev is trying to be helpful and searching for all devices with md superblocks and assemble any array it might find? Clearly this is useless because mdadm itself already does this.

HENCE: It makes sense that we're getting the above error: "md: array md0 already has disks!" (although I'm not sure why it happens over and over again).

Anyways, this is what worked for me (thanks to RaceTM for help!)
Cheers - hopefully my details will help the developers implement this bug fix.

Revision history for this message
fermulator (fermulator) wrote :

PS: Apparently an upgrade of mdadm will fix this also? (I haven't verified)
https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/139802

Revision history for this message
moelles (moelles) wrote :

yes, i have to correct myself (comment from 2008-07-31):

it happened again.. but since i updated mdadm (several months ago), the issue is gone.. i did it the way Alan Jenkins described on 2008-07-29.

i never did anything to the udev-rules or the raid-partitions

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.