boot-time race condition initializing md

Bug #103177 reported by Jeff Balderson on 2007-04-05
108
This bug affects 2 people
Affects Status Importance Assigned to Milestone
initramfs-tools (Ubuntu)
Critical
Scott James Remnant (Canonical)
mdadm (Ubuntu)
Undecided
Scott James Remnant (Canonical)
udev (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: mdadm

I originally contributed some feedback to Bug #75681. It is currently reported as being fixed, but my problem still persists. The person that closed it recommended that we open new bugs if we're still having problems. I've reproduced my feedback here to save looking it up:

==========
FWIW, I have the problem, exactly as described, with a IDE-only system.

I attempted this with both an Edgy->Feisty upgrade (mid Februrary), and a fresh install of Herd5.

I'm running straight RAID1 for all my volumes:

md0 -> hda1/hdc1 -> /boot
md1 -> hda5/hdc5 -> /
md2 -> hda6/hdc6 -> /usr
md3 -> hda7/hdc7 -> /tmp
md4 -> hda8/hdc8 -> swap
md5 -> hda9/hdc9 -> /var
md6 -> hda10/hdc10 -> /export

EVERY time I boot, mdadm complains that it can't build my arrays. It dumps me in the busybox shell where I discover that all /dev/hda* and /dev/hdc* devices are missing. I ran across a suggestion here: http://ubuntuforums.org/showpost.php?p=2236181&postcount=5. I followed the instructions and so far, it's worked perfectly at every boot (about 5 times now).

Other than just to post a "me too", I thought my comments might help give a possible temporary workaround, as well as document the fact this isn't just a SATA problem.
============

I just updated to the latest packages as of approximately 4/5/07 00:50 EDT, and nothing has changed.

To fix it, I can follow the workaround here:

https://bugs.launchpad.net/ubuntu/+source/mdadm/+bug/75681/comments/60
(Boot with break=premount
At the initramfs prompt:
  udevd --verbose --suppress-syslog >/tmp/udev-output 2>&1 &
  udevtrigger
)

or here:

http://ubuntuforums.org/showpost.php?p=2236181&postcount=5
(add 'sleep 10' after 'log_begin_msg "Mounting root file system..."' in usr/share/initramfs-tools/init)

Either will allow me to boot (the latter without manual intervention).

Jeff Balderson (jbalders) wrote :

Here are the versions of the related packages currently installed on my system:

dmsetup 1.02.08-1ubuntu10
libdevmapper1.02 1.02.08-1ubuntu10
libvolume-id0 108-0ubuntu3
lvm-common <none>
lvm2 <none>
mdadm 2.5.6-7ubuntu5
udev 108-0ubuntu3
volumeid 108-0ubuntu3

Dwayne Nelson (edn2) wrote :

I believe I am having this same problem.

My machine contains 4 sata drives, 3 of which are being used for linux, configured as two RAIDs:

md0: RAID1:
  /

md1: RAID5: LVM:
  /home
  /tmp
  swap

The above configuration worked well in Edgy, but I didn't have access to my wireless ethernet (WPA). The latest attempt was after installing from this morning's (4/5/07) alternate CD.

I don't know how to use either of the above workarounds (not sure how to "boot with break=premount" and I don't have access to the file-system because I haven't learned how to mount the RAID file-system manually when I boot from the live CD).

cionci (cionci-tin) wrote :

I've the same busybox problem, but with a SCSI system. Look at the picture attached.
My Feisty (upgraded from Edgy) is up to date. The issue appears randomly.
SCSI subsystem: Adaptec 19160 + 2 x Fujitsu MAS 73GB, no Raid and no LVM
I'll try to give you a more detailed report if you tell me how to track back the issue,

Jeff Balderson (jbalders) wrote :

All,

Please read this comment in another bug by Scott James Remnant.

https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/83231/comments/25

This policy isn't clearly displayed anywhere obvious, they apparently want us all to open separate bugs for them to investigate, and they will determine if it's an actual duplicate or not.

cionci (cionci-tin) wrote :

Ok, but I think I've not enough information to create a new bug report, so I asked here to tell me how to compile a detailed report for this kind of issue.

Changed in mdadm:
status: Unconfirmed → Confirmed
Changed in initramfs-tools:
status: Unconfirmed → Confirmed

Jeff, I don't suppose you could take a digital photograph of your screen at the point where mdadm aborts and drops you to a shell?

I think what's happening here is that the scripts/local-top/mdadm or mdrun script is causing the initramfs to abort, when the failure should be ignored since udev will run mdadm again later.

At the shell it drops you to, could you run "ps ax > /dev/ps.log 2>&1" and when you've booted (use any of your usual tricks to recover from there) attach that log to this bug.

Another check; if it drops you to the shell there, wait for a few seconds then look for /dev/hd* and /dev/sd* ... do they appear?

Can you also attach your /etc/mdadm/mdadm.conf

Thanks

Changed in udev:
status: Unconfirmed → Rejected

(This won't be a udev bug -- it's likely a bug in mdadm's scripts; keeping the initramfs bit open since that supplies the mdrun script)

Changed in initramfs-tools:
status: Confirmed → Unconfirmed

Random other check -- if you leave it for a few seconds, and your /dev/hd* and /dev/sd* devices appear - do your /dev/md* devices appear also (may be a few seconds later)

Changed in mdadm:
status: Confirmed → Needs Info
Jeffrey Knockel (jeff250) wrote :

Since my bug was marked a duplicate of this one, is there any more information that I can provide? (Gah, this might get confusing with two Jeff's!)

Please provide the same information

Jeff Balderson (jbalders) wrote :

Sorry for the delay. We went away for the weekend. Attached is the requested information.

The first one is a photo when it first drops to the busybox shell.

Jeff Balderson (jbalders) wrote :

This photo is about two minutes after the first one, with "ls -al /dev/hda* /dev/hdc* /dev/md*" and "cat /proc/mdstat"

Jeff Balderson (jbalders) wrote :

ps.log

Jeff Balderson (jbalders) wrote :
Jeff Balderson (jbalders) wrote :

I decided to include this, since it also seems to control how and which arrays are started up at boot.

Jeffrey Knockel (jeff250) wrote :

Just to reiterate what I said in my other bug report, my /dev/sd* drives aren't ever created when I'm thrown to busybox unless I boot with break=mount. Then they magically appear without having to do anything, and /dev/md0 is created successfully too.

I've attached a picture of my display after being thrown to busybox. (It looks almost identical to Jeff Balderson's.)

rejecting initramfs-tools portion for now, so only the mdadm bit remains open

There's a set of problems here that all seem related to just mdadm

Changed in initramfs-tools:
status: Unconfirmed → Rejected
Changed in mdadm:
assignee: nobody → keybuk
Brett Johnson (linuxturtle) wrote :

FWIW, the d-i installer seems broken as well when trying to do a RAID install, and I'm guessing it's related to this.

The partitioner gets very confused looking for /dev/md/<number> devices which don't exist.

Johan Christiansen (johandc) wrote :

Same problem here:
The mdadm tries to assemble /dev/md0 before /dev/sda1 and /dev/sdb1 appears.

Workaround:
1) Boot kernel with break=mount (This causes the initramfs to pause somewhere before mdadm but after udev)
2) Wait a few seconds while udev detects /dev/sda1 and /dev/sdb1
3) Press Ctrl-D to continue startup, mdadm can now assemble /dev/md0

I suppose the "sleep 10" workaround in one of the initramfs files will solve my problem, but i was hoping for something cleaner to show up.

This shouldn't be a problem since udev calls mdadm itself (check /etc/udev/rules.d/85-mdadm.rules)

The mdadm script should fail, but then when udev makes sda1 and sda2, it should also make /dev/md0

Can you do me a favour, do 1) and 2) as you suggest above, but also look to see whether /dev/md0 exists as well as sda1 and sdb1

Jeffrey Knockel (jeff250) wrote :

I can't speak for Johan, but when I do 1) and 2), /dev/md0 is created after a couple of seconds when the sd? devices are created and before pressing ctrl+d.

Johan Christiansen (johandc) wrote :

What Jeff is describing is exactly what happens. I just have to wait a few seconds, then the sd? devices and the /dev/md0 is created automatically. I can then press ctrl+d to continue. It's like it just needs a few extra seconds to find the drives before mdadm tries to assemble it.

To answer your question more specifically: Yes, it does assemble md0 before i press ctrl+d.

- Tomorrow i'll have physical access to the machine again, and i can try to record a little movie of it happening. Please ask if you have further questions. This is a urgent bug that needs to be fixed before feisty release.

Jeff Balderson (jbalders) wrote :

My experience mirrors (pun intended) Jeff250's.

If I boot with break=mount, and remove "quiet splash", it drops to the shell, and about a second later, it discovers everything properly. At that point, I can exit the shell and the boot process will continue normally.

If I don't boot with break=mount and I let it fail and drop me to the shell, nothing is ever discovered. This might be expected behavior in this circumstance.

One of the odd things in the first case is that my ATA (not SATA) drives are discovered as /dev/sda and /dev/sdb according to /proc/mdstat, which reports that /dev/md1 is comprised of /dev/sda5 and /dev/sdb5. In the "let it fail" scenario, if I perform a "udevd &; udevtrigger; [wait a few seconds] ; pkill udev" it properly discovers them as /dev/hda and /dev/hdc and starts /dev/md1 with /dev/hda5 /dev/hdc5. Either way, once it's finished booting, /proc/mdstat reports the correct device names (/dev/hda* and /dev/hdc*).

Jeff Balderson (jbalders) wrote :

Correction -- after booting in the "break=mount" scenario my ATA (not SATA) drives are still identified as /dev/sda and /dev/sdb according to /proc/mdstat:

root@heimdall:~# cat /proc/mdstat
Personalities : [raid1]
...
md1 : active raid1 sda5[0] sdb5[1]
      497856 blocks [2/2] [UU]

The same holds true for the "let it fail" and "sleep 10" boot scenarios. I guess this is a new feature.

I hadn't noticed this previously, since the Feisty system I've been working with most uses SATA, and SATA has always shown up as SCSI.

David Portwood (dzportwood) wrote :

/sbin/udevsettle --timeout 10 in /usr/share/initramfs-tools/scripts/init-premount/udev solves this issue.

Changed in udev:
status: Rejected → Confirmed
Jeff Balderson (jbalders) wrote :

Clarification -- two comments up, in the "let it fail" scenario, "nothing is ever discovered" should read:

/dev/sda*, /dev/sdb*, /dev/hda* /dev/hdc* are never discovered regardless how long I wait. /dev/md1 does exist at this point.

Jeff Balderson (jbalders) wrote :

I tried adding "/sbin/udevsettle --timeout=10" to the bottom of /usr/share/initramfs-tools/scripts/init-premount/udev, rebuilt my initrd, and it does seem to work on my system just as reliably as the "sleep 10" method.

This turns out to be an initramfs-tools problem; we only loop for while the device node doesn't exist, we also need to check whether it's a useful device node.

Changed in mdadm:
status: Needs Info → Rejected
Changed in udev:
status: Confirmed → Rejected
Changed in initramfs-tools:
status: Rejected → Confirmed
Tollef Fog Heen (tfheen) wrote :

Importance critical; has to be fixed for release.

Changed in initramfs-tools:
importance: Undecided → Critical
Changed in initramfs-tools:
assignee: nobody → keybuk
status: Confirmed → In Progress
Martin Pitt (pitti) wrote :

For the record, I confirm that Scott's patch works.

Fixed package uploaded

Changed in initramfs-tools:
status: In Progress → Fix Released

For me this fix is a bigger problem than the previous problem. I have a Raid 5 mounted in /media/other , with 4 partitions - /dev/hda12 (IDE), /dev/sda1, /dev/sdb1 and /dev/sdc11. Before this patch it would fail because md0 would appear after /dev/sd* but before /dev/hd*; the "sleep 10" workaround fixed that.
With this patch, I get a endless loop of "No available memory (MPOL_BIND): kill process 2445 (mdadm) score 0 or a child" which I can only get around if I boot from CD.
I am running feisty on a AMD64.

Correction: alt-sysreq-E lets me resume booting a recovery boot. Inside it,a "cat /proc/partitions" does show sd* followed by dm-0, then hda*, then dm-[1-9].

Re-adding the "sleep" workaround lets me boot into a console, but mdadm won't assemble the array as only hda and sda are present - and it was also the case above. sdb and sdc are not detected...

I also tried reverting the patch, but it seems I no longer have some hw driver - sdb and sdc aren't being detected anymore. Probably one actualization that was done in paralel (I did a "apt-get upgrade" to get this patch) that broke my system and won't let ubuntu detect the second drive connected to my via sata interface, as well as the one connected to the promise interface. I get a "revalidation failed (errno=-19)" error in dmesg during the promise sata detection, and the same error during the detection of the second hd connected to the via sata interface. Has udev been updated also in the last 14 hours?

Jeffrey Knockel (jeff250) wrote :

The newer initramfs-tools package has resolved this issue for me.

Jose, your bug may be related to the bug in kernel 2.6.20-14.23. If so, try updating to 2.6.20-15, which should fix it.

Jeff Balderson (jbalders) wrote :

This appears to have resolved my problem also.

Thanks, the kernel update fixed the sdb and sdc not detected problem. But the mdadm race condition is still here with the patch - hda is detected only after mdadm has loaded and assembled /dev/md0. So I always get a degraded array at boot.

deuce (azinas) wrote :

can someone tell me how to upgrade the kernel or do the break=mount command thingy? Thanks

Jeffrey Knockel (jeff250) wrote :

In the grub menu, select the kernel you want to boot with and then press 'e'. Then select the kernel line and press 'e'. Add a space and 'break=mount' to the end of the line and press enter. Press b to boot. You'll get thrown to busybox. Wait a few seconds, and then press ctrl+d.

Alternatively, the way to update the kernel would be to find an old live cd that you know works with the array (I used an old Dapper cd). Then mount your root partition to a directory with a command similar to 'sudo mount /dev/md0 /mnt'. Then 'sudo chroot /mnt'. Your shell will now behave as if your mounted partition is now the root directory until you type exit, so you can wget the new initramfs-tools or the new kernel package or whatever you need and dpkg -i it.

Jeffrey Knockel (jeff250) wrote :

I should also add that if you're trying to update the kernel because of the bug in the -14.23 kernel, the break=mount fix won't help get around that bug. It just gets around the initramfs bug that this bug report is about. To get around the -14.23 kernel bug, you're going to either have to chroot like I mentioned in my last comment, or boot using an older kernel, like a -13.

Arthur (moz-liebesgedichte) wrote :

As there haven't been any comments since two weeks, let me ask: Is there a solution to this problem? After upgrading from edgy my RAID1 partitions only come up with one disk and I've got to manually sync them after every boot up or I risk data loss when rebooting (and by chance the other disk wins the race).

Rod Roark (rod) wrote :

I don't think this bug is fixed. I have an up-to-date Feisty/amd64 machine with 2 320G SATA drives supporting 5 RAID1 partitions, including md0 as root. Boot (via lilo) often hangs as depicted in the attached image. Applying the "sleep 10" fix to /usr/share/initramfs-tools/init (and running update-initramfs -k all -u) works for me.

Seems to me this is still a critical bug.

Rod Roark (rod) wrote :

Correction: root is md1, not md0.

I am also affected by this bug and it has made Feisty/Gutsy unusable for me, since all of my data is on a RAID 5 array which fails to mount.

I have since got back to Edgy which works.

This bug is still present - every reboot I have to do on my server is a nightmare. Now its sda7 that doesn't show on /dev/mapper (even though it shows under /dev), so it is the one I have to re-add to my raid5 array /dev/md0.

Mike H (mike-heden) wrote :

I wondered if this bug might be fixed in the recent kernel update that was distributed for 7.04 Feisty, but it seems not. The update restored my /boot/grub/menu.lst file to it's original state with no 'break=mount', and the result was that my system again failed to boot. I've reinstated my amendments to menu.lst and can get round the problem again, but can anyone explain what the 'fixed' status of the bug on launchpad really means? In the case of this particular problem it looks as though a 'fix' has been available since 13th April?

From the evidence, I guess it means the source now contains a fix but, as far as many Ubuntu users are concerned, that's a mile away from it meaning the fix is distributed as an update to the user-base.

Is there anything in launchpad that indicates when a bug-fix has been released in the sense of being available as an updated binary that's accessible via update manager, as opposed to being available as amended source code?

Apologies if this is not the best place to ask this question, but I've asked launchpad-related questions elsewhere and had no reply....

I still have this with all latest updates on feisty. The only hint I have that I don't see here is that the partition that never gets added automatically to the raid 5 array and forces a manual "mdadm --re-add" is also never linked to /dev/mapper/

Noah (noah-noah) wrote :

I have this problem with a fresh install of Ubutnu Server 7.04 with apt update and upgrade; using two PATA disks in Raid-1. Adding "break=mount" does work, but this isn't a useful solution since this is supposed to be an unattended server. I also tested this under a vmware image and have the same problem.

Fredrik Sjögren (fsj) wrote :

I too have problems about 10% of the boots. Rebootalways fix the issue. I have a raid1 soft raid (PATA)

For me, adding "break=mount" does not work. When it stops at the initramfs prompt, my machine is still missing sda* entries in /dev/mapper.

Arthur (moz-liebesgedichte) wrote :

Seconding Jose's comment #52. And I've just upgraded to the Gutsy Beta and the bug is still present there. Resetting status to Confirmed from Fix Released as the fix obviously doesn't work.

Changed in initramfs-tools:
status: Fix Released → Confirmed

Today I bit the bullet, and installed gutsy on my machine, formatting /. No more /dev/mapper, no more problems with the raid5 array being assembled properly at boot. Of course, I backed up /etc before, and /home is on another partition, so no problems.
It looks like a case of bitrot - I hadn't done a clean install on this server since dapper.

Arthur (moz-liebesgedichte) wrote :

The removal of evms (make sure you that you have a valid /etc/fstab and remove it when upgrading) takes care of all the /dev/mapper problems. But the race condition still exists in Gutsy:
[ 4.684000] sd 0:0:0:0: [sda] 234441648 512-byte hardware sectors (120034 MB)
...
[ 4.688000] sda: sda1 sda2 sda3 sda4
...
[ 4.956000] md: bind<sda2>
[ 4.960000] md: md1 stopped.
[ 4.964000] md: bind<sda3>
[ 4.964000] md: md0 stopped.
[ 4.964000] md: unbind<sda2>
[ 4.964000] md: export_rdev(sda2)
[ 4.972000] md: bind<sda2>
[ 5.080000] hda: SAMSUNG SP1213N, ATA DISK drive
...
[ 7.492000] hda: hda1 hda2 hda3 hda4 < hda5 hda6 hda7 >

and my RAIDs end up without the partitions from hda.

Arthur (moz-liebesgedichte) wrote :

I've finally solved this in my case: After detecting the second disk, udev triggers a mdadm --assemble --scan as it should. This failed to pick up the new disk. Removing num-devices=, level= and uuid= settings from /etc/mdadm/mdadm.conf resolved this problem. My /etc/mdadm/mdadm.conf now reads:

DEVICE /dev/hda1 /dev/hda2 /dev/sda2 /dev/sda3

ARRAY /dev/md0 devices=/dev/hda1,/dev/sda2
ARRAY /dev/md1 devices=/dev/hda2,/dev/sda3

A dpkg-reconfigure mdadm later the problem was gone. Together with the evms removal this is probably worth it to be put into an upgrade guide.

Arthur, please don't reopen bugs if you did not file them -- you may be experiencing a completely different bug.

This bug is specifically and only about the initramfs loop failing to wait for certain devices, which has been long fixed.

You seem to have other bugs, please file new ones.

Changed in initramfs-tools:
status: Confirmed → Fix Released

Hi

I have 4 hard devices and i create raid ARRY

/dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1

and i change file systems to linux auto detect raid and i create arry like this

mdadm -Cv /dev/md0 --level=0 --raid-devices=2 /dev/sdb1 /dev/sdc1 --chunk=128 (this arry created ok and worked)
&
mdadm -Cv /dev/md1 --level=0 --raid-devices=2 /dev/sdd1 /dev/sde1 --chunk=128 (this arry created ok and worked)

and i create this arry its create and worked when i restart my linux md2 its losed

mdadm -Cv /dev/md2 --level=1 --raid-devices=2 /dev/md0 /dev/md1 --chunk=128

syncing full on 60min when restart its lose

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers