boot-time race condition initializing md

Bug #75681 reported by Kees Cook
118
This bug affects 2 people
Affects Status Importance Assigned to Milestone
initramfs-tools (Ubuntu)
Invalid
Undecided
Unassigned
Feisty
Invalid
High
Unassigned
lvm2 (Ubuntu)
Fix Released
Undecided
Unassigned
Feisty
Fix Released
Undecided
Unassigned
mdadm (Ubuntu)
Fix Released
High
Ian Jackson
Feisty
Fix Released
High
Ian Jackson
udev (Ubuntu)
Fix Released
Undecided
Unassigned
Feisty
Fix Released
Undecided
Unassigned

Bug Description

Running initramfs-tools 0.69ubuntu26 my system hangs forever at usplash, I assume waiting for the root device to show up (I can see the disk IO led polling).

Following instructions in /usr/share/doc/mdadm/README.upgrading-2.5.3.gz, I booted with break=mount. The first time, I ran ./scripts/local-top/mdadm by hand, and when I exited the shell, it tried to run it again. The second time I rebooted, I just did a break=mount, and exited from the shell immediately. mdadm started up fine, and things booted fine.

I'm not really sure how to dig into this and debug the issue. I'm running a pair of drives in md arrays, with an LVM root on top of that.

Related branches

Revision history for this message
Kees Cook (kees) wrote : Re: race condition between sata and md

Tracked this down: there is a race between the /scripts/local-top/mdadm running and the SATA drives being brought online. Adding a delay seems to work, and I now look forward to
https://blueprints.launchpad.net/distros/ubuntu/+spec/udev-mdadm
working with initramfs. :)

Revision history for this message
Kees Cook (kees) wrote : Re: initramfs script: race condition between sata and md

Here is a patch that stalls the mdadm initramfs script until the desired devices are available. I set the timeout to 30 seconds.

Revision history for this message
Reinhard Tartler (siretart) wrote :

I can confirm this bug, ajmitch is suffering as well from this.

Changed in mdadm:
importance: Undecided → High
status: Unconfirmed → Confirmed
Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

It looks like it was fixed by recent updates, but at the expense of breaking LVM (my initramfs does not contain (/scripts/local-top/lvm anymore).

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

It looks like a race condition is still here: the initramfs tryes to assemble my RAID upon detection of /dev/sda. Obviously with 3 drives missing out of 4, it fails... Then, /dev/md0 is not here but inactive and not correctly created when the other drives are detected.

Is it possible to get and udev event when udev knows that all drives have been detected ?
I think that the best behaviour looks like:
- upon disk detection, scan for RAID volumes. If a volume is available, assemble it ONLY if all of its drives are present.
- try to assemble uncomplete (degraded) RAID volumes ONLY when ALL drives have been detected (i.e. if a drive is still missing, it will _not_ appear later).

This would make it faster without breaking things when a drive is just slow to answer.
Is it what the "udev-mdadm" and "udev-lvm" specs are for ? They seem to only deal with sending udev events for newly created md/lvm volumes, not about detecting/assembling them.

Revision history for this message
Reinhard Tartler (siretart) wrote : Re: [Bug 75681] Re: initramfs script: race condition between sata and md

Aurelien Naldi <email address hidden> writes:
> It looks like it was fixed by recent updates, but at the expense of
> breaking LVM (my initramfs does not contain (/scripts/local-top/lvm
> anymore).

I can confirm. Latest updates changed the behavior, now the raid is
activated but not lvm. booting with 'break=mount' lets me run the mdadm
script by hand, but still does not activate lvm. I have to run 'lvm
vgchange -ay' by hand and press 'CTRL-D' to continue booting.

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message
Reinhard Tartler (siretart) wrote : Re: initramfs script: race condition between sata and md

subscribing Ian and Scott, since they have done some uploads regarding udev and lvm2 in the past which improved the situation a bit.

Revision history for this message
Reinhard Tartler (siretart) wrote :

status update: udev does start one raid device on bootup, but not all. After starting the raid devices manually using /scripts/local-top/mdadm, the vg still don't get up, I need to do manually 'lvm vgscan ; lvm vgchange -a y'

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

There is still a race condition here. I does not attempt to start the array before detecting devices but it often starts it too early. My 4 devices RAID5 is, most of the time, either not assembled (i.e. 1 or 2 devices out of 4 is not enough) or assembled in degraded mode. An array should be assembled in degraded mode ONLY when no more devices are expected (or at least after a timeout)!

Revision history for this message
snore (sten-bit) wrote :

I've been hit with the same problem with a fresh feisty install (10 feb install cd),
booting often doesn't work and always results in incomplete raid arrays.

This is with 2x 80gb sata, and 7 raid1 devices. The current mdadm
initramfs script is unsuitable for release.

Revision history for this message
Ian Jackson (ijackson) wrote :

I think I have fixed this in mdadm_2.5.6-7ubuntu4. Please could you install this, which I have just uploaded, and check if it works.

Changed in mdadm:
assignee: nobody → ijackson
status: Confirmed → Fix Committed
Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 83832] Re: [feisty] mounting LVM root broken

I think I have fixed the bug that causes the assembly of the RAID
arrays in degraded mode, in mdadm_2.5.6-7ubuntu4 which I have just
uploaded.

Please let me know if it works ...

Thanks,
Ian.

Revision history for this message
Sten Spans (sten-blinkenlights) wrote : Re: initramfs script: race condition between sata and md

this fixes the issue for me

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

The first boot went fine. The next one hanged. And the next one went also fine.
For all of them, I get the message "no devices listed in cinf file were found" but it looks harmless.
I do not know were the second one hanged so it may be unrelated, but it was before mounting /

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

From the changelog of mdadm 2.5.6-7ubuntu4:
     Specify --no-degraded argument to mdadm in initramfs; this
     can be overridden by setting MD_DEGRADED_ARGS to some nonempty value
     (eg, a single space). This ought to fix race problems where RAIDs are
     assembled in degraded mode far too much. (LP #75681 and many dupes.)

This looks much more like a work arround than a real fix! I may misunderstand it, please ignore me if this is the case :)
Does it mean that I have to enter commands in the initramfs or to do some kind of black magic to boot a system with a broken hard drive ?
And it only avoids to desynchronise a working RAID, it does not fix the race at boot time as mdadm does still try to assemble the array with too few drives and thus refuses to assemble it properly later.

The array should _not_ be started _at_all_ unless all needed devices have been detected (including spare ones) OR when all device detection is done.
So I ask it again, is it any way to make udev send a special signal when devices detection is finished ?
Is it any way to make mdadm scan for arrays and assemble them ONLY if all of their devices are present, doing _nothing_ otherwise ?

It may be tricky but I think that these problems can only be fixed this way.
Anyway, it did work before.

Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 75681] Re: initramfs script: race condition between sata and md

Aurelien Naldi writes ("[Bug 75681] Re: initramfs script: race condition between sata and md"):
> The first boot went fine. The next one hanged. And the next one went
> also fine. For all of them, I get the message "no devices listed in
> cinf file were found" but it looks harmless. I do not know were the
> second one hanged so it may be unrelated, but it was before mounting
> /

When one of the boots hangs, can you please wait for it to time out (3
minutes by default, though IIRC you can adjust this by saying
ROOTWAIT=<some-number-of-seconds>) and then when you get the initramfs
prompt try

1. Check that
     cat /proc/partitions
   lists all of the components of your array. If not, then we need to
   understand why not. What are those components ? You will probably
   find that it does not list the array itself. If it does then we
   need to understand why it doesn't seem to be able to mount it.

2. Run
     /scripts/local-top/mdadm from-udev
   (NB `from-udev' is an argument you must pass to the script)
   and see if it fixes it. (If so it will show up in /proc/partitions
   and be mountable.)

3. If that doesn't fix it, see if
     mdadm -As --no-degraded
   fixes it.

Thanks,
Ian.

Revision history for this message
Ian Jackson (ijackson) wrote :

Aurelien Naldi writes ("[Bug 75681] Re: initramfs script: race condition between sata and md"):
> The array should _not_ be started _at_all_ unless all needed devices
> have been detected (including spare ones) OR when all device
> detection is done.

Unfortunately, in Linux 2.6, there is no way to detect when `all
device detection is done' (and indeed depending on which kinds of bus
are available, that may not even be a meaningful concept).

Ian.

Revision history for this message
Reinhard Tartler (siretart) wrote :

Ian Jackson <email address hidden> writes:
> 1. Check that
> cat /proc/partitions
> lists all of the components of your array. If not, then we need to
> understand why not. What are those components ? You will probably
> find that it does not list the array itself. If it does then we
> need to understand why it doesn't seem to be able to mount it.

I notice thay my 'real' partitions (/dev/sd[a,b][1..7]) do appear, but
only md0, not my other md[1..3] devices. Moreover, after checking
/proc/mdstat, /dev/md0 is started in degraded mode.

My setup: /dev/md0 is a mirror of /boot. md1 is a mirror of swap, md2 is
a mirrored volume group with my home (and backup of root), md3 is the
rest with a striped volume group.

> 2. Run
> /scripts/local-top/mdadm from-udev
> (NB `from-udev' is an argument you must pass to the script)
> and see if it fixes it. (If so it will show up in /proc/partitions
> and be mountable.)

Running that script makes /dev/md[1..3] come up in non-degraded mode,
md0 stays in degraded mode. LVM is not started automatically. After
typing 'lvm vgscan ; lvm vgchange -ay', my LVMs come up and I can
continue booting with CTRL-D.

Interestingly, this seem to happen most of the time. Just before I
booted with mdadm_2.5.6-7ubuntu3, and the system came up just fine. Note
that this happened exactly once to me so far.

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message
Ian Jackson (ijackson) wrote : Re: [Bug 75681] Re: initramfs script: race condition between sata and md

Reinhard Tartler writes ("Re: [Bug 75681] Re: initramfs script: race condition between sata and md"):
> Ian Jackson <email address hidden> writes:
> > 2. Run
> > /scripts/local-top/mdadm from-udev
> > (NB `from-udev' is an argument you must pass to the script)
> > and see if it fixes it. (If so it will show up in /proc/partitions
> > and be mountable.)
>
> Running that script makes /dev/md[1..3] come up in non-degraded mode,
> md0 stays in degraded mode. LVM is not started automatically. After
> typing 'lvm vgscan ; lvm vgchange -ay', my LVMs come up and I can
> continue booting with CTRL-D.

udev is supposed to do all of these things. It's supposed to
automatically activate the LVM when the md devices appear, too. I
don't know why it isn't, yet.

Could you please boot with break=premount, and do this:
 udevd --verbose >/tmp/udev-output 2>&1 &
 udevtrigger
At this point I think you will find that your mds and lvms
are not activated properly; check with
 cat /proc/partitions

If as I assume they aren't:
 pkill udevd
 udevd --verbose >/tmp/udev-output2 2>&1 &
        /scripts/local-top/mdadm from-udev

And now your mds should all be activated but hopefully the LVM bug
will recur. So:
 pkill udevd
 lvm vgscan
 lvm vgchange -a y
 mount /dev/my-volume-group/my-volume-name /root
        cp /tmp/udev-output* /root/root/.
        exit

And then when the system boots attach /root/udev-output* and your
initramfs to this bug report.

mdadm_2.5.6-7ubuntu4 ought to fix the fact that md0 comes up in
degraded mode but please don't upgrade to it yet as it may perturb the
other symptoms out of existence.

Thanks,
Ian.

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote : Re: initramfs script: race condition between sata and md

OK, here are the result of some more tests with mdadm_2.5.6-7ubuntu4:

* my first boot went fine, allowing me to get your advice.
* the second boot hanged, so I waited for a while. All my devices were here, both in /dev and in /proc/partitions
The RAID array was assemble with 3 devices (out of 4) but not started (i.e. not degraded)
running "/scripts/local-top/mdadm from udev" did not work, the screen was filled with error message:
"mdadm: SET_ARRAY_INFO failed for /dev/md0: Device or ressource busy"
I could stop it only using the sysreq, which took me back to the initramfs shell, with the raid array completely stopped. There running the same command again assembled the array correctly.
[off topic] I could not leave the shell and continue booting, It took me back to this shell every time, saying:
"can't access tty; job control turned off"
[/off topic]
* The third boot hanged also. The array was assembled with one drive only (thus it could not even have started in degraded mode). I tried running "mdadm -S /dev/md0" before doing anything else.
After this, running "/scripts/local-top/mdadm from-udev" assembled the array properly

[off topic] But I still could not continue the boot. The next 4 boot did also hang (but I did not wait). Then it accepted to boot again [/off topic]

Revision history for this message
Reinhard Tartler (siretart) wrote :

Ian Jackson <email address hidden> writes:

> Could you please boot with break=premount, and do this:
> udevd --verbose >/tmp/udev-output 2>&1 &
> udevtrigger
> At this point I think you will find that your mds and lvms
> are not activated properly; check with
> cat /proc/partitions

The thing is that I cannot reproduce the problem this way, because my
raid and lvm devices come up as expected. The visible difference is,
that I get tons of output on the screen after having typed
udevtrigger. I suspect that they slow down udev's operation which avoids
the race condition. Is there some way to avoid this massive output on
the console while preserving the debug info on the tempfile?

Changed in mdadm:
status: Fix Committed → Confirmed
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

The problem here is clearly that mdadm is being run too early, and is trying to assemble RAIDs that are not yet complete.

Either mdadm needs to be fixed so that doing this is possible, and harmless; like it is for lvm, etc. or the script that calls mdadm needs to check whether it is safe to call mdadm yet.

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

With the --no-degraded option that has been added, it IS harmless for the array itself, but it can still block the boot.

If an array has been assembled too early, it should be stopped (running "mdadm -S /dev/md0" by hand worked for me) BEFORE trying to assemble it again, an other race may be present here ?

mdadm can tell if a device is part of an array, this could be used to check that all devices are present before assembling it, but would slow things down or require some memory.

If a device is really missing (shit happens...), the array should then be assembled in degraded mode. This should probably happen automatically (or with a confirmation) after the timeout, or by adding a boot option.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

How can it be assembled too early if --no-degraded is given?

Surely with that option, mdadm doesn't assemble the array if some devices are missing, instead of part-assembling it in degraded mode?

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

Perhaps is it another bug on my particular system...
I have written about it in previous comments, when trying to assemble /dev/md0 I have three different results:
* all devices of the array are available: /dev/md0 is created and working
* a device is missing (and --no-degraded is _not_ specified): /dev/md0 is created and working in degraded mode
* several devices are missing (or only one with --no-degraded): /dev/md0 is still created but _not_ working. I get a message like "too many missing devices, not starting the array", but it still appears in /proc/mdstat (not running but present, it may not be completely assembled but it is here!). I can not assemble it until it has been fully stopped using "mdadm -S /dev/md0"

This looks weird, and I had not seen this before, but I had not tryed to launch the array with 2 missing drives...

Revision history for this message
Jyrki Muukkonen (jvtm) wrote :

Can confirm this with a fresh installation from daily/20070301/feisty-server-amd64.iso.

Setup:
- only one disk
- md0 raid1 mounted as / (/dev/sda1 + other mirror missing, the installation ui actually permits this)
- md1 raid1 unused (/dev/sda3 + other mirror missing)

On the first boot I got to the initramfs prompt, with only md1 active:
mdadm: No devices listed in conf file were found.
stdin: error 0
Usage: modprobe ....
mount: Cannot read /etc/fstab: No such file or directory
mount: Mounting /root/dev ... failed
mount: Mounting /sys ... failed
mount: Mounting /proc ...failed
Target filesystem doesn't have /sbin/init

/bin/sh: can't access tty; job control turned off
(initramfs)

Second try gave the same prompt, but now md0 was active. However, it didn't boot:

(initramfs) cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0]
      15623104 blocks [2/1] [U_]

Finally, on the third try, it booted (well, got some warnings about missing /dev/input/mice or something like that, but that's not the point here).

Now I'm just sticking with / as a normal partition (+ others like /home as raid1). I'm hoping that the migration to raid1 goes fine after this problem has been fixed.

Revision history for this message
octothorp (shawn-leas) wrote :

I have been dealing with this for a while now, and it's a udev problem.

Check out the following:
https://launchpad.net/bugs/90657
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=403136

I have four identical SATAs, and sometimes the sd[a,b] device nodes fail to show up, and sometimes they don't but are lagged, sometimes it's sd[dc]...

I have to boot with break=premount and cobble things together manually, then exit the shell and let it boot.

Revision history for this message
octothorp (shawn-leas) wrote :

This is very much udev's fault. A fix is reportedly in debian's 105-2 version.

Revision history for this message
octothorp (shawn-leas) wrote :

Is anyone listening???

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

I would surely test a newer udev packages if it fixes this problem. Using apt-get source to rebuild the debian package should be fairly easy, but I guess ubuntu maintains a set of additionnal patches, and merging them might be non-trivial. Is any of the ubuntu dev willing to upload a test package somewhere ?

Reading the debian bug, I am not convinced it is the same problem, on my system I have SOME devices created and other lagging behind, the debian bug seems to be about devices not created at all.

If the newer udev does not fix this problem, is it doable to detect "stalled" RAID array and to stop them ?

Revision history for this message
octothorp (shawn-leas) wrote : Re: [Bug 75681] Re: boot-time race condition initializing md

It seems it's all related to SATA initialization.

udev does not build like other packages, and I'd hate to miss something
about building it and then wreck my system totally.

On 3/13/07, Reinhard Tartler <email address hidden> wrote:
>
> ** Changed in: mdadm (Ubuntu Feisty)
> Target: None => 7.04-beta
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message
pjwigan (pjwigan) wrote :

I've just tried Herd 5 x86 and hit a similar issue; only this box has no RAID capability.

Setup is:
  - 1 disk (/dev/sda1)
  - 1 DVD+RW (/dev/scd0)

Attempting to boot from the live CD consistently gives me:

  udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit

  BusyBox v1.1.3 (Debian ...

  /bin/sh: can't access tty; job control turned off
  (initramfs)

I'll try the 64 bit version and report back

Revision history for this message
octothorp (shawn-leas) wrote :

I don't think by what I've seen that this is the same. In your case it's
modprobe's fault by extension, and really probably caused by a module not
loading for whatever reason.

Caveat: I have not done bug hunting for your bug. I just don't think it's
this one.

On 3/13/07, pjwigan <email address hidden> wrote:
>
> I've just tried Herd 5 x86 and hit a similar issue; only this box has no
> RAID capability.
>
> Setup is:
> - 1 disk (/dev/sda1)
> - 1 DVD+RW (/dev/scd0)
>
> Attempting to boot from the live CD consistently gives me:
>
> udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit
>
> BusyBox v1.1.3 (Debian ...
>
> /bin/sh: can't access tty; job control turned off
> (initramfs)
>
>
> I'll try the 64 bit version and report back
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message
pjwigan (pjwigan) wrote :

Thanks for the tip. Having dug deeper, 84964 is a perfect match.

Revision history for this message
Eamonn Sullivan (eamonn-sullivan) wrote :

I got hit with something that sounds very similar to this when upgrading to the 2.6.20-10-server kernel. My system works fine on -9. I ended up stuck in busybox with no mounted drives. I'm using a DG965 Intel motherboard with three SATA hard disks. The following are details on my RAID5 setup and LVM. My /boot partition is non-raid on the first SATA disk. What else do you need?

        Version : 00.90.03
  Creation Time : Sun Mar 4 14:26:40 2007
     Raid Level : raid5
     Array Size : 617184896 (588.59 GiB 632.00 GB)
    Device Size : 308592448 (294.30 GiB 316.00 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Wed Mar 14 07:24:03 2007
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           UUID : 9a1cfa02:4eddd96e:18354ce8:e82aff38
         Events : 0.13

    Number Major Minor RaidDevice State
       0 8 1 0 active sync /dev/sda1
       1 8 17 1 active sync /dev/sdb1
       2 8 33 2 active sync /dev/sdc1

And here's my lvm setup:

  --- Logical volume ---
  LV Name /dev/snape/root
  VG Name snape
  LV UUID 5j5Rh6-YeSD-vr3p-T1ci-RPoi-fpuI-t1k40O
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 20.00 GB
  Current LE 5120
  Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:0

  --- Logical volume ---
  LV Name /dev/snape/tmp
  VG Name snape
  LV UUID NP3pnf-uPAz-3jdd-fK6n-A62K-KGqU-UXQviZ
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 20.00 GB
  Current LE 5 Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:1

  --- Logical volume ---
  LV Name /dev/snape/var
  VG Name snape
  LV UUID 1sxfyk-b22f-ajmE-rtdg-Sg0h-hR2q-MjMzHi
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 250.00 GB
  Current LE 64000
  Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:2

  --- Logical volume ---
  LV Name /dev/snape/home
  VG Name snape
  LV UUID 52zbNh-vfpr-UKTZ-moXR-fVEN-Zqzk-061JTK
120
  LV Write Access read/write
  LV Status available
  # open 1
  LV Size 298.59 GB
  Current LE 76439
  Segments 1
  Allocation inherit
  Read ahead sectors 0
  Block device 254:3

Revision history for this message
Ian Jackson (ijackson) wrote :

pjwigan writes ("[Bug 75681] Re: boot-time race condition initializing md"):
> udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit

I think this is probably a separate problem. Are you using LILO ?
Can you please email me your lilo.conf ? (Don't attach it to the bug
report since I want to avoid confusing this report.)

Ian.

Revision history for this message
pjwigan (pjwigan) wrote :

The issue (which appears to be bug #84964 BTW) occurs
when trying to boot from the Herd 5 live CD, whether
32 or 64 bit. The PC has an up to date standard
install of 32 bit Edgy, so LILO is not involved.

One oddity tho': the udevd-event line only appears on
my secondary monitor. They usually display exactly
the same text until X starts.

--- Ian Jackson <email address hidden> wrote:

> pjwigan writes ("[Bug 75681] Re: boot-time race
> condition initializing md"):
> > udevd-event[2029]: run_program: '/sbin/modprobe'
> abnormal exit
>
> I think this is probably a separate problem. Are
> you using LILO ?
> Can you please email me your lilo.conf ? (Don't
> attach it to the bug
> report since I want to avoid confusing this report.)
>
> Ian.
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

___________________________________________________________
New Yahoo! Mail is the ultimate force in competitive emailing. Find out more at the Yahoo! Mail Championships. Plus: play games and win prizes.
http://uk.rd.yahoo.com/evt=44106/*http://mail.yahoo.net/uk

Revision history for this message
octothorp (shawn-leas) wrote :

He has already discovered that his problem likely has a different bug
already in the system.

On 3/14/07, Ian Jackson <email address hidden> wrote:
>
> pjwigan writes ("[Bug 75681] Re: boot-time race condition initializing
> md"):
> > udevd-event[2029]: run_program: '/sbin/modprobe' abnormal exit
>
> I think this is probably a separate problem. Are you using LILO ?
> Can you please email me your lilo.conf ? (Don't attach it to the bug
> report since I want to avoid confusing this report.)
>
> Ian.
>
> --
> boot-time race condition initializing md
> https://launchpad.net/bugs/75681
>

Revision history for this message
octothorp (shawn-leas) wrote :

I suggest breaking in an "sh -x"ing the script that loads the modules,
tacking "-v" onto any modprobe lines, and getting kernel messages captures
using a serial console or something.

You'll need to identify, since there's a hard failure at a consistent
location, where it's failing.

Probably should move further discussion over to that bug before this one
looses its specificity.

On 3/14/07, pjwigan <email address hidden> wrote:
>
> The issue (which appears to be bug #84964 BTW) occurs
> when trying to boot from the Herd 5 live CD, whether
> 32 or 64 bit. The PC has an up to date standard
> install of 32 bit Edgy, so LILO is not involved.
>
> One oddity tho': the udevd-event line only appears on
> my secondary monitor. They usually display exactly
> the same text until X starts.
>

Revision history for this message
Eamonn Sullivan (eamonn-sullivan) wrote :

Just a note that 2.6.20-11-server appears to have solved this issue for me. The server is booting normally after today's update.

29 comments hidden view all 109 comments
Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

my initramfs can be found here: http://gin.univ-mrs.fr/GINsim/download/initrd.img-2.6.20-13-generic
NOTE: it is note the one on which I booted this morning but I think I reverted all changes and built it again, so it should be pretty similar ;)

Revision history for this message
Ian Jackson (ijackson) wrote :

Aurelien Naldi writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> On 3/28/07, Ian Jackson <email address hidden> wrote:
> > OK, so the main symptom in the log that I'm looking at was that you
> > got an initramfs prompt ?
>
> a normal boot would have given me a busybox yes, but I added
> "break=premount here", so the shell is not exactly a bug ;) The bug
> is that my array was not _correctly_ assembled after running
> udevtrigger

I see. Err, actually, I don't see. In what way was the assembly of
the raid incorrect ? You say it wasn't degraded. Was it assembled at
all ? Was it half-assembled ?

> No, I copied the output to one of my boot partition, one that is _not_
> in the RAID/LVM. I did it to avoir running vgscan and friends by hand
> to see if the next part of the boot goes fine or not.

Right.

Ian.

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

Le mercredi 28 mars 2007 à 22:55 +0000, Ian Jackson a écrit :
> I see. Err, actually, I don't see. In what way was the assembly of
> the raid incorrect ? You say it wasn't degraded. Was it assembled at
> all ? Was it half-assembled ?

My memory does not deserve me right, sorry!
In previous versions, the array was listed in /proc/mdstat with a set of
drives, but not really assembled. before running the mdadm script I had
to stop it... This bug is now gone, the array was _not_ assembled and I
could run the mdadm script directly ;)

Revision history for this message
Reinhard Tartler (siretart) wrote :

Ian Jackson <email address hidden> writes:

> Reinhard, you may remember that on irc I asked you to try moving
> /usr/share/initramfs-tools/scripts/local-top/mdrun
> aside and rebuilding your initramfs. Did you try this in the end ?

Yes, I tried that with the effect that reproducibly none if the raid
devices come up at all :(

Revision history for this message
Ian Jackson (ijackson) wrote :

Aurelien, sorry to ask you to do this tedious test again, but I've
been looking at this logfile and I think I should have told you to use
`>>' rather than `>' when writing the log. Also, I really need to
know more clearly what the fault was and what you did to fix it (if
anything). And while I'm at it I'd like to rule out a possible
failure mode.

So here's some more data collection instructions:

0. General: please write down everything you do. It's very difficult
   to do this debugging at a distance and accurate, detailed and
   reliable information about the sequence of events will make life
   much easier. If you have two computers available, use one for
   making notes. Otherwise make notes on paper.

1. Edit
     /usr/share/initramfs-tools/scripts/local-top/mdadm
   and insert
     echo "running local-top/mdadm $*"
   near the top, just after `set -eu'.

2. Install
     http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/udev_103-0ubuntu14~iwj1_i386.deb
   (Sources can be found alongside, at
     http://www.chiark.greenend.org.uk/~ian/d/udev-nosyslog/)
   If you had that installed already then because of step 1 you must
   say
     update-initramfs -u

3. Boot with break=premount

4. At the initramfs prompt:
  udevd --verbose --suppress-syslog >>/tmp/udev-output 2>&1 &
  udevtrigger
   and wait for everything to settle down.

At this point we need to know whether your root filesystem is there.
If it is (/dev/VG/LV exists) then the attempt to reproduce has failed.

If the attempt to reproduce the problem has succeeded:

5. Collect information about the problem symptoms
  (cat /proc/partitions; mdadm -Q /dev/md0; mdadm -D /dev/md0) >>/tmp/extra-info 2>&1

6. Write down what you do to fix it. Preserve
   /tmp/udev-output and /tmp/extra-info eg by copying them to
   your root filesystem. Eg:
        pkill udevd
        mount /dev/my-volume-group/my-volume-name /root
        cp /tmp/udev-output* /root/root/.
        exit

7. Please comment on this bug giving the following information:
     * What you did, exactly
     * What the outcomes were including attaching these files
        udev-output
        extra-info
     * A location where I can download the /initrd.img you were
       using.
     * A description of your raid and lvm setup, if you haven't given
       that already.

Once again, I'm sorry to put you to this trouble. I think it's
essential to fix this bug for the feisty release and I have been
poring over your logs and trying various strategies to reproduce what
I suspect might be relevant failure modes, but without significant
success so far.

Thanks,
Ian.

Revision history for this message
Ian Jackson (ijackson) wrote :

Reinhard Tartler writes ("Re: [Bug 75681] Re: boot-time race condition initializing md"):
> Yes, I tried that with the effect that reproducibly none if the raid
> devices come up at all :(

I find this puzzling. I've double-checked your initramfs again
following other people's comments and the existence of the mdrun
script shouldn't matter nowadays. Anyway, if you can reliably
reproduce the problem this is good because it means I might be able to
fix it.

Can you please follow the instructions I've just given to Aurelien in
my last comment to this bug ? As I say I'm sorry to put you to all
this trouble - I know that repeatedly rebooting and messing about in
the initramfs are a PITA. But I really want to fix this bug.

Ian.

Revision history for this message
Reinhard Tartler (siretart) wrote :

octothorp <email address hidden> writes:

> I'm very interested in finally testing 105, and the missing
> udev_105.orig.tar.gz is a bit of a challenge, but at least there's a
> diff.

I'm terribly sorry, I've just uploaded the forgotten orig.tar.gz

Revision history for this message
Aurelien Naldi (aurelien.naldi) wrote :

OK, I really appreciate that you want to fix this and I really want to help, but this has been driving me nuts.
I can reproduce it *way too* reliably with a "normal" boot (i.e. it happens 8 times out of 10) but when trying to get a log it suddenly refused to happen :/
I started to think that the echo line added to the mdadm script caused a sufficient lag to avoid the trap and rebooted without this line, I could reproduce it and get a log at the first boot! Unfortunatly I messed somethings up and thus tryed to make a new one without much success.

Then I put the "echo running...." line back and I have tried again and again for half an hour to finally get one, but I forgot to put a ">>" instead of a ">" when redirecting udevd's output.
I give up for now as I have been unable to reproduce it since (except twice with "normal" boot, but not willing to wait 3minutes...)

so, here it is, I attach a tar.gz of the log I made, and I put my initramfs here: http://crfb.univ-mrs.fr/~naldi/initrd.img-2.6.20-13-generic

What I did:
boot with break=premount
launch udevd & udevtrigger, check that the bug happened, collect some info

to fix it:
/scripts/local-top/mdadm
lvm vgscan
lvm vgchange -a y

collect some more info

mount /dev/sda1 on /mnt # an ext2 partition for /boot
copy my log files here
umount /mnt
pkill udevd (sorry, I might have get nicer logs if I stopped it before, right ?)
exit

watch it finish booting

Revision history for this message
Akmal Xushvaqov (uzadmin) wrote :

it doesn't upgrade the system. It shows me following:

Failed to fetch http://uz.archive.ubuntu.com/ubuntu/dists/edgy-updates/Release.gpg Соединение разорвано
Failed to fetch http://uz.archive.ubuntu.com/ubuntu/dists/edgy/Release.gpg Соединение разорвано
Failed to fetch http://uz.archive.ubuntu.com/ubuntu/dists/edgy-backports/Release.gpg Соединение разорвано
Failed to fetch http://kubuntu.org/packages/kde4-3.80.3/dists/edgy/Release.gpg Ошибка чтения, удалённый сервер прервал соединение
Failed to fetch http://thomas.enix.org/pub/debian/packages/dists/edgy/main/binary-i386/Packages.gz 302 Found

Revision history for this message
Reinhard Tartler (siretart) wrote :
Download full text (4.1 KiB)

my setup (taken from a booted system):

siretart-@hades:~
>> cat /proc/mdstat
Personalities : [raid0] [raid1]
md1 : active raid1 sda5[0] sdb5[1]
      1951744 blocks [2/2] [UU]

md0 : active raid1 sda2[0] sdb2[1]
      489856 blocks [2/2] [UU]

md3 : active raid0 sda7[0] sdb7[1]
      395632512 blocks 64k chunks

md2 : active raid1 sda6[0] sdb6[1]
      97659008 blocks [2/2] [UU]

unused devices: <none>

siretart-@hades:~
>> sudo lvs
  LV VG Attr LSize Origin Snap% Move Log Copy%
  backup hades_mirror -wi-ao 42,00G
  home hades_mirror -wi-ao 25,00G
  ubunturoot hades_mirror -wi-ao 25,00G
  chroot_dapper hades_stripe -wi-a- 5,00G
  chroot_dapper32 hades_stripe -wi-a- 5,00G
  chroot_edgy hades_stripe -wi-a- 5,00G
  chroot_edgy32 hades_stripe -wi-a- 5,00G
  chroot_feisty hades_stripe -wi-a- 5,00G
  chroot_feisty32 hades_stripe -wi-a- 5,00G
  chroot_sarge32 hades_stripe -wi-a- 3,00G
  chroot_sid hades_stripe -wi-a- 5,00G
  chroot_sid32 hades_stripe owi-a- 5,00G
  dapper32-snap hades_stripe -wi-a- 2,00G
  mirror hades_stripe -wi-ao 89,00G
  scratch hades_stripe -wi-ao 105,00G
  sid-xine-snap hades_stripe swi-a- 3,00G chroot_sid32 26,43
  ubunturoot hades_stripe -wi-ao 25,00G
siretart-@hades:~
>> sudo pvs
  PV VG Fmt Attr PSize PFree
  /dev/md2 hades_mirror lvm2 a- 93,13G 1,13G
  /dev/md3 hades_stripe lvm2 a- 377,30G 10,30G
siretart-@hades:~
>> sudo vgs
  VG #PV #LV #SN Attr VSize VFree
  hades_mirror 1 3 0 wz--n- 93,13G 1,13G
  hades_stripe 1 14 1 wz--n- 377,30G 10,30G

The (primary) root volume is /dev/hades_stripe/ubunturoot.

I wasn't able to reproduce the problem with the instructions you
gave. However, I modified
/usr/share/initramfs-tools/scripts/init-premount/udev to look like this:

--- /usr/share/initramfs-tools/scripts/init-premount/udev 2007-03-29 20:44:30.000000000 +0200
+++ /usr/share/initramfs-tools/scripts/init-premount/udev~ 2007-03-29 20:30:21.000000000 +0200
@@ -20,9 +20,10 @@
 # It's all over netlink now
 echo "" > /proc/sys/kernel/hotplug

+sleep 3
+
 # Start the udev daemon to process events
-#/sbin/udevd --daemon
-/sbin/udevd --verbose --suppress-syslog >> /tmp/udev-output 2>&1 &
+/sbin/udevd --daemon

 # Iterate sysfs and fire off everything; if we include a rule for it then
 # it'll get handled; otherwise it'll get handled later when we do this again

This way (okay, after two boots), I was able to reproduce...

Read more...

Revision history for this message
Reinhard Tartler (siretart) wrote :
Revision history for this message
Jeffrey Knockel (jeff250) wrote :

Reinhard, I tried your Ubuntu deb (105-4), and it brought the bug back for me consistently. Then I reverted to debian unstable's version (0.105-4), and now I'm booting consistently correctly again.

This is interesting. Clearly what is causing my bug must be in the Ubuntu packaging? I've already tried gouging out the Ubuntu patches in debian/patches of the Ubuntu package that you gave me (and hacking debian/rules to accommodate) and then repackaging and reinstalling it then, but still no go. This as far as I have time to play around with this tonight, but before I try again in the near future, does anyone have any ideas?

Revision history for this message
Reinhard Tartler (siretart) wrote :

Jeff250 <email address hidden> writes:

> Reinhard, I tried your Ubuntu deb (105-4), and it brought the bug back
> for me consistently. Then I reverted to debian unstable's version
> (0.105-4), and now I'm booting consistently correctly again.
>
> This is interesting. Clearly what is causing my bug must be in the
> Ubuntu packaging?

As you might have already noticed, udev is a pretty critical and central
piece of software in the boot procedure of a debian/ubuntu system. The
packaging and integration of udev is pretty different to debian. The
change which caused this is the implementation of the UdevMdadm Spec,
which was approved last UDS. I'm pretty sure that with reverting to the
packaging of the edgy udev package, the problem won't
appear. Unfortunately, this seems to be no option for feisty. :(

Revision history for this message
Codink (herco) wrote :

I think the solution is in:
Bug report: 83231
https://launchpad.net/ubuntu/+source/initramfs-tools/+bug/83231

the "udevsettle --timeout 10" at the end of /usr/share/initramfs-tools/scripts/init-premount/udev works too.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

We believe that this problem has been corrected by a series of uploads today. Please update to ensure you have the following package versions:

    dmsetup, libdevmapper1.02 - 1.02.08-1ubuntu6
    lvm-common - 1.5.20ubuntu12
    lvm2 - 2.02.06-2ubuntu9
    mdadm - 2.5.6-7ubuntu5 (not applicable unless you're also using mdadm)
    udev, volumeid, libvolume-id0 - 108-0ubuntu1

The problem was caused by a number of ordering issues and race conditions relating to when lvm and mdadm were called, and how those interacted to ensure the devices were created and their contents examined.

This should work as follows:
 * an underlying block device, sda1, is detected
 * udev (through vol_id) detects that this is a RAID member
 * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
 * the creation of a new raid, md0, is detected
 * udev fails to detect this device, because it is not yet complete

meanwhile:
 * a second underlying block device, sdb1, is detected
 * udev (through vol_id) detects that this is a RAID member
 * udev invokes mdadm, which can now complete since the set is ready
 * the change of the raid array, md0, is detected
 * udev (through vol_id) detects that this is an LVM physical volume
 * lvm is called to handle the creation of the devmapper devices

then
 * various devmapper devices are detected
 * the devices are created by udev, named correctly under /dev/mapper
 * meanwhile the requesting application spins until the device exists, at which point it carries on
 * udev (through vol_id) detects that these devices contain typical filesystems
 * using vol_id it obtains the LABEL and UUID, which is used to populate /dev/disk

Note that this event-based sequence is substantially different from Debian, so any bugs filed there will not be relevant to helping solve problems in Ubuntu.

This should now work correctly. If it does not, I would ask that you do not re-open this bug, and instead file a new bug on lvm2 for your exact problem, even if someone else has already filed one, with verbose details about your setup and how you cause the error.

Changed in lvm2:
status: Unconfirmed → Fix Released
Changed in mdadm:
status: Confirmed → Fix Released
Changed in udev:
status: Unconfirmed → Fix Released
Revision history for this message
Manoj Kasichainula (manoj+launchpad-net) wrote :

I don't use LVM yet am seeing the same problem with software RAID. I just dist-upgraded, reran the update-initramfs just to be sure, and saw the failure at boot. Package list confirmed below matching the most recent versions mentioned above:

> dpkg -l dmsetup libdevmapper1.02 lvm-common lvm2 mdadm udev volumeid libvolume-id0
No packages found matching lvm-common.
No packages found matching lvm2.
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Installed/Config-files/Unpacked/Failed-config/Half-installed
|/ Err?=(none)/Hold/Reinst-required/X=both-problems (Status,Err: uppercase=bad)
||/ Name Version Description
+++-=======================-=======================-==============================================================
ii dmsetup 1.02.08-1ubuntu6 The Linux Kernel Device Mapper userspace library
ii libdevmapper1.02 1.02.08-1ubuntu6 The Linux Kernel Device Mapper userspace library
ii libvolume-id0 108-0ubuntu1 volume identification library
ii mdadm 2.5.6-7ubuntu5 tool to administer Linux MD arrays (software RAID)
ii udev 108-0ubuntu1 rule-based device node and kernel event manager
ii volumeid 108-0ubuntu1 volume identification tool

The udevsettle script from https://launchpad.net/ubuntu/+source/initramfs-tools/+bug/83231 fixed my problem.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Manoj: as noted above, please file a new bug

Revision history for this message
Sami J. Laine (sjlain) wrote :

Scott James Remnant wrote:
> We believe that this problem has been corrected by a series of uploads
> today. Please update to ensure you have the following package versions:
>
> dmsetup, libdevmapper1.02 - 1.02.08-1ubuntu6
> lvm-common - 1.5.20ubuntu12
> lvm2 - 2.02.06-2ubuntu9
> mdadm - 2.5.6-7ubuntu5 (not applicable unless you're also using mdadm)
> udev, volumeid, libvolume-id0 - 108-0ubuntu1
>
> The problem was caused by a number of ordering issues and race
> conditions relating to when lvm and mdadm were called, and how those
> interacted to ensure the devices were created and their contents
> examined.
>
> This should work as follows:
> * an underlying block device, sda1, is detected
> * udev (through vol_id) detects that this is a RAID member
> * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
> * the creation of a new raid, md0, is detected
> * udev fails to detect this device, because it is not yet complete
>
> meanwhile:
> * a second underlying block device, sdb1, is detected
> * udev (through vol_id) detects that this is a RAID member
> * udev invokes mdadm, which can now complete since the set is ready
> * the change of the raid array, md0, is detected
> * udev (through vol_id) detects that this is an LVM physical volume
> * lvm is called to handle the creation of the devmapper devices
>
> then
> * various devmapper devices are detected
> * the devices are created by udev, named correctly under /dev/mapper
> * meanwhile the requesting application spins until the device exists, at which point it carries on
> * udev (through vol_id) detects that these devices contain typical filesystems
> * using vol_id it obtains the LABEL and UUID, which is used to populate /dev/disk
>
> Note that this event-based sequence is substantially different from
> Debian, so any bugs filed there will not be relevant to helping solve
> problems in Ubuntu.
>
> This should now work correctly. If it does not, I would ask that you do
> not re-open this bug, and instead file a new bug on lvm2 for your exact
> problem, even if someone else has already filed one, with verbose
> details about your setup and how you cause the error.

The problem persists. Only solution is still to use break=mount option
to boot.

However, I don't use LVM at all, so I don't think I should file a bug on
lvm.

--
Sami Laine @ GMail

Revision history for this message
Wilb (ubuntu-wilb) wrote :

Exactly same problem for me here too - no LVM in sight, just a md0 and md1 as /boot and / respectively, fixed by using break=mount and mounting manually.

Revision history for this message
Oliver Brakmann (obrakmann) wrote :

Did somebody already report a new bug on this?
If not, there are still two other open bugs with the same issue, one of them new: bug #83231 and bug #102410

Revision history for this message
Mathias Lustig (delaylama) wrote :

Hi everyone,

yesterday evening (here in Germany), I also got into trouble using the latest lvm2 upgrade available for feisty. I upgraded the feisty install on my Desktop PC and during the upgrade, everything just ran fine. The pc is equipped with two 160 GB Samsung Sata-Drives on an Nforce4 Sata Controller.

After a reboot, Usplash just hangs forever at the state, where the lvm should mount all his volumes. It hangs and hangs and hangs and nothing happens. This procedure happens withe every kernel that's installed ( the 386, generic and lowlatency kernels from 2.6.20-11, 2.6.20-12 and 2.6.20-13). Booting in single user / recovery mode and interrupting the LVM with strg +c brings me to a root shell where mount -a enables me to mount all my logical volumes - obivously something the lvm init-script just won't do on its own.

Is there some workaround to get my lvm working again, until a new, fixed package is available?

ah, something I forgot: at my working place, there's a feisty Workstation, too, which also uses LVM on a single ata disk. Here, I experienced no problems after an upgrade. The upgrade here just worked fine. Now, I'm a little afraid that upgrading my notebook will also brake the lvm ...

Is there any file or any other thing that I should provide to help you finding and fixing the bug?

Greetings,

Mathias

Revision history for this message
Mathias Lustig (delaylama) wrote :

i forgot to mention, that I do NOT use any software RAID prior to the LVM. The LVM just spans my two diskt that are joined together in one volume group.

Revision history for this message
Reinhard Tartler (siretart) wrote :

could you try removing the evms package and regenerating your initramfs? This fixed the problem for me.

Scott has a fix for evms in the pipe, bug feel free to file another bug to document this issue.

Revision history for this message
Mathias Lustig (delaylama) wrote :

okay, I can try to remove the evms-package and rerun update-initramfs this evening. I'm not shure if evms is installed but maybe it is ;)
The biggest problem is, that the ubuntu installation on my desktop pc was dapper at first, where I installed a lot of extra software. Three weeks after the edgy release I upgrades from Dapper to Edgy and about 3 weeks ago I just wanted to do something new and upgraded to feisty. I can never be sure if a specific problem is a error in feisty or if it results because of some strange packages that remained from an earlier and dist-upgraded ubuntu version.
maybe there are pontetial sources of error under this condition...

I'll try to follow your advice to track down my screwed up lvm2 / evms problem. I don't know if it's usefull to file a bugreport for an allready know problem...

Btw - (I know, that this is probebly not the right place for that question) - but what are the differences between lvm2 and evms ? *confused*
Just wondering why evms might be installed on my desktop even if I don't need it because lvm2 is the logical volume manager of choice ...

Revision history for this message
cionci (cionci-tin) wrote :

My Feisty is up to date. I've the same BusyBox problem at boot up. Can you tell me how to track the issue ? I don't have raid but I have a Adaptec 19160 SCSI controller on which reside the root partition.

Revision history for this message
Jeff Balderson (jbalders) wrote :

This isn't fixed for me. I opened a new report (Bug # 103177) as requested by Scott James Remnant.

My problem still has the exact same symptoms that I described above.

Revision history for this message
den (den-elak) wrote :

So there is still no cure!?? The workaround with MD_DEGRADED_ARGS helps to assemble array only when all of raid devices found by udev. But there is still very BAD behavior, if we unplug one of raid disks the system hangs during a boot process.

PS. Just installed feisty server, and updated.
I use MD, and LVM2 on top of it.
The setup has two SATA drives acting in RAID1 mode.
mdadm 2.5.6-7ubuntu5
udev 108-0ubuntu4
initramfs-tool 0.85eubuntu10
lvm2 2.02.06-2ubuntu9

I fixed the problem for me by purpose of udevsettle and setting MD_DEGRADED_ARGS=" ", but that's an ugly method...

Can you tell is there any full-fledged idea to fix the problem?

Revision history for this message
dan_linder (dan-linder) wrote :

I was having a similar issue, but a solution in this bug report fixed it for me:

https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/99439/comments/1

[Quote]
Try adding "/sbin/udevsettle --timeout=10" in /usr/share/initramfs-tools/init before line:
  log_begin_msg "Mounting root file system..."
and then rebuild initrd with:
  sudo update-initramfs -u -k all
[/Quote]

Does this help someone/anyone on this bug thread?

Dan

1 comments hidden view all 109 comments
Revision history for this message
den (den-elak) wrote :

Thanks Dan!
That's true, I did managed to solve the problem just the same way.
But why /sbin/udevsettle --timeout=10 is not in the distribution itself, perhaps something is wrong with that?

Revision history for this message
Reinhard Tartler (siretart) wrote :

den <email address hidden> writes:

> But why /sbin/udevsettle --timeout=10 is not in the distribution
> itself, perhaps something is wrong with that?

That's not the right fix. It is a workaround that seems to work on some
machines, though...

Scott, do you think that workaround is worth an upload to
feisty-updates? I could perhaps upload a test package to my ppa...

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message
Reinhard Tartler (siretart) wrote :

Okay, I uploaded a package with the udevsettle line to my ppa. Add this to your /etc/apt/sources.list:

deb http://ppa.dogfood.launchpad.net/siretart/ubuntu/ feisty main

do a 'apt-get update && apt-get dist-upgrade'. Please give feedback if that package lets your system boot.

Changed in initramfs-tools:
importance: Undecided → High
status: New → Incomplete
Revision history for this message
den (den-elak) wrote :

Hello Reinhard!
My system boots fine with your updated initramfs-tools package!

Revision history for this message
den (den-elak) wrote :

Hello!

Not everything is OK as I expected. So I use /sbin/udevsettle --timeout=10 at the end of /etc/initramfs-tools/scripts/init-premount/udev script! Sometimes system boots fine but once I have rebooted, I get cat /proc/mdstat:
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      116712128 blocks [2/2] [UU]

md0 : active raid1 sda1[0]
      505920 blocks [2/1] [U_]

unused devices: <none>
The first raid array can't assemble well!

PS!
I also set MD_DEGRADED_ARGS=" " in local-top/mdadm.

Revision history for this message
Reinhard Tartler (siretart) wrote :

den <email address hidden> writes:

> Not everything is OK as I expected. So I use /sbin/udevsettle
> --timeout=10 at the end of
> /etc/initramfs-tools/scripts/init-premount/udev script! Sometimes
> system boots fine but once I have rebooted, I get cat /proc/mdstat:
> Personalities : [raid1]

As said, this is a nasty race condition in the bootup
procedure. Honestly, I don't believe we can fix it in feisty. In gutsy,
my system now seems to boot reliably. It would be great if you could
verify it, so that we can fix it there.

--
Gruesse/greetings,
Reinhard Tartler, KeyID 945348A4

Revision history for this message
den (den-elak) wrote :

I think I can try. But I put the system in production. If you could tell me what packages are exactly involved (kernel, initramfs-tools ...) and how to get them from gutsy repository without full dist-upgrade. I can check that on that system and report.

Revision history for this message
Peter Haight (peterh-sapros) wrote :

There is a problem with the solution outlined by Scott James Remnant above. (https://bugs.launchpad.net/ubuntu/+source/initramfs-tools/+bug/75681/comments/84)

What happens when one of the raid devices has failed and it isn't getting detected? Take Scott's example a raid with (sda1 and sdb1):

> This should work as follows:
> * an underlying block device, sda1, is detected
> * udev (through vol_id) detects that this is a RAID member
> * udev invokes mdadm, which fails to assemble because the RAID-1 is not complete
> * the creation of a new raid, md0, is detected
> * udev fails to detect this device, because it is not yet complete

At this point, mdadm should assemble the RAID with just sda1 because sdb1 is down, but in the current scheme mdadm only assembles the RAID if all drives are available. This sort of defeats the point of using any of the mirrored RAID schemes.

So because the only case that I know of where this is an issue is a case with drive failure, how about trying to run mdadm again after the root mount timeout, but this time without the --no-degraded arg so that if we can assemble some of the RAID arrays without the missing drives, we do it.

I'll attach some patches to some /usr/share/initramfs-tools scripts which fix this problem for me.

So then the question is how do we know that sdb1 is down and should go ahead with assembling the RAID array? I'm not sure exactly what kind of information we have this early in the bootup process, but how about something like this

Revision history for this message
Peter Haight (peterh-sapros) wrote :

So, there is something wrong with that patch. Actually it seems to be working great, but when I disconnect a drive to fail it, it boots up immediately instead of trying mdadm after the timeout. So I'm guessing that the mdadm script is getting called without the from-udev parameter somewhere else. But it is working in some sense because the machine boots nicely with one of the RAID drives disconnected, or with both of them properly setup. So there might be some race problem with this patch.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for initramfs-tools (Ubuntu Feisty) because there has been no activity for 60 days.]

Revision history for this message
xteejx (xteejx-deactivatedaccount) wrote :

LP Janitor did not change the status of the initramfs-tools. Changing it to Invalid. If this is wrong, please change it to Fix Released, etc. Thanks.

Changed in initramfs-tools (Ubuntu):
status: New → Invalid
Displaying first 40 and last 40 comments. View all 109 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.