boot failures as /dev is not transferred to /root (because 'udevadm exit' times out waiting for a deadlocked worker)

Bug #818177 reported by Tom Ellis on 2011-07-29
220
This bug affects 41 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
linux (Ubuntu)
High
Unassigned
Oneiric
High
Unassigned
Precise
Undecided
Unassigned
udev (Ubuntu)
High
Andy Whitcroft
Oneiric
High
Andy Whitcroft
Precise
High
Andy Whitcroft

Bug Description

After running a do-release-upgrade -d on a natty server, I'm unable to boot the machine properly.

It looks like it doesn't see the local disks anymore.

I think this looks quite a lot like the debbug linked from here:
https://bugs.launchpad.net/ubuntu/oneiric/+source/linux/+bug/746751

ProblemType: Bug
DistroRelease: Ubuntu 11.10
Package: linux-image-3.0.0-7-server 3.0.0-7.8
ProcVersionSignature: Ubuntu 3.0.0-7.8-server 3.0.0
Uname: Linux 3.0.0-7-server x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 2011-07-29 13:34 seq
 crw-rw---- 1 root audio 116, 33 2011-07-29 13:34 timer
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
Date: Fri Jul 29 13:40:10 2011
HibernationDevice: RESUME=UUID=844d17fa-ff69-4250-844b-33b930236c56
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.

 eth1 no wireless extensions.
MachineType: HP ProLiant DL380 G5
PciMultimedia:

ProcEnviron:
 LANGUAGE=en_US:
 PATH=(custom, user)
 LANG=en_US
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.0.0-7-server root=/dev/mapper/User Name-root ro INIT_VERBOSE=yes
RelatedPackageVersions:
 linux-restricted-modules-3.0.0-7-server N/A
 linux-backports-modules-3.0.0-7-server N/A
 linux-firmware 1.56
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux
UpgradeStatus: Upgraded to oneiric on 2011-07-29 (0 days ago)
dmi.bios.date: 06/28/2007
dmi.bios.vendor: HP
dmi.bios.version: P56
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP56:bd06/28/2007:svnHP:pnProLiantDL380G5:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL380 G5
dmi.sys.vendor: HP

Tom Ellis (tellis) wrote :
Tom Ellis (tellis) wrote :

Hmm, this now looks like some sort of race condition on based on a couple of issues. I've managed to boot successfully once in 10 attempts without any problems but more often than none I either get to a login prompt and no disks visible or stuck in a udev loop with the errors:

udev[90]: timeout: killing '/sbin/modprobe -bv pci:v000014E4d0000164Csv0000103Csd00007038b
c02sc00i00' [178]

Lots and lots of the above shown, then finally:

bnx2: Can't load firmware file "bnx2/bnx2-mips-06-6.2.1.fw"
bnx2: 0000:05:00.0 PCI INT A disabled
bnx2: probe of 0000:05:00.0 failed with error -2

Brad Figg (brad-figg) on 2011-08-15
Changed in linux (Ubuntu):
status: New → Confirmed
Andy Whitcroft (apw) wrote :

The dmesg on this bug is from a rare good boot. it would be also good to get a dmesg from a bad boot as well.

Marc Cluet (lynxman) wrote :
Marc Cluet (lynxman) wrote :
Changed in linux (Ubuntu):
importance: Undecided → High
Marc Cluet (lynxman) wrote :
Marc Cluet (lynxman) wrote :
Andy Whitcroft (apw) on 2011-08-31
summary: - HP DL380G5 does not see disks after upgrade to Oneiric
+ HP DL380G5 root disk mounted read-only on boot and boot fails
Marc Cluet (lynxman) wrote :

System is able to restart

Marc Cluet (lynxman) wrote :

If we remount manually while in busybox

# mount -o rw,remount /

Then proceed to resume boot, system just shows

rc-sysinit start/running, process 1011

Andy Whitcroft (apw) wrote :

Ok. so the disk can be seen and mounts ok. I am suspicious this is a userspace/init issue.

Note that there are no /dev/disk/* directories or symlinks. Not sure
if that matters...

Marc Cluet (lynxman) wrote :
Marc Cluet (lynxman) wrote :
Adam Gandelman (gandelman-a) wrote :

Seems I can reproduce something similar and break an otherwise functioning oneiric system just by adding LVM into my setup. If I create a VG and LV using /dev/sda7 as the backing PV (the last partition of a single-disk system), init-bottom hangs until udev kills '/sbin/lvm vgscan; /sbin/lvm vgchange -a -y' due to timeout. The root fileystem ends up read only 3 of 5 boots. Will try to gather more debugging info and report here.

Adam Gandelman (gandelman-a) wrote :

Not sure if its a hack or a fix, but allowing udev to settle before stopping it in init-bottom seems to do the trick.

I've modified /usr/share/initramfs-tools/scripts/init-bottom/udev according to http://paste.ubuntu.com/679222/ and regenerated my initre.img

I believe this is something more general wrt LVM and init as my symptoms are different but workaround seems to fix both cases.

Adam Gandelman (gandelman-a) wrote :

Attached is serial console log of a failed boot from a virtual machine. I've found it easy to reproduce the race in a kvm virtual machine using the daily oneiric iso. Results match what I encounter on real hardware:

Install oneiric, using standard guided partition table.
Attach a second storage dev
pvcreate /dev/sdb
vgcreate test-vg /dev/sdb
lvcreate -L 100M -n test-lv test-vg

Reboot. Seems to hit ~1/5 boots.

Dave Walker (davewalker) on 2011-09-02
tags: added: server-o-ro
Changed in linux (Ubuntu):
assignee: nobody → Ubuntu Foundations Team (ubuntu-foundations-team)
Steve Langasek (vorlon) wrote :

I'm not sure if there's a kernel bug here (there might be?), but I'm pretty sure there is a udev bug, as indicated by the linked branch. However, just adding an extra 'udevadm settle' to the initramfs before killing udev is not sufficient to fix the race conditions here; udevadm settle + udevadm control --exit is not atomic, which means any new events that arrive in between the two calls might still be lost by whatever's going on in udevadm control.

udevadm control --exit needs to be made robust, to ensure that any events already being processed by udev at the time it's called get fully processed, and not lost in the middle as seems to be happening now.

Changed in udev (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Ubuntu Foundations Team (ubuntu-foundations-team)
Changed in linux (Ubuntu):
assignee: Ubuntu Foundations Team (ubuntu-foundations-team) → nobody
Marc Cluet (lynxman) wrote :

We did a ver throughout diagnose with apw which determined that the bug was sitting on user level rather than kernel level, I'm subscribing him to the bug as well so he can provide any further feedback about it.

Colin Watson (cjwatson) on 2011-09-07
Changed in udev (Ubuntu):
assignee: Ubuntu Foundations Team (ubuntu-foundations-team) → James Hunt (jamesodhunt)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel currently in the release pocket than the one you tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

Thank you for your help.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.0.0-7.8
Brad Figg (brad-figg) on 2011-09-12
tags: added: kernel-request-3.0.0-11.17
removed: kernel-request-3.0.0-7.8

I've now got a setup similar to Adams, but even with the 3.0.0-11.17 kernel, I'm still seeing the problem. Took 14 good boots to get a bad one though :) I'm testing with oneiric server.

Leonardo Borda (lborda) wrote :

Hi Guys,

We're still experiencing this issue on DL380G5 servers. Tested on latest daily oneiric kernel. 16/09/2011.
If you're lucky you can get a good boot 2 out of 10 times.

Leonardo

James Page (james-page) wrote :

I think this is the same issue as bug 801494

If you use the multi LVM recipe with a preseed on relatively slow IO hardware it happens 100% of the time.

Stefan Bader (smb) wrote :

This looks quite like it is likely related to /usr/share/initramfs-tools/scripts/init-bottom/udev. How things should work is:

- we got the initramfs mounted as initial /
- udev gets started and left running until the real root device node becomes available
- now udev gets stopped (used to be clubbed to death which caused even more problems) and /dev is tried to be
  moved to <realroot>/dev.
- <newroot> bcomes /, restart udev, and things should proceed from there

From the serial console log of comment #18 it looks like things went ok until the time udev should stop. There is no error message there but somehow it looks like fsck in the new / does not find the uuid link for it. Unclear whether maybe the move of /dev did not work, or did the previous step resolve the uuid to dev without the symlink? As Adam mentioned before, allowing udev to settle helps. So the race is somewhere there. Whoever can reproduce this may want to add additional echo's to the init-bottom script.

Quoting Stefan Bader (<email address hidden>):
> This looks quite like it is likely related to /usr/share/initramfs-
> tools/scripts/init-bottom/udev. How things should work is:

This is also where it appears to hang on my systems.

> - we got the initramfs mounted as initial /
> - udev gets started and left running until the real root device node becomes available
> - now udev gets stopped (used to be clubbed to death which caused even more problems) and /dev is tried to be
> moved to <realroot>/dev.
> - <newroot> bcomes /, restart udev, and things should proceed from there
>
> >From the serial console log of comment #18 it looks like things went ok
> until the time udev should stop. There is no error message there but
> somehow it looks like fsck in the new / does not find the uuid link for
> it. Unclear whether maybe the move of /dev did not work, or did the
> previous step resolve the uuid to dev without the symlink? As Adam
> mentioned before, allowing udev to settle helps. So the race is
> somewhere there. Whoever can reproduce this may want to add additional
> echo's to the init-bottom script.

(TBH I don't believe anyone can 'not reproduce it'. It sometimes takes
me 10 boots, but a standard ubuntu server kvm vm with a single primary
partition on virtio occasionally hangs there as well).

I've tried adding echos, but as near as I can tell, adding any echo
before the udevadm control --exit stops the hang. I rebooted 10 times
last night trying to do a 'ps -ef' before it, but that seemed to slow it
down just enough to stop the race.

I've tried (blindly) installing a recompiled udev which keeps handling
inotify events while waiting to exit, but that doesn't seem to be the
problem. It looks as though udevadm control --exit simply is hanging.

Perhaps some debug statements in udevd itself will be more helpful.
(will try)

On my system, things go better when I (re-)replace the 'udevadm control --exit' in /scripts/init-bottom/udev with 'pkill udevd', as was the case in natty's udev.

Then the timeout-followed-by-readonly-fs on bad boots is instead replaced by a 60 second timeout followed by a seemingly clean boot. kern.log then shows:

Sep 21 12:20:51 o-nolvm kernel: [ 2.450409] init: Handling started event
Sep 21 12:20:51 o-nolvm kernel: [ 2.450572] init: Reconnecting to system bus
Sep 21 12:20:51 o-nolvm kernel: [ 2.451993] init: mountall post-stop process
(336) exited normally
Sep 21 12:20:51 o-nolvm kernel: [ 2.452038] init: mountall state changed from
 post-stop to waiting
Sep 21 12:20:51 o-nolvm kernel: [ 2.452383] init: Handling stopped event
Sep 21 12:21:51 o-nolvm kernel: [ 61.925549] init: udevtrigger post-stop proce
ss (301) exited normally

the udevtrigger post-stop just does

post-stop exec udevadm settle

So I'm guessing that in both cases a hung handler or missing event is the problem.

tags: added: rls-mgr-o-tracking
Changed in udev (Ubuntu Oneiric):
milestone: none → ubuntu-11.10
Serge Hallyn (serge-hallyn) wrote :

(I no longer believe this is related to the lvm udev problem - i guess that one is a simple deadlock)

Unfortunately any time I try to instrument udevd to print the list of pending events and workers when it is told to exit, I manage to mask the bug so I can't reproduce.

Stefan Bader (smb) wrote :

Just for reference: it was the pkill of udev, that was causing trouble in tests I did on xen guests. In that case pkill sometimes had to be redone a few times. And even then there could be even rarer cases where udev did not seem to be killable at all. But if that was not done correctly it would leave the initrd not empty and that causes (slightly overreacting there) a panic.

Dave Walker (davewalker) on 2011-09-27
summary: - HP DL380G5 root disk mounted read-only on boot and boot fails
+ boot failures caused by udev race

I have a few machines throwing a kernel panic while netbooting oneiric with a similar error message.

I have attached the boot message log with the panic.

James Hunt (jamesodhunt) wrote :

It looks like udevd.c is rather aggressive when handling the exit scenario. I'm currently building a debug version of udevd + initramfs to try and see if and how messages are getting lost.

Steve Langasek (vorlon) wrote :

Andrew, I'm not sure you're experiencing the same issue; I would say in fact that you have some unrelated kernel bug, since there's no excuse for it taking 2 minutes to settle the kernel event queue. Getting a dump of 'udevadm monitor -e' from this initramfs (which would need to be started *before* this point to capture the actual events) should be instructive. Could you please open a new bug report for this and link it back here?

Dave Walker (davewalker) wrote :

@James, How did you get on with the debug version of udevd? When we tried to do the same, we were unable to reproduce the bug, as the debug statements seemed to slow down udev, hiding the race. Was this the same behaviour you encountered?

Thanks.

James Hunt (jamesodhunt) wrote :

@Dave: I've modded udevd to display some internal details, but cannot now make the images I have fail to boot reliably. Currently working with @jamespage who has a machine that fails to boot most times.

I've tried to force more frequent failures by installing with lots of LVs, but that doesn't seem to help either on virtual h/w (kvm) or the limited physical h/w I have access to. I've even gone as far as forcing particular LVs to respond *extremely* slowly (multi-second response time), but that hasn't helped trigger the problem either).

James Hunt (jamesodhunt) wrote :

BTW - I've also looked at how Fedora 15 stops udev with dracut and using Stefans terminology, they "club it to death" like we used to. They also pepper the code with frequent calls to "settle" and add a few sub-second sleeps here and there which feels horribly wrong IMHO.

I have an off-beat idea as to how we could establish if messages are getting lost which I may be forced to pursue unless I can overcome the Heisenberg nature of debugging udevd.

James Hunt (jamesodhunt) wrote :

@Adam: from comment #18, do you still have an image that fails to boot 1 in 5 times?

Serge Hallyn (serge-hallyn) wrote :

@JamesHunt,

you mention LVs in comment #34, but assuming this is the same bug causing my hangs and read-only rootfs on VMs, it does not require LVs. Unfortunately, like you, whenever I've instrumented grub to print out the list of pending events, I can't reproduce it :)

Perhaps kgdb is the way to go here.

Serge Hallyn (serge-hallyn) wrote :

@JamesHunt,

have you pursued trying to reproduce with a version of udevd which continues to process events when udev_exit==1?

I'm unclear as to whether (1) we need to continue to process inotify events as well (so that udev workers don't get hung), and (2) whether that just means that kobject_uevent messages from the kernel will still get lost between the exit of the initramfs' udevd and the rootfs one.

Adam Gandelman (gandelman-a) wrote :

@James: Im currently at a conference with limited wifi, but I checked yesterday and I can consistently reproduce on my thinkpad + kvm. Any chance you can publish those modded udev packages to a branch or PPA? I'm happy to test and see if I can get anything useful

James Hunt (jamesodhunt) wrote :

@Adam: Not quite yet, but will work on that when I get a chance tomorrow if possible. I'm intrigued by your setup though as I have a thinkpad + kvm but only see the problem very infrequently (and I've tried setting up images as you specify). How exactly are you invoking kvm? To clarify, are you using the server image?

@Serge: what was your storage configuration when you saw the problem without LVs? I think I've seen the problem a couple of times simply by providing 2 extra raw disks to the system but at that point my udevd debug wasn't helpful. I have tried modifying udevd to continue processing events but it then hangs. I really need to be able to recreate the problem more reliably on a system I've got control over before fathoming this one out.

Steve Langasek (vorlon) on 2011-10-05
Changed in linux (Ubuntu Oneiric):
status: Incomplete → Invalid
Steve Langasek (vorlon) on 2011-10-06
summary: - boot failures caused by udev race
+ boot failures because 'udevadm exit' does not kill udevd worker threads
Steve Langasek (vorlon) on 2011-10-06
Changed in udev (Ubuntu Oneiric):
milestone: ubuntu-11.10 → oneiric-updates
status: Confirmed → Triaged
Changed in linux (Ubuntu Precise):
status: New → Invalid
Changed in udev (Ubuntu Precise):
status: New → Triaged
importance: Undecided → High
tags: added: rls-mgr-p-tracking
removed: rls-mgr-o-tracking
Steve Langasek (vorlon) on 2011-10-06
Changed in ubuntu-release-notes:
status: New → Triaged
tags: added: rls-mgr-o-tracking
tags: added: iso-testing
49 comments hidden view all 129 comments
Steve Langasek (vorlon) wrote :

I'm able to reproduce the same behavior that Serge sees when running in a VM, and now have a clear understanding of what's happening here at the top level.

The init-bottom script, as noted earlier, is set -e; and udevadm control exits non-zero if it hits its timeout. These two factors combined mean that if udev gets stuck this way, the mount -o move *is never called*, because the script terminates immediately after the udevadm call.

This also entirely explains the behavior seen in bug #833783 (in fact, I just reproduced this problem here) - the difference between systems not finding the local disks, and systems panicing due to missing /dev/console, is down to what device nodes happen to be prepopulated on the root filesystem. So marking bug #833783 as a duplicate of this one.

I am currently investigating both the most appropriate way to ensure that udevadm control --exit doesn't have to hit its timeout (raising the timeout slightly, so that udevd times out first and exits, *should* be sufficient but doesn't seem to be), and why udev is getting stuck in seemingly ordinary boot scenarios (pristine vm).

The original problem reported, which involves a missing firmware, seems to have its own cause in the kernel since the firmware should not fail to load. I understand the kernel team is already looking into this.

summary: - boot failures because 'udevadm exit' does not kill udevd worker threads
+ boot failures because 'udevadm exit' times out while udevd waits for an
+ already-dead thread

The root error is that in worker_new(), at lines 353 ff., the worker threads process signals and udev event handling requests in the same loop. If it happens that both the signal and the event are received close enough together to be returned in the same epoll_wait(), and the udev_monitor handle is listed *first* in the epoll_event list, the worker will receive the event message (which increases the refcount on the udevd master side), then process the signal, and immediately exit without handling the event.

This should be handled by processing fd_signal first, and only if SIGTERM was not received, process fd_monitor. This ensures that we always exit ASAP after receiving the signal, instead of going off and processing another event that could take an indeterminate amount of time.

Steve Langasek (vorlon) wrote :

Tried that; doesn't work because the master process already thinks the event is dispatched as soon as it sends it, whether or not the worker process acts on it with recvmsg(). So I guess we need to try to process the event after all.

Steve Langasek (vorlon) wrote :

With the help of some debugging code, I've confirmed that the patch committed to the branch results in a correct (and speedy) exit in the case where a SIGTERM is received before a worker thread has accepted an event for processing. Will push this to the oneiric queue shortly.

Steve Langasek (vorlon) on 2011-10-08
Changed in udev (Ubuntu Oneiric):
assignee: James Hunt (jamesodhunt) → Steve Langasek (vorlon)
status: Triaged → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 173-0ubuntu4

---------------
udev (173-0ubuntu4) oneiric; urgency=low

  * Set a timeout of 61 seconds when calling udevadm control --exit, to
    ensure that udevd always reaches its *own* timeout first. If udevadm
    times out before udevd, it exits non-zero, breaking the move of the /dev
    mount; and if we ignore the udevadm failure then udevd might still be
    running, which is the original issue that the patch for bug #624469
    was trying to address. Raising the timeout doesn't make the boot any
    slower, it just makes sure we let udevd count the 60 seconds instead of
    udevadm. LP: #818177.
  * debian/patches/race-condition-on-exit: Process events before signals
    in the worker thread, so we don't sit around waiting 60 seconds for an
    already-dead process.
 -- Steve Langasek <email address hidden> Sat, 08 Oct 2011 00:48:54 -0700

Changed in udev (Ubuntu Oneiric):
status: Fix Committed → Fix Released
Martin Pitt (pitti) wrote :

Great work, Steve! So I suppose this doesn't need a release note after all.

Changed in udev (Ubuntu Precise):
status: Triaged → Fix Released
Changed in ubuntu-release-notes:
status: Triaged → Invalid

Many thanks. My LVM system now starts successfully, albeit still showing the 60s hang occasionally. Even the supposed-to-be-affected logical volume is mounted correctly. Just the "/sbin/lvm vgchange -a y" persists in the process list, but this apparently does not harm.

One additional note I want to make: I'd set the udevadm-control timeout at least to 70 seconds. The epoll_wait timeout is reset every time something happens. So if the "settle time" is bigger than 1 second, the total timeout would be more than 61 seconds of udevadm-control. Setting the timeout to 70 seconds, gives us at least 10 seconds "settle time".

My conclusion is that "udevadm control --exit" (introduced in april) is still a bit immature, and further races might wait to happen.

Serge Hallyn (serge-hallyn) wrote :

@Steve,

the new udev is working in my vms! thanks.

On Sat, Oct 08, 2011 at 11:56:50AM -0000, Eduard Hasenleithner wrote:
> Many thanks. My LVM system now starts successfully, albeit still showing
> the 60s hang occasionally. Even the supposed-to-be-affected logical
> volume is mounted correctly. Just the "/sbin/lvm vgchange -a y" persists
> in the process list, but this apparently does not harm.

Wow, I didn't expect that to be the case! I still want to keep the other
bug report open and pursue the "hung vgchange" problem, anyway; leaving
stray processes running in the initramfs after boot is untidy.

> One additional note I want to make: I'd set the udevadm-control timeout
> at least to 70 seconds. The epoll_wait timeout is reset every time
> something happens. So if the "settle time" is bigger than 1 second, the
> total timeout would be more than 61 seconds of udevadm-control. Setting
> the timeout to 70 seconds, gives us at least 10 seconds "settle time".

Ah, that's a good point. Of course, 70 seconds isn't guaranteed to be
enough time either; in the pathological case, you need 60*num_workers
seconds, because if you get one worker finishing every 59.9 seconds, the
udevd timeout will continue to be extended and we'll still have the udevadm
control problem.

But 61 seconds should be enough for the 99.99% case, and if we start to see
reports of problems in the other .01% we can adapt by raising the timeout.
(I would prefer not to just '|| true' the udevadm call, though, since I
think having to occasionally reboot the machine again is better than being
left with extra udev processes running around doing who-knows-what on the
system.)

> My conclusion is that "udevadm control --exit" (introduced in april) is
> still a bit immature, and further races might wait to happen.

It's true that there might be further races, but I don't think it's due to
immaturity of udevadm control --exit... rather, the races were always there
and we were blissfully ignoring them with pkill (and that's where the bug
report came from that led us to switch to udevadm in the first place).

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

> (I would prefer not to just '|| true' the udevadm call, though, since I
> think having to occasionally reboot the machine again is better than being
> left with extra udev processes running around doing who-knows-what on the
> system.)

Very valid point. But wouldn't it then be better in the case of udevadm-control failing to do something which gets 100% attention of the user, and with an explicit diagnosis message describing the problem? Simply not moving '/dev' to rootfs, and letting the failed boot to reach its "fate", seems to be a bit too obscure for me.

Being an "expert user", I would also like to have the busybox-shell for the failure case of course ;)

On Sat, Oct 08, 2011 at 08:08:34PM -0000, Eduard Hasenleithner wrote:
> > (I would prefer not to just '|| true' the udevadm call, though, since I
> > think having to occasionally reboot the machine again is better than being
> > left with extra udev processes running around doing who-knows-what on the
> > system.)

> Very valid point. But wouldn't it then be better in the case of udevadm-
> control failing to do something which gets 100% attention of the user,
> and with an explicit diagnosis message describing the problem? Simply
> not moving '/dev' to rootfs, and letting the failed boot to reach its
> "fate", seems to be a bit too obscure for me.

I agree. Could you file a new bug against the initramfs-tools package
requesting this?

> Being an "expert user", I would also like to have the busybox-shell for
> the failure case of course ;)

The obvious way to implement this would be to use the panic handler for
script failures.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

I'm encountering this issue on a Dell PowerEdge 2950 (pretty common server model, we have a ton of them here). Clean install of 11.10 will not boot properly, same with installing 11.04 and then upgrading. It appears to be running the latest udev (173-0ubuntu4).

[ 62.944049] bnx2: Can't load firmware file "bnx2/bnx2-mips-06-6.2.1.fw"
[ 62.944081] bnx2 0000:05:00.0: PCI INT A disabled
[ 62.944095] bnx2: probe of 0000:05:00.0 failed with error -2

Apologies if I'm missing something, and let me know if I can provide any further info.

Tom Ellis (tellis) wrote :

@John Miller
I've reopened https://bugs.launchpad.net/ubuntu/+source/update-manager/+bug/842560 which was marked as a duplicate of this udev bug (as I originally mentioned the firmware issue in the description, but it was unrelated), now that this udev problem is sorted the firmware issue can be looked at in more detail.

Let's use LP#842560 to track the firmware issue and leave this bug closed out.

dyna (ubuntu-dyna) wrote :

Same problem here, ML350 G5

Funny thing is that after the update from 11.04 to 11.10 i had no problems booting, just no network as described in bug 842560. Booted without network about 3-4 times with succes.

To try and fixed that i upgraded the firmwares using the hp firmware CD 930. (which seems to work with a 11.10 desktop boot CD).
After this i ran into the killing modprobe / read-only fs problem. And i didn't get the system to boot correctly once since then.

Since everything was backed up i decided to start with a clean 11.10 install, but ran in to the same problem.

Tried to replace the 'udevadm control --exit' in /scripts/init-bottom/udev with 'pkill udevd', but it has no effect.

Guess trying to downgrade firmware's is next, but since hp only supplies seperate windows packages that will take a while...

dyna (ubuntu-dyna) wrote :

Sorry, updates the wrong initramfs. pkill does make a difference, but still doesn't give a booting system.
It does get further and i did see some nbx2 firmware errors again, which i didn't see before using pkill.

The fix in #17 however works: http://paste.ubuntu.com/679222/

Adding the following lines to init-bottom/udev:
. /scripts/functions
wait_for_udev

And i have a booting system with network.

David Bukowski (davebb) wrote :

Sorry if this is the wrong place to ask this, but this bug currently exists on the Ubuntu-11.10-Server (Oneiric) download I just did today 2011-11-17. And cannot seem to even get the server to start.
Server is an HP Proliant DL380G5
The image is the iso available for download
I have chosen NOT to use an LVM and just do a straight disk partition instead of using the LVM. So I can say it exists even when not using the LVM setup.
I tried reading through the notes here to find the fix and says its fixed, but however still is a problem, or do I need to somehow boot off of CD and then make mods to the system.

Doug Weimer (dweimer) wrote :

To echo davebb, I just downloaded the ubuntu-11.10-server-amd64.iso (md5: f8a0112b7cb5dcd6d564dbe59f18c35f) on 11/23/2011 and had this problem on two different systems. One was a Sun x4100 (AMD 8354, nvidia chipset, no LVM) and one was a Supermicro Intel box, no LVM.

In both cases the systems printed out a large number of the 'udev[90]: timeout: killing ' messages and then froze. The console didn't come up for over 45 minutes. I have left it running to see if the host ever comes up.

I see the fix is listed for oneiric updates. These errors happened for me on the first boot after an install and the machines don't finish booting. How can I get access to the fix during the install process?

Endre Stølsvik (stolsvik) wrote :

Just wanted to chime in that I have the exact same problem with udevd killing lvm vgscan.
udevd[139]: timeout: killing 'watershed sh -c '/sbin/lvm vgscan; /sbin/lvm vgchange -a y''

What does this imply? Can I just ignore it till you folks fix it ASAP?

BTW: It has been a real fun upgrade experience to oneiric: First bitten by bug#858122 on my desktop. Then now, my "KVM server" (the host of several guests) got bitten badly by the same bug (only on the server it hung hard), and then apparently this one.

KnisterPeter (knister-peter) wrote :

Seems related to #625395

KnisterPeter (knister-peter) wrote :

Hmm, just ignore my comment...

KnisterPeter (knister-peter) wrote :

I run into the timeout since 2 days. Before that my oneiric experience was quite good. After reading this comments it seems I should be happy that there is a timeout after 61s, but whats causing udev to timeout?
Its very annoying to sit and wait for the system - especially when it does boot a lot faster the day before...

Any idea how to find the root cause of the timeout?

tags: added: kernel-da-key
Andy Whitcroft (apw) wrote :

There are definatly corner cases of this issue still present in Precise. Looking at the udev init-bottom script it seems that we are not coping correctly with a timeout when requesting that udev exit for us. Thus we skip moving /dev over onto the new /root, which is almost all cases is utterly fatal leaving us on a R/O root.

summary: - boot failures because 'udevadm exit' times out while udevd waits for an
- already-dead thread
+ boot failures as /dev is not transferred to /root (because 'udevadm
+ exit' times out waiting for a deadlocked worker)
Changed in udev (Ubuntu Precise):
assignee: nobody → Andy Whitcroft (apw)
status: Fix Released → In Progress
Steve Langasek (vorlon) on 2011-12-16
Changed in udev (Ubuntu Oneiric):
status: Fix Released → In Progress
assignee: Steve Langasek (vorlon) → Andy Whitcroft (apw)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 175-0ubuntu3

---------------
udev (175-0ubuntu3) precise; urgency=low

  [ Andy Whitcroft ]
  * debian/udev.initramfs-bottom: ignore timeout errors from udevadm we
    expect (and indeed requested) in certain failure modes. If we do not we
    will bail out early and not transfer /dev into /root which is always
    fatal leading to unbootable machines. (LP: #818177)
  * avoid-exit-deadlock-for-timely-events: avoid deadlock when exiting
    by continuing to handle events with timeliness requirements.
    The timeliness requirement will be violated if we ignore them which
    is highly undesirable. Also these events are typically dependant
    events and may well block the events we are waiting on leading to
    boot delays and uninitialised devices. (LP: #842560)
  * debian/udev.initramfs-bottom: increase the client-side timeout to
    better cope with potential timeout extension issues in udev. We very
    much would prefer udev to time itself out and guarantee to have
    completed than take action ourselves. Very worst case the timeout may
    be doubled from the default of 60s so increase ours accordingly. Note,
    we should only ever trip this timeout when we are already in severe
    trouble. (LP: #818177)
 -- Steve Langasek <email address hidden> Fri, 16 Dec 2011 11:15:39 -0800

Changed in udev (Ubuntu Precise):
status: In Progress → Fix Released

Hello Tom, or anyone else affected,

Accepted udev into oneiric-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in udev (Ubuntu Oneiric):
status: In Progress → Fix Committed
tags: added: verification-needed
Peter Matulis (petermatulis) wrote :

I verified that I can now install Oneiric on a previously-plagued HP Proliant DL380 by installing with the kernel boot option

apt-setup/proposed=true

tags: removed: kernel-da-key
Steve Langasek (vorlon) on 2012-01-11
tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 173-0ubuntu4.1

---------------
udev (173-0ubuntu4.1) oneiric-proposed; urgency=low

  [ Andy Whitcroft ]
  * debian/udev.initramfs-bottom: ignore timeout errors from udevadm we
    expect (and indeed requested) in certain failure modes. If we do not we
    will bail out early and not transfer /dev into /root which is always
    fatal leading to unbootable machines. (LP: #818177)
  * avoid-exit-deadlock-for-timely-events: avoid deadlock when exiting
    by continuing to handle events with timeliness requirements.
    The timeliness requirement will be violated if we ignore them which
    is highly undesirable. Also these events are typically dependant
    events and may well block the events we are waiting on leading to
    boot delays and uninitialised devices. (LP: #842560)
  * debian/udev.initramfs-bottom: increase the client-side timeout to
    better cope with potential timeout extension issues in udev. We very
    much would prefer udev to time itself out and guarantee to have
    completed than take action ourselves. Very worst case the timeout may
    be doubled from the default of 60s so increase ours accordingly. Note,
    we should only ever trip this timeout when we are already in severe
    trouble. (LP: #818177)
 -- Steve Langasek <email address hidden> Fri, 16 Dec 2011 14:10:02 -0800

Changed in udev (Ubuntu Oneiric):
status: Fix Committed → Fix Released
Dylan Damian (dylan-damian) wrote :

Still happening here, on precise alpha 1. I see a 60+ seconds delay and corresponding time gap in the dmesg boot timeline on every boot. I've repeated the install several times and it is consistent.
udevd and udevadm version 175

Dylan Damian (dylan-damian) wrote :

Edit re: #119: Precise Alpha 2 (not 1), 2/1/2012 build upgraded to udev _175-0ubuntu4_amd64 still experiencing 60+ second boot delay.

Dylan Damian (dylan-damian) wrote :

Sorry, but I am still experiencing a 60+ second boot delay consistently with udev amd64 175-0ubuntu4. I'll assist with testing in any way I can.

Shannon Barber (sgbarber) wrote :

I am not able to install Ubuntu 11.10 x64 due to this bug.
The install option fails and spams the udevd timeout killing modprobe <blah>.
I tried disabling modeset and removing the quiet splash.
It appears to be an infinite loop.

Kevin R (kevin-riehle) wrote :

Shannon, I was at the same point you are before I was able to finally get a functioning install.

Here is what didn't work first:
- 11.10 Ubuntu - Wubi install via Windows 7 (I believe it default selected 32 bit)
-- I believe this froze at the blank purple screen
- 11.10 Ubuntu - 64-bit via bootable CD
-- This froze with an error message before I even got to the install dialog with something like ('sbin modprobe udevd terminated by signal 9' (see above for full error message))

I was then able to finally get into the install menu by doing the following:
- Boot into install disk
- Hit escape at the first purple screen that has the little man figure at the bottom next to the keyboard
- Hit F6 for other options (I believe) and select 'nomodeset', which allowed me to install.

I then installed for the first time without internet access (another issue) and after rebooting after the successful install I received the same blank purple screen as before.

The next time I installed I was able to again install using the 'nomodeset', but this time with networking. After installing, I rebooted and after 20-30s it finally brought up the splash screen.

I'm on a Dell Precision T7500
Dual Quad Core Intel® Xeon® Processor E5606, 2.13GHz, 8M L3, 4.8GT/s
12GB, DDR3 RDIMM Memory, 1333MHz, ECC (6 DIMMS)
Dual 1.0GB NVIDIA&#174; Quadro&#174; 2000, Quad MON, 4 DP & 2 DVI

I shrunk my original Windows 7 partition, formatted into NTFS, and installed onto this area (as the more you use Windows the less you can shrink the original partition).

I'm not sure yet, but this problem appears to be solved by downloading updates while installing. For whatever reason I was unable to make these updates while booting into my first install and changing the grub boot menu by removing 'quiet splash' and adding 'nomodeset'.

Good luck, and I hope this helps somehow.

1 of 4 our Intel S1690WB servers doesn't boot even with the latest oneiric udev package (udev (173-0ubuntu4.1) oneiric-proposed; urgency=low). It boots with a workaround from comment 17. 3 that boot suffer from a 60 second timeout. All 4 have LVM over software RAID1.

The timeout itself indicates a mismanaged race. Can #871528 be reopened for this specific configuration or should I file a new one?

Carli (s3734770) wrote :

The timeout thing is not fixed after a dist-upgrade. How can I workaround it? Or does "fix released" mean, the fix is in git, but not in the package system?

Igor Galić (i.galic) wrote :

Even with latest udev on Oneiric, vgchange still hangs in there:

i.galic@bacon ~ % dpkg -l | awk '/udev/{print $3}'
173-0ubuntu4.1
173-0ubuntu4.1
i.galic@bacon ~ % uname -a
Linux bacon 3.0.0-16-server #28-Ubuntu SMP Fri Jan 27 18:03:45 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
i.galic@bacon ~ % ps -cafe | grep -i vg[c]
root 337 1 TS 37 02:25 ? 00:00:00 /sbin/lvm vgchange -a y
i.galic@bacon ~ %

on the Host, while the KVMs fail to boot, stuck at init-bottom for over a day now (so, it doesn't matter how long I wait, I assume, it'll remain stuck.):

Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[ 1.410792] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.

I'd say this bug still exists. I'd love to help you debug it - you can find me on IRC as 'jMCg'

Igor Galić (i.galic) wrote :

Now with the latest packages installed from oneiric-proposed, I have:

[ 1.574823] md: multipath personality registered for level -4
done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[ 1.652371] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
[ 2.833913] udevd[352]: starting version 173

I'd call it progress.

Adam Gandelman (gandelman-a) wrote :

Users running into deadlock in init-bottom around watershed's calling of vgchange, please see Bug #802626

Displaying first 40 and last 40 comments. View all 129 comments or add a comment.