Bug #818177 “boot failures as /dev is not transferred to /root (...” : Oneiric (11.10) : Bugs : udev package : Ubuntu

Revision history for this message

Tom Ellis (tellis) wrote on 2011-07-29:

#1

AcpiTables.txt Edit (40.4 KiB, text/plain; charset="utf-8")
BootDmesg.txt Edit (61.2 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (70.3 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (1.7 KiB, text/plain; charset="utf-8")
Lspci.txt Edit (25.0 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (626 bytes, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (2.9 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (2.2 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (950 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (101.0 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (233.9 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (205.6 KiB, text/plain; charset="utf-8")

Revision history for this message

Tom Ellis (tellis) wrote on 2011-08-01:

#2

Hmm, this now looks like some sort of race condition on based on a couple of issues. I've managed to boot successfully once in 10 attempts without any problems but more often than none I either get to a login prompt and no disks visible or stuck in a udev loop with the errors:

udev[90]: timeout: killing '/sbin/modprobe -bv pci:v000014E4d0000164Csv0000103Csd00007038b
c02sc00i00' [178]

Lots and lots of the above shown, then finally:

bnx2: Can't load firmware file "bnx2/bnx2-mips-06-6.2.1.fw"
bnx2: 0000:05:00.0 PCI INT A disabled
bnx2: probe of 0000:05:00.0 failed with error -2

Brad Figg (brad-figg) on 2011-08-15

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-08-31:

#3

The dmesg on this bug is from a rare good boot. it would be also good to get a dmesg from a bad boot as well.

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#4

dmesg from broken oneiric Edit (794.0 KiB, image/png)

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#5

lspci from broken oneiric Edit (282.9 KiB, image/png)

Dustin Kirkland  (kirkland) on 2011-08-31

Changed in linux (Ubuntu):
importance:	Undecided → High

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#6

mounts on broken oneiric Edit (103.4 KiB, image/png)

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#7

mounts on broken oneiric Edit (103.4 KiB, image/png)

Andy Whitcroft (apw) on 2011-08-31

summary:

- HP DL380G5 does not see disks after upgrade to Oneiric
+ HP DL380G5 root disk mounted read-only on boot and boot fails

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31: Re: HP DL380G5 root disk mounted read-only on boot and boot fails

#8

After busybox Edit (39.7 KiB, image/png)

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#9

After sending Ctrl-Alt-Del Edit (53.7 KiB, image/png)

System is able to restart

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#10

If we remount manually while in busybox

# mount -o rw,remount /

Then proceed to resume boot, system just shows

rc-sysinit start/running, process 1011

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-08-31:

#11

Ok. so the disk can be seen and mounts ok. I am suspicious this is a userspace/init issue.

Revision history for this message

Dustin Kirkland  (kirkland) wrote on 2011-08-31: Re: [Bug 818177] Re: HP DL380G5 root disk mounted read-only on boot and boot fails

#12

Note that there are no /dev/disk/* directories or symlinks. Not sure
if that matters...

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31: Re: HP DL380G5 root disk mounted read-only on boot and boot fails

#13

Running mountall from prompt waits forever Edit (75.0 KiB, image/png)

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#14

strace mountall Edit (109.7 KiB, image/png)

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-08-31:

#15

mountall debug Edit (117.0 KiB, image/png)

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2011-08-31:

#16

Seems I can reproduce something similar and break an otherwise functioning oneiric system just by adding LVM into my setup. If I create a VG and LV using /dev/sda7 as the backing PV (the last partition of a single-disk system), init-bottom hangs until udev kills '/sbin/lvm vgscan; /sbin/lvm vgchange -a -y' due to timeout. The root fileystem ends up read only 3 of 5 boots. Will try to gather more debugging info and report here.

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2011-08-31:

#17

Not sure if its a hack or a fix, but allowing udev to settle before stopping it in init-bottom seems to do the trick.

I've modified /usr/share/initramfs-tools/scripts/init-bottom/udev according to http://paste.ubuntu.com/679222/ and regenerated my initre.img

I believe this is something more general wrt LVM and init as my symptoms are different but workaround seems to fix both cases.

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2011-09-02:

#18

serial.out Edit (35.6 KiB, text/plain)

Attached is serial console log of a failed boot from a virtual machine. I've found it easy to reproduce the race in a kvm virtual machine using the daily oneiric iso. Results match what I encounter on real hardware:

Install oneiric, using standard guided partition table.
Attach a second storage dev
pvcreate /dev/sdb
vgcreate test-vg /dev/sdb
lvcreate -L 100M -n test-lv test-vg

Reboot. Seems to hit ~1/5 boots.

Dave Walker (davewalker) on 2011-09-02

tags:

added: server-o-ro

Adam Gandelman (gandelman-a) on 2011-09-02

Changed in linux (Ubuntu):
assignee:	nobody → Ubuntu Foundations Team (ubuntu-foundations-team)

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-09-02:

#19

I'm not sure if there's a kernel bug here (there might be?), but I'm pretty sure there is a udev bug, as indicated by the linked branch. However, just adding an extra 'udevadm settle' to the initramfs before killing udev is not sufficient to fix the race conditions here; udevadm settle + udevadm control --exit is not atomic, which means any new events that arrive in between the two calls might still be lost by whatever's going on in udevadm control.

udevadm control --exit needs to be made robust, to ensure that any events already being processed by udev at the time it's called get fully processed, and not lost in the middle as seems to be happening now.

Changed in udev (Ubuntu):
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Ubuntu Foundations Team (ubuntu-foundations-team)
Changed in linux (Ubuntu):
assignee:	Ubuntu Foundations Team (ubuntu-foundations-team) → nobody

Revision history for this message

Marc Cluet (lynxman) wrote on 2011-09-04:

#20

We did a ver throughout diagnose with apw which determined that the bug was sitting on user level rather than kernel level, I'm subscribing him to the bug as well so he can provide any further feedback about it.

Colin Watson (cjwatson) on 2011-09-07

Changed in udev (Ubuntu):
assignee:	Ubuntu Foundations Team (ubuntu-foundations-team) → James Hunt (jamesodhunt)

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-09-12: Test with newer development kernel (3.0.0-11.17)

#21

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel currently in the release pocket than the one you tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

Thank you for your help.

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete
tags:	added: kernel-request-3.0.0-7.8

Brad Figg (brad-figg) on 2011-09-12

tags:

added: kernel-request-3.0.0-11.17
removed: kernel-request-3.0.0-7.8

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-09-13: Re: HP DL380G5 root disk mounted read-only on boot and boot fails

#22

I've now got a setup similar to Adams, but even with the 3.0.0-11.17 kernel, I'm still seeing the problem. Took 14 good boots to get a bad one though :) I'm testing with oneiric server.

Revision history for this message

Leonardo Borda (lborda) wrote on 2011-09-19:

#23

Hi Guys,

We're still experiencing this issue on DL380G5 servers. Tested on latest daily oneiric kernel. 16/09/2011.
If you're lucky you can get a good boot 2 out of 10 times.

Leonardo

Revision history for this message

James Page (james-page) wrote on 2011-09-20:

#24

I think this is the same issue as bug 801494

If you use the multi LVM recipe with a preseed on relatively slow IO hardware it happens 100% of the time.

Revision history for this message

Stefan Bader (smb) wrote on 2011-09-21:

#25

This looks quite like it is likely related to /usr/share/initramfs-tools/scripts/init-bottom/udev. How things should work is:

- we got the initramfs mounted as initial /
- udev gets started and left running until the real root device node becomes available
- now udev gets stopped (used to be clubbed to death which caused even more problems) and /dev is tried to be
moved to <realroot>/dev.
- <newroot> bcomes /, restart udev, and things should proceed from there

From the serial console log of comment #18 it looks like things went ok until the time udev should stop. There is no error message there but somehow it looks like fsck in the new / does not find the uuid link for it. Unclear whether maybe the move of /dev did not work, or did the previous step resolve the uuid to dev without the symlink? As Adam mentioned before, allowing udev to settle helps. So the race is somewhere there. Whoever can reproduce this may want to add additional echo's to the init-bottom script.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-09-21: Re: [Bug 818177] Re: HP DL380G5 root disk mounted read-only on boot and boot fails

#26

Quoting Stefan Bader (<email address hidden>):
> This looks quite like it is likely related to /usr/share/initramfs-
> tools/scripts/init-bottom/udev. How things should work is:

This is also where it appears to hang on my systems.

> - we got the initramfs mounted as initial /
> - udev gets started and left running until the real root device node becomes available
> - now udev gets stopped (used to be clubbed to death which caused even more problems) and /dev is tried to be
> moved to <realroot>/dev.
> - <newroot> bcomes /, restart udev, and things should proceed from there
>
> >From the serial console log of comment #18 it looks like things went ok
> until the time udev should stop. There is no error message there but
> somehow it looks like fsck in the new / does not find the uuid link for
> it. Unclear whether maybe the move of /dev did not work, or did the
> previous step resolve the uuid to dev without the symlink? As Adam
> mentioned before, allowing udev to settle helps. So the race is
> somewhere there. Whoever can reproduce this may want to add additional
> echo's to the init-bottom script.

(TBH I don't believe anyone can 'not reproduce it'. It sometimes takes
me 10 boots, but a standard ubuntu server kvm vm with a single primary
partition on virtio occasionally hangs there as well).

I've tried adding echos, but as near as I can tell, adding any echo
before the udevadm control --exit stops the hang. I rebooted 10 times
last night trying to do a 'ps -ef' before it, but that seemed to slow it
down just enough to stop the race.

I've tried (blindly) installing a recompiled udev which keeps handling
inotify events while waiting to exit, but that doesn't seem to be the
problem. It looks as though udevadm control --exit simply is hanging.

Perhaps some debug statements in udevd itself will be more helpful.
(will try)

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-09-21: Re: HP DL380G5 root disk mounted read-only on boot and boot fails

#27

On my system, things go better when I (re-)replace the 'udevadm control --exit' in /scripts/init-bottom/udev with 'pkill udevd', as was the case in natty's udev.

Then the timeout-followed-by-readonly-fs on bad boots is instead replaced by a 60 second timeout followed by a seemingly clean boot. kern.log then shows:

Sep 21 12:20:51 o-nolvm kernel: [ 2.450409] init: Handling started event
Sep 21 12:20:51 o-nolvm kernel: [ 2.450572] init: Reconnecting to system bus
Sep 21 12:20:51 o-nolvm kernel: [ 2.451993] init: mountall post-stop process
(336) exited normally
Sep 21 12:20:51 o-nolvm kernel: [ 2.452038] init: mountall state changed from
post-stop to waiting
Sep 21 12:20:51 o-nolvm kernel: [ 2.452383] init: Handling stopped event
Sep 21 12:21:51 o-nolvm kernel: [ 61.925549] init: udevtrigger post-stop proce
ss (301) exited normally

the udevtrigger post-stop just does

post-stop exec udevadm settle

So I'm guessing that in both cases a hung handler or missing event is the problem.

Jean-Baptiste Lallement (jibel) on 2011-09-22

tags:

added: rls-mgr-o-tracking

Jean-Baptiste Lallement (jibel) on 2011-09-22

Changed in udev (Ubuntu Oneiric):
milestone:	none → ubuntu-11.10

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-09-22:

#28

(I no longer believe this is related to the lvm udev problem - i guess that one is a simple deadlock)

Unfortunately any time I try to instrument udevd to print the list of pending events and workers when it is told to exit, I manage to mask the bug so I can't reproduce.

Revision history for this message

Stefan Bader (smb) wrote on 2011-09-23:

#29

Just for reference: it was the pkill of udev, that was causing trouble in tests I did on xen guests. In that case pkill sometimes had to be redone a few times. And even then there could be even rarer cases where udev did not seem to be killable at all. But if that was not done correctly it would leave the initrd not empty and that causes (slightly overreacting there) a panic.

Dave Walker (davewalker) on 2011-09-27

summary:

- HP DL380G5 root disk mounted read-only on boot and boot fails
+ boot failures caused by udev race

Revision history for this message

Andrew Glen-Young (aglenyoung) wrote on 2011-09-27: Re: boot failures caused by udev race

#30

udev-race.log Edit (24.3 KiB, text/plain)

I have a few machines throwing a kernel panic while netbooting oneiric with a similar error message.

I have attached the boot message log with the panic.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-09-28:

#31

It looks like udevd.c is rather aggressive when handling the exit scenario. I'm currently building a debug version of udevd + initramfs to try and see if and how messages are getting lost.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-09-29:

#32

Andrew, I'm not sure you're experiencing the same issue; I would say in fact that you have some unrelated kernel bug, since there's no excuse for it taking 2 minutes to settle the kernel event queue. Getting a dump of 'udevadm monitor -e' from this initramfs (which would need to be started *before* this point to capture the actual events) should be instructive. Could you please open a new bug report for this and link it back here?

Revision history for this message

Dave Walker (davewalker) wrote on 2011-10-04:

#33

@James, How did you get on with the debug version of udevd? When we tried to do the same, we were unable to reproduce the bug, as the debug statements seemed to slow down udev, hiding the race. Was this the same behaviour you encountered?

Thanks.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-04:

#34

@Dave: I've modded udevd to display some internal details, but cannot now make the images I have fail to boot reliably. Currently working with @jamespage who has a machine that fails to boot most times.

I've tried to force more frequent failures by installing with lots of LVs, but that doesn't seem to help either on virtual h/w (kvm) or the limited physical h/w I have access to. I've even gone as far as forcing particular LVs to respond *extremely* slowly (multi-second response time), but that hasn't helped trigger the problem either).

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-04:

#35

BTW - I've also looked at how Fedora 15 stops udev with dracut and using Stefans terminology, they "club it to death" like we used to. They also pepper the code with frequent calls to "settle" and add a few sub-second sleeps here and there which feels horribly wrong IMHO.

I have an off-beat idea as to how we could establish if messages are getting lost which I may be forced to pursue unless I can overcome the Heisenberg nature of debugging udevd.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-04:

#36

@Adam: from comment #18, do you still have an image that fails to boot 1 in 5 times?

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-04:

#37

@JamesHunt,

you mention LVs in comment #34, but assuming this is the same bug causing my hangs and read-only rootfs on VMs, it does not require LVs. Unfortunately, like you, whenever I've instrumented grub to print out the list of pending events, I can't reproduce it :)

Perhaps kgdb is the way to go here.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-04:

#38

@JamesHunt,

have you pursued trying to reproduce with a version of udevd which continues to process events when udev_exit==1?

I'm unclear as to whether (1) we need to continue to process inotify events as well (so that udev workers don't get hung), and (2) whether that just means that kobject_uevent messages from the kernel will still get lost between the exit of the initramfs' udevd and the rootfs one.

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2011-10-04:

#39

@James: Im currently at a conference with limited wifi, but I checked yesterday and I can consistently reproduce on my thinkpad + kvm. Any chance you can publish those modded udev packages to a branch or PPA? I'm happy to test and see if I can get anything useful

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-04:

#40

@Adam: Not quite yet, but will work on that when I get a chance tomorrow if possible. I'm intrigued by your setup though as I have a thinkpad + kvm but only see the problem very infrequently (and I've tried setting up images as you specify). How exactly are you invoking kvm? To clarify, are you using the server image?

@Serge: what was your storage configuration when you saw the problem without LVs? I think I've seen the problem a couple of times simply by providing 2 extra raw disks to the system but at that point my udevd debug wasn't helpful. I have tried modifying udevd to continue processing events but it then hangs. I really need to be able to recreate the problem more reliably on a system I've got control over before fathoming this one out.

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2011-10-04:

#41

Hi James-

I'm using the server image, kvm+libvirt for the VM, here is the corresponding XML config for the VM. http://paste.ubuntu.com/702405/ Should note that the root disk is a qcow2 image, the two additional images are raw dd'd images, each 100MB. I've also been sure to provide 2 CPUs to the VM, tho I'm not certain this affects reproducibility.

The corresponding KVM process that gets spawned looks like this: http://paste.ubuntu.com/702411

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-04: Re: [Bug 818177] Re: boot failures caused by udev race

#42

Quoting James Hunt (<email address hidden>):
> @Serge: what was your storage configuration when you saw the problem
> without LVs? I think I've seen the problem a couple of times simply by
> providing 2 extra raw disks to the system but at that point my udevd
> debug wasn't helpful. I have tried modifying udevd to continue
> processing events but it then hangs. I really need to be able to
> recreate the problem more reliably on a system I've got control over
> before fathoming this one out.

I have simple vms, created with the vm-new from lp:ubuntu-qa-tools,
whose root disk looks like:

   Device Boot Start End Blocks Id System
   /dev/vda1 * 2048 15728639 7863296 83 Linux
   /dev/vda2 15730686 16775167 522241 5 Extended
   /dev/vda5 15730688 16775167 522240 82 Linux swap / Solaris

with vda1 being ext3.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-04:

#43

Download full text (3.2 KiB)

Quoting James Hunt (<email address hidden>):
> @Serge: what was your storage configuration when you saw the problem
> without LVs? I think I've seen the problem a couple of times simply by
> providing 2 extra raw disks to the system but at that point my udevd
> debug wasn't helpful. I have tried modifying udevd to continue
> processing events but it then hangs. I really need to be able to
> recreate the problem more reliably on a system I've got control over
> before fathoming this one out.

Here is the xml for that domain, for what it's worth. disk0.img is
raw format:

serge@sergelap:~/vms/kvm/c$ qemu-img info disk0.img
image: disk0.img
file format: raw
virtual size: 8.0G (8589934592 bytes)
disk size: 3.3G

serge@sergelap:~/vms/kvm/c$ virsh dumpxml c
<domain type='kvm' id='2'>
  <name>c</name>
  <uuid>15405141-d078-c266-ae6f-8ca6b72a08e8</uuid>
  <memory>524288</memory>
  <currentMemory>524288</currentMemory>
  <vcpu>1</vcpu>
  <os>
    <type arch='x86_64' machine='pc-0.14'>hvm</type>
    <boot dev='hd'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/home/serge/vms/kvm/c/disk0.img'/>
      <target dev='vda' bus='virtio'/>
      <alias name='virtio-disk0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <disk type='block' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <target dev='hdc' bus='ide'/>
      <readonly/>
      <alias name='ide0-1-0'/>
      <address type='drive' controller='0' bus='1' unit='0'/>
    </disk>
    <controller type='ide' index='0'>
      <alias name='ide0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <interface type='network'>
      <mac address='00:16:3e:26:26:74'/>
      <source network='default'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/4'/>
      <target port='0'/>
      <alias name='serial0'/>
    </serial>
    <console type='pty' tty='/dev/pts/4'>
      <source path='/dev/pts/4'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
    </console>
    <input type='mouse' bus='ps2'/>
    <graphics type='vnc' port='5900' autoport='yes'/>
    <video>
      <model type='vmvga' vram='9216' heads='1'/>
      <alias name='video0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </video>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </memballoon>
  </devices>
  <seclabel type='dynamic' model='apparmor'>
    <label>libvirt-15405141-d078-c266-ae6f-8ca6b72a08e8</label>
    <imagelabel>libvirt-15405141-d078-c266-ae6f-8ca6b72a08e8</imagelabel>
  </seclabel>
</...

analysis of the two logs shows the following events missing from udev in the failed case.

UDEV  [4.229209] add      /devices/pci0000:00/0000:00:01.1/host2/target2:0:1/2:0:1:0/block/sdc (block)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/host2/target2:0:1/2:0:1:0/block/sdc
SUBSYSTEM=block
DEVNAME=/dev/sdc
DEVTYPE=disk
SEQNUM=770
ID_ATA=1
ID_TYPE=disk
ID_BUS=ata
ID_MODEL=QEMU_HARDDISK
ID_MODEL_ENC=QEMU\x20HARDDISK\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20\x20
ID_REVISION=0.14.0
ID_SERIAL=QEMU_HARDDISK_QM00004
ID_SERIAL_SHORT=QM00004
ID_ATA_WRITE_CACHE=1
ID_ATA_WRITE_CACHE_ENABLED=0
ID_ATA_FEATURE_SET_SMART=1
ID_ATA_FEATURE_SET_SMART_ENABLED=1
ID_SCSI_COMPAT=SATA_QEMU_HARDDISK_QM00004
ID_PATH=pci-0000:00:01.1-scsi-1:0:1:0
ID_PATH_TAG=pci-0000_00_01_1-scsi-1_0_1_0
ID_FS_UUID=4qMaMd-jD2v-vzpj-ScTe-Rsz7-CDw0-3DCuLn
ID_FS_UUID_ENC=4qMaMd-jD2v-vzpj-ScTe-Rsz7-CDw0-3DCuLn
ID_FS_VERSION=LVM2\x20001
ID_FS_TYPE=LVM2_member
ID_FS_USAGE=raid
MAJOR=8
MINOR=32
DEVLINKS=/dev/disk/by-id/ata-QEMU_HARDDISK_QM00004 /dev/disk/by-id/scsi-SATA_QEMU_HARDDISK_QM00004 /dev/disk/by-path/pci-0000:00:01.1-scsi-1:0:1:0

UDEV  [4.229522] add      /devices/pci0000:00/0000:00:01.1 (pci)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1
SUBSYSTEM=pci
DRIVER=ata_piix
PCI_CLASS=10180
PCI_ID=8086:7010
PCI_SUBSYS_ID=1AF4:1100
PCI_SLOT_NAME=0000:00:01.1
MODALIAS=pci:v00008086d00007010sv00001AF4sd00001100bc01sc01i80
SEQNUM=851

UDEV  [4.230831] add      /devices/pci0000:00/0000:00:01.1/ata1/ata_port/ata1 (ata_port)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata1/ata_port/ata1
SUBSYSTEM=ata_port
SEQNUM=852

UDEV  [4.230937] add      /devices/pci0000:00/0000:00:01.1/ata1/link1/ata_link/link1 (ata_link)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata1/link1/ata_link/link1
SUBSYSTEM=ata_link
SEQNUM=853

UDEV  [4.229854] add      /devices/pci0000:00/0000:00:01.1/ata1/link1/dev1.0/ata_device/dev1.0 (ata_device)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata1/link1/dev1.0/ata_device/dev1.0
SUBSYSTEM=ata_device
SEQNUM=854

UDEV  [4.229959] add      /devices/pci0000:00/0000:00:01.1/ata1/link1/dev1.1/ata_device/dev1.1 (ata_device)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata1/link1/dev1.1/ata_device/dev1.1
SUBSYSTEM=ata_device
SEQNUM=855

UDEV  [4.230742] add      /devices/pci0000:00/0000:00:01.1/ata2/ata_port/ata2 (ata_port)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata2/ata_port/ata2
SUBSYSTEM=ata_port
SEQNUM=856

UDEV  [4.230209] add      /devices/pci0000:00/0000:00:01.1/ata2/link2/ata_link/l
ink2 (ata_link)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata2/link2/ata_link/link2
SUBSYSTEM=ata_link
SEQNUM=857

UDEV  [4.230318] add      /devices/pci0000:00/0000:00:01.1/ata2/link2/dev2.0/ata
_device/dev2.0 (ata_device)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata2/link2/dev2.0/ata_device/dev2.0
SUBSYSTEM=ata_device
SEQNUM=858

UDEV  [4.230428] add      /devices/pci0000:00/0000:00:01.1/ata2/link2/dev2.1/ata_device/dev2.1 (ata_device)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/ata2/link2/dev2.1/ata_device/dev2.1
SUBSYSTEM=ata_device
SEQNUM=859

UDEV  [4.230531] add      /devices/pci0000:00/0000:00:01.1/host1 (scsi)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/host1
SUBSYSTEM=scsi
DEVTYPE=scsi_host
SEQNUM=860

UDEV  [4.230638] add      /devices/pci0000:00/0000:00:01.1/host2 (scsi)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/host2
SUBSYSTEM=scsi
DEVTYPE=scsi_host
SEQNUM=868

UDEV  [4.231230] add      /devices/pci0000:00/0000:00:01.1/host2/scsi_host/host2 (scsi_host)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/host2/scsi_host/host2
SUBSYSTEM=scsi_host
SEQNUM=869

UDEV  [4.231140] add      /devices/pci0000:00/0000:00:01.1/host2/target2:0:1 (scsi)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/pci0000:00/0000:00:01.1/host2/target2:0:1
SUBSYSTEM=scsi
DEVTYPE=scsi_target
SEQNUM=870

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-05:

#47

The corresponding kernel events *are* present, so this seems to be definitively a udev bug and not a kernel bug.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-05:

#48

The last udev event shown in the failure case is this one:

UDEV [3.735566] change /devices/virtual/block/dm-0 (block)
UDEV_LOG=3
ACTION=change
DEVPATH=/devices/virtual/block/dm-0
SUBSYSTEM=block
DM_COOKIE=4228816
DEVNAME=/dev/dm-0
DEVTYPE=disk
SEQNUM=1092
DM_UDEV_PRIMARY_SOURCE_FLAG=1
DM_NAME=test--vg-test--lv
DM_UUID=LVM-4iWtHXttRCwYKe6cxdfiFUSSU9pFQ9Zelj82pgIWg6GreSTESCmwv43yf6Nrf5s2
DM_SUSPENDED=0
DM_UDEV_RULES=1
DM_VG_NAME=test-vg
DM_LV_NAME=test-lv
DEVLINKS=/dev/mapper/test--vg-test--lv /dev/disk/by-id/dm-name-test--vg-test--lv /dev/disk/by-id/dm-uuid-LVM-4iWtHXttRCwYKe6cxdfiFUSSU9pFQ9Zelj82pgIWg6GreSTESCmwv43yf6Nrf5s2 /dev/test-vg/test-lv
MAJOR=252
MINOR=0

So that points to an lvm problem in this case, at least.

I also notice a difference in this event - DM_SUSPENDED=0 vs. =1:

UDEV [3.709786] add /devices/virtual/block/dm-0 (block)
UDEV_LOG=3
ACTION=add
DEVPATH=/devices/virtual/block/dm-0
SUBSYSTEM=block
DEVNAME=/dev/dm-0
DEVTYPE=disk
SEQNUM=1091
DM_SUSPENDED=0
DM_UDEV_RULES=1
MAJOR=252
MINOR=0
DEVLINKS=/dev/disk/by-id/dm-name-

Steve Langasek (vorlon) on 2011-10-05

Changed in linux (Ubuntu Oneiric):
status:	Incomplete → Invalid

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-05:

#49

Looking at bugs on lvm2 turns up this gem in bug #802626:

> Just a wild speculation, because I haven't yet digged into the interactions between kernel
> and udevd, but the semaphore decrementation event might be lost when transitioning from
> the initrd-udevd to the rootfs-udevd.

In cases where people are seeing initrd problems and lvm is in use, are there always LVM VGs *other* than for the root fs? If there's a VG that's not a dependency of the rootfs, then indeed the vgchange call can be racing the main initramfs sequence.

Twiddling the lvm2 package to insert a delay may help with reproducing the issue.

None of this explains any issues with device nodes being absent from /dev on the root filesystem when lvm is *not* in use. The original bug report did not mention LVM at all. Can someone comment on the disk configuration used there? Maybe we need to split this into two bugs.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-06:

#50

@Steve,

see comment #42 for one disk layout that reproduces it for me in a kvm VM.

(the vm was created with 'vm-new oneiric amd64 clean' - well, technically with a customized vm-new using the mini iso which hasn't yet been merged into ubuntu-qa-tools)

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-06:

#51

@Steve,

also see bug 833891 as a udev bug specifically for the LVM case. In fact, Eduard toward the end speculated precisely the semaphore as a cause as you just did.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-06: Re: [Bug 818177] Re: boot failures caused by udev race

#52

On Thu, Oct 06, 2011 at 01:18:27AM -0000, Serge Hallyn wrote:

> see comment #42 for one disk layout that reproduces it for me in a kvm
> VM.

This describes a physical disk layout... it doesn't describe a partition or
LV layout, which is what is key to reproducing the LVM-related hang. Is
this VM using LVM?

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-06:

#53

On Thu, Oct 06, 2011 at 01:20:03AM -0000, Serge Hallyn wrote:
> also see bug 833891 as a udev bug specifically for the LVM case.

Do you mean you're already tracking the LVM case on that bug instead, and
that in the setup you're using there is *no* use of LVM? I guess no one
told Adam that this is a different bug, because it was his LVM-based setup
that I was responding to... :)

> In fact, Eduard toward the end speculated precisely the semaphore as a
> cause as you just did.

That was Eduard's speculation, copied from the *other* bug I linked to.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-06: Re: boot failures caused by udev race

#54

@Adam/@Serge: could you try the following to see if it solves / reduces the occurrence of the problem. Also, can you report back if you see any "interesting" processes as logged by the change below:

1) Change initramfs to bind mount /dev and dump devices and running processes before udevd exits:

$ sudo sed -i.bak -e '/^udevadm control --exit/i echo "XXX: ps=`ps -ef`, \/dev=`ls -l \/dev`" > /dev/kmsg' -e 's!^mount -n -o move /dev!mount -n -o bind /dev!g' /usr/share/initramfs-tools/scripts/init-bottom/udev
$ sudo mkinitramfs -o /boot/initramfs.foo

2) Reboot setting "initrd /boot/initramfs.foo" in grub.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-06:

#55

@Steve,

(re comment #52) I'm not sure what you were asking for then, but as I said in that comment, partition 1 is a simple ext3 filesystem. Partition 2 is just an extended. Partition 5 (the only one on the extended) is swap. There is no LVM. With standard (non-instrumented) udev it hangs roughly every 5 boots.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-06: Re: [Bug 818177] Re: boot failures caused by udev race

#56

Quoting Steve Langasek (<email address hidden>):
> On Thu, Oct 06, 2011 at 01:20:03AM -0000, Serge Hallyn wrote:
> > also see bug 833891 as a udev bug specifically for the LVM case.
>
> Do you mean you're already tracking the LVM case on that bug instead, and

I was, yes. I filed that one some time ago to track the hang on my own
laptop. I realize there are too many open bugs on this, and we should
mark some as dupes.

> that in the setup you're using there is *no* use of LVM? I guess no one

Yes.

> told Adam that this is a different bug, because it was his LVM-based setup
> that I was responding to... :)

Right.

> > In fact, Eduard toward the end speculated precisely the semaphore as a
> > cause as you just did.
>
> That was Eduard's speculation, copied from the *other* bug I linked to.

Right.

NOTE though:

The LVM authors have admonished us for calling vgscan/vgchange in a udev
rule, as that has its own deadlock potential. My debdiff in bug 833891
is meant to address that. It sounds like that deadlock is separate from
the speculation above. Although perhaps not - such a deadlock could
explain why a semaphore in udevd is not getting Vd (since udevd never
finishes its processing of that rule, but just gets killed on timeout)?

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-06:

#57

Download full text (12.2 KiB)

Quoting James Hunt (<email address hidden>):
> @Adam/@Serge: could you try the following to see if it solves / reduces
> the occurrence of the problem.

It certainly didn't solve it, hung on first try :) I'm not sure what you
would deem interesting, but here is the output from the script as obtained
from dmesg.

...

[ 0.651771] udevd[79]: starting version 173
[ 0.812351] FDC 0 is a S82078B
[ 0.815764] EXT3-fs: barriers not enabled
[ 0.818981] kjournald starting. Commit interval 5 seconds
[ 0.819002] EXT3-fs (vda1): mounted filesystem with ordered data mode
[ 0.839767] XXX: ps= PID USER VSZ STAT COMMAND
[ 0.839768] 1 0 4456 S /bin/sh /init ro quiet
[ 0.839769] 2 0 0 SW [kthreadd]
[ 0.839770] 3 0 0 SW [ksoftirqd/0]
[ 0.839771] 4 0 0 SW [kworker/0:0]
[ 0.839771] 5 0 0 SW [kworker/u:0]
[ 0.839772] 6 0 0 SW [migration/0]
[ 0.839773] 7 0 0 SW< [cpuset]
[ 0.839773] 8 0 0 SW< [khelper]
[ 0.839774] 9 0 0 SW< [netns]
[ 0.839774] 10 0 0 SW [sync_supers]
[ 0.839775] 11 0 0 SW [bdi-default]
[ 0.839776] 12 0 0 SW< [kintegrityd]
[ 0.839776] 13 0 0 SW< [kblockd]
[ 0.839777] 14 0 0 SW< [ata_sff]
[ 0.839778] 15 0 0 SW [khubd]
[ 0.839778] 16 0 0 SW< [md]
[ 0.839779] 17 0 0 SW [kworker/0:1]
[ 0.839780] 18 0 0 SW [khungtaskd]
[ 0.839780] 19 0 0 SW [kworker/u:1]
[ 0.839781] 20 0 0 SW [kswapd0]
[ 0.839781] 21 0 0 SWN [ksmd]
[ 0.839782] 22 0 0 SW [fsnotify_mark]
[ 0.839783] 23 0 0 SW [ecryptfs-kthrea]
[ 0.839783] 24 0 0 SW< [crypto]
[ 0.839784] 32 0 0 SW< [kthrotld]
[ 0.839785] 33 0 0 SW [aemon --resolve-names=never
[ 0.839807] 96 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839808] 97 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839808] 98 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839809] 99 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839810] 100 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839811] 101 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839812] 102 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839813] 103 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839816] 104 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839817] 105 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839817] 106 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839818] 107 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839819] 108 0 12752 S /sbin/udevd --daemon --resolve-names=never
[ 0.839820] 109 0 12752 S /sbin/udevd --daemon --reso...

Quoting James Hunt (818177@bugs.launchpad.net):
> @Adam/@Serge: could you try the following to see if it solves / reduces
> the occurrence of the problem.

It certainly didn't solve it, hung on first try :)  I'm not sure what you
would deem interesting, but here is the output from the script as obtained
from dmesg.

...

[    0.651771] udevd[79]: starting version 173
[    0.812351] FDC 0 is a S82078B
[    0.815764] EXT3-fs: barriers not enabled
[    0.818981] kjournald starting.  Commit interval 5 seconds
[    0.819002] EXT3-fs (vda1): mounted filesystem with ordered data mode
[    0.839767] XXX: ps=  PID USER       VSZ STAT COMMAND
[    0.839768]     1 0         4456 S    /bin/sh /init ro quiet
[    0.839769]     2 0            0 SW   [kthreadd]
[    0.839770]     3 0            0 SW   [ksoftirqd/0]
[    0.839771]     4 0            0 SW   [kworker/0:0]
[    0.839771]     5 0            0 SW   [kworker/u:0]
[    0.839772]     6 0            0 SW   [migration/0]
[    0.839773]     7 0            0 SW<  [cpuset]
[    0.839773]     8 0            0 SW<  [khelper]
[    0.839774]     9 0            0 SW<  [netns]
[    0.839774]    10 0            0 SW   [sync_supers]
[    0.839775]    11 0            0 SW   [bdi-default]
[    0.839776]    12 0            0 SW<  [kintegrityd]
[    0.839776]    13 0            0 SW<  [kblockd]
[    0.839777]    14 0            0 SW<  [ata_sff]
[    0.839778]    15 0            0 SW   [khubd]
[    0.839778]    16 0            0 SW<  [md]
[    0.839779]    17 0            0 SW   [kworker/0:1]
[    0.839780]    18 0            0 SW   [khungtaskd]
[    0.839780]    19 0            0 SW   [kworker/u:1]
[    0.839781]    20 0            0 SW   [kswapd0]
[    0.839781]    21 0            0 SWN  [ksmd]
[    0.839782]    22 0            0 SW   [fsnotify_mark]
[    0.839783]    23 0            0 SW   [ecryptfs-kthrea]
[    0.839783]    24 0            0 SW<  [crypto]
[    0.839784]    32 0            0 SW<  [kthrotld]
[    0.839785]    33 0            0 SW   [aemon --resolve-names=never
[    0.839807]    96 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839808]    97 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839808]    98 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839809]    99 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839810]   100 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839811]   101 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839812]   102 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839813]   103 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839816]   104 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839817]   105 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839817]   106 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839818]   107 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839819]   108 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839820]   109 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839821]   110 0        12752 S    /sb   /sbin/udevd --daemon --resolve-names=never
[    0.839833]   115 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839834]   116 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839834]   120 0        12752 S    /sbin/udevd --daemon --resolve-names=never
[    0.839835]   121 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839836]   122 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839837]   123 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839838]   124 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839839]   125 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839840]   126 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839840]   127 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839841]   128 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839842]   129 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839843]   130 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839844]   131 0        12884 S    /sbin/udevd --daemon --resolve-names=never
[    0.839845]   132 0     0        12984 S    /sbin/udevd --daemon --resolve-names=never
[    0.839854]   175 0        12988 S    /sbin/udevd --daemon --resolve-names=never
[    0.839855]   176 0        12968 S    /sbin/udevd --daemon --resolve-names=never
[    0.839856]   177 0        12964 S    /sbin/udevd --daemon --resolve-names=never
[    0.839857]   178 0        12976 S    /sbin/udevd --daemon --resolve-names=never
[    0.839858]   179 0        12796 S    /sbin/udevd --daemon --resolve-names=never
[    0.839858]   180 0        12944 S    /sbin/udevd --daemon --resolve-names=never
[    0.839859]   181 0        12836 S    /sbin/udevd --daemon --resolve-names=never
[    0.839860]   182 0        12976 S    /sbin/udevd --daemon --resolve-names=never
[    0.839861]   183 0        12976 S    /sbin/udevd --daemon --resolve-names=never
[    0.839862]   184 0        12832 S    /sbin/udevd --daemon --resolve-names=never
[    0.839863]   185 0        12832 S    /sbin/udevd --daemon --resolve-names=never
[    0.839864]   186 0        12816 S    /sbin/udevd --daemon --resolve-names=never
[    0.839864]   187 0        12816 S    /sbin/udevd --daemon --resolve-names=never
[    0.839865]   188 0        12808 S    /sbin/udevd --daemon --resolve-names--daemon --resolve-names=never
[    0.839873]   193 0        12780 S    /sbin/udevd --daemon --resolve-names=never
[    0.839874]   194 0        12780 S    /sbin/udevd --daemon --resolve-names=never
[    0.839875]   195 0        12780 S    /sbin/udevd --daemon --resolve-names=never
[    0.839876]   196 0        12784 S    /sbin/udevd --daemon --resolve-names=never
[    0.839876]   197 0        12784 S    /sbin/udevd --daemon --resolve-names=never
[    0.839877]   198 0        12784 S    /sbin/udevd --daemon --resolve-names=never
[    0.839878]   199 0        12784 S    /sbin/udevd --daemon --resolve-names=never
[    0.839879]   200 0        12800 S    /sbin/udevd --daemon --resolve-names=never
[    0.839880]   201 0        12800 S    /sbin/udevd --daemon --resolve-names=never
[    0.839881]   202 0        12792 S    /sbin/udevd --daemon --resolve-names=never
[    0.839882]   203 0        12792 S    /sbin/udevd --daemon --resolve-names=never
[    0.839885]   204 0        12824 S    /sbin/udevd --daemon --resolve-names=never
[    0.839886]   224 0            0 SW   [kjournald]
[    0.839886]   231 0         4456 S    /bin/sh -e /scripts/init-bottom/udev
[    0.839887]   232 0         4460 R    ps -ef, /dev=crw-------    1   10,  6w-------    1    1,  11 kmsg
[    0.839894] crw-------    1    1,  12 oldmem
[    0.839895] crw-rw-rw-    1    5,   0 tty
[    0.839896] crw-------    1    5,   1 console
[    0.839896] crw--w----    1    4,   0 tty0
[    0.839897] crw-------    1    7,   0 vcs
[    0.839897] crw-------    1    7, 128 vcsa
[    0.839898] crw-------    1    7,   1 vcs1
[    0.839899] crw-------    1    7, 129 vcsa1
[    0.839899] crw--w----    1    4,   1 tty1
[    0.839900] crw--w----    1    4,   2 tty2
[    0.839900] crw--w----    1    4,   3 tty3
[    0.839901] crw--w----    1    4,   4 tty4
[    0.839902] crw-------    1    4,   5 tty5
[    0.839902] crw-------    1    4,   6 tty6
[    0.839903] crw-------    1    4,   7 tty7
[    0.839903] crw-------    1    4,   8 tty8
[    0.839904] crw-------    1    4,   9 tty9
[    0.839904] crw--w----    1    4,  10 tty10
[    0.839905] crw--w----    1    4,  11 tty11
[    0.839906] crw--w----    1    4,  12 tty12
[    0.839906] crw--w----    1    4,  13 tty13
[    0.839907] crw--w----    1    4,  14 tty14
[    0.839907] crw--w----    1    4,  15 tty15
[    0.839908] crw--w----    1    4,  16 tty16
[    0.839909] crw--w----    1    4,  17 tty17
[    0.839909] crw--w----    1    4,  18 tty18
[    0.839910] crw--w----    1    4,  19 tty19
[    0.839910] crw--w----    1    4,  20 tty20
[    0.839911] crw--w----    1    4,  21 tty21
[    0.839912] crw--w----    1    4,  22 tty22
[    0.839912] crw--w----    1    4,  23 tty23
[    0.839913] crw--w----    1-    1    4,  40 tty40
[    0.839928] crw--w----    1    4,  41 tty41
[    0.839929] crw--w----    1    4,  42 tty42
[    0.839929] crw--w----    1    4,  43 tty43
[    0.839930] crw--w----    1    4,  44 tty44
[    0.839931] crw--w----    1    4,  45 tty45
[    0.839931] crw--w----    1    4,  46 tty46
[    0.839932] crw--w----    1    4,  47 tty47
[    0.839932] crw-------    1    4,  48 tty48
[    0.839933] crw--w----    1    4,  49 tty49
[    0.839933] crw-------    1    4,  50 tty50
[    0.839934] crw-------    1    4,  51 tty51
[    0.839935] crw-------    1    4,  52 tty52
[    0.839935] crw-------    1    4,  53 tty53
[    0.839936] crw-------    1    4,  54 tty54
[    0.839936] crw-------    1    4,  55 tty55
[    0.839937] crw-------    1    4,  56 tty56
[    0.839938] crw-------    1    4,  57 tty57
[    0.839938] crw-------    1    4,  58 tty58
[    0.839939] crw-------    1    4,  59 tty59
[    0.839939] crw-------    1    4,  60 tty60
[    0.839940] crw-------    1    4,  61 tty61
[    0.839941] crw-------    1    4,  62 tty62
[    0.839941] crw-------    1    4,  63 tty63
[    0.839942] crw-------    1   10, 227 mcelog
[    0.839942] crw-------    1   10, 231 snapshot
[    0.839943] crw-------    1   10,  61 ecryptfs
[    0.839944] crw-rw-rw-    1   10, 229 fuse
[    0.839944] crw-rw-rw-    1    5,   2 ptmx
[    0.839945] crw-------    1    4,  64 ttyS0
[    0.839945] crw-------    1    4,  65 ttyS1
[    0.839946] crw-------    1    4,  66 ttyS2
[    0.839947] crw ttyS20
[    0.839954] crw-------    1    4,  85 ttyS21
[    0.839955] crw-------    1    4,  86 ttyS22
[    0.839955] crw-------    1    4,  87 ttyS23
[    0.839956] crw-------    1    4,  88 ttyS24
[    0.839956] crw-------    1    4,  89 ttyS25
[    0.839957] crw-------    1    4,  90 ttyS26
[    0.839958] crw-------    1    4,  91 ttyS27
[    0.839958] crw-------    1    4,  92 ttyS28
[    0.839959] crw-------    1    4,  93 ttyS29
[    0.839959] crw-------    1    4,  94 ttyS30
[    0.839960] crw-------    1    4,  95 ttyS31
[    0.839962] crw-------    1    5,   3 ttyprintk
[    0.839963] crw-------    1   10, 228 hpet
[    0.839964] brw-------    1    1,   0 ram0
[    0.839964] brw-------    1    1,   1 ram1
[    0.839965] brw-------    1    1,   2 ram2
[    0.839965] brw-------    1    1,   3 ram3
[    0.839966] brw-------    1    1,   4 ram4
[    0.839967] brw-------    1    1,   5 ram5
[    0.839967] brw-------    1    1,   6 ram6
[    0.839968] brw-------    1    1,   7 ram7
[    0.839968] brw-------    1    1,   8 ram8
[    0.839969] brw-------    1    1,   9 ram9
[    0.839970] brw-------    1    1,  10 ram10
[    0.839970] brw-------    1    1,  11 ram11
[    0.839971] brw-------    1    1,  12 ram12
[    0.839971] brw-------    1    1,  13 ram13
[    0.839972] brw-------    1    1,  14 ram14
[    0.839973] brw-------    1    1,  15 ram15
[    0.839973] brw-------    1    7,   0 loop0
[    0.839974] brw-------    1    7,   1 loop1
[    0.839974] brw-------    1     rtc0
[    0.839980] drwxr-xr-x    2        60 mapper
[    0.839980] crw-------    1   10,  60 cpu_dma_latency
[    0.839981] crw-------    1   10,  59 network_latency
[    0.839982] crw-------    1   10,  58 network_throughput
[    0.839982] brw-------    1   11,   0 sr0
[    0.839983] crw-------    1   21,   0 sg0
[    0.839984] drwxr-xr-x    2        60 bsg
[    0.839984] drwxr-xr-x    2         0 pts
[    0.839985] lrwxrwxrwx    1        11 core -> /proc/kcore
[    0.839985] lrwxrwxrwx    1        13 fd -> /proc/self/fd
[    0.839986] lrwxrwxrwx    1        15 stdin -> /proc/self/fd/0
[    0.839987] lrwxrwxrwx    1        15 stdout -> /proc/self/fd/1
[    0.839988] lrwxrwxrwx    1        15 stderr -> /proc/self/fd/2
[    0.839988] crw-------    1   10, 234 btrfs-control
[    0.839989] drwxr-xr-x    2       620 block
[    0.839990] lrwxrwxrwx    1         3 scd0 -> sr0
[    0.839990] drwxr-xr-x    5       100 disk
[    0.839991] drwxr-xr-x    2      2260 char
[    0.839991] lrwxrwxrwx    1         4 rtc -> rtc0

-serge

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-06: Re: boot failures caused by udev race

#58

Serge,

> I'm not sure what you were asking for then, but as I said in that
> comment, partition 1 is a simple ext3 filesystem.

Oh, doh - I read the wrong comment. Sorry. :)

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-10-06:

#59

Ok. I have a machine here which triggers somethign similar to this pretty often. For me udev finds /dev is read-only and halts boot. This is presumably because our devtmpfs /dev has not made it into / when it starts. I was getting a failed boot about every 5-6 boots, 4 in 20 overall. I then used the work around in comment #54, with that in place I have 35 consecutive successful boots.

The conjecture is that /dev in the initramfs is busy and cannot be moved when needed. From the ps output in comment #57 it seems udev has not actually quit when requested, which might well account for this.

Steve Langasek (vorlon) on 2011-10-06

summary:

- boot failures caused by udev race
+ boot failures because 'udevadm exit' does not kill udevd worker threads

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-06: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#60

The output in comment #57 is pretty conclusive. The call to 'udevadm exit' is killing the parent process, but leaving all the worker threads running... and these worker threads are holding /dev open, preventing the mount -o move from working.

The use of 'udevadm exit' is a recent (oneiric-cycle) change, in response to bug #787610 / Debian bug #624469. Previously we used pkill, which would kill *all* the processes named 'udevd', ensuring that the children worker threads were also killed; but pkill is asynchronous (since all it does is send a signal), so that's also a race condition.

I see two possible fixes:

1) fix 'udevadm exit' to reap its children (but that may not be correct vis-à-vis upstream)
2) call 'udevadm exit', and then *also* call pkill udevd to reap any remaining children

James, please check the udev upstream source to see if there are any changes there already to make 'udevadm exit' handle its children more robustly. If there is, we should cherry-pick that fix. If there isn't, we should go with option 2) as a fallback. Option 2 is still racy because the child processes aren't guaranteed to die immediately either, but this will close the race as best as we can on short notice. If we do have to go with option 2, we should pursue a more robust 'udevadm exit' implementation with upstream.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-06:

#61

All of this should land in oneiric-updates now, rather than trying to squeeze it in before release. It's an annoying boot-time race condition, but doesn't affect installation itself (if the race hits on the first reboot after install, well.... reboot again). So we should aim for this to be a 0-day SRU instead.

Changed in udev (Ubuntu Oneiric):
milestone:	ubuntu-11.10 → oneiric-updates
status:	Confirmed → Triaged

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-06:

#62

I've read all the comments in this bug thoroughly, and for every log file which has timestamps in it, there appears to be the 60s delay caused by the udev exit timeout on "hanging" children. If that is true, then even when adding pkill to the initrd scripts, there will still be the 60s delay on boot, right?

Kate Stewart (kate.stewart) on 2011-10-06

Changed in linux (Ubuntu Precise):
status:	New → Invalid
Changed in udev (Ubuntu Precise):
status:	New → Triaged
importance:	Undecided → High
tags:	added: rls-mgr-p-tracking removed: rls-mgr-o-tracking

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-06:

#63

If we can't get this fix in for release, it definitely needs to be release-noted.

Changed in ubuntu-release-notes:
status:	New → Triaged

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#64

On Thu, Oct 06, 2011 at 09:12:50PM -0000, Eduard Hasenleithner wrote:
> I've read all the comments in this bug thoroughly, and for every log
> file which has timestamps in it, there appears to be the 60s delay
> caused by the udev exit timeout on "hanging" children.

In fact I only see *one* such log attached to this bug, and it's specific to
the netboot installation - which is now tracked as a separate bug, bug
#862823, because the behavior seen there doesn't match up with anything else
shown in this report.

So I don't see any reason to think we'll be hitting timeouts. Indeed, if
udev were actually waiting for that timeout I wouldn't expect us to see any
of these races...

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#65

Based on additional feedback from Daviey regarding the possible urgency of getting this in before release, I've reviewed the upstream changes since the 173 release. Of course, with kernel.org down for the past month, there's not really much of it. No indication of an upstream fix for udevadm control --exit not taking care of the worker processes.

So I think we need to go with the udevadm control --exit && pkill udevd as a short-term solution.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07:

#66

udev3.log Edit (300.9 KiB, text/plain)

With initrd's scripts/init-top/udev changed to run udev with -D and > /run/initrd/udev.log 2>&1, I get this output. I also added to udev/udevd.c the following patch. Note that IIUC there are both workers and events on the list. Every time that I've tried to print out info about the remaining events, I've failed to reproduce the hang.

(This, fwiw, is on the VMs with no LVM at all)

Index: udev-173/udev/udevd.c
===================================================================
--- udev-173.orig/udev/udevd.c 2011-10-06 22:14:07.000000000 -0500
+++ udev-173/udev/udevd.c 2011-10-06 22:34:12.000000000 -0500
@@ -1633,8 +1633,16 @@
worker_kill(udev, 0);

/* exit after all has cleaned up */
- if (udev_list_is_empty(&event_list) && udev_list_is_empty(&worker_list))
+ if (udev_list_is_empty(&event_list) && udev_list_is_empty(&worker_list)) {
+ info(udev, "event list and worker list are empty\n");
break;
+ }
+ if (!udev_list_is_empty(&event_list)) {
+ info(udev, "with udev_exit, event list not empty\n");
+ }
+ if (!udev_list_is_empty(&worker_list)) {
+ info(udev, "with udev_exit, worker list not empty\n");
+ }

/* timeout at exit for workers to finish */
timeout = 60 * 1000;

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07:

#67

@Steve,

well there *is* the kernel patch https://lkml.org/lkml/2011/8/22/155, which doesn't seem to be upstream yet.

Hard to believe anything is doing clone(CLONE_NEWNET) during boot (though not impossible), but I wonder if udevd stopping reading events could trigger this.

Kate Stewart (kate.stewart) on 2011-10-07

tags:

added: rls-mgr-o-tracking

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07:

#68

Serge,

Picking that log apart, I see one worker thread (pid 94) not exiting cleanly, whereas all the others do. No clear indication of why this worker would have failed to exit since its own log output shows it completely processing an event (seq 1072) and doesn't show it picking up another one; but we're definitely left with references to the worker, which is why we see the 'worker exit' message without the 'worker cleaned up' message (and why the worker list is never empty).

Could you do a test that would correlate this udev output with a 'ps -ef' at the end of scripts/init-bottom/udev? It doesn't look to me like the output from the two tests we have here match in the slightest - one reports all worker threads exited, the other shows dozens of processes still running...

Could you (and others) also try the udev from ppa:vorlon/ppa, and see if this makes a difference in bootability? This adds the pkill after udevadm control --exit. If it works, we know udevd is losing track of its worker threads...

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-07:

#69

Hmmm, I guess the original report of this bug is not for a netbook, but a server. Still I see this in the BootDmesg.txt attached to this bug:

[ 3.521410] input: ATEN CS1308 V1.2.114 as /devices/pci0000:00/0000:00:1d.2/usb4/4-1/4-1.1/4-1.1:1.1/input/input4
[ 3.521587] generic-usb 0003:0557:2261.0004: input,hidraw3: USB HID v1.10 Mouse [ATEN CS1308 V1.2.114] on usb-0000:00:1d.2-1.1/input1
[ 63.200054] bnx2: Can't load firmware file "bnx2/bnx2-mips-06-6.2.1.fw"
[ 63.200148] bnx2 0000:03:00.0: PCI INT A disabled

Isn't this the 60s wait from "udevadm control --exit"?

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#70

On Fri, Oct 07, 2011 at 06:12:51AM -0000, Eduard Hasenleithner wrote:
> Hmmm, I guess the original report of this bug is not for a netbook, but
> a server. Still I see this in the BootDmesg.txt attached to this bug:

> [ 3.521410] input: ATEN CS1308 V1.2.114 as /devices/pci0000:00/0000:00:1d.2/usb4/4-1/4-1.1/4-1.1:1.1/input/input4
> [ 3.521587] generic-usb 0003:0557:2261.0004: input,hidraw3: USB HID v1.10 Mouse [ATEN CS1308 V1.2.114] on usb-0000:00:1d.2-1.1/input1
> [ 63.200054] bnx2: Can't load firmware file "bnx2/bnx2-mips-06-6.2.1.fw"
> [ 63.200148] bnx2 0000:03:00.0: PCI INT A disabled

> Isn't this the 60s wait from "udevadm control --exit"?

Ah, yes, that is. Sorry for missing that.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#71

Just for the record (I do have an LVM setup): Installing the PPA with pkill enabled still gives me a read-only root filesystem (after 60s timeout). Don't know how this will turn out for the cases without LVM (i.e. without vgscan hanging).

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07:

#72

Looking through the udev code, I can't really see any reason that pkill would be more effective than udevadm control --exit (as earlier reported), except if udevd has somehow lost track of a worker, because both are supposed to send SIGTERM to all the processes. Furthermore, if udevadm is doing what it ought, I would expect the 'udevadm control --exit' command to block until either the udevd processes have all died, or the timeout is reached, thus blocking the boot. But I don't remember anyone saying the boot itself was slow when hitting this bug - so maybe one issue here is that udevadm is failing to wait for udevd to exit before continuing.

Ubuntu QA Website (ubuntuqa) on 2011-10-07

tags:

added: iso-testing

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#73

Quoting Eduard Hasenleithner (<email address hidden>):
> Just for the record (I do have an LVM setup): Installing the PPA with
> pkill enabled still gives me a read-only root filesystem (after 60s
> timeout). Don't know how this will turn out for the cases without LVM
> (i.e. without vgscan hanging).

I've tried (last week) swapping the --exit for pkill. It did not solve
my problem.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07:

#74

Quoting Steve Langasek (<email address hidden>):
> Serge,
>
> Picking that log apart, I see one worker thread (pid 94) not exiting
> cleanly, whereas all the others do. No clear indication of why this
> worker would have failed to exit since its own log output shows it
> completely processing an event (seq 1072) and doesn't show it picking up
> another one; but we're definitely left with references to the worker,
> which is why we see the 'worker exit' message without the 'worker
> cleaned up' message (and why the worker list is never empty).

Will do. Right now I'm trying one more time to print out the worker info.

> Could you do a test that would correlate this udev output with a 'ps
> -ef' at the end of scripts/init-bottom/udev? It doesn't look to me like
> the output from the two tests we have here match in the slightest - one
> reports all worker threads exited, the other shows dozens of processes
> still running...

Are you sure, though? Some of the output from my other test looked garbled,
I'm wondering whether the ps -ef output simply got cut such that the main
worker thread wasn't listed.

> Could you (and others) also try the udev from ppa:vorlon/ppa, and see if
> this makes a difference in bootability? This adds the pkill after
> udevadm control --exit. If it works, we know udevd is losing track of
> its worker threads...

(Will try)

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#75

udev4.log Edit (336.8 KiB, text/plain)

@Steve,

When the hang happens, the ps -ef > /run/.initramfs/udevps.log I placed at the bottom of initrd's /scripts/init-bottom/udev does not appear to happen. (On a good boot, it does)

Here is the udev log file printing out the info about the hanging worker and event.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07:

#76

udev5.log Edit (274.7 KiB, text/plain)

Another log - the hanging event is again for a tty, but for a different one.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#77

> Could you (and others) also try the udev from ppa:vorlon/ppa, and see if
> this makes a difference in bootability? This adds the pkill after
> udevadm control --exit. If it works, we know udevd is losing track of
> its worker threads...

The 60 second timeout still happens with udev from your ppa. However,
/ ends up rw, a big improvement. However, /run is still mounted, so I'm
not sure what other parts of init did not happen.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#78

Just to be clear, the output in #57 is *before* we attempt to kill udevd in any shape or form.

Doing some experimentation, I've found that a "long-running" udev rule continues to run post-initramfs, so although the workers should be being killed, the children of them (vgscan et al) shouldn't be.

Also, picking over the initramfs code, I've found a bug which I don't think is directly related to this, but could cause similar behaviour (bug 870031).

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-07:

#79

I've just noticed something odd: if you interrupt the initramfs before it kills udev and performs the move mount, /root/dev/ contains a ton of device entries including console, zero, null (I was expecting /root/dev/ to be empty).

As @apw has just suggested, this could well explain why I and other have not seen the problem very frequently since if either the move mount fails or udevd weirdness happens, /root/dev still contains a usable set of devices.

Revision history for this message

Martin Pitt (pitti) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#80

James Hunt [2011-10-07 16:18 -0000]:
> I've just noticed something odd: if you interrupt the initramfs before
> it kills udev and performs the move mount, /root/dev/ contains a ton of
> device entries including console, zero, null (I was expecting /root/dev/
> to be empty).

That should be the usual devtmpfs?

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#81

I guess james point is that if you happen to have a rather complete set of device nodes from a really old installation in <root>/dev, this might help you in the case when the devtmpfs cannot be moved.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#82

On Fri, Oct 07, 2011 at 04:18:09PM -0000, James Hunt wrote:
> I've just noticed something odd: if you interrupt the initramfs before
> it kills udev and performs the move mount, /root/dev/ contains a ton of
> device entries including console, zero, null (I was expecting /root/dev/
> to be empty).

This will vary more or less randomly with the install media. At various
points in the past, we either have or haven't pre-created a handful of
device nodes in the target /dev "just in case".

mount -o bind / /mnt && ls /mnt/dev shows nothing at all here, and I don't
see this behavior.

But it shouldn't hurt to mount -o bind / /mnt && rm -r /mnt/dev/* to rule it
out.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#83

A fresh oneiric server install seems to generally show 88-91 device entries in /root/dev/. My oneiric system that has been upgraded from maverick contains 90 device entries.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2011-10-07:

#84

@Serge: following on from comments #54 and #57 - could you possibly add a line after the "udevadm control --exit" so we can see the ps output at that point?

echo "XXX: POST udevd: ps=`ps -ef`, /dev=`ls -l /dev`" > /dev/kmsg

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#85

Quoting James Hunt (<email address hidden>):
> @Serge: following on from comments #54 and #57 - could you possibly add
> a line after the "udevadm control --exit" so we can see the ps output at
> that point?
>
> echo "XXX: POST udevd: ps=`ps -ef`, /dev=`ls -l /dev`" > /dev/kmsg

I already had that:

!# Stop udevd, we'll miss a few events while we run init, but we catch up
!udevadm control --exit
!
!# Move /dev to the real filesystem
!mount -n -o move /dev ${rootmnt}/dev
!
!ps -ef > /run/initramfs/udevps.log 2>&1
!ps -ef > /dev/.initramfs/udevps.log 2>&1

That file however did not show up in the cases of the hang.

Is the scripts/init-bottom/udev file itself being killed?

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#86

I have to withdraw my Comment #71. At that time I did not have yet the new initial-ramdisk installed. Having it installed, gives me always an read-only rootfs.

And I think, I have also found the reason. The scripts in the initial ramdisk are executed with "set -e". In this case your pkill fails and aborts the execution of init-bottom/udev. And in the "normal" case, init-bottom/udev aborts because "udevadm control --exit" reports by means of its return value that, after timeout (yes, udevadm has its own timeout of 60s!), the exit failed.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07:

#87

I've just pushed a new udev package to ppa:vorlon/ppa which should provide better debugging capabilities. It unconditionally runs 'udevadm monitor -e' before starting udevd, and spits out the complete list of kernel/udev events after udevd is stopped in the initramfs - as well as spitting out a process list with 'ps -ef' so we can see what's been left behind that's still holding /dev open and preventing the mount -omove.

Anyone who's reproducing these boot failures (other than the LVM vgchange race case), please test with this updated package (~ppa3) and attach the full console output to the bug. If you can also attach the full console output to the bug from a successful boot, that will be useful for comparison.

A few comments in summary about the things that have already been tried that didn't work:
pkill after udevadm control --exit -- this was a distraction. The ps output I was looking at when I jumped to this conclusion was the output from running ps *before* running --exit, so it doesn't really tell us anything - especially since 'pkill udevd' vs 'udevadm control --exit' have nearly identical behavior, with the difference that udevadm is designed to wait for udevd to shut down before returning and the pkill call *may* catch processes that udevd has somehow managed to lose track of (a possible, but not likely, scenario). The new test package should tell us more, by showing us only those processes still running *after* udevadm is called.

ps -ef not returning anything when run after the mount call (comment #75) - yes, because the initramfs script in question is 'set -e'; so as soon as the mount -omove call fails, the script terminates. We should grab the ps output before the mount call - this is done in the test package.

non-empty /dev on the root filesystem - this will definitely mask the problem in many cases. For ease of reproducing the issue, best to do the following from the booted system:

  sudo mount -obind / /mnt
  sudo rm -rf /mnt/dev/*
  sudo umount /mnt

thus ensuring /dev is empty and that any failure to move the /dev mount in the initramfs will be caught early.

I've just pushed a new udev package to ppa:vorlon/ppa which should provide better debugging capabilities.  It unconditionally runs 'udevadm monitor -e' before starting udevd, and spits out the complete list of kernel/udev events after udevd is stopped in the initramfs - as well as spitting out a process list with 'ps -ef' so we can see what's been left behind that's still holding /dev open and preventing the mount -omove.

Anyone who's reproducing these boot failures (other than the LVM vgchange race case), please test with this updated package (~ppa3) and attach the full console output to the bug.  If you can also attach the full console output to the bug from a successful boot, that will be useful for comparison.

A few comments in summary about the things that have already been tried that didn't work:
 pkill after udevadm control --exit -- this was a distraction.  The ps output I was looking at when I jumped to this conclusion was the output from running ps *before* running --exit, so it doesn't really tell us anything - especially since 'pkill udevd' vs 'udevadm control --exit' have nearly identical behavior, with the difference that udevadm is designed to wait for udevd to shut down before returning and the pkill call *may* catch processes that udevd has somehow managed to lose track of (a possible, but not likely, scenario).  The new test package should tell us more, by showing us only those processes still running *after* udevadm is called.

ps -ef not returning anything when run after the mount call (comment #75) - yes, because the initramfs script in question is 'set -e'; so as soon as the mount -omove call fails, the script terminates.  We should grab the ps output before the mount call - this is done in the test package.

non-empty /dev on the root filesystem - this will definitely mask the problem in many cases.  For ease of reproducing the issue, best to do the following from the booted system:

sudo mount -obind / /mnt
  sudo rm -rf /mnt/dev/*
  sudo umount /mnt

thus ensuring /dev is empty and that any failure to move the /dev mount in the initramfs will be caught early.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-07: Re: [Bug 818177] Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#88

On Fri, Oct 07, 2011 at 09:55:52PM -0000, Eduard Hasenleithner wrote:
> And I think, I have also found the reason. The scripts in the initial
> ramdisk are executed with "set -e". In this case your pkill fails and
> aborts the execution of init-bottom/udev. And in the "normal" case,
> init-bottom/udev aborts because "udevadm control --exit" reports by
> means of its return value that, after timeout (yes, udevadm has its own
> timeout of 60s!), the exit failed.

Oh, right. pkill exits non-zero if it doesn't find anything, and udevadm
exits non-zero if the timeout is hit. And the udevadm timeout is 60 seconds
(not mentioned in the manpage), which just about ensures that the udevadm
timeout will be hit before the udevd one.

So yeah, that patch was certainly a dud.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-07: Re: boot failures because 'udevadm exit' does not kill udevd worker threads

#89

So, maybe we just put a

udevadm control --exit || pkill udevd || true

in the init-bottom/udev and close this bug ;)
The occasional 60s boot delay will stay, some stray udev RUN commands might remain in memory, but Oneiric is at least supposed to start properly.

Nevertheless, I'd be very interested to see what causes udevd to hang in an non-LVM setup.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-08:

#90

udev log Edit (118.5 KiB, text/plain)

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-08:

#91

ps -ef output at bottom of scripts/init-bottom/udev Edit (1.3 KiB, text/plain)

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-08:

#92

output of 'udevd -D --daemon ... > /dev/.udev.initramfs.log3 2>&1" Edit (292.2 KiB, text/plain)

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-08:

#93

I'm able to reproduce the same behavior that Serge sees when running in a VM, and now have a clear understanding of what's happening here at the top level.

The init-bottom script, as noted earlier, is set -e; and udevadm control exits non-zero if it hits its timeout. These two factors combined mean that if udev gets stuck this way, the mount -o move *is never called*, because the script terminates immediately after the udevadm call.

This also entirely explains the behavior seen in bug #833783 (in fact, I just reproduced this problem here) - the difference between systems not finding the local disks, and systems panicing due to missing /dev/console, is down to what device nodes happen to be prepopulated on the root filesystem. So marking bug #833783 as a duplicate of this one.

I am currently investigating both the most appropriate way to ensure that udevadm control --exit doesn't have to hit its timeout (raising the timeout slightly, so that udevd times out first and exits, *should* be sufficient but doesn't seem to be), and why udev is getting stuck in seemingly ordinary boot scenarios (pristine vm).

The original problem reported, which involves a missing firmware, seems to have its own cause in the kernel since the firmware should not fail to load. I understand the kernel team is already looking into this.

summary:

- boot failures because 'udevadm exit' does not kill udevd worker threads
+ boot failures because 'udevadm exit' times out while udevd waits for an
+ already-dead thread

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-08: Re: boot failures because 'udevadm exit' times out while udevd waits for an already-dead thread

#94

The root error is that in worker_new(), at lines 353 ff., the worker threads process signals and udev event handling requests in the same loop. If it happens that both the signal and the event are received close enough together to be returned in the same epoll_wait(), and the udev_monitor handle is listed *first* in the epoll_event list, the worker will receive the event message (which increases the refcount on the udevd master side), then process the signal, and immediately exit without handling the event.

This should be handled by processing fd_signal first, and only if SIGTERM was not received, process fd_monitor. This ensures that we always exit ASAP after receiving the signal, instead of going off and processing another event that could take an indeterminate amount of time.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-08:

#95

Tried that; doesn't work because the master process already thinks the event is dispatched as soon as it sends it, whether or not the worker process acts on it with recvmsg(). So I guess we need to try to process the event after all.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-08:

#96

With the help of some debugging code, I've confirmed that the patch committed to the branch results in a correct (and speedy) exit in the case where a SIGTERM is received before a worker thread has accepted an event for processing. Will push this to the oneiric queue shortly.

Steve Langasek (vorlon) on 2011-10-08

Changed in udev (Ubuntu Oneiric):
assignee:	James Hunt (jamesodhunt) → Steve Langasek (vorlon)
status:	Triaged → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-10-08:

#97

This bug was fixed in the package udev - 173-0ubuntu4

---------------
udev (173-0ubuntu4) oneiric; urgency=low

  * Set a timeout of 61 seconds when calling udevadm control --exit, to
    ensure that udevd always reaches its *own* timeout first. If udevadm
    times out before udevd, it exits non-zero, breaking the move of the /dev
    mount; and if we ignore the udevadm failure then udevd might still be
    running, which is the original issue that the patch for bug #624469
    was trying to address. Raising the timeout doesn't make the boot any
    slower, it just makes sure we let udevd count the 60 seconds instead of
    udevadm. LP: #818177.
  * debian/patches/race-condition-on-exit: Process events before signals
    in the worker thread, so we don't sit around waiting 60 seconds for an
    already-dead process.
-- Steve Langasek <email address hidden> Sat, 08 Oct 2011 00:48:54 -0700

Changed in udev (Ubuntu Oneiric):
status:	Fix Committed → Fix Released

Revision history for this message

Martin Pitt (pitti) wrote on 2011-10-08:

#98

Great work, Steve! So I suppose this doesn't need a release note after all.

Changed in udev (Ubuntu Precise):
status:	Triaged → Fix Released
Changed in ubuntu-release-notes:
status:	Triaged → Invalid

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-08:

#99

Many thanks. My LVM system now starts successfully, albeit still showing the 60s hang occasionally. Even the supposed-to-be-affected logical volume is mounted correctly. Just the "/sbin/lvm vgchange -a y" persists in the process list, but this apparently does not harm.

One additional note I want to make: I'd set the udevadm-control timeout at least to 70 seconds. The epoll_wait timeout is reset every time something happens. So if the "settle time" is bigger than 1 second, the total timeout would be more than 61 seconds of udevadm-control. Setting the timeout to 70 seconds, gives us at least 10 seconds "settle time".

My conclusion is that "udevadm control --exit" (introduced in april) is still a bit immature, and further races might wait to happen.

Revision history for this message

Serge Hallyn (serge-hallyn) wrote on 2011-10-08:

#100

@Steve,

the new udev is working in my vms! thanks.

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-08: Re: [Bug 818177] Re: boot failures because 'udevadm exit' times out while udevd waits for an already-dead thread

#101

On Sat, Oct 08, 2011 at 11:56:50AM -0000, Eduard Hasenleithner wrote:
> Many thanks. My LVM system now starts successfully, albeit still showing
> the 60s hang occasionally. Even the supposed-to-be-affected logical
> volume is mounted correctly. Just the "/sbin/lvm vgchange -a y" persists
> in the process list, but this apparently does not harm.

Wow, I didn't expect that to be the case! I still want to keep the other
bug report open and pursue the "hung vgchange" problem, anyway; leaving
stray processes running in the initramfs after boot is untidy.

> One additional note I want to make: I'd set the udevadm-control timeout
> at least to 70 seconds. The epoll_wait timeout is reset every time
> something happens. So if the "settle time" is bigger than 1 second, the
> total timeout would be more than 61 seconds of udevadm-control. Setting
> the timeout to 70 seconds, gives us at least 10 seconds "settle time".

Ah, that's a good point. Of course, 70 seconds isn't guaranteed to be
enough time either; in the pathological case, you need 60*num_workers
seconds, because if you get one worker finishing every 59.9 seconds, the
udevd timeout will continue to be extended and we'll still have the udevadm
control problem.

But 61 seconds should be enough for the 99.99% case, and if we start to see
reports of problems in the other .01% we can adapt by raising the timeout.
(I would prefer not to just '|| true' the udevadm call, though, since I
think having to occasionally reboot the machine again is better than being
left with extra udev processes running around doing who-knows-what on the
system.)

> My conclusion is that "udevadm control --exit" (introduced in april) is
> still a bit immature, and further races might wait to happen.

It's true that there might be further races, but I don't think it's due to
immaturity of udevadm control --exit... rather, the races were always there
and we were blissfully ignoring them with pkill (and that's where the bug
report came from that led us to switch to udevadm in the first place).

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

On Sat, Oct 08, 2011 at 11:56:50AM -0000, Eduard Hasenleithner wrote:
> Many thanks. My LVM system now starts successfully, albeit still showing
> the 60s hang occasionally. Even the supposed-to-be-affected logical
> volume is mounted correctly. Just the "/sbin/lvm vgchange -a y" persists
> in the process list, but this apparently does not harm.

Wow, I didn't expect that to be the case!  I still want to keep the other
bug report open and pursue the "hung vgchange" problem, anyway; leaving
stray processes running in the initramfs after boot is untidy.

> One additional note I want to make: I'd set the udevadm-control timeout
> at least to 70 seconds. The epoll_wait timeout is reset every time
> something happens. So if the "settle time" is bigger than 1 second, the
> total timeout would be more than 61 seconds of udevadm-control. Setting
> the timeout to 70 seconds, gives us at least 10 seconds "settle time".

Ah, that's a good point.  Of course, 70 seconds isn't guaranteed to be
enough time either; in the pathological case, you need 60*num_workers
seconds, because if you get one worker finishing every 59.9 seconds, the
udevd timeout will continue to be extended and we'll still have the udevadm
control problem.

But 61 seconds should be enough for the 99.99% case, and if we start to see
reports of problems in the other .01% we can adapt by raising the timeout. 
(I would prefer not to just '|| true' the udevadm call, though, since I
think having to occasionally reboot the machine again is better than being
left with extra udev processes running around doing who-knows-what on the
system.)

> My conclusion is that "udevadm control --exit" (introduced in april) is
> still a bit immature, and further races might wait to happen.

It's true that there might be further races, but I don't think it's due to
immaturity of udevadm control --exit... rather, the races were always there
and we were blissfully ignoring them with pkill (and that's where the bug
report came from that led us to switch to udevadm in the first place).

-- 
Steve Langasek                   Give me a lever long enough and a Free OS
Debian Developer                   to set it on, and I can move the world.
Ubuntu Developer                                    http://www.debian.org/
slangasek@ubuntu.com                                     vorlon@debian.org

Revision history for this message

Eduard Hasenleithner (eduard-hasenleithner) wrote on 2011-10-08: Re: boot failures because 'udevadm exit' times out while udevd waits for an already-dead thread

#102

> (I would prefer not to just '|| true' the udevadm call, though, since I
> think having to occasionally reboot the machine again is better than being
> left with extra udev processes running around doing who-knows-what on the
> system.)

Very valid point. But wouldn't it then be better in the case of udevadm-control failing to do something which gets 100% attention of the user, and with an explicit diagnosis message describing the problem? Simply not moving '/dev' to rootfs, and letting the failed boot to reach its "fate", seems to be a bit too obscure for me.

Being an "expert user", I would also like to have the busybox-shell for the failure case of course ;)

Revision history for this message

Steve Langasek (vorlon) wrote on 2011-10-11: Re: [Bug 818177] Re: boot failures because 'udevadm exit' times out while udevd waits for an already-dead thread

#103

On Sat, Oct 08, 2011 at 08:08:34PM -0000, Eduard Hasenleithner wrote:
> > (I would prefer not to just '|| true' the udevadm call, though, since I
> > think having to occasionally reboot the machine again is better than being
> > left with extra udev processes running around doing who-knows-what on the
> > system.)

> Very valid point. But wouldn't it then be better in the case of udevadm-
> control failing to do something which gets 100% attention of the user,
> and with an explicit diagnosis message describing the problem? Simply
> not moving '/dev' to rootfs, and letting the failed boot to reach its
> "fate", seems to be a bit too obscure for me.

I agree. Could you file a new bug against the initramfs-tools package
requesting this?

> Being an "expert user", I would also like to have the busybox-shell for
> the failure case of course ;)

The obvious way to implement this would be to use the panic handler for
script failures.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message

John Miller (jmiller-1) wrote on 2011-10-19: Re: boot failures because 'udevadm exit' times out while udevd waits for an already-dead thread

#104

I'm encountering this issue on a Dell PowerEdge 2950 (pretty common server model, we have a ton of them here). Clean install of 11.10 will not boot properly, same with installing 11.04 and then upgrading. It appears to be running the latest udev (173-0ubuntu4).

[ 62.944049] bnx2: Can't load firmware file "bnx2/bnx2-mips-06-6.2.1.fw"
[ 62.944081] bnx2 0000:05:00.0: PCI INT A disabled
[ 62.944095] bnx2: probe of 0000:05:00.0 failed with error -2

Apologies if I'm missing something, and let me know if I can provide any further info.

Revision history for this message

Tom Ellis (tellis) wrote on 2011-10-20:

#105

@John Miller
I've reopened https://bugs.launchpad.net/ubuntu/+source/update-manager/+bug/842560 which was marked as a duplicate of this udev bug (as I originally mentioned the firmware issue in the description, but it was unrelated), now that this udev problem is sorted the firmware issue can be looked at in more detail.

Let's use LP#842560 to track the firmware issue and leave this bug closed out.

Revision history for this message

dyna (ubuntu-dyna) wrote on 2011-10-26:

#106

Same problem here, ML350 G5

Funny thing is that after the update from 11.04 to 11.10 i had no problems booting, just no network as described in bug 842560. Booted without network about 3-4 times with succes.

To try and fixed that i upgraded the firmwares using the hp firmware CD 930. (which seems to work with a 11.10 desktop boot CD).
After this i ran into the killing modprobe / read-only fs problem. And i didn't get the system to boot correctly once since then.

Since everything was backed up i decided to start with a clean 11.10 install, but ran in to the same problem.

Tried to replace the 'udevadm control --exit' in /scripts/init-bottom/udev with 'pkill udevd', but it has no effect.

Guess trying to downgrade firmware's is next, but since hp only supplies seperate windows packages that will take a while...

Revision history for this message

dyna (ubuntu-dyna) wrote on 2011-10-26:

#107

Sorry, updates the wrong initramfs. pkill does make a difference, but still doesn't give a booting system.
It does get further and i did see some nbx2 firmware errors again, which i didn't see before using pkill.

The fix in #17 however works: http://paste.ubuntu.com/679222/

Adding the following lines to init-bottom/udev:
. /scripts/functions
wait_for_udev

And i have a booting system with network.

Revision history for this message

David Bukowski (davebb) wrote on 2011-11-17:

#108

Sorry if this is the wrong place to ask this, but this bug currently exists on the Ubuntu-11.10-Server (Oneiric) download I just did today 2011-11-17. And cannot seem to even get the server to start.
Server is an HP Proliant DL380G5
The image is the iso available for download
I have chosen NOT to use an LVM and just do a straight disk partition instead of using the LVM. So I can say it exists even when not using the LVM setup.
I tried reading through the notes here to find the fix and says its fixed, but however still is a problem, or do I need to somehow boot off of CD and then make mods to the system.

Revision history for this message

Doug Weimer (dweimer) wrote on 2011-11-24:

#109

To echo davebb, I just downloaded the ubuntu-11.10-server-amd64.iso (md5: f8a0112b7cb5dcd6d564dbe59f18c35f) on 11/23/2011 and had this problem on two different systems. One was a Sun x4100 (AMD 8354, nvidia chipset, no LVM) and one was a Supermicro Intel box, no LVM.

In both cases the systems printed out a large number of the 'udev[90]: timeout: killing ' messages and then froze. The console didn't come up for over 45 minutes. I have left it running to see if the host ever comes up.

I see the fix is listed for oneiric updates. These errors happened for me on the first boot after an install and the machines don't finish booting. How can I get access to the fix during the install process?

Revision history for this message

Endre Stølsvik (stolsvik) wrote on 2011-11-27:

#110

Just wanted to chime in that I have the exact same problem with udevd killing lvm vgscan.
udevd[139]: timeout: killing 'watershed sh -c '/sbin/lvm vgscan; /sbin/lvm vgchange -a y''

What does this imply? Can I just ignore it till you folks fix it ASAP?

BTW: It has been a real fun upgrade experience to oneiric: First bitten by bug#858122 on my desktop. Then now, my "KVM server" (the host of several guests) got bitten badly by the same bug (only on the server it hung hard), and then apparently this one.

Revision history for this message

KnisterPeter (knister-peter) wrote on 2011-12-14:

#111

Seems related to #625395

Revision history for this message

KnisterPeter (knister-peter) wrote on 2011-12-14:

#112

Hmm, just ignore my comment...

Revision history for this message

KnisterPeter (knister-peter) wrote on 2011-12-14:

#113

I run into the timeout since 2 days. Before that my oneiric experience was quite good. After reading this comments it seems I should be happy that there is a timeout after 61s, but whats causing udev to timeout?
Its very annoying to sit and wait for the system - especially when it does boot a lot faster the day before...

Any idea how to find the root cause of the timeout?

Joseph Salisbury (jsalisbury) on 2011-12-14

tags:

added: kernel-da-key

Revision history for this message

Andy Whitcroft (apw) wrote on 2011-12-16:

#114

There are definatly corner cases of this issue still present in Precise. Looking at the udev init-bottom script it seems that we are not coping correctly with a timeout when requesting that udev exit for us. Thus we skip moving /dev over onto the new /root, which is almost all cases is utterly fatal leaving us on a R/O root.

summary:	- boot failures because 'udevadm exit' times out while udevd waits for an - already-dead thread + boot failures as /dev is not transferred to /root (because 'udevadm + exit' times out waiting for a deadlocked worker)
Changed in udev (Ubuntu Precise):
assignee:	nobody → Andy Whitcroft (apw)
status:	Fix Released → In Progress

Steve Langasek (vorlon) on 2011-12-16

Changed in udev (Ubuntu Oneiric):
status:	Fix Released → In Progress
assignee:	Steve Langasek (vorlon) → Andy Whitcroft (apw)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2011-12-16:

#115

This bug was fixed in the package udev - 175-0ubuntu3

---------------
udev (175-0ubuntu3) precise; urgency=low

  [ Andy Whitcroft ]
  * debian/udev.initramfs-bottom: ignore timeout errors from udevadm we
    expect (and indeed requested) in certain failure modes. If we do not we
    will bail out early and not transfer /dev into /root which is always
    fatal leading to unbootable machines. (LP: #818177)
  * avoid-exit-deadlock-for-timely-events: avoid deadlock when exiting
    by continuing to handle events with timeliness requirements.
    The timeliness requirement will be violated if we ignore them which
    is highly undesirable. Also these events are typically dependant
    events and may well block the events we are waiting on leading to
    boot delays and uninitialised devices. (LP: #842560)
  * debian/udev.initramfs-bottom: increase the client-side timeout to
    better cope with potential timeout extension issues in udev. We very
    much would prefer udev to time itself out and guarantee to have
    completed than take action ourselves. Very worst case the timeout may
    be doubled from the default of 60s so increase ours accordingly. Note,
    we should only ever trip this timeout when we are already in severe
    trouble. (LP: #818177)
-- Steve Langasek <email address hidden> Fri, 16 Dec 2011 11:15:39 -0800

Changed in udev (Ubuntu Precise):
status:	In Progress → Fix Released

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2012-01-09: Please test proposed package

#116

Hello Tom, or anyone else affected,

Accepted udev into oneiric-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in udev (Ubuntu Oneiric):
status:	In Progress → Fix Committed
tags:	added: verification-needed

Revision history for this message

Peter Matulis (petermatulis) wrote on 2012-01-11:

#117

I verified that I can now install Oneiric on a previously-plagued HP Proliant DL380 by installing with the kernel boot option

apt-setup/proposed=true

Joseph Salisbury (jsalisbury) on 2012-01-11

tags:

removed: kernel-da-key

Steve Langasek (vorlon) on 2012-01-11

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-01-23:

#118

This bug was fixed in the package udev - 173-0ubuntu4.1

---------------
udev (173-0ubuntu4.1) oneiric-proposed; urgency=low

  [ Andy Whitcroft ]
  * debian/udev.initramfs-bottom: ignore timeout errors from udevadm we
    expect (and indeed requested) in certain failure modes. If we do not we
    will bail out early and not transfer /dev into /root which is always
    fatal leading to unbootable machines. (LP: #818177)
  * avoid-exit-deadlock-for-timely-events: avoid deadlock when exiting
    by continuing to handle events with timeliness requirements.
    The timeliness requirement will be violated if we ignore them which
    is highly undesirable. Also these events are typically dependant
    events and may well block the events we are waiting on leading to
    boot delays and uninitialised devices. (LP: #842560)
  * debian/udev.initramfs-bottom: increase the client-side timeout to
    better cope with potential timeout extension issues in udev. We very
    much would prefer udev to time itself out and guarantee to have
    completed than take action ourselves. Very worst case the timeout may
    be doubled from the default of 60s so increase ours accordingly. Note,
    we should only ever trip this timeout when we are already in severe
    trouble. (LP: #818177)
-- Steve Langasek <email address hidden> Fri, 16 Dec 2011 14:10:02 -0800

Changed in udev (Ubuntu Oneiric):
status:	Fix Committed → Fix Released

Revision history for this message

Dylan Damian (dylan-damian) wrote on 2012-02-11:

#119

Still happening here, on precise alpha 1. I see a 60+ seconds delay and corresponding time gap in the dmesg boot timeline on every boot. I've repeated the install several times and it is consistent.
udevd and udevadm version 175

Revision history for this message

Dylan Damian (dylan-damian) wrote on 2012-02-11:

#120

Edit re: #119: Precise Alpha 2 (not 1), 2/1/2012 build upgraded to udev _175-0ubuntu4_amd64 still experiencing 60+ second boot delay.

Revision history for this message

Dylan Damian (dylan-damian) wrote on 2012-02-11:

#121

Sorry, but I am still experiencing a 60+ second boot delay consistently with udev amd64 175-0ubuntu4. I'll assist with testing in any way I can.

Revision history for this message

Shannon Barber (sgbarber) wrote on 2012-02-14:

#122

I am not able to install Ubuntu 11.10 x64 due to this bug.
The install option fails and spams the udevd timeout killing modprobe <blah>.
I tried disabling modeset and removing the quiet splash.
It appears to be an infinite loop.

Revision history for this message

Kevin R (kevin-riehle) wrote on 2012-02-16:

#123

Shannon, I was at the same point you are before I was able to finally get a functioning install.

Here is what didn't work first:
- 11.10 Ubuntu - Wubi install via Windows 7 (I believe it default selected 32 bit)
-- I believe this froze at the blank purple screen
- 11.10 Ubuntu - 64-bit via bootable CD
-- This froze with an error message before I even got to the install dialog with something like ('sbin modprobe udevd terminated by signal 9' (see above for full error message))

I was then able to finally get into the install menu by doing the following:
- Boot into install disk
- Hit escape at the first purple screen that has the little man figure at the bottom next to the keyboard
- Hit F6 for other options (I believe) and select 'nomodeset', which allowed me to install.

I then installed for the first time without internet access (another issue) and after rebooting after the successful install I received the same blank purple screen as before.

The next time I installed I was able to again install using the 'nomodeset', but this time with networking. After installing, I rebooted and after 20-30s it finally brought up the splash screen.

I'm on a Dell Precision T7500
Dual Quad Core Intel® Xeon® Processor E5606, 2.13GHz, 8M L3, 4.8GT/s
12GB, DDR3 RDIMM Memory, 1333MHz, ECC (6 DIMMS)
Dual 1.0GB NVIDIA® Quadro® 2000, Quad MON, 4 DP & 2 DVI

I shrunk my original Windows 7 partition, formatted into NTFS, and installed onto this area (as the more you use Windows the less you can shrink the original partition).

I'm not sure yet, but this problem appears to be solved by downloading updates while installing. For whatever reason I was unable to make these updates while booting into my first install and changing the grub boot menu by removing 'quiet splash' and adding 'nomodeset'.

Good luck, and I hope this helps somehow.

Revision history for this message

Sergei Ianovich (ynvich-gmail) wrote on 2012-02-19:

#124

1 of 4 our Intel S1690WB servers doesn't boot even with the latest oneiric udev package (udev (173-0ubuntu4.1) oneiric-proposed; urgency=low). It boots with a workaround from comment 17. 3 that boot suffer from a 60 second timeout. All 4 have LVM over software RAID1.

The timeout itself indicates a mismanaged race. Can #871528 be reopened for this specific configuration or should I file a new one?

Revision history for this message

Sergei Ianovich (ynvich-gmail) wrote on 2012-02-19:

#125

Found a working patch in https://bugs.launchpad.net/ubuntu/+source/lvm2/+bug/802626/comments/53

Revision history for this message

Carli (s3734770) wrote on 2012-02-23:

#126

The timeout thing is not fixed after a dist-upgrade. How can I workaround it? Or does "fix released" mean, the fix is in git, but not in the package system?

Revision history for this message

Mina Galić (minagalic) wrote on 2012-02-23:

#127

Even with latest udev on Oneiric, vgchange still hangs in there:

i.galic@bacon ~ % dpkg -l | awk '/udev/{print $3}'
173-0ubuntu4.1
173-0ubuntu4.1
i.galic@bacon ~ % uname -a
Linux bacon 3.0.0-16-server #28-Ubuntu SMP Fri Jan 27 18:03:45 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
i.galic@bacon ~ % ps -cafe | grep -i vg[c]
root 337 1 TS 37 02:25 ? 00:00:00 /sbin/lvm vgchange -a y
i.galic@bacon ~ %

on the Host, while the KVMs fail to boot, stuck at init-bottom for over a day now (so, it doesn't matter how long I wait, I assume, it'll remain stuck.):

Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[ 1.410792] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.

I'd say this bug still exists. I'd love to help you debug it - you can find me on IRC as 'jMCg'

Revision history for this message

Mina Galić (minagalic) wrote on 2012-02-27:

#128

Now with the latest packages installed from oneiric-proposed, I have:

[ 1.574823] md: multipath personality registered for level -4
done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
[ 1.652371] EXT4-fs (vda): mounted filesystem with ordered data mode. Opts: (null)
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
[ 2.833913] udevd[352]: starting version 173

I'd call it progress.

Revision history for this message

Adam Gandelman (gandelman-a) wrote on 2012-03-04:

#129

Users running into deadlock in init-bottom around watershed's calling of vgchange, please see Bug #802626

Ubuntu
udev package

boot failures as /dev is not transferred to /root (because 'udevadm exit' times out waiting for a deadlocked worker)

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
Release Notes for Ubuntu	Invalid	Undecided	Unassigned
linux (Ubuntu)	Invalid	High	Unassigned
Oneiric	Invalid	High	Unassigned
Precise	Invalid	Undecided	Unassigned
udev (Ubuntu)	Fix Released	High	Andy Whitcroft	Ubuntu oneiric-updates
Oneiric	Fix Released	High	Andy Whitcroft	Ubuntu oneiric-updates
Precise	Fix Released	High	Andy Whitcroft

Ubuntuudev package

boot failures as /dev is not transferred to /root (because 'udevadm exit' times out waiting for a deadlocked worker)

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
udev package