udevd segfault after "worker did not accept message" and "worker unexpectedly returned with 0"

Bug #396957 reported by Colin Watson
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
udev (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Binary package hint: udev

I'm afraid I don't have very much to go on here, and I don't know whether I'll be able to reproduce this. I was attempting to resize a partition in a current Karmic server CD, and noticed that it seemed to be taking an unusually long time; on investigation this turned out to be because 'udevadm settle' was timing out, which in turn was because udevd didn't actually appear to be listening on the /org/kernel/udev/udevd socket.

Looking back in the logs, I see that udevd actually died before I got as far as partitioning - that was just the first point when I noticed. I've attached the syslog in the hope that you might be able to do something with it.

Revision history for this message
Colin Watson (cjwatson) wrote :
Revision history for this message
Colin Watson (cjwatson) wrote :
  • ps Edit (3.0 KiB, text/plain)
Revision history for this message
Colin Watson (cjwatson) wrote :
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Jul 8 09:36:41 udevd[663]: worker [5491] did not accept message, kill it
Jul 8 09:36:41 udevd[663]: worker [5491] unexpectedly returned with 0
Jul 8 09:36:41 udevd[663]: worker [5551] unexpectedly returned with 0
Jul 8 09:36:41 kernel: [ 156.832086] <6>udevd[663]: segfault at 4 ip 00959fbc sp bfbe7b78 error 6 in udevd[94f000+1c000]

That's the interesting bit, the remaining udevd processes are just workers left over after the crash.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Kay: didn't get a core dump from this, but any idea - looks like udevd segfaulted (bug 1) and left its worker processes around (bug 2, maybe)

Revision history for this message
Colin Watson (cjwatson) wrote :

"udevd[663]: segfault at 4 ip 00959fbc sp bfbe7b78 error 6 in udevd[94f000+1c000]" works out as udev_list_node_remove + 11, according to gdb.

Dump of assembler code for function udev_list_node_remove:
0x0000afb1 <udev_list_node_remove+0>: push %ebp
0x0000afb2 <udev_list_node_remove+1>: mov %esp,%ebp
0x0000afb4 <udev_list_node_remove+3>: mov 0x8(%ebp),%eax
0x0000afb7 <udev_list_node_remove+6>: mov 0x4(%eax),%edx
0x0000afba <udev_list_node_remove+9>: mov (%eax),%ecx
0x0000afbc <udev_list_node_remove+11>: mov %edx,0x4(%ecx)
0x0000afbf <udev_list_node_remove+14>: mov %ecx,(%edx)
0x0000afc1 <udev_list_node_remove+16>: movl $0x0,0x4(%eax)
0x0000afc8 <udev_list_node_remove+23>: movl $0x0,(%eax)
0x0000afce <udev_list_node_remove+29>: pop %ebp
0x0000afcf <udev_list_node_remove+30>: ret
End of assembler dump.

I'm not sure whether that's entry->prev or entry->next, but looks like one of those is NULL.

Revision history for this message
Colin Watson (cjwatson) wrote :

For the record, since the log doesn't say, this is udev 143-8.

Revision history for this message
Colin Watson (cjwatson) wrote :
Revision history for this message
Colin Watson (cjwatson) wrote :
  • core Edit (432.0 KiB, application/octet-stream)

'while :; do udevadm trigger; udevadm settle; kill -HUP $udevd_pid || break; done' (or the near equivalent) eventually tickled this into crashing under 'ulimit -c unlimited'. Here's a core dump.

Revision history for this message
Colin Watson (cjwatson) wrote :

If it helps (since gdb doesn't seem to want to produce a backtrace unassisted), the kernel segfault message this time round said that the VMA base address was 0x95e000.

Revision history for this message
Kay Sievers (kaysievers) wrote :

The list corruption I think I've found, and the fix is simple. Seems, udev tried to handle the event associated with the worker twice, if it killed the worker itself.

The interesting question is why a worker will not accept messages. We will get there when the wrong kill handling is fixed. :) Thanks for the debugging!

Revision history for this message
Colin Watson (cjwatson) wrote :

I applied http://people.ubuntu.com/~scott/d-i-udev-crash.patch, but unfortunately this doesn't work. The initial 'udevadm settle' on boot just sits there. udevd --debug shows the following after the settle (there was a preceding trigger that did lots of event work):

  [669] handle_ctrl_msg: udevd message (SETTLE) received

... and then nothing, just a hang.

Revision history for this message
Kay Sievers (kaysievers) wrote :

The settle will time out? And you can log in?

Does "udevadm settle" on the running system work?

It seems to work all fine here.

Revision history for this message
Colin Watson (cjwatson) wrote :

"Log in" is not a meaningful concept in this environment; this is happening at installer startup time. Repeated invocations of 'udevadm settle' after timeouts hang in the same way.

If you can't reproduce this, what can I do to debug this for you? --debug doesn't seem especially helpful here.

Revision history for this message
Kay Sievers (kaysievers) wrote :

Any chance to strace a "udevadm settle" call? It sounds strange, that the daemon is receiving the message but does not send the signal back.

During installation, is there anything that might kill udev processes? Like with a package installation/update script?

Revision history for this message
Colin Watson (cjwatson) wrote : Re: [Bug 396957] Re: udevd segfault after "worker did not accept message" and "worker unexpectedly returned with 0"

On Thu, Jul 09, 2009 at 12:04:48AM -0000, Kay Sievers wrote:
> Any chance to strace a "udevadm settle" call? It sounds strange, that
> the daemon is receiving the message but does not send the signal back.

I have an strace, but I don't have any particularly straightforward way
to extract it from the running image at this point, since it's so early.
I can try to describe what's going on in it though. udevadm settle
actually does get SIGUSR1 immediately after sending the settle command
to udevd; but it then sits there waiting for the event queue to become
empty. After the timeout fires, it prints:

udevadm settle - timeout of 10 seconds reached, the event queue contains:
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda (653)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda1 (654)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda2 (655)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda5 (656)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda6 (657)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda7 (658)
  /sys/devices/pci0000:00/0000:00:01.1/host1/target1:0:0/1:0:0:0/block/sr0 (666)
  /sys/devices/platform/i8042/serio0/input/input2/event2 (681)
  /sys/devices/pci0000:00/0000:00:01.1/host1/target1:0:0/1:0:0:0 (850)

There are no instances of "did not accept message" in the udevd debug
log. Since it didn't fail this dramatically early nor as reliably before
this patch, does that suggest that the change to worker_returned is
responsible?

The exact set of events remaining in the event queue does not seem to be
entirely identical in every run, but I haven't seen it be empty yet, and
the above seems fairly typical.

What else should I try to look for?

> During installation, is there anything that might kill udev processes?
> Like with a package installation/update script?

No, this is while the installer is starting up, way before any of that
stuff might happen (and even when I reproduced it earlier in a more
complex environment, nothing like that was going on). Apart from
debugging shells and other such trivia, userspace has literally just
done the following:

  mount /proc
  mount /sys
  mount -t tmpfs tmpfs /dev
  mknod -m 600 /dev/console c 5 1
  mknod -m 666 /dev/null c 1 3
  cp -a -f /lib/udev/devices/* /dev
  echo "" > /proc/sys/kernel/hotplug
  udevd --daemon --resolve-names=never
  udevadm trigger
  udevadm settle

I've put the kernel and initramfs at
http://people.ubuntu.com/~cjwatson/tmp/396957/, which you should just be
able to boot in kvm to reproduce this; no special boot options required
other than initrd= required, but BOOT_DEBUG=3 will get you a debugging
shell before any of the above commands start. You might have to attach a
hard disk and/or virtual CD to kvm as well to reproduce it, I suppose.

Revision history for this message
Steve Dodd (anarchetic) wrote :

Just booted today's (20090709) i386 live CD and seen this after booting with quiet and splash removed from the command line. This was in VirtualBox 2.2.4 with 512Mb of RAM allocated to VM, and no virtual HDD set-up.

I've got to dash off to work now, will try and do some more testing, i.e. on real hardware, when I get back, but I thought I'd mention this in case it was helpful.

If it turn out to be specific to virtual machines, that normally makes me think "race condition" first, as stuff runs that much more slowly when virtualized..

Revision history for this message
Kay Sievers (kaysievers) wrote :

I think I've found it. It likely happens only on new installations because udev writes out the network config rule, which triggers a reload of the rules. This needs to kill all old worker processes to get the new config active. There's a bug how the config-reload is handled while the workers are busy. I'll fix that tomorrow. Thanks for the data, it was really helpful to find that.

Revision history for this message
Kay Sievers (kaysievers) wrote :

Care to try this on top?
http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=adda4c682ad2c56fc091222be3bd94fa817013b9

If that works for you, I'll put out a new release to fix this issue. Thanks!

Revision history for this message
Steve Dodd (anarchetic) wrote :

Applying the two patches to udev 143-8, rebuilding and patching it into the live CD I was having problems with (20090709) eliminates the segfault and the error about not accepting messages, though I do often still get the "unexpectedly returned with status 0x0000" messages - see screenshot. I've tested this in VBox and on a couple of real machines ..

Revision history for this message
Kay Sievers (kaysievers) wrote :

Does it work fine otherwise, and you just get the message? Or does it still fail?

Revision history for this message
Steve Dodd (anarchetic) wrote :

Yes, other than the message everything seems to work fine ..

Revision history for this message
Kay Sievers (kaysievers) wrote :

Great. Thanks for the tests. I'll fix the warning, as "status 0x0000" suggests, it isn't an unexpected failure.

Revision history for this message
Kay Sievers (kaysievers) wrote :
Revision history for this message
Colin Watson (cjwatson) wrote :

On Sat, Jul 11, 2009 at 04:34:18PM -0000, Kay Sievers wrote:
> Care to try this on top?
> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=adda4c682ad2c56fc091222be3bd94fa817013b9

Ah yes, that makes sense as a cause and it seems to fix the problem for
me. Thanks!

Changed in udev (Ubuntu):
importance: Undecided → High
status: New → Triaged
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 145-1

---------------
udev (145-1) karmic; urgency=low

  * New upstream release:
    - detach event from worker if killed.
    - make sure a worker finishes its event before exiting. LP: #396957.
    - handle SIGCHLD before worker messages.
    - use qsort() to sort list rather than by hand (performance).
    - correct mangling of scsi_id return values

  * debian/rules: Update "prep" target to not wipe uncommitted files.
  * debian/libudev-dev.install: Install the restored libtool file
  * debian/libgudev-1.0-dev.install: Install the restored libtool file

 -- Scott James Remnant <email address hidden> Tue, 14 Jul 2009 17:10:38 +0100

Changed in udev (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.