Bug #396957 “udevd segfault after “worker did not accept message...” : Bugs : udev package : Ubuntu

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#1

syslog Edit (66.5 KiB, text/plain)

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#2

ps Edit (3.0 KiB, text/plain)

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#3

dmesg Edit (15.2 KiB, text/plain)

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-07-08:

#4

Jul 8 09:36:41 udevd[663]: worker [5491] did not accept message, kill it
Jul 8 09:36:41 udevd[663]: worker [5491] unexpectedly returned with 0
Jul 8 09:36:41 udevd[663]: worker [5551] unexpectedly returned with 0
Jul 8 09:36:41 kernel: [ 156.832086] <6>udevd[663]: segfault at 4 ip 00959fbc sp bfbe7b78 error 6 in udevd[94f000+1c000]

That's the interesting bit, the remaining udevd processes are just workers left over after the crash.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2009-07-08:

#5

Kay: didn't get a core dump from this, but any idea - looks like udevd segfaulted (bug 1) and left its worker processes around (bug 2, maybe)

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#6

"udevd[663]: segfault at 4 ip 00959fbc sp bfbe7b78 error 6 in udevd[94f000+1c000]" works out as udev_list_node_remove + 11, according to gdb.

Dump of assembler code for function udev_list_node_remove:
0x0000afb1 <udev_list_node_remove+0>: push %ebp
0x0000afb2 <udev_list_node_remove+1>: mov %esp,%ebp
0x0000afb4 <udev_list_node_remove+3>: mov 0x8(%ebp),%eax
0x0000afb7 <udev_list_node_remove+6>: mov 0x4(%eax),%edx
0x0000afba <udev_list_node_remove+9>: mov (%eax),%ecx
0x0000afbc <udev_list_node_remove+11>: mov %edx,0x4(%ecx)
0x0000afbf <udev_list_node_remove+14>: mov %ecx,(%edx)
0x0000afc1 <udev_list_node_remove+16>: movl $0x0,0x4(%eax)
0x0000afc8 <udev_list_node_remove+23>: movl $0x0,(%eax)
0x0000afce <udev_list_node_remove+29>: pop %ebp
0x0000afcf <udev_list_node_remove+30>: ret
End of assembler dump.

I'm not sure whether that's entry->prev or entry->next, but looks like one of those is NULL.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#7

For the record, since the log doesn't say, this is udev 143-8.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#8

udpkg status file Edit (57.8 KiB, text/plain)

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#9

core Edit (432.0 KiB, application/octet-stream)

'while :; do udevadm trigger; udevadm settle; kill -HUP $udevd_pid || break; done' (or the near equivalent) eventually tickled this into crashing under 'ulimit -c unlimited'. Here's a core dump.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#10

If it helps (since gdb doesn't seem to want to produce a backtrace unassisted), the kernel segfault message this time round said that the VMA base address was 0x95e000.

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-08:

#11

The list corruption I think I've found, and the fix is simple. Seems, udev tried to handle the event associated with the worker twice, if it killed the worker itself.

The interesting question is why a worker will not accept messages. We will get there when the wrong kill handling is fixed. :) Thanks for the debugging!

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#12

I applied http://people.ubuntu.com/~scott/d-i-udev-crash.patch, but unfortunately this doesn't work. The initial 'udevadm settle' on boot just sits there. udevd --debug shows the following after the settle (there was a preceding trigger that did lots of event work):

[669] handle_ctrl_msg: udevd message (SETTLE) received

... and then nothing, just a hang.

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-08:

#13

The settle will time out? And you can log in?

Does "udevadm settle" on the running system work?

It seems to work all fine here.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-08:

#14

"Log in" is not a meaningful concept in this environment; this is happening at installer startup time. Repeated invocations of 'udevadm settle' after timeouts hang in the same way.

If you can't reproduce this, what can I do to debug this for you? --debug doesn't seem especially helpful here.

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-09:

#15

Any chance to strace a "udevadm settle" call? It sounds strange, that the daemon is receiving the message but does not send the signal back.

During installation, is there anything that might kill udev processes? Like with a package installation/update script?

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-09: Re: [Bug 396957] Re: udevd segfault after "worker did not accept message" and "worker unexpectedly returned with 0"

#16

On Thu, Jul 09, 2009 at 12:04:48AM -0000, Kay Sievers wrote:
> Any chance to strace a "udevadm settle" call? It sounds strange, that
> the daemon is receiving the message but does not send the signal back.

I have an strace, but I don't have any particularly straightforward way
to extract it from the running image at this point, since it's so early.
I can try to describe what's going on in it though. udevadm settle
actually does get SIGUSR1 immediately after sending the settle command
to udevd; but it then sits there waiting for the event queue to become
empty. After the timeout fires, it prints:

udevadm settle - timeout of 10 seconds reached, the event queue contains:
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda (653)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda1 (654)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda2 (655)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda5 (656)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda6 (657)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda7 (658)
  /sys/devices/pci0000:00/0000:00:01.1/host1/target1:0:0/1:0:0:0/block/sr0 (666)
  /sys/devices/platform/i8042/serio0/input/input2/event2 (681)
  /sys/devices/pci0000:00/0000:00:01.1/host1/target1:0:0/1:0:0:0 (850)

There are no instances of "did not accept message" in the udevd debug
log. Since it didn't fail this dramatically early nor as reliably before
this patch, does that suggest that the change to worker_returned is
responsible?

The exact set of events remaining in the event queue does not seem to be
entirely identical in every run, but I haven't seen it be empty yet, and
the above seems fairly typical.

What else should I try to look for?

> During installation, is there anything that might kill udev processes?
> Like with a package installation/update script?

No, this is while the installer is starting up, way before any of that
stuff might happen (and even when I reproduced it earlier in a more
complex environment, nothing like that was going on). Apart from
debugging shells and other such trivia, userspace has literally just
done the following:

  mount /proc
  mount /sys
  mount -t tmpfs tmpfs /dev
  mknod -m 600 /dev/console c 5 1
  mknod -m 666 /dev/null c 1 3
  cp -a -f /lib/udev/devices/* /dev
  echo "" > /proc/sys/kernel/hotplug
  udevd --daemon --resolve-names=never
  udevadm trigger
  udevadm settle

I've put the kernel and initramfs at
http://people.ubuntu.com/~cjwatson/tmp/396957/, which you should just be
able to boot in kvm to reproduce this; no special boot options required
other than initrd= required, but BOOT_DEBUG=3 will get you a debugging
shell before any of the above commands start. You might have to attach a
hard disk and/or virtual CD to kvm as well to reproduce it, I suppose.

On Thu, Jul 09, 2009 at 12:04:48AM -0000, Kay Sievers wrote:
> Any chance to strace a "udevadm settle" call? It sounds strange, that
> the daemon is receiving the message but does not send the signal back.

I have an strace, but I don't have any particularly straightforward way
to extract it from the running image at this point, since it's so early.
I can try to describe what's going on in it though. udevadm settle
actually does get SIGUSR1 immediately after sending the settle command
to udevd; but it then sits there waiting for the event queue to become
empty. After the timeout fires, it prints:

udevadm settle - timeout of 10 seconds reached, the event queue contains:
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda (653)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda1 (654)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda2 (655)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda5 (656)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda6 (657)
  /sys/devices/pci0000:00/0000:00:01.1/host0/target0:0:0/0:0:0:0/block/sda/sda7 (658)
  /sys/devices/pci0000:00/0000:00:01.1/host1/target1:0:0/1:0:0:0/block/sr0 (666)
  /sys/devices/platform/i8042/serio0/input/input2/event2 (681)
  /sys/devices/pci0000:00/0000:00:01.1/host1/target1:0:0/1:0:0:0 (850)

There are no instances of "did not accept message" in the udevd debug
log. Since it didn't fail this dramatically early nor as reliably before
this patch, does that suggest that the change to worker_returned is
responsible?

The exact set of events remaining in the event queue does not seem to be
entirely identical in every run, but I haven't seen it be empty yet, and
the above seems fairly typical.

What else should I try to look for?

> During installation, is there anything that might kill udev processes?
> Like with a package installation/update script?

No, this is while the installer is starting up, way before any of that
stuff might happen (and even when I reproduced it earlier in a more
complex environment, nothing like that was going on). Apart from
debugging shells and other such trivia, userspace has literally just
done the following:

mount /proc
  mount /sys
  mount -t tmpfs tmpfs /dev
  mknod -m 600 /dev/console c 5 1
  mknod -m 666 /dev/null c 1 3
  cp -a -f /lib/udev/devices/* /dev
  echo "" > /proc/sys/kernel/hotplug
  udevd --daemon --resolve-names=never
  udevadm trigger
  udevadm settle

I've put the kernel and initramfs at
http://people.ubuntu.com/~cjwatson/tmp/396957/, which you should just be
able to boot in kvm to reproduce this; no special boot options required
other than initrd= required, but BOOT_DEBUG=3 will get you a debugging
shell before any of the above commands start. You might have to attach a
hard disk and/or virtual CD to kvm as well to reproduce it, I suppose.

Revision history for this message

Steve Dodd (anarchetic) wrote on 2009-07-09:

#17

Just booted today's (20090709) i386 live CD and seen this after booting with quiet and splash removed from the command line. This was in VirtualBox 2.2.4 with 512Mb of RAM allocated to VM, and no virtual HDD set-up.

I've got to dash off to work now, will try and do some more testing, i.e. on real hardware, when I get back, but I thought I'd mention this in case it was helpful.

If it turn out to be specific to virtual machines, that normally makes me think "race condition" first, as stuff runs that much more slowly when virtualized..

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-11:

#18

I think I've found it. It likely happens only on new installations because udev writes out the network config rule, which triggers a reload of the rules. This needs to kill all old worker processes to get the new config active. There's a bug how the config-reload is handled while the workers are busy. I'll fix that tomorrow. Thanks for the data, it was really helpful to find that.

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-11:

#19

Care to try this on top?
http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=adda4c682ad2c56fc091222be3bd94fa817013b9

If that works for you, I'll put out a new release to fix this issue. Thanks!

Revision history for this message

Steve Dodd (anarchetic) wrote on 2009-07-12:

#20

remaining udev messages Edit (37.3 KiB, image/png)

Applying the two patches to udev 143-8, rebuilding and patching it into the live CD I was having problems with (20090709) eliminates the segfault and the error about not accepting messages, though I do often still get the "unexpectedly returned with status 0x0000" messages - see screenshot. I've tested this in VBox and on a couple of real machines ..

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-12:

#21

Does it work fine otherwise, and you just get the message? Or does it still fail?

Revision history for this message

Steve Dodd (anarchetic) wrote on 2009-07-12:

#22

Yes, other than the message everything seems to work fine ..

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-12:

#23

Great. Thanks for the tests. I'll fix the warning, as "status 0x0000" suggests, it isn't an unexpected failure.

Revision history for this message

Kay Sievers (kaysievers) wrote on 2009-07-13:

#24

Hopefully all fixed now:
http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=bc113de9a4dc1229f7533acd41310a56d60fbe7e
Thanks again.

Revision history for this message

Colin Watson (cjwatson) wrote on 2009-07-13:

#25

On Sat, Jul 11, 2009 at 04:34:18PM -0000, Kay Sievers wrote:
> Care to try this on top?
> http://git.kernel.org/?p=linux/hotplug/udev.git;a=commitdiff;h=adda4c682ad2c56fc091222be3bd94fa817013b9

Ah yes, that makes sense as a cause and it seems to fix the problem for
me. Thanks!

Scott James Remnant (Canonical) (canonical-scott) on 2009-07-14

Changed in udev (Ubuntu):
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Launchpad Janitor (janitor) wrote on 2009-07-14:

#26

This bug was fixed in the package udev - 145-1

---------------
udev (145-1) karmic; urgency=low

  * New upstream release:
    - detach event from worker if killed.
    - make sure a worker finishes its event before exiting. LP: #396957.
    - handle SIGCHLD before worker messages.
    - use qsort() to sort list rather than by hand (performance).
    - correct mangling of scsi_id return values

  * debian/rules: Update "prep" target to not wipe uncommitted files.
  * debian/libudev-dev.install: Install the restored libtool file
  * debian/libgudev-1.0-dev.install: Install the restored libtool file

-- Scott James Remnant <email address hidden> Tue, 14 Jul 2009 17:10:38 +0100

Changed in udev (Ubuntu):
status:	Triaged → Fix Released

Ubuntu
udev package

udevd segfault after "worker did not accept message" and "worker unexpectedly returned with 0"

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntuudev package

udevd segfault after "worker did not accept message" and "worker unexpectedly returned with 0"

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
udev package