upstart won't collect zombies?

Bug #89135 reported by Evan Klitzke
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
upstart
Confirmed
Medium
Unassigned

Bug Description

Hi, I was just on #ubuntu and was talking to a user who had a problem with his CD drive not unmounting. He pastebinned the output of ps -ef, which was interesting:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Feb28 ? 00:00:00 /sbin/init splash
root 2 1 0 Feb28 ? 00:00:00 [migration/0]
root 3 1 0 Feb28 ? 00:00:01 [ksoftirqd/0]
root 4 1 0 Feb28 ? 00:00:00 [watchdog/0]
root 5 1 0 Feb28 ? 00:00:00 [events/0]
root 6 1 0 Feb28 ? 00:00:00 [khelper]
root 7 1 0 Feb28 ? 00:00:00 [kthread]
root 9 7 0 Feb28 ? 00:00:00 [kblockd/0]
root 10 7 0 Feb28 ? 00:00:00 [kacpid]
root 11 7 0 Feb28 ? 00:00:00 [kacpi_notify]
root 120 7 0 Feb28 ? 00:00:00 [kseriod]
root 152 7 0 Feb28 ? 00:00:00 [pdflush]
root 153 7 0 Feb28 ? 00:00:00 [pdflush]
root 154 1 0 Feb28 ? 00:00:00 [kswapd0]
root 155 7 0 Feb28 ? 00:00:00 [aio/0]
root 1777 7 0 Feb28 ? 00:00:00 [ata/0]
root 1781 7 0 Feb28 ? 00:00:00 [scsi_eh_0]
root 1782 7 0 Feb28 ? 00:00:00 [scsi_eh_1]
root 1783 7 0 Feb28 ? 00:00:00 [scsi_eh_2]
root 1784 7 0 Feb28 ? 00:00:00 [scsi_eh_3]
root 1964 7 0 Feb28 ? 00:00:00 [khubd]
root 1990 7 0 Feb28 ? 00:00:00 [khpsbpkt]
root 2012 1 0 Feb28 ? 00:00:00 [knodemgrd_0]
root 2028 7 0 Feb28 ? 00:00:00 [kmirrord]
root 2072 7 0 Feb28 ? 00:00:00 [kjournald]
root 2153 1 0 Feb28 ? 00:00:00 //sbin/logd
root 2316 1 0 Feb28 ? 00:00:00 /sbin/udevd --daemon
root 3084 7 0 Feb28 ? 00:00:00 [shpchpd]
root 3163 7 0 Feb28 ? 00:00:00 [scsi_eh_4]
root 3164 7 0 Feb28 ? 00:00:00 [scsi_eh_5]
root 3219 7 0 Feb28 ? 00:00:00 [kgameportd]
root 3477 7 0 Feb28 ? 00:00:00 [kpsmoused]
root 3560 7 0 Feb28 ? 00:00:00 [scsi_eh_6]
root 3561 7 0 Feb28 ? 00:00:03 [usb-storage]
root 3815 7 0 Feb28 ? 00:00:00 [kjournald]
dhcp 4030 1 0 Feb28 ? 00:00:00 dhclient3 -pf /var/run/dhclient.eth0.pid -lf /va
root 4144 1 0 Feb28 tty1 00:00:00 /sbin/getty 38400 tty1
root 4145 1 0 Feb28 tty2 00:00:00 /sbin/getty 38400 tty2
root 4146 1 0 Feb28 tty3 00:00:00 /sbin/getty 38400 tty3
root 4147 1 0 Feb28 tty4 00:00:00 /sbin/getty 38400 tty4
root 4148 1 0 Feb28 tty5 00:00:00 /sbin/getty 38400 tty5
root 4149 1 0 Feb28 tty6 00:00:00 /sbin/getty 38400 tty6
root 4360 1 0 Feb28 ? 00:00:00 /usr/sbin/acpid -c /etc/acpi/events -s /var/run/
root 4533 1 0 Feb28 ? 00:00:00 /bin/dd bs 1 if /proc/kmsg of /var/run/klogd/kms
klog 4535 1 0 Feb28 ? 00:00:00 /sbin/klogd -P /var/run/klogd/kmsg
root 4607 1 0 Feb28 ? 00:00:00 /usr/sbin/gdm
root 4608 4607 0 Feb28 ? 00:00:00 /usr/sbin/gdm
root 4631 4608 0 Feb28 tty7 00:11:29 /usr/X11R6/bin/X :0 -br -audit 0 -auth /var/lib/
root 4662 1 0 Feb28 ? 00:00:00 /usr/sbin/hpiod
hplip 4665 1 0 Feb28 ? 00:00:00 python /usr/sbin/hpssd
103 4713 1 0 Feb28 ? 00:00:01 /usr/bin/dbus-daemon --system
106 4728 1 0 Feb28 ? 00:00:22 /usr/sbin/hald
root 4729 4728 0 Feb28 ? 00:00:00 hald-runner
106 4735 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-acpi
106 4751 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
106 4755 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
106 4758 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
106 4762 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
root 4775 4729 0 Feb28 ? 00:00:16 /usr/lib/hal/hald-addon-hid-ups
106 4785 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4787 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4789 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4791 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4802 4729 0 Feb28 ? 00:00:03 /usr/lib/hal/hald-addon-storage
106 4804 4729 0 Feb28 ? 00:00:03 /usr/lib/hal/hald-addon-storage
106 4806 4729 0 Feb28 ? 00:00:05 /usr/lib/hal/hald-addon-storage
106 4808 4729 0 Feb28 ? 00:00:05 /usr/lib/hal/hald-addon-storage
root 4824 1 0 Feb28 ? 00:00:00 perl /usr/share/system-tools-backends-2.0/script
con-man 4867 4608 0 Feb28 ? 00:00:00 x-session-manager
con-man 4904 4867 0 Feb28 ? 00:00:00 /usr/bin/ssh-agent /usr/bin/dbus-launch --exit-w
con-man 4907 1 0 Feb28 ? 00:00:00 /usr/bin/dbus-launch --exit-with-session x-sessi
con-man 4908 1 0 Feb28 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 8 --prin
con-man 4910 1 0 Feb28 ? 00:00:00 /usr/lib/libgconf2-4/gconfd-2 5
con-man 4913 1 0 Feb28 ? 00:00:00 /usr/bin/gnome-keyring-daemon
con-man 4916 1 0 Feb28 ? 00:00:03 /usr/lib/control-center/gnome-settings-daemon
con-man 4925 1 0 Feb28 ? 00:00:00 /bin/sh -c /usr/bin/esd -terminate -nobeeps -as
con-man 4926 4925 0 Feb28 ? 00:00:00 /usr/bin/esd -terminate -nobeeps -as 1 -spawnfd
con-man 4936 1 0 Feb28 ? 00:00:05 gnome-panel --sm-client-id default1
con-man 4938 1 0 Feb28 ? 00:00:24 nautilus --no-default-window --sm-client-id defa
con-man 4942 1 0 Feb28 ? 00:00:00 /usr/lib/bonobo-activation/bonobo-activation-ser
con-man 4943 1 0 Feb28 ? 00:00:00 gnome-volume-manager --sm-client-id default4
con-man 4950 1 0 Feb28 ? 00:00:00 /usr/lib/gnome-vfs-2.0/gnome-vfs-daemon
con-man 4963 1 0 Feb28 ? 00:00:00 update-notifier
con-man 4971 1 0 Feb28 ? 00:00:00 beryl-manager
con-man 4977 1 0 Feb28 ? 00:00:06 gkrellm
con-man 4982 1 0 Feb28 ? 00:00:00 /usr/lib/evolution/2.8/evolution-alarm-notify
con-man 4995 1 0 Feb28 ? 00:00:01 gnome-cups-icon --sm-client-id default3
con-man 5112 4971 1 Feb28 ? 00:28:07 beryl --skip-gl-yield
con-man 5113 1 0 Feb28 ? 00:00:00 gnome-power-manager
con-man 5134 1 0 Feb28 ? 00:00:00 /usr/lib/evolution/evolution-data-server-1.8 --o
109 5135 1 0 Feb28 ? 00:00:00 /usr/sbin/exim4 -bd -q30m
root 5157 1 0 Feb28 ? 00:00:00 /usr/sbin/inetutils-inetd
con-man 5220 1 0 Feb28 ? 00:00:00 /usr/lib/nautilus-cd-burner/mapping-daemon
root 5221 1 0 Feb28 ? 00:00:00 /usr/sbin/hcid -x
root 5228 1 0 Feb28 ? 00:00:00 /usr/sbin/sdpd
root 5240 1 0 Feb28 ? 00:00:00 [krfcommd]
con-man 5268 1 0 Feb28 ? 00:00:00 /usr/lib/evolution/2.8/evolution-exchange-storag
ntp 5271 1 0 Feb28 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 110:116
daemon 5308 1 0 Feb28 ? 00:00:00 /usr/sbin/atd
root 5325 1 0 Feb28 ? 00:00:00 /usr/sbin/cron
con-man 5453 1 0 Feb28 ? 00:00:00 /usr/lib/gnome-applets/mixer_applet2 --oaf-activ
con-man 5469 1 0 Feb28 ? 00:00:31 xchat-gnome
con-man 5474 1 0 Feb28 ? 00:00:17 gnome-screensaver
con-man 5502 1 0 Feb28 ? 00:00:10 xmms
dhcp 5536 1 0 Feb28 ? 00:00:00 dhclient3 -pf /var/run/dhclient.eth1.pid -lf /va
con-man 5570 1 0 Feb28 ? 00:05:20 /usr/lib/firefox/firefox-bin
con-man 5600 1 0 Feb28 ? 00:00:37 gaim
con-man 5626 1 0 Feb28 ? 00:00:06 evolution-2.8
con-man 11721 4971 0 Feb28 ? 00:00:15 emerald --replace
root 8195 1 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8199 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8200 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8201 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8202 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8203 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
cupsys 8230 1 0 07:35 ? 00:00:00 /usr/sbin/cupsd
root 8363 1 0 07:36 ? 00:00:00 /sbin/syslogd
con-man 8396 1 0 21:38 ? 00:00:11 [wxvlc] <defunct>
root 9068 4729 0 21:50 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
root 9074 9068 0 21:50 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
con-man 9075 9074 0 21:50 ? 00:00:00 su -c eject '/dev/hdb' con-man
con-man 9076 9075 0 21:50 ? 00:00:00 eject /dev/hdb
con-man 9157 1 1 21:52 ? 00:00:27 [totem] <defunct>
con-man 9311 1 0 21:55 ? 00:00:00 eject /dev/hdd
root 9623 4729 0 22:02 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
root 9629 9623 0 22:02 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
con-man 9630 9629 0 22:02 ? 00:00:00 su -c eject '/dev/hda' con-man
con-man 9631 9630 0 22:02 ? 00:00:00 [eject]
con-man 9741 1 0 22:05 ? 00:00:07 wxvlc
con-man 9919 1 0 22:06 ? 00:00:02 wxvlc
root 9966 1 0 22:07 ? 00:00:00 mount /dev/hda
root 9992 1 0 22:07 ? 00:00:00 mount /dev/hdd
root 10021 1 0 22:07 ? 00:00:00 mount /dev/hdc
con-man 10159 1 0 22:10 ? 00:00:00 eject /dev/hda
root 10193 1 0 22:10 ? 00:00:00 eject /dev/hda
con-man 10296 1 0 22:13 ? 00:00:00 wxvlc
root 10334 1 0 22:13 ? 00:00:00 mount cdrom0
root 10359 1 0 22:13 ? 00:00:00 mount /media/cdrom1
root 10385 1 0 22:14 ? 00:00:00 mount /media/cdrom3
root 10416 1 0 22:14 ? 00:00:00 mount /media/cdrom2
con-man 10459 1 0 22:15 ? 00:00:03 gnome-terminal
con-man 10461 10459 0 22:15 ? 00:00:00 gnome-pty-helper
con-man 10462 10459 0 22:15 pts/10 00:00:00 bash
con-man 10804 10462 0 22:23 pts/10 00:00:00 ps -ef

It looks like totem is a zombie, and its PID is init, but it is not dieing. My understanding is that init should always let its children die, but apparently this is not happening. This is on Edgy, so I think that here upstart is acting as init.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Could you provide the output of "ps -ely" instead please; there's no particular evidence that they've failed to be reaped, the multiple running mount/eject processes point to a different problem.

Changed in upstart:
status: Unconfirmed → Needs Info
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

It has been two weeks since more information was requested about this bug report.

If you're unable to supply the information, I would welcome anyone else who can to jump in.

Thanks

Changed in upstart:
importance: Undecided → Medium
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Moving this bug to rejected since without the requested information, it's not possible to debug further.

Changed in upstart:
status: Needs Info → Rejected
Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :
Download full text (16.1 KiB)

I'd like to revive this bug with a similar case I'm experiencing myself. It's a natty installation with upstart 0.9.7-1.
I've got a hundreds of processes owned by init in a zombie state and upstart is not responding at all. That means commands like `status`, `reload`, etc. are hanging without any response.

Here's the beginning of `ps -ely`:

S UID PID PPID C PRI NI RSS SZ WCHAN TTY TIME CMD
S 0 1 0 0 80 0 2236 22939 futex_ ? 00:00:52 init
S 0 2 0 0 80 0 0 0 kthrea ? 00:00:01 kthreadd
S 0 3 2 0 80 0 0 0 run_ks ? 00:00:10 ksoftirqd/0
S 0 6 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/0
S 0 7 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/1
S 0 9 2 0 80 0 0 0 run_ks ? 00:00:12 ksoftirqd/1
S 0 11 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/2
S 0 13 2 0 80 0 0 0 run_ks ? 01:17:28 ksoftirqd/2
S 0 14 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/3
S 0 16 2 0 80 0 0 0 run_ks ? 00:00:30 ksoftirqd/3
S 0 17 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/4
S 0 19 2 0 80 0 0 0 run_ks ? 00:00:05 ksoftirqd/4
S 0 20 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/5
S 0 22 2 0 80 0 0 0 run_ks ? 00:00:12 ksoftirqd/5
S 0 23 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/6
S 0 25 2 0 80 0 0 0 run_ks ? 00:00:15 ksoftirqd/6
S 0 26 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/7
S 0 28 2 0 80 0 0 0 run_ks ? 00:00:06 ksoftirqd/7
S 0 29 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/8
S 0 31 2 0 80 0 0 0 run_ks ? 00:00:10 ksoftirqd/8
S 0 32 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/9
S 0 33 2 0 80 0 0 0 worker ? 00:00:00 kworker/9:0
S 0 34 2 0 80 0 0 0 run_ks ? 00:00:03 ksoftirqd/9
S 0 35 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/10
S 0 36 2 0 80 0 0 0 worker ? 00:00:00 kworker/10:0
S 0 37 2 0 80 0 0 0 run_ks ? 00:00:05 ksoftirqd/10
S 0 38 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/11
S 0 40 2 0 80 0 0 0 run_ks ? 00:00:18 ksoftirqd/11
S 0 41 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/12
S 0 43 2 0 80 0 0 0 run_ks ? 00:00:12 ksoftirqd/12
S 0 44 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/13
S 0 46 2 0 80 0 0 0 run_ks ? 00:00:24 ksoftirqd/13
S 0 47 2 0 -40 - 0 0 cpu_st ? 00:00:00 migration/14
S 0 49 2 0 80 0 0 0 run_ks ? 00:00:27 ksoftirqd/14
S 0 50 2 0 -40 - 0 0 cpu_st ? 00:...

Changed in upstart:
status: Invalid → New
Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

Also this may be a side effect of gdb attaching to the process, but this is the stacktrace I see every time I connect to init on that host (did not change over ~10 attempts):

#0 0x00007fc94fc9b5ae in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fc94fc2437e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fc94fc1c67c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fc94fc53635 in fork () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007fc950bed605 in ?? ()
#5 <signal handler called>
#6 0x00007fc94fbd9d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007fc94fbddab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x00007fc94fc12d7b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x00007fc94fc1cbb6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fc94fc1fe78 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007fc94fc2231e in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fc9507b2f95 in nih_alloc () from /lib/libnih.so.1
#13 0x00007fc9507b36ed in nih_strndup () from /lib/libnih.so.1
#14 0x00007fc9507bc921 in nih_log_message () from /lib/libnih.so.1
#15 0x00007fc950bed76a in ?? ()
#16 <signal handler called>
#17 0x00007fc9507b4500 in nih_list_add_after () from /lib/libnih.so.1
#18 0x00007fc9507b64b4 in nih_io_handle_fds () from /lib/libnih.so.1
#19 0x00007fc9507b9a03 in nih_main_loop () from /lib/libnih.so.1
#20 0x00007fc950bedf2d in main ()

Strace suggests it's waiting on a lock:

$ sudo strace -p 1
Process 1 attached - interrupt to quit
futex(0x7fc94ff351c0, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>
Process 1 detached

Revision history for this message
James Hunt (jamesodhunt) wrote :

@Stanislaw - the process you list are not zombies. Added to which they are all kernel processes.

Since the problem you are seeing seems to be very different to that first reported on this bug, please raise a new bug including as many details as possible, such as:

- kernel version (cat /proc/version)
- full 'ps -ely' listing (attach it as a file)
- details of whether 'initctl list' hangs
- details of whether this issue is a "one off" or if you see it on every boot
- details of when you see it (always post boot? at what time?)
- list of jobs you have (ls /etc/init/)
- and so on...

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

@James I think you missed the rest of the list. (I should've trimmed it too) If you download the whole post, you'll see at the bottom:

Z 104 716 1 0 80 0 0 0 exit ? 00:00:00 sshd <defunct>
Z 104 730 1 0 80 0 0 0 exit ? 00:00:00 sshd <defunct>
Z 104 742 1 0 80 0 0 0 exit ? 00:00:00 sshd <defunct>
Z 0 744 1 0 80 0 0 0 exit ? 00:00:00 nscd <defunct>
Z 0 753 1 0 80 0 0 0 exit ? 00:00:00 nscd <defunct>
...

Kernel is 2.6.38-15.62 from ubuntu. It happens every couple of weeks on one of 5xx machines and I couldn't spot any specific situation that makes hosts go into that state. It's only after the host was live for some time, not right after bootup.
The host is now rebooted, so I can't answer other questions until I run into this issue again.

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

Run into another one. Here's the rest of requested information:
- this time it was kernel 2.6.38-13.56
- `initctl list` hangs on execution
- /etc/init contains standard natty 'ubuntu server' services plus: cgconfig, cgred, libvirt-bin, libvirt-cgred-wait, nova-* (host runs openstack), qemu-kvm, ufw

Revision history for this message
James Hunt (jamesodhunt) wrote :

Please can you check if you can run 'sudo initctl list' since this uses a different mechanism than running that command as a non-root user. I suspect that too will hang.

Looking back at the stacktrace in #5, it appears the problem is that some process is leaking memory such that Upstart is unable to grab any to allow it to work. The side effect of this being that init PID 1 will probably be consuming a lot of CPU as it continues to attempt to obtain memory.

To identify the cause of the low-memory scenario, you'll have to monitor your system with top / sar (sysstat) / nagios / vmstat /etc. You could even just run top in a loop redirecting to a file like this:

    top -b -d 60 >> /tmp/top.log

You'll need to tweak the -d parameter to something suitable to avoid chewing up too much disk though :-)

Also, check your system log to look for any odd entries - you might be lucky and find some 'Out of memory' messages, but possibly not if the process(es) causing the problem are high-priority root ones.

Revision history for this message
Stanislaw Pitucha (stanislaw-pitucha) wrote :

Checking another host in this condition:
- `initctl list` fails both as a normal user and root
- there doesn't seem to be a low memory issue - /proc/meminfo attached, but the main part is:

MemTotal: 99193376 kB
MemFree: 87722536 kB
Buffers: 4448 kB
Cached: 8214824 kB
SwapCached: 0 kB
Active: 2307340 kB
Inactive: 6248908 kB

also apart from `init` not responding, host is completely functional - it allows new connections, processes all jobs fine... until it runs out of pids to assign.
- all logs are cycled pretty quickly with UFW warnings, so I couldn't find any unusual entries unfortunately. I'll check a bit earlier if I find another host running into this issue.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

There hasn't been much activity in this bug lately. Any updates?

Revision history for this message
James Hunt (jamesodhunt) wrote :

Stanislaw - I don't understand your comment about "running out of pids". Do you mean that no further upstart jobs can be started or that no further processes of any type can be started? Can you start a new shell for example or just run "ls"?

Regarding the futex call you see in the strace output, this must relate to some libc call since Upstart doesn't make use of locking directly (and nor does the NIH library it uses). If you can get a strace log with more context, that might give more clues.

However, if you have only seen this issue on a single machine, I'd suggest the problem is with your hardware. Please check it and ensure you run a memory checker like memtest86 on the system to rule out bad memory (which is probably the most likely cause after a bad disk).

Revision history for this message
Blaisorblade (p-giarrusso) wrote :
Download full text (12.7 KiB)

I run into the same issue on Ubuntu 13.10, and I have lots of evidence for you to take a look, including an Apport report and a core dump (because I didn't trust apport), and a small clue about what's happening.
Apparently, init
- corrupts the memory allocation structures
- later, some code tries to allocate memory using malloc (#19 below)
- malloc takes a lock (I'm inferring this)
- then, it notices memory allocation structures are corrupted (#16 below; note "corrupted double-linked list")
- it tries to report an error about that
- the reporting code invokes malloc again, without even releasing the lock first (#2)
- malloc tries to acquire the lock (#1-#0) and gets stuck; if the lock had been released, probably malloc would fail because of the corruption.

While I'm quite rusty, I have quite some patches in the Linux kernel, so I hope looking at my analysis shouldn't be a waste of time.

Backtrace:
(gdb) bt
#0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1 0x00007fe58016ef1c in _L_lock_11850 () at malloc.c:5151
#2 0x00007fe58016c4c5 in __GI___libc_malloc (bytes=36) at malloc.c:2856
#3 0x00007fe580f32c37 in local_strdup (s=0x7fe5811264a5 "/lib/x86_64-linux-gnu/libgcc_s.so.1") at dl-load.c:162
#4 _dl_map_object (loader=loader@entry=0x7fe58112f000, name=name@entry=0x7fe58026bb26 "libgcc_s.so.1", type=type@entry=2, trace_mode=trace_mode@entry=0,
    mode=mode@entry=-1879048191, nsid=<optimized out>) at dl-load.c:2510
#5 0x00007fe580f3dd54 in dl_open_worker (a=a@entry=0x7fff350930c8) at dl-open.c:239
#6 0x00007fe580f396e6 in _dl_catch_error (objname=objname@entry=0x7fff350930b8, errstring=errstring@entry=0x7fff350930c0,
    mallocedp=mallocedp@entry=0x7fff350930b0, operate=operate@entry=0x7fe580f3dc00 <dl_open_worker>, args=args@entry=0x7fff350930c8) at dl-error.c:177
#7 0x00007fe580f3d809 in _dl_open (file=0x7fe58026bb26 "libgcc_s.so.1", mode=-2147483647, caller_dlopen=<optimized out>, nsid=-2, argc=2,
    argv=0x7fff350952b8, env=0x7fff350952d0) at dl-open.c:667
#8 0x00007fe580220da2 in do_dlopen (ptr=ptr@entry=0x7fff350932d0) at dl-libc.c:87
#9 0x00007fe580f396e6 in _dl_catch_error (objname=0x7fff350932b0, errstring=0x7fff350932c0, mallocedp=0x7fff350932a0, operate=0x7fe580220d60 <do_dlopen>,
    args=0x7fff350932d0) at dl-error.c:177
#10 0x00007fe580220e62 in dlerror_run (args=0x7fff350932d0, operate=0x7fe580220d60 <do_dlopen>) at dl-libc.c:46
#11 __GI___libc_dlopen_mode (name=name@entry=0x7fe58026bb26 "libgcc_s.so.1", mode=mode@entry=-2147483647) at dl-libc.c:163
#12 0x00007fe5801fb175 in init () at ../sysdeps/x86_64/backtrace.c:52
#13 0x00007fe57fed9370 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
#14 0x00007fe5801fb294 in __GI___backtrace (array=array@entry=0x7fff35093590, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103
#15 0x00007fe58015d515 in __libc_message (do_abort=2, fmt=fmt@entry=0x7fe580271240 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:178
#16 0x00007fe580168e1d in malloc_printerr (ptr=0x7fe582600ec0, str=0x7fe58026d1e8 "corrupted double-linked list", action=<optimized out>) at malloc.c:4923
...

Changed in upstart:
status: New → Confirmed
Revision history for this message
hon (hon2048) wrote :

This bug affects me:
$ ps -o pid,ppid,cmd -ax | grep defunct
 6657 6608 grep --color=auto defunct
18979 1 [indicator-sound] <defunct>

OS: Xubuntu 14.04
Kernel version: 3.13.0-44-generic

Revision history for this message
Robert Lunnon (bobl-1) wrote :

This has bitten me badly on 3.13.0-86-generic #131-Ubuntu SMP Thu May 12 23:33:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

It occurs only after a while seemingly where lots of processes are being spawned/reclaimed by init. Does not stop init but seems to slow down the reclaiming of zombies. If this IS a memory corruption due to thread-unsafe malloc usage as Blaisorblade (p-giarrusso) wrote on 2013-12-14 then it should not be ignored. Bugs in init should be taken doubly seriously. It's occurring only on live servers at the moment so I can't debug until it happens in a sandbox. I would suspect that this bug will continue come and go as the kernel changes because of memory placement/timing variations in the upstart process.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.