upstart

upstart won't collect zombies?

Bug #89135 reported by Evan Klitzke on 2007-03-02

This bug affects 5 people

Affects		Status	Importance	Assigned to	Milestone
	upstart	Confirmed	Medium	Unassigned

Bug Description

Hi, I was just on #ubuntu and was talking to a user who had a problem with his CD drive not unmounting. He pastebinned the output of ps -ef, which was interesting:

UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Feb28 ? 00:00:00 /sbin/init splash
root 2 1 0 Feb28 ? 00:00:00 [migration/0]
root 3 1 0 Feb28 ? 00:00:01 [ksoftirqd/0]
root 4 1 0 Feb28 ? 00:00:00 [watchdog/0]
root 5 1 0 Feb28 ? 00:00:00 [events/0]
root 6 1 0 Feb28 ? 00:00:00 [khelper]
root 7 1 0 Feb28 ? 00:00:00 [kthread]
root 9 7 0 Feb28 ? 00:00:00 [kblockd/0]
root 10 7 0 Feb28 ? 00:00:00 [kacpid]
root 11 7 0 Feb28 ? 00:00:00 [kacpi_notify]
root 120 7 0 Feb28 ? 00:00:00 [kseriod]
root 152 7 0 Feb28 ? 00:00:00 [pdflush]
root 153 7 0 Feb28 ? 00:00:00 [pdflush]
root 154 1 0 Feb28 ? 00:00:00 [kswapd0]
root 155 7 0 Feb28 ? 00:00:00 [aio/0]
root 1777 7 0 Feb28 ? 00:00:00 [ata/0]
root 1781 7 0 Feb28 ? 00:00:00 [scsi_eh_0]
root 1782 7 0 Feb28 ? 00:00:00 [scsi_eh_1]
root 1783 7 0 Feb28 ? 00:00:00 [scsi_eh_2]
root 1784 7 0 Feb28 ? 00:00:00 [scsi_eh_3]
root 1964 7 0 Feb28 ? 00:00:00 [khubd]
root 1990 7 0 Feb28 ? 00:00:00 [khpsbpkt]
root 2012 1 0 Feb28 ? 00:00:00 [knodemgrd_0]
root 2028 7 0 Feb28 ? 00:00:00 [kmirrord]
root 2072 7 0 Feb28 ? 00:00:00 [kjournald]
root 2153 1 0 Feb28 ? 00:00:00 //sbin/logd
root 2316 1 0 Feb28 ? 00:00:00 /sbin/udevd --daemon
root 3084 7 0 Feb28 ? 00:00:00 [shpchpd]
root 3163 7 0 Feb28 ? 00:00:00 [scsi_eh_4]
root 3164 7 0 Feb28 ? 00:00:00 [scsi_eh_5]
root 3219 7 0 Feb28 ? 00:00:00 [kgameportd]
root 3477 7 0 Feb28 ? 00:00:00 [kpsmoused]
root 3560 7 0 Feb28 ? 00:00:00 [scsi_eh_6]
root 3561 7 0 Feb28 ? 00:00:03 [usb-storage]
root 3815 7 0 Feb28 ? 00:00:00 [kjournald]
dhcp 4030 1 0 Feb28 ? 00:00:00 dhclient3 -pf /var/run/dhclient.eth0.pid -lf /va
root 4144 1 0 Feb28 tty1 00:00:00 /sbin/getty 38400 tty1
root 4145 1 0 Feb28 tty2 00:00:00 /sbin/getty 38400 tty2
root 4146 1 0 Feb28 tty3 00:00:00 /sbin/getty 38400 tty3
root 4147 1 0 Feb28 tty4 00:00:00 /sbin/getty 38400 tty4
root 4148 1 0 Feb28 tty5 00:00:00 /sbin/getty 38400 tty5
root 4149 1 0 Feb28 tty6 00:00:00 /sbin/getty 38400 tty6
root 4360 1 0 Feb28 ? 00:00:00 /usr/sbin/acpid -c /etc/acpi/events -s /var/run/
root 4533 1 0 Feb28 ? 00:00:00 /bin/dd bs 1 if /proc/kmsg of /var/run/klogd/kms
klog 4535 1 0 Feb28 ? 00:00:00 /sbin/klogd -P /var/run/klogd/kmsg
root 4607 1 0 Feb28 ? 00:00:00 /usr/sbin/gdm
root 4608 4607 0 Feb28 ? 00:00:00 /usr/sbin/gdm
root 4631 4608 0 Feb28 tty7 00:11:29 /usr/X11R6/bin/X :0 -br -audit 0 -auth /var/lib/
root 4662 1 0 Feb28 ? 00:00:00 /usr/sbin/hpiod
hplip 4665 1 0 Feb28 ? 00:00:00 python /usr/sbin/hpssd
103 4713 1 0 Feb28 ? 00:00:01 /usr/bin/dbus-daemon --system
106 4728 1 0 Feb28 ? 00:00:22 /usr/sbin/hald
root 4729 4728 0 Feb28 ? 00:00:00 hald-runner
106 4735 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-acpi
106 4751 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
106 4755 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
106 4758 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
106 4762 4729 0 Feb28 ? 00:00:00 /usr/lib/hal/hald-addon-keyboard
root 4775 4729 0 Feb28 ? 00:00:16 /usr/lib/hal/hald-addon-hid-ups
106 4785 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4787 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4789 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4791 4729 0 Feb28 ? 00:00:02 /usr/lib/hal/hald-addon-storage
106 4802 4729 0 Feb28 ? 00:00:03 /usr/lib/hal/hald-addon-storage
106 4804 4729 0 Feb28 ? 00:00:03 /usr/lib/hal/hald-addon-storage
106 4806 4729 0 Feb28 ? 00:00:05 /usr/lib/hal/hald-addon-storage
106 4808 4729 0 Feb28 ? 00:00:05 /usr/lib/hal/hald-addon-storage
root 4824 1 0 Feb28 ? 00:00:00 perl /usr/share/system-tools-backends-2.0/script
con-man 4867 4608 0 Feb28 ? 00:00:00 x-session-manager
con-man 4904 4867 0 Feb28 ? 00:00:00 /usr/bin/ssh-agent /usr/bin/dbus-launch --exit-w
con-man 4907 1 0 Feb28 ? 00:00:00 /usr/bin/dbus-launch --exit-with-session x-sessi
con-man 4908 1 0 Feb28 ? 00:00:00 /usr/bin/dbus-daemon --fork --print-pid 8 --prin
con-man 4910 1 0 Feb28 ? 00:00:00 /usr/lib/libgconf2-4/gconfd-2 5
con-man 4913 1 0 Feb28 ? 00:00:00 /usr/bin/gnome-keyring-daemon
con-man 4916 1 0 Feb28 ? 00:00:03 /usr/lib/control-center/gnome-settings-daemon
con-man 4925 1 0 Feb28 ? 00:00:00 /bin/sh -c /usr/bin/esd -terminate -nobeeps -as
con-man 4926 4925 0 Feb28 ? 00:00:00 /usr/bin/esd -terminate -nobeeps -as 1 -spawnfd
con-man 4936 1 0 Feb28 ? 00:00:05 gnome-panel --sm-client-id default1
con-man 4938 1 0 Feb28 ? 00:00:24 nautilus --no-default-window --sm-client-id defa
con-man 4942 1 0 Feb28 ? 00:00:00 /usr/lib/bonobo-activation/bonobo-activation-ser
con-man 4943 1 0 Feb28 ? 00:00:00 gnome-volume-manager --sm-client-id default4
con-man 4950 1 0 Feb28 ? 00:00:00 /usr/lib/gnome-vfs-2.0/gnome-vfs-daemon
con-man 4963 1 0 Feb28 ? 00:00:00 update-notifier
con-man 4971 1 0 Feb28 ? 00:00:00 beryl-manager
con-man 4977 1 0 Feb28 ? 00:00:06 gkrellm
con-man 4982 1 0 Feb28 ? 00:00:00 /usr/lib/evolution/2.8/evolution-alarm-notify
con-man 4995 1 0 Feb28 ? 00:00:01 gnome-cups-icon --sm-client-id default3
con-man 5112 4971 1 Feb28 ? 00:28:07 beryl --skip-gl-yield
con-man 5113 1 0 Feb28 ? 00:00:00 gnome-power-manager
con-man 5134 1 0 Feb28 ? 00:00:00 /usr/lib/evolution/evolution-data-server-1.8 --o
109 5135 1 0 Feb28 ? 00:00:00 /usr/sbin/exim4 -bd -q30m
root 5157 1 0 Feb28 ? 00:00:00 /usr/sbin/inetutils-inetd
con-man 5220 1 0 Feb28 ? 00:00:00 /usr/lib/nautilus-cd-burner/mapping-daemon
root 5221 1 0 Feb28 ? 00:00:00 /usr/sbin/hcid -x
root 5228 1 0 Feb28 ? 00:00:00 /usr/sbin/sdpd
root 5240 1 0 Feb28 ? 00:00:00 [krfcommd]
con-man 5268 1 0 Feb28 ? 00:00:00 /usr/lib/evolution/2.8/evolution-exchange-storag
ntp 5271 1 0 Feb28 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntpd.pid -u 110:116
daemon 5308 1 0 Feb28 ? 00:00:00 /usr/sbin/atd
root 5325 1 0 Feb28 ? 00:00:00 /usr/sbin/cron
con-man 5453 1 0 Feb28 ? 00:00:00 /usr/lib/gnome-applets/mixer_applet2 --oaf-activ
con-man 5469 1 0 Feb28 ? 00:00:31 xchat-gnome
con-man 5474 1 0 Feb28 ? 00:00:17 gnome-screensaver
con-man 5502 1 0 Feb28 ? 00:00:10 xmms
dhcp 5536 1 0 Feb28 ? 00:00:00 dhclient3 -pf /var/run/dhclient.eth1.pid -lf /va
con-man 5570 1 0 Feb28 ? 00:05:20 /usr/lib/firefox/firefox-bin
con-man 5600 1 0 Feb28 ? 00:00:37 gaim
con-man 5626 1 0 Feb28 ? 00:00:06 evolution-2.8
con-man 11721 4971 0 Feb28 ? 00:00:15 emerald --replace
root 8195 1 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8199 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8200 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8201 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8202 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
www-data 8203 8195 0 07:35 ? 00:00:00 /usr/sbin/apache2 -k start -DSSL
cupsys 8230 1 0 07:35 ? 00:00:00 /usr/sbin/cupsd
root 8363 1 0 07:36 ? 00:00:00 /sbin/syslogd
con-man 8396 1 0 21:38 ? 00:00:11 [wxvlc] <defunct>
root 9068 4729 0 21:50 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
root 9074 9068 0 21:50 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
con-man 9075 9074 0 21:50 ? 00:00:00 su -c eject '/dev/hdb' con-man
con-man 9076 9075 0 21:50 ? 00:00:00 eject /dev/hdb
con-man 9157 1 1 21:52 ? 00:00:27 [totem] <defunct>
con-man 9311 1 0 21:55 ? 00:00:00 eject /dev/hdd
root 9623 4729 0 22:02 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
root 9629 9623 0 22:02 ? 00:00:00 /bin/bash /usr/share/hal/scripts/hal-system-stor
con-man 9630 9629 0 22:02 ? 00:00:00 su -c eject '/dev/hda' con-man
con-man 9631 9630 0 22:02 ? 00:00:00 [eject]
con-man 9741 1 0 22:05 ? 00:00:07 wxvlc
con-man 9919 1 0 22:06 ? 00:00:02 wxvlc
root 9966 1 0 22:07 ? 00:00:00 mount /dev/hda
root 9992 1 0 22:07 ? 00:00:00 mount /dev/hdd
root 10021 1 0 22:07 ? 00:00:00 mount /dev/hdc
con-man 10159 1 0 22:10 ? 00:00:00 eject /dev/hda
root 10193 1 0 22:10 ? 00:00:00 eject /dev/hda
con-man 10296 1 0 22:13 ? 00:00:00 wxvlc
root 10334 1 0 22:13 ? 00:00:00 mount cdrom0
root 10359 1 0 22:13 ? 00:00:00 mount /media/cdrom1
root 10385 1 0 22:14 ? 00:00:00 mount /media/cdrom3
root 10416 1 0 22:14 ? 00:00:00 mount /media/cdrom2
con-man 10459 1 0 22:15 ? 00:00:03 gnome-terminal
con-man 10461 10459 0 22:15 ? 00:00:00 gnome-pty-helper
con-man 10462 10459 0 22:15 pts/10 00:00:00 bash
con-man 10804 10462 0 22:23 pts/10 00:00:00 ps -ef

It looks like totem is a zombie, and its PID is init, but it is not dieing. My understanding is that init should always let its children die, but apparently this is not happening. This is on Edgy, so I think that here upstart is acting as init.

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-03-02:

Could you provide the output of "ps -ely" instead please; there's no particular evidence that they've failed to be reaped, the multiple running mount/eject processes point to a different problem.

Changed in upstart:
status:	Unconfirmed → Needs Info

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-03-18:

It has been two weeks since more information was requested about this bug report.

If you're unable to supply the information, I would welcome anyone else who can to jump in.

Thanks

Scott James Remnant (Canonical) (canonical-scott) on 2007-03-20

Changed in upstart:
importance:	Undecided → Medium

Revision history for this message

Scott James Remnant (Canonical) (canonical-scott) wrote on 2007-05-23:

Moving this bug to rejected since without the requested information, it's not possible to debug further.

Changed in upstart:
status:	Needs Info → Rejected

Revision history for this message

Stanislaw Pitucha (stanislaw-pitucha) wrote on 2012-08-21:

Download full text (16.1 KiB)

I'd like to revive this bug with a similar case I'm experiencing myself. It's a natty installation with upstart 0.9.7-1.
I've got a hundreds of processes owned by init in a zombie state and upstart is not responding at all. That means commands like `status`, `reload`, etc. are hanging without any response.

Here's the beginning of `ps -ely`:

S UID PID PPID C PRI NI RSS S 0 1 0 0 80 S 0 2 0 0 80 0 S 0 3 2 0 80 0 S 0 6 2 0 -40 - S 0 7 2 0 -40 - S 0 9 2 0 80 0 S 0 11 2 0 -40 - S 0 13 2 0 80 0 S 0 14 2 0 -40 - S 0 16 2 0 80 0 S 0 17 2 0 -40 - S 0 19 2 0 80 0 S 0 20 2 0 -40 - S 0 22 2 0 80 0 S 0 23 2 0 -40 - S 0 25 2 0 80 0 S 0 26 2 0 -40 - S 0 28 2 0 80 0 S 0 29 2 0 -40 - S 0 31 2 0 80 0 S 0 32 2 0 -40 - S 0 33 2 0 80 0 S 0 34 2 0 80 0 S 0 35 2 0 -40 - S 0 36 2 0 80 0 S 0 37 2 0 80 0 S 0 38 2 0 -40 - S 0 40 2 0 80 0 S 0 41 2 0 -40 - S 0 43 2 0 80 0 S 0 44 2 0 -40 - S 0 46 2 0 80 0 S 0 47 2 0 -40 - S 0 49 2 0 80 0 S 0 50 2 0 -40 - SZ WCHAN TTY TIME CMD
0 2236 22939 futex_ ? 00:00:52 init
0 0 kthrea ? 00:00:01 kthreadd
0 0 run_ks ? 00:00:10 ksoftirqd/0
0 0 cpu_st ? 00:00:00 migration/0
0 0 cpu_st ? 00:00:00 migration/1
0 0 run_ks ? 00:00:12 ksoftirqd/1
0 0 cpu_st ? 00:00:00 migration/2
0 0 run_ks ? 01:17:28 ksoftirqd/2
0 0 cpu_st ? 00:00:00 migration/3
0 0 run_ks ? 00:00:30 ksoftirqd/3
0 0 cpu_st ? 00:00:00 migration/4
0 0 run_ks ? 00:00:05 ksoftirqd/4
0 0 cpu_st ? 00:00:00 migration/5
0 0 run_ks ? 00:00:12 ksoftirqd/5
0 0 cpu_st ? 00:00:00 migration/6
0 0 run_ks ? 00:00:15 ksoftirqd/6
0 0 cpu_st ? 00:00:00 migration/7
0 0 run_ks ? 00:00:06 ksoftirqd/7
0 0 cpu_st ? 00:00:00 migration/8
0 0 run_ks ? 00:00:10 ksoftirqd/8
0 0 cpu_st ? 00:00:00 migration/9
0 0 worker ? 00:00:00 kworker/9:0
0 0 run_ks ? 00:00:03 ksoftirqd/9
0 0 cpu_st ? 00:00:00 migration/10
0 0 worker ? 00:00:00 kworker/10:0
0 0 run_ks ? 00:00:05 ksoftirqd/10
0 0 cpu_st ? 00:00:00 migration/11
0 0 run_ks ? 00:00:18 ksoftirqd/11
0 0 cpu_st ? 00:00:00 migration/12
0 0 run_ks ? 00:00:12 ksoftirqd/12
0 0 cpu_st ? 00:00:00 migration/13
0 0 run_ks ? 00:00:24 ksoftirqd/13
0 0 cpu_st ? 00:00:00 migration/14
0 0 run_ks ? 00:00:27 ksoftirqd/14
0 0 cpu_st ? 00:...

Here's the beginning of `ps -ely`:

S   UID   PID  PPID  C PRI  NI   RSS    SZ WCHAN  TTY          TIME CMD
S     0     1     0  0  80   0  2236 22939 futex_ ?        00:00:52 init
S     0     2     0  0  80   0     0     0 kthrea ?        00:00:01 kthreadd
S     0     3     2  0  80   0     0     0 run_ks ?        00:00:10 ksoftirqd/0
S     0     6     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/0
S     0     7     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/1
S     0     9     2  0  80   0     0     0 run_ks ?        00:00:12 ksoftirqd/1
S     0    11     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/2
S     0    13     2  0  80   0     0     0 run_ks ?        01:17:28 ksoftirqd/2
S     0    14     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/3
S     0    16     2  0  80   0     0     0 run_ks ?        00:00:30 ksoftirqd/3
S     0    17     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/4
S     0    19     2  0  80   0     0     0 run_ks ?        00:00:05 ksoftirqd/4
S     0    20     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/5
S     0    22     2  0  80   0     0     0 run_ks ?        00:00:12 ksoftirqd/5
S     0    23     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/6
S     0    25     2  0  80   0     0     0 run_ks ?        00:00:15 ksoftirqd/6
S     0    26     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/7
S     0    28     2  0  80   0     0     0 run_ks ?        00:00:06 ksoftirqd/7
S     0    29     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/8
S     0    31     2  0  80   0     0     0 run_ks ?        00:00:10 ksoftirqd/8
S     0    32     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/9
S     0    33     2  0  80   0     0     0 worker ?        00:00:00 kworker/9:0
S     0    34     2  0  80   0     0     0 run_ks ?        00:00:03 ksoftirqd/9
S     0    35     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/10
S     0    36     2  0  80   0     0     0 worker ?        00:00:00 kworker/10:0
S     0    37     2  0  80   0     0     0 run_ks ?        00:00:05 ksoftirqd/10
S     0    38     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/11
S     0    40     2  0  80   0     0     0 run_ks ?        00:00:18 ksoftirqd/11
S     0    41     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/12
S     0    43     2  0  80   0     0     0 run_ks ?        00:00:12 ksoftirqd/12
S     0    44     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/13
S     0    46     2  0  80   0     0     0 run_ks ?        00:00:24 ksoftirqd/13
S     0    47     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/14
S     0    49     2  0  80   0     0     0 run_ks ?        00:00:27 ksoftirqd/14
S     0    50     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/15
S     0    52     2  0  80   0     0     0 run_ks ?        00:00:11 ksoftirqd/15
S     0    53     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/16
S     0    54     2  0  80   0     0     0 worker ?        00:00:00 kworker/16:0
S     0    55     2  0  80   0     0     0 run_ks ?        00:00:08 ksoftirqd/16
S     0    56     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/17
S     0    58     2  0  80   0     0     0 run_ks ?        00:00:12 ksoftirqd/17
S     0    59     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/18
S     0    61     2  0  80   0     0     0 run_ks ?        00:00:14 ksoftirqd/18
S     0    62     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/19
S     0    63     2  0  80   0     0     0 worker ?        00:00:17 kworker/19:0
S     0    64     2  0  80   0     0     0 run_ks ?        00:00:07 ksoftirqd/19
S     0    65     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/20
S     0    66     2  0  80   0     0     0 worker ?        00:00:00 kworker/20:0
S     0    67     2  0  80   0     0     0 run_ks ?        00:00:12 ksoftirqd/20
S     0    68     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/21
S     0    69     2  0  80   0     0     0 worker ?        00:00:00 kworker/21:0
S     0    70     2  0  80   0     0     0 run_ks ?        00:00:07 ksoftirqd/21
S     0    71     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/22
S     0    72     2  0  80   0     0     0 worker ?        00:00:00 kworker/22:0
S     0    73     2  0  80   0     0     0 run_ks ?        00:00:05 ksoftirqd/22
S     0    74     2  0 -40   -     0     0 cpu_st ?        00:00:00 migration/23
S     0    75     2  0  80   0     0     0 worker ?        00:00:00 kworker/23:0
S     0    76     2  0  80   0     0     0 run_ks ?        00:00:22 ksoftirqd/23
S     0    77     2  0  60 -20     0     0 rescue ?        00:00:00 cpuset
S     0    78     2  0  60 -20     0     0 rescue ?        00:00:00 khelper
S     0    79     2  0  60 -20     0     0 rescue ?        00:00:00 netns
S     0    81     2  0  80   0     0     0 bdi_sy ?        00:00:12 sync_supers
S     0    82     2  0  80   0     0     0 bdi_fo ?        00:00:01 bdi-default
S     0    83     2  0  60 -20     0     0 rescue ?        00:00:00 kintegrityd
S     0    84     2  0  60 -20     0     0 rescue ?        00:00:00 kblockd
S     0    85     2  0  60 -20     0     0 rescue ?        00:00:00 kacpid
S     0    86     2  0  60 -20     0     0 rescue ?        00:00:00 kacpi_notify
S     0    87     2  0  60 -20     0     0 rescue ?        00:00:00 kacpi_hotplug
S     0    88     2  0  60 -20     0     0 rescue ?        00:00:00 ata_sff
S     0    89     2  0  80   0     0     0 hub_th ?        00:00:00 khubd
S     0    90     2  0  60 -20     0     0 rescue ?        00:00:00 md
S     0    93     2  0  80   0     0     0 watchd ?        00:00:01 khungtaskd
S     0    94     2  0  80   0     0     0 kswapd ?        00:00:00 kswapd0
S     0    95     2  0  80   0     0     0 kswapd ?        00:00:00 kswapd1
S     0    96     2  0  85   5     0     0 ksm_sc ?        00:00:00 ksmd
S     0    97     2  0  80   0     0     0 fsnoti ?        00:00:02 fsnotify_mark
S     0    98     2  0  60 -20     0     0 rescue ?        00:00:00 aio
S     0    99     2  0  80   0     0     0 ecrypt ?        00:00:00 ecryptfs-kthrea
S     0   100     2  0  60 -20     0     0 rescue ?        00:00:00 crypto
S     0   104     2  0  60 -20     0     0 rescue ?        00:00:00 kthrotld
S     0   106     2  0  80   0     0     0 scsi_e ?        00:00:00 scsi_eh_0
S     0   107     2  0  80   0     0     0 scsi_e ?        00:00:00 scsi_eh_1
S     0   108     2  0  80   0     0     0 scsi_e ?        00:00:00 scsi_eh_2
S     0   109     2  0  80   0     0     0 scsi_e ?        00:00:00 scsi_eh_3
S     0   112     2  0  60 -20     0     0 rescue ?        00:00:00 kmpathd
S     0   113     2  0  60 -20     0     0 rescue ?        00:00:00 kmpath_handlerd
S     0   114     2  0  60 -20     0     0 rescue ?        00:00:00 kondemand
S     0   115     2  0  60 -20     0     0 rescue ?        00:00:00 kconservative
S     0   265     2  0  60 -20     0     0 rescue ?        00:00:00 mlx4
S     0   269     2  0  80   0     0     0 scsi_e ?        00:00:00 scsi_eh_4
S     0   272     2  0  80   0     0     0 worker ?        00:00:29 kworker/23:1
S     0   273     2  0  80   0     0     0 worker ?        00:00:31 kworker/22:1
S     0   274     2  0  80   0     0     0 worker ?        00:00:28 kworker/21:1
S     0   276     2  0  80   0     0     0 worker ?        00:00:30 kworker/20:1
S     0   281     2  0  80   0     0     0 worker ?        00:00:35 kworker/11:1
S     0   291     2  0  80   0     0     0 worker ?        00:00:35 kworker/10:1
S     0   293     2  0  80   0     0     0 worker ?        00:00:31 kworker/18:1
S     0   294     2  0  80   0     0     0 worker ?        00:00:35 kworker/16:1
S     0   301     2  0  80   0     0     0 worker ?        00:00:35 kworker/9:1
S     0   302     2  0  80   0     0     0 worker ?        00:00:33 kworker/17:1
S     0   303     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
S     0   315     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
S     0   322     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
Z   104   325     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
S     0   329     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
Z   104   331     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
S     0   336     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
S     0   343     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
S     0   349     2  0  60 -20     0     0 rescue ?        00:00:00 kdmflush
S     0   356     2  0  60 -20     0     0 rescue ?        00:00:00 xfs_mru_cache
S     0   357     2  0  60 -20     0     0 rescue ?        00:00:00 xfslogd
S     0   358     2  0  60 -20     0     0 rescue ?        00:00:00 xfsdatad
S     0   359     2  0  60 -20     0     0 rescue ?        00:00:00 xfsconvertd
S     0   361     2  0  80   0     0     0 xfsbuf ?        00:00:52 xfsbufd/dm-0
S     0   363     2  0  80   0     0     0 xfssyn ?        00:00:10 xfssyncd/dm-0
S     0   364     2  0  80   0     0     0 xfsail ?        00:00:00 xfsaild/dm-0
Z     0   375     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   384     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   408     1  0  80   0 13572  7451 poll_s ?        00:00:08 upstart-udev-br
Z     0   410     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   413     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   437     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   442     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   448     2  0  80   0     0     0 xfsbuf ?        00:01:01 xfsbufd/dm-2
S     0   449     2  0  80   0     0     0 xfssyn ?        00:00:11 xfssyncd/dm-2
S     0   450     2  0  80   0     0     0 xfsail ?        00:00:00 xfsaild/dm-2
Z   104   453     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   454     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
S     0   455     1  0  76  -4   908  5310 poll_s ?        00:00:05 udevd
Z     0   467     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   485     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   492     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   505     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   506     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   509     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   514     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   515     2  0  80   0     0     0 worker ?        00:00:00 kworker/13:0
Z   104   526     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   527     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   536     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   539     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   540     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   542     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   557     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   558     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   570     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   574     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   577     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   582     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   585     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   587     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   596     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   606     2  0  80   0     0     0 kjourn ?        00:00:00 kjournald
Z   104   607     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
S     0   609     2  0  80   0     0     0 xfsbuf ?        00:00:21 xfsbufd/dm-3
Z     0   628     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   637     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   638     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   647     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   654     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   656     2  0  80   0     0     0 xfssyn ?        00:00:07 xfssyncd/dm-3
Z   104   659     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   662     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   670     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   673     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   678     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
S     0   679     2  0  80   0     0     0 xfsail ?        00:00:00 xfsaild/dm-3
S     0   682     2  0  60 -20     0     0 rescue ?        00:00:00 edac-poller
Z   104   684     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   689     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   693     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   696     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   700     2  0  60 -20     0     0 rescue ?        00:00:00 mlx4_en
S     0   705     2  0  60 -20     0     0 rescue ?        00:00:00 kpsmoused
S     0   709     2  0  80   0     0     0 xfsbuf ?        00:00:41 xfsbufd/dm-4
S     0   710     2  0  60 -20     0     0 rescue ?        00:00:00 ttm_swap
S     0   711     2  0  80   0     0     0 xfssyn ?        00:00:10 xfssyncd/dm-4
S     0   712     2  0  80   0     0     0 xfsail ?        00:00:00 xfsaild/dm-4
Z   104   716     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   730     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   742     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   744     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   753     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   756     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z   104   760     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   771     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   775     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   790     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   792     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   799     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z   104   810     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   833     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   839     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
Z     0   842     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   844     2  0  80   0     0     0 xfsbuf ?        00:00:10 xfsbufd/dm-5
Z   104   845     1  0  80   0     0     0 exit   ?        00:00:00 sshd <defunct>
Z     0   849     1  0  80   0     0     0 exit   ?        00:00:00 nscd <defunct>
S     0   850     2  0  80   0     0     0 xfssyn ?        00:00:04 xfssyncd/dm-5
S     0   852     2  0  80   0     0     0 xfsail ?        00:00:00 xfsaild/dm-5

Changed in upstart:
status:	Invalid → New

Revision history for this message

Stanislaw Pitucha (stanislaw-pitucha) wrote on 2012-08-21:

Also this may be a side effect of gdb attaching to the process, but this is the stacktrace I see every time I connect to init on that host (did not change over ~10 attempts):

#0 0x00007fc94fc9b5ae in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fc94fc2437e in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fc94fc1c67c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fc94fc53635 in fork () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00007fc950bed605 in ?? ()
#5 <signal handler called>
#6 0x00007fc94fbd9d05 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#7 0x00007fc94fbddab6 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#8 0x00007fc94fc12d7b in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#9 0x00007fc94fc1cbb6 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#10 0x00007fc94fc1fe78 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00007fc94fc2231e in malloc () from /lib/x86_64-linux-gnu/libc.so.6
#12 0x00007fc9507b2f95 in nih_alloc () from /lib/libnih.so.1
#13 0x00007fc9507b36ed in nih_strndup () from /lib/libnih.so.1
#14 0x00007fc9507bc921 in nih_log_message () from /lib/libnih.so.1
#15 0x00007fc950bed76a in ?? ()
#16 <signal handler called>
#17 0x00007fc9507b4500 in nih_list_add_after () from /lib/libnih.so.1
#18 0x00007fc9507b64b4 in nih_io_handle_fds () from /lib/libnih.so.1
#19 0x00007fc9507b9a03 in nih_main_loop () from /lib/libnih.so.1
#20 0x00007fc950bedf2d in main ()

Strace suggests it's waiting on a lock:

$ sudo strace -p 1
Process 1 attached - interrupt to quit
futex(0x7fc94ff351c0, FUTEX_WAIT_PRIVATE, 2, NULL^C <unfinished ...>
Process 1 detached

Revision history for this message

James Hunt (jamesodhunt) wrote on 2012-08-22:

@Stanislaw - the process you list are not zombies. Added to which they are all kernel processes.

Since the problem you are seeing seems to be very different to that first reported on this bug, please raise a new bug including as many details as possible, such as:

- kernel version (cat /proc/version)
- full 'ps -ely' listing (attach it as a file)
- details of whether 'initctl list' hangs
- details of whether this issue is a "one off" or if you see it on every boot
- details of when you see it (always post boot? at what time?)
- list of jobs you have (ls /etc/init/)
- and so on...

Revision history for this message

Stanislaw Pitucha (stanislaw-pitucha) wrote on 2012-08-22:

@James I think you missed the rest of the list. (I should've trimmed it too) If you download the whole post, you'll see at the bottom:

Z 104 716 1 0 80 0 0 0 exit ? 00:00:00 sshd <defunct>
Z 104 730 1 0 80 0 0 0 exit ? 00:00:00 sshd <defunct>
Z 104 742 1 0 80 0 0 0 exit ? 00:00:00 sshd <defunct>
Z 0 744 1 0 80 0 0 0 exit ? 00:00:00 nscd <defunct>
Z 0 753 1 0 80 0 0 0 exit ? 00:00:00 nscd <defunct>
...

Kernel is 2.6.38-15.62 from ubuntu. It happens every couple of weeks on one of 5xx machines and I couldn't spot any specific situation that makes hosts go into that state. It's only after the host was live for some time, not right after bootup.
The host is now rebooted, so I can't answer other questions until I run into this issue again.

Revision history for this message

Stanislaw Pitucha (stanislaw-pitucha) wrote on 2012-08-23:

Run into another one. Here's the rest of requested information:
- this time it was kernel 2.6.38-13.56
- `initctl list` hangs on execution
- /etc/init contains standard natty 'ubuntu server' services plus: cgconfig, cgred, libvirt-bin, libvirt-cgred-wait, nova-* (host runs openstack), qemu-kvm, ufw

Revision history for this message

James Hunt (jamesodhunt) wrote on 2012-08-24:

Please can you check if you can run 'sudo initctl list' since this uses a different mechanism than running that command as a non-root user. I suspect that too will hang.

Looking back at the stacktrace in #5, it appears the problem is that some process is leaking memory such that Upstart is unable to grab any to allow it to work. The side effect of this being that init PID 1 will probably be consuming a lot of CPU as it continues to attempt to obtain memory.

To identify the cause of the low-memory scenario, you'll have to monitor your system with top / sar (sysstat) / nagios / vmstat /etc. You could even just run top in a loop redirecting to a file like this:

top -b -d 60 >> /tmp/top.log

You'll need to tweak the -d parameter to something suitable to avoid chewing up too much disk though :-)

Also, check your system log to look for any odd entries - you might be lucky and find some 'Out of memory' messages, but possibly not if the process(es) causing the problem are high-priority root ones.

Revision history for this message

Stanislaw Pitucha (stanislaw-pitucha) wrote on 2012-08-28:

#10

contents of /proc/meminfo Edit (1.1 KiB, text/plain)

Checking another host in this condition:
- `initctl list` fails both as a normal user and root
- there doesn't seem to be a low memory issue - /proc/meminfo attached, but the main part is:

MemTotal: 99193376 kB
MemFree: 87722536 kB
Buffers: 4448 kB
Cached: 8214824 kB
SwapCached: 0 kB
Active: 2307340 kB
Inactive: 6248908 kB

also apart from `init` not responding, host is completely functional - it allows new connections, processes all jobs fine... until it runs out of pids to assign.
- all logs are cycled pretty quickly with UFW warnings, so I couldn't find any unusual entries unfortunately. I'll check a bit earlier if I find another host running into this issue.

Revision history for this message

Clint Byrum (clint-fewbar) wrote on 2013-05-31:

#11

There hasn't been much activity in this bug lately. Any updates?

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-06-03:

#12

Stanislaw - I don't understand your comment about "running out of pids". Do you mean that no further upstart jobs can be started or that no further processes of any type can be started? Can you start a new shell for example or just run "ls"?

Regarding the futex call you see in the strace output, this must relate to some libc call since Upstart doesn't make use of locking directly (and nor does the NIH library it uses). If you can get a strace log with more context, that might give more clues.

However, if you have only seen this issue on a single machine, I'd suggest the problem is with your hardware. Please check it and ensure you run a memory checker like memtest86 on the system to rule out bad memory (which is probably the most likely cause after a bad disk).

Revision history for this message

Blaisorblade (p-giarrusso) wrote on 2013-12-14:

#13

Data about the bug Edit (957.6 KiB, application/x-xz)

Download full text (12.7 KiB)

I run into the same issue on Ubuntu 13.10, and I have lots of evidence for you to take a look, including an Apport report and a core dump (because I didn't trust apport), and a small clue about what's happening.
Apparently, init
- corrupts the memory allocation structures
- later, some code tries to allocate memory using malloc (#19 below)
- malloc takes a lock (I'm inferring this)
- then, it notices memory allocation structures are corrupted (#16 below; note "corrupted double-linked list")
- it tries to report an error about that
- the reporting code invokes malloc again, without even releasing the lock first (#2)
- malloc tries to acquire the lock (#1-#0) and gets stuck; if the lock had been released, probably malloc would fail because of the corruption.

While I'm quite rusty, I have quite some patches in the Linux kernel, so I hope looking at my analysis shouldn't be a waste of time.

Backtrace:
(gdb) bt
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007fe58016ef1c in _L_lock_11850 () at malloc.c:5151
#2  0x00007fe58016c4c5 in __GI___libc_malloc (bytes=36) at malloc.c:2856
#3  0x00007fe580f32c37 in local_strdup (s=0x7fe5811264a5 "/lib/x86_64-linux-gnu/libgcc_s.so.1") at dl-load.c:162
#4  _dl_map_object (loader=loader@entry=0x7fe58112f000, name=name@entry=0x7fe58026bb26 "libgcc_s.so.1", type=type@entry=2, trace_mode=trace_mode@entry=0, 
    mode=mode@entry=-1879048191, nsid=<optimized out>) at dl-load.c:2510
#5  0x00007fe580f3dd54 in dl_open_worker (a=a@entry=0x7fff350930c8) at dl-open.c:239
#6  0x00007fe580f396e6 in _dl_catch_error (objname=objname@entry=0x7fff350930b8, errstring=errstring@entry=0x7fff350930c0, 
    mallocedp=mallocedp@entry=0x7fff350930b0, operate=operate@entry=0x7fe580f3dc00 <dl_open_worker>, args=args@entry=0x7fff350930c8) at dl-error.c:177
#7  0x00007fe580f3d809 in _dl_open (file=0x7fe58026bb26 "libgcc_s.so.1", mode=-2147483647, caller_dlopen=<optimized out>, nsid=-2, argc=2, 
    argv=0x7fff350952b8, env=0x7fff350952d0) at dl-open.c:667
#8  0x00007fe580220da2 in do_dlopen (ptr=ptr@entry=0x7fff350932d0) at dl-libc.c:87
#9  0x00007fe580f396e6 in _dl_catch_error (objname=0x7fff350932b0, errstring=0x7fff350932c0, mallocedp=0x7fff350932a0, operate=0x7fe580220d60 <do_dlopen>, 
    args=0x7fff350932d0) at dl-error.c:177
#10 0x00007fe580220e62 in dlerror_run (args=0x7fff350932d0, operate=0x7fe580220d60 <do_dlopen>) at dl-libc.c:46
#11 __GI___libc_dlopen_mode (name=name@entry=0x7fe58026bb26 "libgcc_s.so.1", mode=mode@entry=-2147483647) at dl-libc.c:163
#12 0x00007fe5801fb175 in init () at ../sysdeps/x86_64/backtrace.c:52
#13 0x00007fe57fed9370 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
#14 0x00007fe5801fb294 in __GI___backtrace (array=array@entry=0x7fff35093590, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103
#15 0x00007fe58015d515 in __libc_message (do_abort=2, fmt=fmt@entry=0x7fe580271240 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:178
#16 0x00007fe580168e1d in malloc_printerr (ptr=0x7fe582600ec0, str=0x7fe58026d1e8 "corrupted double-linked list", action=<optimized out>) at malloc.c:4923
#17 malloc_consolidate (av=av@entry=0x7fe5804aa740 <main_arena>) at malloc.c:4094
#18 0x00007fe58016a0e1 in _int_malloc (av=0x7fe5804aa740 <main_arena>, bytes=8240) at malloc.c:3379
#19 0x00007fe58016c4d0 in __GI___libc_malloc (bytes=8240) at malloc.c:2859
#20 0x00007fe580d16e6d in nih_alloc (parent=parent@entry=0x7fe5826426d0, size=size@entry=8192) at alloc.c:158
#21 0x00007fe580d170a2 in nih_realloc (ptr=<optimized out>, parent=parent@entry=0x7fe5826426d0, size=size@entry=8192) at alloc.c:202
#22 0x00007fe580d1b82d in nih_io_buffer_resize (buffer=0x7fe5826426d0, grow=grow@entry=80) at io.c:315
#23 0x00007fe580d1cd4d in nih_io_watcher_read (watch=0x7fe582642340, io=0x7fe582642570) at io.c:1079
#24 nih_io_watcher (io=0x7fe582642570, watch=0x7fe582642340, events=NIH_IO_READ) at io.c:933
#25 0x00007fe580d1b67a in nih_io_handle_fds (readfds=readfds@entry=0x7fff35093f50, writefds=writefds@entry=0x7fff35093fd0, 
    exceptfds=exceptfds@entry=0x7fff35094050) at io.c:237
#26 0x00007fe580d1f64c in nih_main_loop () at main.c:586
#27 0x00007fe58115816a in main (argc=<optimized out>, argv=<optimized out>) at main.c:772

Scenario: while logged at the console, I stopped by chance dbus, hence pulseaudio started respawning and failing in an infinite loop for ~15 minutes until I restarted dbus. Nothing looked wrong on the console, so I went on for a while. I noticed something was wrong only when top was taking 20% of CPU time instead of 2%, apparently because it's not happy to deal with ~10000 processes. Apart from this, the host stayed completely functional, copying ~1TB of data across two USB 2 disks; I'm writing this bug report from the machine itself.

Analysis: Those are (mostly) pulseaudio processes hanging off init --user, which seems to be deadlocked because the malloc implementation tries to allocate memory while trying to report about a "corrupted double-linked list" through malloc_printerr. Relevant entry from gdb backtrace below:

#16 0x00007fe580168e1d in malloc_printerr (ptr=0x7fe582600ec0, str=0x7fe58026d1e8 "corrupted double-linked list", action=<optimized out>) at malloc.c:4923

Hence, without looking at the sources indicated, I seem to see:
- a deadlock in an error path of glibc (using malloc to tell me that malloc is broken, without releasing the lock, sounds no good) (if this is a deadlock indeed, but the guy is hanging on a lock)
- malloc is trying to say the heap is corrupted, so there's probably some Valgrinding to do.
- why are all those processes getting started without the older ones being reaped *first*?
- at least, it's clear why they die right away: I killed dbus by mistake by running "restart networking", which dbus.conf aliases in fact to "break almost everything". (It says:

start on local-filesystems
stop on deconfiguring-networking

That is, networking itself might be fine, but dbus won't be auto-restarted. But certainly I made a mistake by trying to fix a networking problem by blindly restarting networking (though it doesn't sound *so* unreasonable, does it?), and whether the configuration is too error-prone is an interesting but separate issue.)

Below there's all the supporting evidence for my analysis, with relevant excerpts of program output - complete output are attached, in most cases (including the answers to most or all the questions you asked - /proc/meminfo, service lists, etc.). The commands include the redirection command I used, so you also have the filename.

After capturing all info I could think of (including what you asked for), I've also captured a core and then forced an apport dump with kill -ILL (apparently, apport doesn't think I might want to include stacktraces for hangs, so I need to spoil the core with a spurious signal and then explain what happened).
However, I'm not very familiar with apport; saving a core from the running program was easier than getting apport to save it, and I found no way to attach apport information to this bug (I probably can't), or to find/edit the bug report which apport submitted for me (I see that seems to be by design).

# strace -p 1275
Process 1275 attached
futex(0x7fe5804aa740, FUTEX_WAIT_PRIVATE, 2, NULL^CProcess 1275 detached

$ sudo gdb $(which init) -p 1275 2>&1|tee gdb-1275-transcript-v3.txt
[...]
(gdb) bt
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007fe58016ef1c in _L_lock_11850 () at malloc.c:5151
#2  0x00007fe58016c4c5 in __GI___libc_malloc (bytes=36) at malloc.c:2856
#3  0x00007fe580f32c37 in local_strdup (s=0x7fe5811264a5 "/lib/x86_64-linux-gnu/libgcc_s.so.1") at dl-load.c:162
#4  _dl_map_object (loader=loader@entry=0x7fe58112f000, name=name@entry=0x7fe58026bb26 "libgcc_s.so.1", type=type@entry=2, trace_mode=trace_mode@entry=0, 
    mode=mode@entry=-1879048191, nsid=<optimized out>) at dl-load.c:2510
#5  0x00007fe580f3dd54 in dl_open_worker (a=a@entry=0x7fff350930c8) at dl-open.c:239
#6  0x00007fe580f396e6 in _dl_catch_error (objname=objname@entry=0x7fff350930b8, errstring=errstring@entry=0x7fff350930c0, 
    mallocedp=mallocedp@entry=0x7fff350930b0, operate=operate@entry=0x7fe580f3dc00 <dl_open_worker>, args=args@entry=0x7fff350930c8) at dl-error.c:177
#7  0x00007fe580f3d809 in _dl_open (file=0x7fe58026bb26 "libgcc_s.so.1", mode=-2147483647, caller_dlopen=<optimized out>, nsid=-2, argc=2, 
    argv=0x7fff350952b8, env=0x7fff350952d0) at dl-open.c:667
#8  0x00007fe580220da2 in do_dlopen (ptr=ptr@entry=0x7fff350932d0) at dl-libc.c:87
#9  0x00007fe580f396e6 in _dl_catch_error (objname=0x7fff350932b0, errstring=0x7fff350932c0, mallocedp=0x7fff350932a0, operate=0x7fe580220d60 <do_dlopen>, 
    args=0x7fff350932d0) at dl-error.c:177
#10 0x00007fe580220e62 in dlerror_run (args=0x7fff350932d0, operate=0x7fe580220d60 <do_dlopen>) at dl-libc.c:46
#11 __GI___libc_dlopen_mode (name=name@entry=0x7fe58026bb26 "libgcc_s.so.1", mode=mode@entry=-2147483647) at dl-libc.c:163
#12 0x00007fe5801fb175 in init () at ../sysdeps/x86_64/backtrace.c:52
#13 0x00007fe57fed9370 in pthread_once () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_once.S:103
#14 0x00007fe5801fb294 in __GI___backtrace (array=array@entry=0x7fff35093590, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:103
#15 0x00007fe58015d515 in __libc_message (do_abort=2, fmt=fmt@entry=0x7fe580271240 "*** Error in `%s': %s: 0x%s ***\n")
    at ../sysdeps/unix/sysv/linux/libc_fatal.c:178
#16 0x00007fe580168e1d in malloc_printerr (ptr=0x7fe582600ec0, str=0x7fe58026d1e8 "corrupted double-linked list", action=<optimized out>) at malloc.c:4923
#17 malloc_consolidate (av=av@entry=0x7fe5804aa740 <main_arena>) at malloc.c:4094
#18 0x00007fe58016a0e1 in _int_malloc (av=0x7fe5804aa740 <main_arena>, bytes=8240) at malloc.c:3379
#19 0x00007fe58016c4d0 in __GI___libc_malloc (bytes=8240) at malloc.c:2859
#20 0x00007fe580d16e6d in nih_alloc (parent=parent@entry=0x7fe5826426d0, size=size@entry=8192) at alloc.c:158
#21 0x00007fe580d170a2 in nih_realloc (ptr=<optimized out>, parent=parent@entry=0x7fe5826426d0, size=size@entry=8192) at alloc.c:202
#22 0x00007fe580d1b82d in nih_io_buffer_resize (buffer=0x7fe5826426d0, grow=grow@entry=80) at io.c:315
#23 0x00007fe580d1cd4d in nih_io_watcher_read (watch=0x7fe582642340, io=0x7fe582642570) at io.c:1079
#24 nih_io_watcher (io=0x7fe582642570, watch=0x7fe582642340, events=NIH_IO_READ) at io.c:933
#25 0x00007fe580d1b67a in nih_io_handle_fds (readfds=readfds@entry=0x7fff35093f50, writefds=writefds@entry=0x7fff35093fd0, 
    exceptfds=exceptfds@entry=0x7fff35094050) at io.c:237
#26 0x00007fe580d1f64c in nih_main_loop () at main.c:586
#27 0x00007fe58115816a in main (argc=<optimized out>, argv=<optimized out>) at main.c:772

(This is after installing all needed debugging symbols).

I attached ps -efly's output:
$ ps -efly > ps-efly.txt
$ xz ps-efly.txt

Some stats from it on the zombie pulseaudios:

$ zcat ps-ely.txt.gz |fgrep 'pulseaudio <defunct>'|wc -l
10227

$ zcat ps-ely.txt.gz |fgrep 'pulseaudio <defunct>'|head -2
Z  1000   301  1275  0  80   0     0     0 exit   ?        00:00:00 pulseaudio <defunct>
Z  1000   302  1275  0  80   0     0     0 exit   ?        00:00:00 pulseaudio <defunct>

Some stats on the alive inits:

$ ps -efly|grep init
S root         1     0  0  80   0  1764  6806 poll_s 00:34 ?        00:00:34 /sbin/init
S paolo     1275  1081  0  80   0   432  9084 futex_ 00:35 ?        00:00:00 init --user
S paolo    10808  8289  0  80   0   980  4160 pipe_w 22:49 pts/12   00:00:00 grep --color=auto init
S paolo    28778 28673  0  80   0  1064  9091 poll_s 12:22 ?        00:00:00 init --user
S paolo    28941 28778  0  80   0   200  1110 wait   12:22 ?        00:00:00 /bin/sh /etc/xdg/xfce4/xinitrc -- /etc/X11/xinit/xserverrc

Notice the two init --user alive (I might have created the second one by restarting the desktop session, I'm not sure). initctl list works fine (for both user and system), but I'm betting it attaches to the second lively one, PID 28778. To verify that, let's strace the lively server:

$ sudo strace -p 28778 2>&1| tee strace-28778.txt
Process 28778 attached
select(25, [3 5 6 7 8 9 10 11 14 19 20 24], [], [7 8 9 10 14 20], NULL

Then try running initctl list - note the 28778 in the socket name:
$ strace initctl list 2>&1|tee strace-initctl-list.txt
...
socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_LOCAL, sun_path=@"/com/ubuntu/upstart-session/1000/28778"}, 41) = 0
...

Meanwhile, PID 28778 gets busy answering the query:

select(25, [3 5 6 7 8 9 10 11 14 19 20 24], [], [7 8 9 10 14 20], NULL) = 1 (in [7])
accept4(7, {sa_family=AF_LOCAL, NULL}, [2], SOCK_CLOEXEC) = 13
fcntl(13, F_GETFL)                      = 0x2 (flags O_RDWR)
fcntl(13, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
getsockname(13, {sa_family=AF_LOCAL, sun_path=@"/com/ubuntu/upstart-session/1000/28778"}, [41]) = 0
[...]

Changed in upstart:
status:	New → Confirmed

Revision history for this message

hon (hon2048) wrote on 2016-01-16:

#14

This bug affects me:
$ ps -o pid,ppid,cmd -ax | grep defunct
6657 6608 grep --color=auto defunct
18979 1 [indicator-sound] <defunct>

OS: Xubuntu 14.04
Kernel version: 3.13.0-44-generic

Revision history for this message

Robert Lunnon (bobl-1) wrote on 2016-06-01:

#15

This has bitten me badly on 3.13.0-86-generic #131-Ubuntu SMP Thu May 12 23:33:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

It occurs only after a while seemingly where lots of processes are being spawned/reclaimed by init. Does not stop init but seems to slow down the reclaiming of zombies. If this IS a memory corruption due to thread-unsafe malloc usage as Blaisorblade (p-giarrusso) wrote on 2013-12-14 then it should not be ignored. Bugs in init should be taken doubly seriously. It's occurring only on live servers at the moment so I can't debug until it happens in a sandbox. I would suspect that this bug will continue come and go as the kernel changes because of memory placement/timing variations in the upstart process.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.