worker signal mask inherited by children

Bug #407428 reported by Michael Helmling on 2009-07-31
114
This bug affects 11 people
Affects Status Importance Assigned to Milestone
udev (Ubuntu)
High
Kees Cook
Karmic
High
Kees Cook

Bug Description

Binary package hint: openssh-server

Hi, since upgrading to karmic on one machine I get a strange behavior of sshd:

- After exitting an SSH-session, the terminal on the client machine hangs instead of closing
- On the server there is a "sshd <defunct>" after each such session (this is especially annoying since I have a nagios server checking the SSH status every few minutes, so the number of zombie processes rapidly increases)
- Additionally, I noticed that "Ctrl-C" does not work inside the SSH session, which is a bash shortcut that I use very often ;-) (should start a new empty command line ignoring what you've entered so far)

Restarting sshd takes exceptionally long, but at least kills all the zombies.
When I'm working on a terminal directly at the machine (i.e. not over SSH), the behavior is totally normal, so I consider this an ssh (or at least ssh-related) bug.

Hi Michael,

On Fri, Jul 31, 2009 at 05:24:34PM -0000, Michael Helmling wrote:
> Public bug reported:
>
> Binary package hint: openssh-server
>
> Hi, since upgrading to karmic on one machine I get a strange behavior of
> sshd:
>
> - After exitting an SSH-session, the terminal on the client machine hangs instead of closing
> - On the server there is a "sshd <defunct>" after each such session (this is especially annoying since I have a nagios server checking the SSH status every few minutes, so the number of zombie processes rapidly increases)
> - Additionally, I noticed that "Ctrl-C" does not work inside the SSH session, which is a bash shortcut that I use very often ;-) (should start a new empty command line ignoring what you've entered so far)
>
>
> Restarting sshd takes exceptionally long, but at least kills all the zombies.
> When I'm working on a terminal directly at the machine (i.e. not over SSH), the behavior is totally normal, so I consider this an ssh (or at least ssh-related) bug.
>

I've seen a similar behavior when I clone virtual machines. Rebooting
the virtual instance fixes all of the issue though.

Is your system a virtual machine or a physical machine? What happens if
you reboot the system? How is the system installed (from an iso, the
network)?

  status incomplete

--
Mathias Gug
Ubuntu Developer http://www.ubuntu.com

Changed in openssh (Ubuntu):
status: New → Incomplete

Hi Mathias,
the system is a physical one. It was installed via ISO, but with some earier ubuntu version, and then upgraded release by release. I've already tried 'aptitude reinstall openssh-server' without success.
Rebooting doesn't solve the problem. However, it isn't totally reproducable: After the last /etc/init.d/ssh restart, I could exit ssh sessions normally without leaving zombie processes. However, now after another reboot, the zombies return. So, rebooting is rather counterproductive than helping. ;-)

On Fri, Jul 31, 2009 at 06:53:30PM -0000, Michael Helmling wrote:
> Rebooting doesn't solve the problem. However, it isn't totally reproducable: After the last /etc/init.d/ssh restart, I could exit ssh sessions normally without leaving zombie processes. However, now after another reboot, the zombies return. So, rebooting is rather counterproductive than helping. ;-)
>

Right. Could confirm that issuing *one* /etc/init.d/ssh restart after
the system has rebooted fixes the issue?

--
Mathias Gug
Ubuntu Developer http://www.ubuntu.com

I'm experiencing the same bug, karmic x64, tons of ssh zombies.
Happens with interactive and non-interactive shells, and with ssh tunnels.

if i restarts the main sshd, it works for a little while, then starts leaving zombies behind.

My shell is tcsh (doesn't seem to matter though).

If i strace the main sshd while doing a "ssh server echo '$$'", i see:

28640 write(1, "28640\n", 6) = 6 <=== my echo
28640 rt_sigprocmask(SIG_SETMASK, NULL, ~[INT KILL ALRM TERM CHLD STOP RTMIN RT_1], 8) = 0
28640 rt_sigprocmask(SIG_SETMASK, ~[KILL ALRM TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
28640 close(0) = 0
28640 close(1) = 0
28640 close(2) = 0
28640 rt_sigprocmask(SIG_SETMASK, NULL, ~[KILL ALRM TERM CHLD STOP RTMIN RT_1], 8) = 0
28640 rt_sigprocmask(SIG_SETMASK, ~[INT KILL ALRM TERM CHLD STOP RTMIN RT_1], NULL, 8) = 0
28640 exit_group(0) = ? <=== the shell exits properly
28639 <... select resumed> ) = 1 (in [11])
28639 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[KILL ALRM STOP RTMIN RT_1], 8) = 0
28639 rt_sigprocmask(SIG_SETMASK, ~[KILL ALRM STOP RTMIN RT_1], NULL, 8) = 0
28639 read(11, "28640\n", 16384) = 6
28639 select(14, [3 7 11 13], [3], NULL, NULL) = 3 (in [11 13], out [3])
28639 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[KILL ALRM STOP RTMIN RT_1], 8) = 0
28639 rt_sigprocmask(SIG_SETMASK, ~[KILL ALRM STOP RTMIN RT_1], NULL, 8) = 0
28639 read(11, "", 16384) = 0
28639 close(11) = 0
28639 read(13, "", 16384) = 0
28639 close(13) = 0
28639 write(3, "\315\300.\216\6\313\\\303\360\304\324*\355\332\222\373\266\275\255\275\t\261\",\325\205\3432\25\355\26\214"..., 48) = 48
28639 select(14, [3 7], [3], NULL, NULL) = 1 (out [3])
28639 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[KILL ALRM STOP RTMIN RT_1], 8) = 0
28639 rt_sigprocmask(SIG_SETMASK, ~[KILL ALRM STOP RTMIN RT_1], NULL, 8) = 0
28639 write(3, "\202T\314[\351\244\33\22\245\361\302NI=Yv\35\375\311h\25\275\272\255fbzH\365\244\247n", 32) = 32
28639 select(14, [3 7], [], NULL, NULL <== the forked sshd expects something from fd 3 or 7 but it never happens

Changed in openssh (Ubuntu):
status: Incomplete → Confirmed
Hervé Fache (rvfh) wrote :

Most of my computers have been upgraded from earlier versions. The only one showing this behaviour is my server which was initially installed with 6.06. Issues are the same as reported:
* Ctrl-C and Ctrl-Z do not work
* login takes time
* logout hangs until I kill it with '~.'
* loads of zombie processes

About the systems:
* both systems are up-to-date karmic.
* sshd_config is almost the same as on my laptop (only authorized users differ).
* sshd and all its shared libs are identical on both systems.
* .bashrc and .bash_logout are the same on both systems
* .bash_profile was only present on the server, but removing it does not change the Ctrl-C behaviour

If I ssh to my server and then from there ssh back to my laptop, Ctrl-C works...
I'll keep looking!

Michael Helmling (supermihi) wrote :

I can confirm that one /etc/init.d/sshd restart after seems to fix the issue with the zombie processes.

Stas Sușcov (sushkov) wrote :

Same as above after updating from Jaunty.
Shell is not important, any connections hangs on logout.

This also is true on using screen. Screen hangs after logging out from it.

Hope it get's solved soon, my box is a home server.

Thank you.

Stas Sușcov (sushkov) wrote :

Downgrading to openssh 1:5.1p1-5ubuntu1 (Jaunty x86) solved the bug for me, so the issue is somewhere in applied patches between 1:5.1p1-5ubuntu1 (Jaunty) and 1:5.1p1-6ubuntu1 (Karmic) builds.

Stas Sușcov (sushkov) wrote :

Hmm, actually it solved partially the bug.
Now I can logout, but CTRL+C and screen still are not working.

The problem doesn't seem to be caused by openssh-server/client packages.
I've downgraded to hardy version, and the bug persists.
There's something else that causes the bug, but it happens only on remote ssh connections.

I don't think the Ctrl-C handling is due to ssh, more like some tty or shell
option of some sort. Bash _does_ print "^C" when I press Ctrl-C but nothing
else happens (no interrupt signal sent).

There probably are two bugs here:
* ssh hangs/creates zombies
* bash does not propagate signals (interactive mode broken?)

it's not bash. I use tcsh (both on the client and the server) and i'm also impacted.

Nicolas Valcarcel (nvalcarcel) wrote :

i checked the diff between the jaunty and the karmic version and i can only find 2 options:
a) colin's debian patch is making this break (using READ and WRITE respectively instead of RW)
b) is a bug in libkrb5-3

Michael Helmling (supermihi) wrote :

I don't know if those are the same type of "Signals", but maybe this bug I found is related:
https://bugs.launchpad.net/ubuntu/+bug/412972
Could the ones of you that are also affected by the zombie processes try the kill example?

as a workaround, when booting with kernel 2.6.28-14 logoff and Ctrl-C/Z work normal.

Bryce Harrington (bryce) wrote :

I have been able to reproduce this issue on my hardware. Definitely a regression that needs attention.

Changed in openssh (Ubuntu):
importance: Undecided → High
milestone: none → karmic-alpha-5
Wisefox (shangxiaole) wrote :

Finally, I find this describe the common problem....
In my first trying to install karmic via network install script and update to the latest version. I found my desktop refuse to interrupt process with Ctrl + c in ssh. It's really hard to use because I only work with ssh....I was thinking that the debootstrip did not properly configure my system. So I made another try to install via ISO. But the problem did not seem to be fixed. I had to say the problem was disappear in some day's package recently(because I did a daily update). But it was broken again.. I doubt the problem is the settings for kernel in initrd and sshd problem may also relate to that, but I'm not sure.

in my system:

kernel: 2.6.31-5-generic
ssh: Can not job control.
VNC: Can not job control. similar to ssh
local: job control is ok.

Michael Helmling (supermihi) wrote :

I am experiencing more strange behavior: When I start aptitude directly with the ssh command, i.e. something like

ssh -t <email address hidden> aptitude -V dist-upgrade

then I always get "kicked out" after the packages are downloaded, or more precise, I assume directly before dpkg is called. I only have the output in german:

Muss 0B/264MB an Archiven herunterladen. Nach dem Entpacken werden 4571kB frei werden.
Wollen Sie fortsetzen? [Y/n/?]
Schreibe erweiterte Statusinformationen... Fertig
Connection to m*********.de closed.

Sometimes I found zombie dpkg processes after that happened, but not always. If I start an interactive ssh session and then run aptitude, it always works.

Jamie Strandboge (jdstrand) wrote :

FWIW, I see this as well. A /etc/init.d/ssh restart does clear everything out and ssh is ok for a while. I noticed that nagios tends to aggravate things (/usr/lib/nagios/plugins/check_ssh -4 <ip address>), but not immediately after a restart. Eg, I had 883 zombies a moment ago, then I did restart (which cleared them all out), and now I can't trigger it immediately. :(

Wisefox (shangxiaole) wrote :

Don't know the developer already get the root cause or not. I can provide some clue deeper into this issue.

When Karmic starts the ssh and other several services(udevd,dhcp etc.), it uses a wrong Sigmask set for the program:
Check the Sigblk for sshd:
$ cat /proc/`pidof sshd| awk '{print $1}'`/status | grep Sig
SigQ: 4/16297
SigPnd: 0000000000000000
SigBlk: fffffffe7ffb9eff # should be all zero
SigIgn: 0000000000000000
SigCgt: 0000000180006000

All child forked by sshd will succeed the same SigBlk from sshd. And the Ctrl + C token uses the SIGINT(2) to interrupt the program. But it's blocked by the program. That's the cause of sshd zombie and Ctrl + C problem.
I provide two method to get sshd back to normal.
1). On that machine(physical), use the /etc/init.d/ssh stop/start to restart ssh to make it use the right mask.
2). Write a little wrapper program to restart the ssh.

caraboides (chhennig) wrote :

how is it possible to set the SigBlk-mask?

That would be a workaround, or?

Brett Glasson (brett-glasson) wrote :

I can confirm this bug

Wisefox (shangxiaole) wrote :

Seems latest update fix the problem. But now, I get ext4 and shutdown problem.....

Michael Helmling (supermihi) wrote :

Same here. CTRL-C handling on SSH works again, I don't see zombies, but "poweroff" hangs at "checking for unattended upgrades .. * asking all remaining processes to terminate ". No CTRL-ALT-DEL or anything else helps in this state, only hard resetting. And there are more mysterious things going on with SSH:
/etc/init.d/ssh stop leaves the sshd process running, even kill <pid> does not work, I need kill -9 <pid>. But this also doesn't solve the poweroff problem.
Hoewver, a "reboot" command worked. See this on two machines so far.

Michael Helmling (supermihi) wrote :

Actually ... few reboots later the problem is back. :-(

Colin Watson (cjwatson) wrote :

Might any of you have restarted sshd from within a su session? I don't have definite proof that it's related, but I note that the quoted blocked signal mask corresponds exactly to those signals that su blocks. I wonder if it's due to some PAM session module ...

Filipi Vianna (fvianna) wrote :

I confirm this bug

Wisefox (shangxiaole) wrote :

Well. Provide a piece of code as a workaround. compile the following C code:
========================
#include "signal.h"

int main( int argc, char ** argv )
{

        sigset_t oldset, newset;
        sigfillset( &newset );
        sigprocmask( SIG_UNBLOCK, &newset, &oldset );
        int pid = fork();
        if ( pid == 0 )
        {
                execvp( argv[1], &(argv[1]) );
                perror("\n");
        }
        else
        {
                printf("father return;\n");
        }
}
=========================
gcc -o sigwapper thisfile.c
=========================
Use the program to fix the problem as workaround. need root privilege :
=========================
$ ./sigwapper /etc/init.d/ssh restart
=========================
Check if the C + c is back or not.

Brett Glasson (brett-glasson) wrote :

OK, I did that. Apart from a warning regarding printf it went OK but it did not make a difference to the ctrl+c problem

Console output is below;

brettg@jupiter:~$ gcc -o sigwapper sshfix.c
sshfix.c: In function ‘main’:
sshfix.c:17: warning: incompatible implicit declaration of built-in function ‘printf’
brettg@jupiter:~$ sudo ./sigwapper /etc/init.d/ssh restart
[sudo] password for brettg:
father return;
brettg@jupiter:~$ * Restarting OpenBSD Secure Shell server sshd [ OK ]
brettg@jupiter:~$

Wisefox (shangxiaole) wrote :

Hi Brett.
I forgot to mention that:

You need to logout from current console. Then login your machine.

Sorry for that.

Michael Helmling (supermihi) wrote :

I have no success, too.
# ./sigwrapper /etc/init.d/ssh restart
father return;
root@menk:~# * Restarting OpenBSD Secure Shell server sshd
root@menk:~#

On another ssh session started thereafter, still no Ctrl-C.

Brett Glasson (brett-glasson) wrote :

hehe, no probs Wisefox. I tried a new session and ctrl+c does appear to work as expected now.

Bravo!

Daniel Silva (daniel-silva) wrote :

are the Ctrl-C and screen issues listed as separate bugs or is it really the same problem?

Wisefox (shangxiaole) wrote :

Hi Michael

Please paste the result of "cat /proc/`pidof sshd | awk '{print $1}'`/status | grep SigBlk". If the problem is not caused by sigmask. The workaround can't work. :)

Johannes (johannes-claesson) wrote :

I can also confirm this problem. As can I confirm that "/etc/init.d/ssh restart" is a valid (but time consuming and probably proportional in time to number of processes) workaround. When I noticed this problem (after about 15 hours of uptime), there were something in the region of 3000 defunct ssh PID:s so it'd basically take a week of uptime to fill up all the PID:s. I don't know for sure, but that could be a bad thing. :P

Takeshi Sone (ts1) wrote :

I am also experiencing this problem, and found that some other processes have this strange signal mask.
udevd and dhclient3.
They are not descendants of sshd.
So I guess the culprit is not sshd but somewhere in the system startup scripts.

Takeshi Sone (ts1) wrote :

The culprit is udev.
udev runs "ifup" when it founds eth0, then ifup starts dhclient3 and restarts sshd!

Temporary fix to this:
- Remove "auto eth0" from /etc/network/interfaces
- Add "/sbin/ifup eth0" to /etc/rc.local (before "exit 0", of course)

Now only udev has the strange signal mask on my machine.

Colin Watson (cjwatson) wrote :

Wow, yes, excellent catch. udevd.c:worker_new() blocks a load of signals and nothing puts them back.

Bug tennis :-)

affects: openssh (Ubuntu) → udev (Ubuntu)
Changed in udev (Ubuntu):
milestone: karmic-alpha-5 → karmic-alpha-6
Jeremy Kerr (jk-ozlabs) wrote :

The attached patch fixes the problem for me. From the patch comments:

External programs triggered by events (via RUN=) will inherit udev's
signal mask, which is set to block all but SIGALRM. For most utilities,
this is OK, but if we start daemons from RUN=, we run into trouble
(especially as SIGCHLD is blocked).

This change saves the original sigmask when udev starts, and restores it
just before we exec() the external command.

Signed-off-by: Jeremy Kerr <email address hidden>

I'm not a udev hacker, so not sure if this is the proper fix. I've sent it to linux-hotplug for comment.

Jeremy Kerr (jk-ozlabs) wrote :

Looks like this patch has been accepted to udev:

http://www.spinics.net/lists/hotplug/msg02675.html

The udev.git tree doesn't look like it has been updated though, I assume Kay hasn't pushed recent changes yet.

summary: - sshd zombie processes and strange behavior after karmic upgrade
+ worker signal mask inherited by children
Changed in udev (Ubuntu Karmic):
assignee: nobody → Scott James Remnant (scott)
Kees Cook (kees) on 2009-09-10
Changed in udev (Ubuntu Karmic):
assignee: Scott James Remnant (scott) → Kees Cook (kees)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package udev - 147~-1

---------------
udev (147~-1) karmic; urgency=low

  FFE LP: #427356.

  * Update to GIT HEAD (pre 147 release):
    - worker signal mask corrected. LP: #407428.
    - database format change to avoid path length issues. LP: #377121.
    - multiple devices may not claim the same /dev names, except with
      symlinks
    - NAME="%k" produces a warning
    - symlinks to udevadm no longer resolve to the original command
    - rules updates. LP: #281335, LP: #407940, #420015, #426647.

  * Build-depend on gawk, since build fails with mawk.

  * Replace init scripts with Upstart jobs.
  * debian/control:
    - Add missing ${misc:Depends}
    - Bump build-dependency on debhelper for Upstart-aware dh_installinit

 -- Scott James Remnant <email address hidden> Tue, 15 Sep 2009 03:22:11 +0100

Changed in udev (Ubuntu Karmic):
status: Confirmed → Fix Released
Fil (zlashdot) wrote :

Seems like this bug is also happening in 9.04 Jaunty Jackalope since udev is in 141 and my system gets clogged by f**king sshd process with nothing connected to them.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers