hadoop crash: /bin/kill in ubuntu16.04 has bug in killing process group

Bug #1610499 reported by Dongdong88
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
procps (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

when i run hadoop in ubuntu 16.04, ssh will exit, all process which belong to hadoop user will be killed ,through debug ,i found the /bin/kill in ubuntu16.04 has a bug , it has bug in killing process group .

Ubuntu version is:

Description: Ubuntu 16.04.1 LTS
Release: 16.04

(1)The way to repeat this bug
It is easy to repeat this bug , run “/bin/kill -15 -12345” or any like “/bin/kill -15 -1xxxx” in ubuntu16.04 , it will kill all the process .

(2)Cause analysis
The code of /bin/kill in ubuntu16.04 come from procps-3.3.10 , when I run “/bin/kill -15 -1xxxx” , it actually send signal 15 to -1 ,

-1 mean it will kill all the process .

(3)The bug in procps-3.3.10/skill.c ,I think the code "pid = (long)('0' - optopt) " is not right .

static void __attribute__ ((__noreturn__)) kill_main(int argc, char **argv)
{
          case '?':
                        if (!isdigit(optopt)) {
                                xwarnx(_("invalid argument %c"), optopt);
                                kill_usage(stderr);
                        } else {
                            /* Special case for signal digit negative
                             * PIDs */
                        pid = (long)('0' - optopt);

                        if (kill((pid_t)pid, signo) != 0)
                             exitvalue = EXIT_FAILURE;
                            exit(exitvalue);
                        }
                        loop=0;
}

(4) the cause
 sometimes when the resource is tight or a hadoop container lost connection in sometime, the nodemanager will kill this container , it send a signal to kill this jvm process ,it is a normal behavior for hadoop to kill a task and then reexecute this task. but with this kill bug ,it kill all the process belong to a hadoop user .

(5) The way to workaround
 I copy /bin/kill in ubuntu14.04 to override /bin/kill in ubuntu16.04, it is ok in this way . I also think it is better to ask procps-3.3.10 maintainer to solve their bug,but i don't know how to contact them .

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in alsa-driver (Ubuntu):
status: New → Confirmed
Revision history for this message
Zak Peirce (plastikman) wrote :

This also affects an out of repository Hadoop install.

I found this issue but it did not solve the issue.

http://stackoverflow.com/questions/38419078/logouts-while-running-hadoop-under-ubuntu-16-04

Replacing /bin/kill with the binary from 14.04 LTS solved the issue for me. I am uncomfortable with the workaround, I will have to revert these systems to 14.04.

Revision history for this message
Dongdong88 (groden) wrote :

To Zak Peirce (plastikman) :
  i found this issue when i run hadoop also, after i replace /bin/kill with binary from 14.04 , hadoop is ok, i don't meet other issue again, so you also can use ubuntu16.04.

Dongdong88 (groden)
description: updated
description: updated
Revision history for this message
Zak Peirce (plastikman) wrote :

While I fully appreciate the work you have taken in this bug report I do not feel comfortable running with this workaround in production.

If and when this is resolved I will upgrade these nodes back to 16.04.

Dongdong88 (groden)
description: updated
Revision history for this message
xin (pursue) wrote :

I met exactly the same problem when using hadoop in ubuntu 16.04, and copied /bin/kill in ubuntu14.04 to ubuntu16.04 following Dongdong88's suggestion. However it renders /bin/kill simply unusable due to the lack of shared library (the built-in "kill" still works though). I created a link file libprocps.so.3 in /lib/x86_64-linux-gnu and targeted it to libprocps.so.4.0.0. But I am not sure if that will make the issue come back again. We may try copying and use libprocps.so.3.0.0 in ubuntu16.04 if so. I haven't got a chance to test it yet.

Revision history for this message
Dongdong88 (groden) wrote :

To xin (pursue) :

  you also can download procps-3.3.10 source code and compile it in ubuntu16.04, then you could replace /bin/kill with the binary you compile . in this way you dont need copy the libprocps.so.3 .

Revision history for this message
Johnny Klonaris (jawknee) wrote :

I've been seeing the same problem - numerous services being shutdown, including cron, ssh and even a script I have running on the console to recover. The syslog shows that various systemd "targets" being shutdown, including: Default, Basic System, Timers, Paths, Sockets...

This has correlated to running a script on our system (that has been running for many years on many systems) that finds and sends a kill signal to a ser2net connection before opening a remote console.

This has been a nightmare on our Ubuntu 16 systems since upgrading from 14, and nearly a daily occurrence on one of them.

Dongdong88 (groden)
summary: - /bin/kill in ubuntu16.04 has bug in killing process group
+ hadoop crash: /bin/kill in ubuntu16.04 has bug in killing process group
Jochen (jradmacher)
affects: alsa-driver (Ubuntu) → procps (Ubuntu)
Revision history for this message
shanmuga (shanmuga) wrote :

Hi @groden,

I am running hadoop 2.7.3 in pseudo distribution mode on ubuntu 16.04 through a Virtual Machine. I am facing the same issue. My ubuntu logs off whenever i submit a new hadoop job. I would like to try your workaround. Can you provide me a link/explain on how to download and override procps-3.3.10 source code.

I am a beginner with ubuntu. Please help!

Revision history for this message
Dongdong88 (groden) wrote :

Hi shanmuga,

(1) download the sourcecode
sudo apt-get source procps

(2) install dependency
sudo apt-get build-dep procps

(3) compile procps
cd procps-3.3.10
sudo dpkg-buildpackage

then you could get a kill binary .

Revision history for this message
Jeff Schoenborn (jeffschoenborn) wrote :

Wow, this has been driving me nuts. Also new to ubuntu; how would one obtain the 14.04 kill binary? Is there a way short of downloading the full distribution?

Revision history for this message
Jacek Sroka (zephod-4) wrote :

I still get this bug on 16.04 LTS after updating my system. I've looked at https://bugs.launchpad.net/ubuntu/+source/procps/+bug/1637026 and confirmed that I have procps - 2:3.3.10-4ubuntu2.2

Revision history for this message
Volker Mischo (volker-m) wrote :

Hello, I 've got the same problem with DB2. When I'm leaving the system as the DB instance owner, all procs are killed. I guess, it depends on the new systemd in 1604. ( see /var/log/syslog "received SIGRTMIN+24 from PID xxxx (kill) " ) maybe some sysctl param like kernel.shmall .. are involved too. It would be nice if a maintainer finally fixed that problem.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.