hadoop crash: /bin/kill in ubuntu16.04 has bug in killing process group

Bug #1610499 reported by Dongdong88 on 2016-08-06
36
This bug affects 6 people
Affects Status Importance Assigned to Milestone
procps (Ubuntu)
Undecided
Unassigned

Bug Description

when i run hadoop in ubuntu 16.04, ssh will exit, all process which belong to hadoop user will be killed ,through debug ,i found the /bin/kill in ubuntu16.04 has a bug , it has bug in killing process group .

Ubuntu version is:

Description: Ubuntu 16.04.1 LTS
Release: 16.04

(1)The way to repeat this bug
It is easy to repeat this bug , run “/bin/kill -15 -12345” or any like “/bin/kill -15 -1xxxx” in ubuntu16.04 , it will kill all the process .

(2)Cause analysis
The code of /bin/kill in ubuntu16.04 come from procps-3.3.10 , when I run “/bin/kill -15 -1xxxx” , it actually send signal 15 to -1 ,

-1 mean it will kill all the process .

(3)The bug in procps-3.3.10/skill.c ,I think the code "pid = (long)('0' - optopt) " is not right .

static void __attribute__ ((__noreturn__)) kill_main(int argc, char **argv)
{
          case '?':
                        if (!isdigit(optopt)) {
                                xwarnx(_("invalid argument %c"), optopt);
                                kill_usage(stderr);
                        } else {
                            /* Special case for signal digit negative
                             * PIDs */
                        pid = (long)('0' - optopt);

                        if (kill((pid_t)pid, signo) != 0)
                             exitvalue = EXIT_FAILURE;
                            exit(exitvalue);
                        }
                        loop=0;
}

(4) the cause
 sometimes when the resource is tight or a hadoop container lost connection in sometime, the nodemanager will kill this container , it send a signal to kill this jvm process ,it is a normal behavior for hadoop to kill a task and then reexecute this task. but with this kill bug ,it kill all the process belong to a hadoop user .

(5) The way to workaround
 I copy /bin/kill in ubuntu14.04 to override /bin/kill in ubuntu16.04, it is ok in this way . I also think it is better to ask procps-3.3.10 maintainer to solve their bug,but i don't know how to contact them .

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in alsa-driver (Ubuntu):
status: New → Confirmed
Zak Peirce (plastikman) wrote :

This also affects an out of repository Hadoop install.

I found this issue but it did not solve the issue.

http://stackoverflow.com/questions/38419078/logouts-while-running-hadoop-under-ubuntu-16-04

Replacing /bin/kill with the binary from 14.04 LTS solved the issue for me. I am uncomfortable with the workaround, I will have to revert these systems to 14.04.

Dongdong88 (groden) wrote :

To Zak Peirce (plastikman) :
  i found this issue when i run hadoop also, after i replace /bin/kill with binary from 14.04 , hadoop is ok, i don't meet other issue again, so you also can use ubuntu16.04.

Dongdong88 (groden) on 2016-09-08
description: updated
description: updated
Zak Peirce (plastikman) wrote :

While I fully appreciate the work you have taken in this bug report I do not feel comfortable running with this workaround in production.

If and when this is resolved I will upgrade these nodes back to 16.04.

Dongdong88 (groden) on 2016-09-08
description: updated
xin (pursue) wrote :

I met exactly the same problem when using hadoop in ubuntu 16.04, and copied /bin/kill in ubuntu14.04 to ubuntu16.04 following Dongdong88's suggestion. However it renders /bin/kill simply unusable due to the lack of shared library (the built-in "kill" still works though). I created a link file libprocps.so.3 in /lib/x86_64-linux-gnu and targeted it to libprocps.so.4.0.0. But I am not sure if that will make the issue come back again. We may try copying and use libprocps.so.3.0.0 in ubuntu16.04 if so. I haven't got a chance to test it yet.

Dongdong88 (groden) wrote :

To xin (pursue) :

  you also can download procps-3.3.10 source code and compile it in ubuntu16.04, then you could replace /bin/kill with the binary you compile . in this way you dont need copy the libprocps.so.3 .

Johnny Klonaris (jawknee) wrote :

I've been seeing the same problem - numerous services being shutdown, including cron, ssh and even a script I have running on the console to recover. The syslog shows that various systemd "targets" being shutdown, including: Default, Basic System, Timers, Paths, Sockets...

This has correlated to running a script on our system (that has been running for many years on many systems) that finds and sends a kill signal to a ser2net connection before opening a remote console.

This has been a nightmare on our Ubuntu 16 systems since upgrading from 14, and nearly a daily occurrence on one of them.

Dongdong88 (groden) on 2016-09-27
summary: - /bin/kill in ubuntu16.04 has bug in killing process group
+ hadoop crash: /bin/kill in ubuntu16.04 has bug in killing process group
Jochen (jradmacher) on 2016-10-07
affects: alsa-driver (Ubuntu) → procps (Ubuntu)
shanmuga (shanmuga) wrote :

Hi @groden,

I am running hadoop 2.7.3 in pseudo distribution mode on ubuntu 16.04 through a Virtual Machine. I am facing the same issue. My ubuntu logs off whenever i submit a new hadoop job. I would like to try your workaround. Can you provide me a link/explain on how to download and override procps-3.3.10 source code.

I am a beginner with ubuntu. Please help!

Dongdong88 (groden) wrote :

Hi shanmuga,

(1) download the sourcecode
sudo apt-get source procps

(2) install dependency
sudo apt-get build-dep procps

(3) compile procps
cd procps-3.3.10
sudo dpkg-buildpackage

then you could get a kill binary .

Wow, this has been driving me nuts. Also new to ubuntu; how would one obtain the 14.04 kill binary? Is there a way short of downloading the full distribution?

Jacek Sroka (zephod-4) wrote :

I still get this bug on 16.04 LTS after updating my system. I've looked at https://bugs.launchpad.net/ubuntu/+source/procps/+bug/1637026 and confirmed that I have procps - 2:3.3.10-4ubuntu2.2

Volker Mischo (volker-m) wrote :

Hello, I 've got the same problem with DB2. When I'm leaving the system as the DB instance owner, all procs are killed. I guess, it depends on the new systemd in 1604. ( see /var/log/syslog "received SIGRTMIN+24 from PID xxxx (kill) " ) maybe some sysctl param like kernel.shmall .. are involved too. It would be nice if a maintainer finally fixed that problem.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers