Ubuntu
hadoop package

Terasort (hadoop 2.7.1) failed on Ubuntu 1604

Bug #1594534 reported by Simon Xiao on 2016-06-20

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	hadoop (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

When I run terasort with Hadoop 2.7.1 on Ubuntu 1604, in 3 slaves and 1 master, with 500000000 records, in the middle of mapreduce job, some Ubuntu slave nodes will not able to be connected.
In this case, we are not able to create ssh connection to those slave nodes (connection refused).

If we login the slave node, then we will find:
1. dmesg shows systemd-journald received SIGTERM;
2. Several errors are found from /var/log/syslog. Iscsid reports semop down failed 22.

This is the Terasort output:

16/06/10 03:39:25 INFO terasort.TeraSort: starting
16/06/10 03:39:27 INFO input.FileInputFormat: Total input paths to process : 2
Spent 336ms computing base-splits.
Spent 9ms computing TeraScheduler splits.
Computing input splits took 348ms
Sampling 10 splits of 38
Making 7 from 100000 sampled records
Computing parititions took 1396ms
Spent 1749ms computing partitions.
16/06/10 03:39:29 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.1.85:8032
16/06/10 03:39:30 INFO mapreduce.JobSubmitter: number of splits:38
16/06/10 03:39:30 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/06/10 03:39:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1465554943455_0002
16/06/10 03:39:30 INFO impl.YarnClientImpl: Submitted application application_1465554943455_0002
16/06/10 03:39:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1465554943455_0002/
16/06/10 03:39:30 INFO mapreduce.Job: Running job: job_1465554943455_0002
16/06/10 03:40:03 INFO mapreduce.Job: Job job_1465554943455_0002 running in uber mode : false
16/06/10 03:40:03 INFO mapreduce.Job: map 0% reduce 0%
16/06/10 03:44:54 INFO mapreduce.Job: map 1% reduce 0%
16/06/10 03:45:12 INFO mapreduce.Job: map 2% reduce 0%
..................
16/05/25 00:35:53 INFO mapreduce.Job: map 69% reduce 0%
16/05/25 00:35:54 INFO mapreduce.Job: map 75% reduce 0%
16/05/25 00:35:56 INFO mapreduce.Job: map 88% reduce 0%
16/05/25 00:35:57 INFO ipc.Client: Retrying connect to server: ubuntubm10/192.168.1.85:38381. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

Tags:

Revision history for this message

Simon Xiao (sixiao) wrote on 2016-06-20:

1.png Edit (205.6 KiB, image/png)

Revision history for this message

Simon Xiao (sixiao) wrote on 2016-06-20:

2.png Edit (339.1 KiB, image/png)

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2016-06-20:

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1594534/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags:

added: bot-comment

Paul White (paulw2u) on 2016-06-21

affects:

ubuntu → hadoop (Ubuntu)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-06-24:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in hadoop (Ubuntu):
status:	New → Confirmed

Revision history for this message

Kevin W Monroe (kwmonroe) wrote on 2016-08-24:

Hi Simon - what does your cluster topology look like? Are your hadoop services running in containers, VMs, bare metal? How much ram do your slaves and master have, and is there any swap space on those machines?

Off the cuff, it sounds like one or more of your machines is running out of memory, but more details about your environment would help to know for sure.