Terasort (hadoop 2.7.1) failed on Ubuntu 1604

Bug #1594534 reported by Simon Xiao
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
hadoop (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

When I run terasort with Hadoop 2.7.1 on Ubuntu 1604, in 3 slaves and 1 master, with 500000000 records, in the middle of mapreduce job, some Ubuntu slave nodes will not able to be connected.
In this case, we are not able to create ssh connection to those slave nodes (connection refused).

If we login the slave node, then we will find:
1. dmesg shows systemd-journald received SIGTERM;
2. Several errors are found from /var/log/syslog. Iscsid reports semop down failed 22.

This is the Terasort output:

16/06/10 03:39:25 INFO terasort.TeraSort: starting
16/06/10 03:39:27 INFO input.FileInputFormat: Total input paths to process : 2
Spent 336ms computing base-splits.
Spent 9ms computing TeraScheduler splits.
Computing input splits took 348ms
Sampling 10 splits of 38
Making 7 from 100000 sampled records
Computing parititions took 1396ms
Spent 1749ms computing partitions.
16/06/10 03:39:29 INFO client.RMProxy: Connecting to ResourceManager at master/192.168.1.85:8032
16/06/10 03:39:30 INFO mapreduce.JobSubmitter: number of splits:38
16/06/10 03:39:30 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
16/06/10 03:39:30 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1465554943455_0002
16/06/10 03:39:30 INFO impl.YarnClientImpl: Submitted application application_1465554943455_0002
16/06/10 03:39:30 INFO mapreduce.Job: The url to track the job: http://master:8088/proxy/application_1465554943455_0002/
16/06/10 03:39:30 INFO mapreduce.Job: Running job: job_1465554943455_0002
16/06/10 03:40:03 INFO mapreduce.Job: Job job_1465554943455_0002 running in uber mode : false
16/06/10 03:40:03 INFO mapreduce.Job: map 0% reduce 0%
16/06/10 03:44:54 INFO mapreduce.Job: map 1% reduce 0%
16/06/10 03:45:12 INFO mapreduce.Job: map 2% reduce 0%
..................
16/05/25 00:35:53 INFO mapreduce.Job: map 69% reduce 0%
16/05/25 00:35:54 INFO mapreduce.Job: map 75% reduce 0%
16/05/25 00:35:56 INFO mapreduce.Job: map 88% reduce 0%
16/05/25 00:35:57 INFO ipc.Client: Retrying connect to server: ubuntubm10/192.168.1.85:38381. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=3, sleepTime=1000 MILLISECONDS)

Tags: bot-comment
Revision history for this message
Simon Xiao (sixiao) wrote :
Revision history for this message
Simon Xiao (sixiao) wrote :
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1594534/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Paul White (paulw2u)
affects: ubuntu → hadoop (Ubuntu)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in hadoop (Ubuntu):
status: New → Confirmed
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

Hi Simon - what does your cluster topology look like? Are your hadoop services running in containers, VMs, bare metal? How much ram do your slaves and master have, and is there any swap space on those machines?

Off the cuff, it sounds like one or more of your machines is running out of memory, but more details about your environment would help to know for sure.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.