Juju Charms Collection
hadoop package

Scaling works in 1 direction

Bug #1336309 reported by Charles Butler on 2014-07-01

20

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	hadoop (Juju Charms Collection)	New	Undecided	Unassigned

Bug Description

hadoop reconfigures properly for scale-up operation.

When scaling down, the administrative interface still shows the maximum amount of nodes registered. eg: scale up to 4, it displays 4. Scale back down to 2, it still shows 4.

We need to handle the cluster reconfiguration in the -broken / -departed hooks to properly scale hadoop back down.

Tags:

Related branches

lp:~asanjar/charms/trusty/hadoop/hadoop-elk

Merged into lp:charms/hadoop at revision 4

Charles Butler (community): Approve on 2014-07-08

charmers: Pending requested 2014-06-25

Nicolas Thomas (thomnico) on 2014-07-01

tags:

added: orange-box

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-01:

#1

looking into it

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-01:

#2

looks like "stop" hook was never fully implemented.. even in hadoop 1.0 version of the Charm..

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-02:

#3

implemented "stop" hook last night. As below log indicates both datanode and nodemanager have started the shutdown process, however for an unknown reason neither namenode or resourcemanager (master node) are notified. Might be either a timing issue or a bug in hadoop 2.2.0. Continue debugging..

2014-07-02 14:51:58 INFO stop + su ubuntu -c '/home/ubuntu/hadoop/hadoop-2.2.0/sbin/hadoop-daemons.sh stop datanode'
2014-07-02 14:52:03 INFO stop localhost: stopping datanode
2014-07-02 14:52:03 INFO stop + case $mapred_role in
2014-07-02 14:52:03 INFO stop + su ubuntu -c '/home/ubuntu/hadoop/hadoop-2.2.0/sbin/yarn-daemons.sh stop nodemanager'
2014-07-02 14:52:08 INFO stop localhost: stopping nodemanager
2014-07-02 14:52:08 INFO stop + snapshot_config

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-02:

#4

verified it is not a timing issue, using "jps" I could verify the termination of both nodemanager and datnode jvm process in the slave node
next action:
1) Debugging hadoop base code
2) engaging hadoop community (already sent a note)

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-07:

#5

Graceful datanode shutdown, as it was implemented above, is an appropriate method of shutting down the datanode and nodemanager to reduce HDFS and YARN metadata corruption. However, after lengthy discussion with HDFS community, it will not solve the scale-down issue reported by this bug. As of now, there are only two ways that namenode gets modified of a data shutdown:
1) Shutdown the datanode directly from namenode (mark the node as discarded)
2) Heartbeat timeout for datanode. The current default value is 10 minutes.
workaround will be to make value of "dfs.heartbeat.recheck.interval" configurable by Juju.

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-07:

#6

correction: dfs_namenode_heartbeat_recheck_interval not dfs.heartbeat.recheck.interval

Revision history for this message

Charles Butler (lazypower) wrote on 2014-07-07: Re: [Bug 1336309] Re: Scaling works in 1 direction

#7

Thanks for working on this Amir.

So the timeout/heartbeat is the limiting factor here, and making it
configurable causes nodes to fall out faster for purposes of demonstration
of scale? Excellent findings.

On Mon, Jul 7, 2014 at 9:13 AM, amir sanjar <email address hidden>
wrote:

> Graceful datanode shutdown, as it was implemented above, is an appropriate
> method of shutting down the datanode and nodemanager to reduce HDFS and
> YARN metadata corruption. However, after lengthy discussion with HDFS
> community, it will not solve the scale-down issue reported by this bug. As
> of now, there are only two ways that namenode gets modified of a data
> shutdown:
> 1) Shutdown the datanode directly from namenode (mark the node as
> discarded)
> 2) Heartbeat timeout for datanode. The current default value is 10 minutes.
> workaround will be to make value of "dfs.heartbeat.recheck.interval"
> configurable by Juju.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1336309
>
> Title:
> Scaling works in 1 direction
>
> Status in “hadoop” package in Juju Charms Collection:
> New
>
> Bug description:
> hadoop reconfigures properly for scale-up operation.
>
> When scaling down, the administrative interface still shows the
> maximum amount of nodes registered. eg: scale up to 4, it displays 4.
> Scale back down to 2, it still shows 4.
>
> We need to handle the cluster reconfiguration in the -broken /
> -departed hooks to properly scale hadoop back down.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/charms/+source/hadoop/+bug/1336309/+subscriptions
>

Revision history for this message

amir sanjar (asanjar) wrote on 2014-07-07:

#8

change the value of dfs_namenode_heartbeat_recheck_interval from 300000 to 3, for demo purpose only

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.