Scaling works in 1 direction

Bug #1336309 reported by Charles Butler
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
hadoop (Juju Charms Collection)
New
Undecided
Unassigned

Bug Description

hadoop reconfigures properly for scale-up operation.

When scaling down, the administrative interface still shows the maximum amount of nodes registered. eg: scale up to 4, it displays 4. Scale back down to 2, it still shows 4.

We need to handle the cluster reconfiguration in the -broken / -departed hooks to properly scale hadoop back down.

Tags: orange-box

Related branches

tags: added: orange-box
Revision history for this message
amir sanjar (asanjar) wrote :

looking into it

Revision history for this message
amir sanjar (asanjar) wrote :

looks like "stop" hook was never fully implemented.. even in hadoop 1.0 version of the Charm..

Revision history for this message
amir sanjar (asanjar) wrote :

implemented "stop" hook last night. As below log indicates both datanode and nodemanager have started the shutdown process, however for an unknown reason neither namenode or resourcemanager (master node) are notified. Might be either a timing issue or a bug in hadoop 2.2.0. Continue debugging..

2014-07-02 14:51:58 INFO stop + su ubuntu -c '/home/ubuntu/hadoop/hadoop-2.2.0/sbin/hadoop-daemons.sh stop datanode'
2014-07-02 14:52:03 INFO stop localhost: stopping datanode
2014-07-02 14:52:03 INFO stop + case $mapred_role in
2014-07-02 14:52:03 INFO stop + su ubuntu -c '/home/ubuntu/hadoop/hadoop-2.2.0/sbin/yarn-daemons.sh stop nodemanager'
2014-07-02 14:52:08 INFO stop localhost: stopping nodemanager
2014-07-02 14:52:08 INFO stop + snapshot_config

Revision history for this message
amir sanjar (asanjar) wrote :

verified it is not a timing issue, using "jps" I could verify the termination of both nodemanager and datnode jvm process in the slave node
next action:
1) Debugging hadoop base code
2) engaging hadoop community (already sent a note)

Revision history for this message
amir sanjar (asanjar) wrote :

Graceful datanode shutdown, as it was implemented above, is an appropriate method of shutting down the datanode and nodemanager to reduce HDFS and YARN metadata corruption. However, after lengthy discussion with HDFS community, it will not solve the scale-down issue reported by this bug. As of now, there are only two ways that namenode gets modified of a data shutdown:
1) Shutdown the datanode directly from namenode (mark the node as discarded)
2) Heartbeat timeout for datanode. The current default value is 10 minutes.
workaround will be to make value of "dfs.heartbeat.recheck.interval" configurable by Juju.

Revision history for this message
amir sanjar (asanjar) wrote :

correction: dfs_namenode_heartbeat_recheck_interval not dfs.heartbeat.recheck.interval

Revision history for this message
Charles Butler (lazypower) wrote : Re: [Bug 1336309] Re: Scaling works in 1 direction

Thanks for working on this Amir.

So the timeout/heartbeat is the limiting factor here, and making it
configurable causes nodes to fall out faster for purposes of demonstration
of scale? Excellent findings.

On Mon, Jul 7, 2014 at 9:13 AM, amir sanjar <email address hidden>
wrote:

> Graceful datanode shutdown, as it was implemented above, is an appropriate
> method of shutting down the datanode and nodemanager to reduce HDFS and
> YARN metadata corruption. However, after lengthy discussion with HDFS
> community, it will not solve the scale-down issue reported by this bug. As
> of now, there are only two ways that namenode gets modified of a data
> shutdown:
> 1) Shutdown the datanode directly from namenode (mark the node as
> discarded)
> 2) Heartbeat timeout for datanode. The current default value is 10 minutes.
> workaround will be to make value of "dfs.heartbeat.recheck.interval"
> configurable by Juju.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1336309
>
> Title:
> Scaling works in 1 direction
>
> Status in “hadoop” package in Juju Charms Collection:
> New
>
> Bug description:
> hadoop reconfigures properly for scale-up operation.
>
> When scaling down, the administrative interface still shows the
> maximum amount of nodes registered. eg: scale up to 4, it displays 4.
> Scale back down to 2, it still shows 4.
>
> We need to handle the cluster reconfiguration in the -broken /
> -departed hooks to properly scale hadoop back down.
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/charms/+source/hadoop/+bug/1336309/+subscriptions
>

Revision history for this message
amir sanjar (asanjar) wrote :

change the value of dfs_namenode_heartbeat_recheck_interval from 300000 to 3, for demo purpose only

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.