gate check timed out for core -regress-core-ahw2.2

Bug #1434179 reported by Anuradha
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
New
High
Unassigned

Bug Description

http://logs.trafodion.org/26/1326/2/gate/core-regress-core-ahw2.2/7691eba/console.html

2015-03-19 08:29:48 ***INFO: Hortonworks installed will run traf_hortonworks_mods98
2015-03-19 08:29:48 ***INFO: Detected JAVA version 1.7
2015-03-19 08:29:48 ***INFO: copying hbase-trx-hdp2_2-1.1.0.jar to all nodes
2015-03-19 08:30:58 ***INFO: Restarting HBase to pick up config changes for Trafodion
2015-03-19 08:30:58 ***INFO: Stopping HBase...
2015-03-19 08:30:58 ***INFO: ...polling every 30 seconds until HBase stop is completed.
2015-03-19 08:30:58 ***DEBUG: Ambari command_id=366
2015-03-19 08:31:28 }***INFO: ...polling every 30 seconds until HBase stop is completed.
2015-03-19 08:31:58 }***INFO: ...polling every 30 seconds until HBase stop is completed.
2015-03-19 08:31:58 ***INFO: HBase stop completed
2015-03-19 08:31:58 ***INFO: Restarting HDFS to pick up config changes for Trafodion
2015-03-19 08:31:58 ***INFO: Stopping HDFS...
2015-03-19 08:31:58 ***INFO: ...polling every 30 seconds until HDFS stop is completed.
2015-03-19 08:31:58 ***DEBUG: Ambari command_id=367
2015-03-19 08:32:28 }***INFO: ...polling every 30 seconds until HDFS stop is completed.
2015-03-19 08:32:28 ***INFO: Starting HDFS...
2015-03-19 08:32:28 ***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:32:28 ***DEBUG: Ambari command_id=368
2015-03-19 08:32:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:33:28 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:33:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:34:28 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:34:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:35:28 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:35:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:36:28 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:36:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:37:28 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:37:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:38:28 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 08:38:58 }***INFO: ...polling every 30 seconds until HDFS start is completed.
....
...
2015-03-19 10:59:36 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 11:00:06 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 11:00:36 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 11:01:06 }***INFO: ...polling every 30 seconds until HDFS start is completed.
2015-03-19 11:01:33 Build timed out (after 200 minutes). Marking the build as failed.
2015-03-19 11:01:33 Build was aborted
2015-03-19 11:01:33 [PostBuildScript] - Execution post build scripts.

Tags: installer
Changed in trafodion:
importance: Undecided → High
Revision history for this message
Steve Varnau (steve-varnau) wrote :

http://logs.trafodion.org/26/1326/2/gate/core-regress-core-ahw2.2/7691eba/Install_Start.log

Log shows that start HDFS start command failed, but installer kept polling the command result forever. The test job eventually timed out.
Combing through the Ambari logs, I see that request 1079 failed. Ideally installer should report that and retreive the stdout/stderr of the command request. Ambari logs show it was specifically NameNode start that failed.

Ambari is not much more helpful about the ultimate failure, but at least it timesout after 10 minutes. After re-trying this command every 10 seconds:
2015-03-19 08:43:03,106 - Retrying after 10 seconds. Reason: Execution of 'su -s /bin/bash - hdfs -c 'export PATH=$PATH:/usr/hdp/current/hadoop-client/bin ; hdfs --config /etc/hadoop/conf dfsadmin -safemode get' | grep 'Safe mode is OFF'' returned 1.

Digging down into HDFS logs, namenode could not contact datanode. Looking at datanode log, it got this error:
2015-03-19 08:33:14,884 FATAL datanode.DataNode (DataNode.java:secureMain(2385)) - Exception in secureMain
java.net.BindException: Problem binding to [0.0.0.0:50010] java.net.BindException: Address already in use

So, perhaps when Ambari tells us HDFS is shutdown, there may sometimes still be processes hanging around that prevent a proper start-up. That seems to be an Ambari bug.

Revision history for this message
Steve Varnau (steve-varnau) wrote :

Not sure what can be done about the port conflict in datanode, but the polling of an already failed job seems like an installer problem that needs to be fixed.

Changing tag to installer.

tags: added: installer
removed: infrastructure
Changed in trafodion:
milestone: r1.1 → r2.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.