[EDP][Vanilla] Oozie process does not start from time to time

Bug #1240511 reported by Vadim Rovachev
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Sahara
Fix Released
Critical
Andrew Lazarev

Bug Description

!!!!!!!!! !!!!!!!!
This bug is not always reproduced
!!!!!!!!! !!!!!!!!

Step 1.
Create node group templates(flavor: 1 CRU, 1 GB RAM, 10 GB root disk, 10 GB ):
   1. namenode, jobtracker
   2. oozie, secondarynamenode
   3. tasktracker, datanode

Create cluster template

Create cluster

Step 2.
   Create data sources
   Create job binaries
   Create job

Step 3.
   Launch job and completion wait

Actual result:
   Job status: KILLED

Expected result:
   Job status: SUCCEEDED

---------------------------------------------------------------------------

stacktrase:

Pig script [selenium-pig-job-binary] content:
------------------------
A = load '$INPUT' using PigStorage(':') as (fruit: chararray);
B = foreach A generate com.hadoopbook.pig.Trim(fruit);
store B into '$OUTPUT' USING PigStorage();
------------------------

Current (local) dir = /mnt/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201310161144_0001/attempt_201310161144_0001_m_000000_0/work
Pig command arguments :
             -file
             selenium-pig-job-binary
             -param
             INPUT=swift://selenium-tests.savanna/input
             -param
             OUTPUT=swift://selenium-tests.savanna/output
             -log4jconf
             /mnt/lib/hadoop/mapred/taskTracker/hadoop/jobcache/job_201310161144_0001/attempt_201310161144_0001_m_000000_0/work/piglog4j.properties
             -logfile
             pig-job_201310161144_0001.log
=================================================================

>>> Invoking Pig command line now >>>

Run pig script using Main.main() for Pig version before 0.8

<<< Invocation of Main class completed <<<

Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], main() threw exception, org/apache/pig/Main
java.lang.NoClassDefFoundError: org/apache/pig/Main
 at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:324)
 at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:219)
 at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37)
 at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: org.apache.pig.Main
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 17 more

Oozie Launcher failed, finishing Hadoop job gracefully

Oozie Launcher ends

stderr logs
Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.PigMain], main() threw exception, org/apache/pig/Main
java.lang.NoClassDefFoundError: org/apache/pig/Main
 at org.apache.oozie.action.hadoop.PigMain.runPigJob(PigMain.java:324)
 at org.apache.oozie.action.hadoop.PigMain.run(PigMain.java:219)
 at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:37)
 at org.apache.oozie.action.hadoop.PigMain.main(PigMain.java:76)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.oozie.action.hadoop.LauncherMapper.map(LauncherMapper.java:226)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
 at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: java.lang.ClassNotFoundException: org.apache.pig.Main
 at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 17 more

---------------------------------------------------------------------------

Revision history for this message
Alexander Ignatov (aignatov) wrote :

I've also We need to identify why this happens from time to time. I think this is High-priority bug because it can affect all EDP related stuff

Changed in savanna:
status: New → Confirmed
summary: - Oozie process does not start
+ [EDP][Vanilla]Oozie process does not start from time to time
Changed in savanna:
importance: Undecided → High
Revision history for this message
Alexander Ignatov (aignatov) wrote : Re: [EDP][Vanilla]Oozie process does not start from time to time

Also I think if this is native Oozie issue than we should have an ability to do oozie sharelib several times.

Revision history for this message
Matthew Farrellee (mattf) wrote :

Workaround -

 0. Login to the master node of your hadoop cluster (where oozie was started)
 1. sudo -s -u hadoop
 2. hadoop fs -rmr share
 3. tar zxf /opt/oozie/oozie-sharelib-4.0.0.tar.gz
 4. hadoop fs -put share .

Changed in savanna:
assignee: nobody → Alexander Ignatov (aignatov)
summary: - [EDP][Vanilla]Oozie process does not start from time to time
+ [EDP][Vanilla] Oozie process does not start from time to time
Changed in savanna:
milestone: none → icehouse-1
Revision history for this message
Alexander Ignatov (aignatov) wrote :

I can't reproduce this bug from 10 times.

Changed in savanna:
importance: High → Low
milestone: icehouse-1 → icehouse-2
status: Confirmed → Incomplete
Revision history for this message
Andrew Lazarev (alazarev) wrote :

It is fully reproducible on demo laptop.
Vote to change priority to High.

Changed in savanna:
status: Incomplete → Confirmed
Revision history for this message
Sergey Lukjanov (slukjanov) wrote :

Andrew, please, add details, we can't repro it in our lab.

Revision history for this message
Andrew Lazarev (alazarev) wrote :

Stacktrace is exactly the same.
"share/lib" HDFS folder contains distcp lib only.
"sharelib create" reports no errors.
Workaround provided by Matthew works.
Reproducibility is about 80% (much more failures than successes).

Note, this is devstack on one laptop. Probably hardware performance matters.
What details could help to debug deeper? As I see the problem is that sharelib returns no errors but doesn't copy libs. And I have no idea why it happens.

Changed in savanna:
importance: Low → Critical
Changed in savanna:
assignee: Alexander Ignatov (aignatov) → Andrew Lazarev (alazarev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to savanna (master)

Fix proposed to branch: master
Review: https://review.openstack.org/62511

Changed in savanna:
status: Confirmed → In Progress
Revision history for this message
Alexander Ignatov (aignatov) wrote :

As per https://wiki.openstack.org/wiki/Savanna/BugTriage#Task_7:_Review_Critical.2FHigh_bugs_.28bug_supervisors.29 not sure if this bug is critical becuase it only affects Andrew's environment and workaround works for him.

Revision history for this message
Matthew Farrellee (mattf) wrote :

This issue impacts more than just Andrew's laptop environment. According the the bug metadata it impacts 4 people who have bothered to click the button.

This issue smells like a race condition that may only occur on slower hardware configurations, which makes it highly unpredictable.

I was seeing ~66% occurrence rate in a non-laptop environment.

IMHO, this issue also render EDP, which should be zero manual user interaction w/ a cluster, effectively broken.

Revision history for this message
Andrew Lazarev (alazarev) wrote :

Investigation:
1. oozie-setup.sh

* oozie-setup.sh exists with code 0 even when error occurred.
* Filed https://issues.apache.org/jira/browse/OOZIE-1647.
* Filed https://bugs.launchpad.net/savanna/+bug/1262023 to have it fixed in image builder.

2. Switch to manual 'share' copying leads to real error:

2013-12-16 18:24:12.843 34350 TRACE savanna.service.api SubprocessException: RemoteCommandException: Error during command execution: "sudo su - -c "mkdir /tmp/oozielib && tar zxf
 /opt/oozie/oozie-sharelib-4.0.0.tar.gz -C /tmp/oozielib && hadoop fs -put /tmp/oozielib/share share && rm -rf /tmp/oozielib" hadoop"
2013-12-16 18:24:12.843 34350 TRACE savanna.service.api Return code: 255
2013-12-16 18:24:12.843 34350 TRACE savanna.service.api STDERR:
2013-12-16 18:24:12.843 34350 TRACE savanna.service.api 13/12/17 19:19:16 WARN : DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
 File /user/hadoop/share/lib/distcp/oozie-sharelib-distcp-4.0.0.jar could only be replicated to 0 nodes, instead of 1

3. HDFS
The issue itself occurs because Savanna expects that HDFS is running immediately after "/usr/sbin/hadoop-daemon.sh start datanode" execution.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/62966

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to savanna (master)

Reviewed: https://review.openstack.org/62966
Committed: https://git.openstack.org/cgit/openstack/savanna/commit/?id=9fff9fe6fef8cbc137de2d04cf38ab9c4cebe919
Submitter: Jenkins
Branch: master

commit 9fff9fe6fef8cbc137de2d04cf38ab9c4cebe919
Author: Andrew Lazarev <email address hidden>
Date: Thu Dec 19 17:40:25 2013 -0800

    Wait for HDFS readiness after datanode services start

    Bug #1240511 consists of two connected problems:
    1. oozie-setup.sh exists with status code 0 when error occurs.
    2. Savanna expects that HDFS is running immediately
       after "/usr/sbin/hadoop-daemon.sh start datanode" execution

    This patch addresses the second issue.
    The first issue is addressed by https://review.openstack.org/#/c/62511/

    Added check that all datanodes are running before manipulations with HDFS.

    Fixes bug #1240511

    Change-Id: I953512e2b430fba562ee3dceb8c1781aa8ad61bf

Changed in savanna:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in savanna:
status: Fix Committed → Fix Released
Revision history for this message
gws (gzsgwsgwl) wrote :

Andrew Lazarev (alazarev) ‘s workround doesn't work,i am using savanna 0.3.

Thierry Carrez (ttx)
Changed in sahara:
milestone: icehouse-2 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.