[EDP] Exceptions in 'run_job' leave a job Pending forever

Bug #1317205 reported by Trevor McKay
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Sahara
Fix Released
High
Andrew Lazarev

Bug Description

We've seen various cases over time where uncaught exceptions after a job is created but before it is successfully submitted to Oozie cause a job to remain "Pending" forever unless it is deleted.

Many of these cases can be handled by adding exception handlers and logging to the "run_job" method (and perhaps associated methods) in job_manager.py.

Since these can be handled simply on the Sahara side with exception handlers,
we should treat it as a separate issue from https://bugs.launchpad.net/sahara/+bug/1265068 where the problem is detection of Oozie failures after a job
is successfully submitted.

A few instances are reported below and in following comments, but they are
all instances of essentially the same problem:

------

Saw this on a job launch, and I want to record that it happened. Two potential problems here:

1) Why did the hdfs write fail?
2) This is another case of a job stuck in Pending forever

2014-05-07 17:20:05.125 639 TRACE sahara.context Traceback (most recent call last):
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/context.py", line 120, in _wrapper
2014-05-07 17:20:05.125 639 TRACE sahara.context func(*args, **kwargs)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/service/edp/job_manager.py", line 144, in run_job
2014-05-07 17:20:05.125 639 TRACE sahara.context upload_job_files(oozie_server, wf_dir, job, hdfs_user)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/service/edp/job_manager.py", line 187, in upload_job_files
2014-05-07 17:20:05.125 639 TRACE sahara.context h.put_file_to_hdfs(r, raw_data, main.name, job_dir, hdfs_user)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/service/edp/hdfs_helper.py", line 30, in put_file_to_hdfs
2014-05-07 17:20:05.125 639 TRACE sahara.context r.write_file_to('/tmp/%s' % file_name, file)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/utils/ssh_remote.py", line 371, in write_file_to
2014-05-07 17:20:05.125 639 TRACE sahara.context self._run_s(_write_file_to, timeout, remote_file, data, run_as_root)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/utils/ssh_remote.py", line 428, in _run_s
2014-05-07 17:20:05.125 639 TRACE sahara.context return self._run_with_log(func, timeout, *args, **kwargs)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/utils/ssh_remote.py", line 334, in _run_with_log
2014-05-07 17:20:05.125 639 TRACE sahara.context return self._run(func, *args, **kwargs)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/utils/ssh_remote.py", line 425, in _run
2014-05-07 17:20:05.125 639 TRACE sahara.context return procutils.run_in_subprocess(self.proc, func, args, kwargs)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/croberts/os1/sahara/sahara/utils/procutils.py", line 52, in run_in_subprocess
2014-05-07 17:20:05.125 639 TRACE sahara.context raise SubprocessException(result['exception'])
2014-05-07 17:20:05.125 639 TRACE sahara.context SubprocessException: TypeError: Expected unicode or bytes, got None
2014-05-07 17:20:05.125 639 TRACE sahara.context

-----

Another case where run_job fails in trying to contact the Oozie server. This should be treated as a Sahara-side failure

https://bugs.launchpad.net/sahara/+bug/1265068/comments/10

Changed in sahara:
milestone: none → juno-1
Trevor McKay (tmckay)
summary: - [EDP] Sahara, write to hdfs fails, job stays Pending
+ [EDP] Exceptions in 'run_job' leave a job Pending forever
Trevor McKay (tmckay)
description: updated
Changed in sahara:
importance: Undecided → High
status: New → Triaged
assignee: nobody → Trevor McKay (tmckay)
Changed in sahara:
assignee: Trevor McKay (tmckay) → Andrew Lazarev (alazarev)
Changed in sahara:
status: Triaged → In Progress
Changed in sahara:
milestone: juno-1 → juno-2
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara (master)

Reviewed: https://review.openstack.org/97637
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=53e803bb8b539545c1e9c4a6618a1995f7e5bc68
Submitter: Jenkins
Branch: master

commit 53e803bb8b539545c1e9c4a6618a1995f7e5bc68
Author: Andrew Lazarev <email address hidden>
Date: Tue Jun 3 15:12:56 2014 -0700

    Changing job excecution status to 'FAILED' in case of exception

    Change-Id: Iabf3cffbfda7fe9fd2215fbdaf9a596b08d5bbde
    Closes-Bug: #1306720
    Closes-Bug: #1317205

Changed in sahara:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to sahara (master)

Fix proposed to branch: master
Review: https://review.openstack.org/99790

Changed in sahara:
milestone: juno-2 → juno-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to sahara (master)

Reviewed: https://review.openstack.org/99790
Committed: https://git.openstack.org/cgit/openstack/sahara/commit/?id=f092140d18a1c43ba096fed00cc1c622033f0fcf
Submitter: Jenkins
Branch: master

commit f092140d18a1c43ba096fed00cc1c622033f0fcf
Author: Andrew Lazarev <email address hidden>
Date: Thu Jun 12 15:50:49 2014 -0700

    Fixed status update for job execution

    Fixed wrong status update for job execution introduced at
    https://review.openstack.org/#/c/97637/

    Change-Id: I4067667c19c8fd2e3eea72b2c1d64bba6016c19f
    Closes-Bug: #1306720
    Closes-Bug: #1317205

Thierry Carrez (ttx)
Changed in sahara:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.