[EDP] Exceptions in 'run_job' leave a job Pending forever
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Sahara |
Fix Released
|
High
|
Andrew Lazarev |
Bug Description
We've seen various cases over time where uncaught exceptions after a job is created but before it is successfully submitted to Oozie cause a job to remain "Pending" forever unless it is deleted.
Many of these cases can be handled by adding exception handlers and logging to the "run_job" method (and perhaps associated methods) in job_manager.py.
Since these can be handled simply on the Sahara side with exception handlers,
we should treat it as a separate issue from https:/
is successfully submitted.
A few instances are reported below and in following comments, but they are
all instances of essentially the same problem:
------
Saw this on a job launch, and I want to record that it happened. Two potential problems here:
1) Why did the hdfs write fail?
2) This is another case of a job stuck in Pending forever
2014-05-07 17:20:05.125 639 TRACE sahara.context Traceback (most recent call last):
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context func(*args, **kwargs)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context upload_
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context h.put_file_
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context r.write_
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context self._run_
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context return self._run_
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context return self._run(func, *args, **kwargs)
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context return procutils.
2014-05-07 17:20:05.125 639 TRACE sahara.context File "/home/
2014-05-07 17:20:05.125 639 TRACE sahara.context raise SubprocessExcep
2014-05-07 17:20:05.125 639 TRACE sahara.context SubprocessExcep
2014-05-07 17:20:05.125 639 TRACE sahara.context
-----
Another case where run_job fails in trying to contact the Oozie server. This should be treated as a Sahara-side failure
Changed in sahara: | |
milestone: | none → juno-1 |
summary: |
- [EDP] Sahara, write to hdfs fails, job stays Pending + [EDP] Exceptions in 'run_job' leave a job Pending forever |
description: | updated |
Changed in sahara: | |
importance: | Undecided → High |
status: | New → Triaged |
assignee: | nobody → Trevor McKay (tmckay) |
Changed in sahara: | |
assignee: | Trevor McKay (tmckay) → Andrew Lazarev (alazarev) |
Changed in sahara: | |
status: | Triaged → In Progress |
Changed in sahara: | |
milestone: | juno-1 → juno-2 |
Changed in sahara: | |
milestone: | juno-2 → juno-1 |
status: | Fix Committed → Fix Released |
Changed in sahara: | |
milestone: | juno-1 → 2014.2 |
Reviewed: https:/ /review. openstack. org/97637 /git.openstack. org/cgit/ openstack/ sahara/ commit/ ?id=53e803bb8b5 39545c1e9c4a661 8a1995f7e5bc68
Committed: https:/
Submitter: Jenkins
Branch: master
commit 53e803bb8b53954 5c1e9c4a6618a19 95f7e5bc68
Author: Andrew Lazarev <email address hidden>
Date: Tue Jun 3 15:12:56 2014 -0700
Changing job excecution status to 'FAILED' in case of exception
Change-Id: Iabf3cffbfda7fe 9fd2215fbdaf9a5 96b08d5bbde
Closes-Bug: #1306720
Closes-Bug: #1317205