Bug #1265068 “[EDP] Execution stays 'Pending' after failure on o...” : Bugs : Sahara

Revision history for this message

Matthew Farrellee (mattf) wrote on 2014-01-02:

#1

Do we have a way to poll for a functional oozie without running a job that should take a known amount of time?

Revision history for this message

Andrew Lazarev (alazarev) wrote on 2014-01-02:

#2

Matthew, I don't know. Filed bug from user perspective.

Revision history for this message

Matthew Farrellee (mattf) wrote on 2014-01-07:

#3

Seems like the options are to 0) find a way to detect a functional oozie or 1) decide on a timeout for pending

I'd rather do (0), but a functioning oozie may still fail if the cluster itself is not functional. The only way to detect a functioning cluster is to run a simple job through oozie. However, a functioning cluster may only run one job before crashing, which means ultimately (1) is necessary.

If urgent, I'd proceed w/ (1), after careful thought about what's too long of a PENDING wait.

If not urgent or as a long term approach, we should look at doing a oozie status test and potentially running a simple job.

Revision history for this message

Jonathan Maron (jmaron) wrote on 2014-01-07: Re: [Bug 1265068] Re: [EDP] Execution stays 'Pending' after failure on oozie side

#4

Not that some of these failures can occur prior to the submission to oozie (configuration issues etc), in which case the problem may be more of an issue with propagation of exceptions to the appropriate layer.

— Jon

On Jan 7, 2014, at 4:07 AM, Matthew Farrellee <email address hidden> wrote:

> Seems like the options are to 0) find a way to detect a functional oozie
> or 1) decide on a timeout for pending
>
> I'd rather do (0), but a functioning oozie may still fail if the cluster
> itself is not functional. The only way to detect a functioning cluster
> is to run a simple job through oozie. However, a functioning cluster may
> only run one job before crashing, which means ultimately (1) is
> necessary.
>
> If urgent, I'd proceed w/ (1), after careful thought about what's too
> long of a PENDING wait.
>
> If not urgent or as a long term approach, we should look at doing a
> oozie status test and potentially running a simple job.
>
> --
> You received this bug notification because you are a member of Savanna,
> which is subscribed to Savanna.
> Matching subscriptions: Savanna bugs
> https://bugs.launchpad.net/bugs/1265068
>
> Title:
> [EDP] Execution stays 'Pending' after failure on oozie side
>
> Status in OpenStack Data Processing (Savanna):
> New
>
> Bug description:
> Steps to repro:
> 1. Misconfigure oozie (a lot of ways)
> 2. Run Job
>
> Observed behavior:
> Execution will remain in 'Pending' forever. No errors on Savanna side.
>
> Expected behavior:
> Execution moved to Error state.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/savanna/+bug/1265068/+subscriptions

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Not that some of these failures can occur prior to the submission to oozie (configuration issues etc), in which case the problem may be more of an issue with propagation of exceptions to the appropriate layer.

— Jon

On Jan 7, 2014, at 4:07 AM, Matthew Farrellee <1265068@bugs.launchpad.net> wrote:

> Seems like the options are to 0) find a way to detect a functional oozie
> or 1) decide on a timeout for pending
> 
> I'd rather do (0), but a functioning oozie may still fail if the cluster
> itself is not functional. The only way to detect a functioning cluster
> is to run a simple job through oozie. However, a functioning cluster may
> only run one job before crashing, which means ultimately (1) is
> necessary.
> 
> If urgent, I'd proceed w/ (1), after careful thought about what's too
> long of a PENDING wait.
> 
> If not urgent or as a long term approach, we should look at doing a
> oozie status test and potentially running a simple job.
> 
> -- 
> You received this bug notification because you are a member of Savanna,
> which is subscribed to Savanna.
> Matching subscriptions: Savanna bugs
> https://bugs.launchpad.net/bugs/1265068
> 
> Title:
>  [EDP] Execution stays 'Pending' after failure on oozie side
> 
> Status in OpenStack Data Processing (Savanna):
>  New
> 
> Bug description:
>  Steps to repro:
>  1. Misconfigure oozie (a lot of ways)
>  2. Run Job
> 
>  Observed behavior:
>  Execution will remain in 'Pending' forever. No errors on Savanna side.
> 
>  Expected behavior:
>  Execution moved to Error state.
> 
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/savanna/+bug/1265068/+subscriptions

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Revision history for this message

Matthew Farrellee (mattf) wrote on 2014-01-08:

#5

I agree, there are (too?) many things that can go wrong that Savanna has
no control over. That just strengthens the case that Savanna should have
a reasonable policy for giving up on instance/cluster startup.

Alexander Ignatov (aignatov) on 2014-01-17

Changed in savanna:
milestone:	none → icehouse-3
tags:	added: edp
Changed in savanna:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Trevor McKay (tmckay) wrote on 2014-01-22:

#6

Just starting to take a look at this.

I'm posutlating there may be a few subcases we can fix here through use of the Oozie client.

Revision history for this message

Trevor McKay (tmckay) wrote on 2014-01-22:

#7

Note, I'm distinguishing between "misconfigure Oozie" meaning "Ooozie itself is horked" and "misconfiguring Oozie" meaning create a bogus workflow and submit.

Which one are we talking about, or both?

Revision history for this message

Andrew Lazarev (alazarev) wrote on 2014-01-24:

#8

@Trevor
In my case there was wrong "pig.jar" file in oozie /share/lib (case #1). But problem is general, so topic is about both cases.

Revision history for this message

Andrew Lazarev (alazarev) wrote on 2014-01-24:

#9

Checked with broken pig.jar. Execution moved to "KILLED" as expected. So, it seems that the problem is for case #2 only. Don't remember exactly what was wrong with oozie in my case.

Sergey Lukjanov (slukjanov) on 2014-02-19

Changed in savanna:
milestone:	icehouse-3 → next

Revision history for this message

Trevor McKay (tmckay) wrote on 2014-02-25:

#10

Here is a failure case, fyi. It seems that if you submit a job right after the cluster becomes "Active", it can fail. Apparently Oozie is not quite ready. Edge case, but part of the general problem:

2014-02-25 11:11:17.339 10952 ERROR savanna.context [-] Thread 'Starting Job Execution 428a39c1-817b-4c93-be1b-f1cc3124e378' fails with exception: 'HTTPConnectionPool(host='10.0.3.2', port=11000): Max retries exceeded with url: /oozie//v1/jobs (Caused by <class 'httplib.BadStatusLine'>: '')'
2014-02-25 11:11:17.339 10952 TRACE savanna.context Traceback (most recent call last):
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/savanna/context.py", line 124, in _wrapper
2014-02-25 11:11:17.339 10952 TRACE savanna.context func(*args, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/savanna/service/edp/job_manager.py", line 167, in run_job
2014-02-25 11:11:17.339 10952 TRACE savanna.context job_execution)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/savanna/service/edp/oozie.py", line 37, in add_job
2014-02-25 11:11:17.339 10952 TRACE savanna.context "Content-Type": "application/xml;charset=UTF-8"
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 425, in post
2014-02-25 11:11:17.339 10952 TRACE savanna.context return self.request('POST', url, data=data, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 383, in request
2014-02-25 11:11:17.339 10952 TRACE savanna.context resp = self.send(prep, **send_kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 486, in send
2014-02-25 11:11:17.339 10952 TRACE savanna.context r = adapter.send(request, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/adapters.py", line 378, in send
2014-02-25 11:11:17.339 10952 TRACE savanna.context raise ConnectionError(e)
2014-02-25 11:11:17.339 10952 TRACE savanna.context ConnectionError: HTTPConnectionPool(host='10.0.3.2', port=11000): Max retries exceeded with url: /oozie//v1/jobs (Caused by <class 'httplib.BadStatusLine'>: '')

Here is a failure case, fyi.  It seems that if you submit a job right after the cluster becomes "Active", it can fail.  Apparently Oozie is not quite ready.  Edge case, but part of the general problem:

2014-02-25 11:11:17.339 10952 ERROR savanna.context [-] Thread 'Starting Job Execution 428a39c1-817b-4c93-be1b-f1cc3124e378' fails with exception: 'HTTPConnectionPool(host='10.0.3.2', port=11000): Max retries exceeded with url: /oozie//v1/jobs (Caused by <class 'httplib.BadStatusLine'>: '')'
2014-02-25 11:11:17.339 10952 TRACE savanna.context Traceback (most recent call last):
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/savanna/context.py", line 124, in _wrapper
2014-02-25 11:11:17.339 10952 TRACE savanna.context     func(*args, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/savanna/service/edp/job_manager.py", line 167, in run_job
2014-02-25 11:11:17.339 10952 TRACE savanna.context     job_execution)
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/savanna/service/edp/oozie.py", line 37, in add_job
2014-02-25 11:11:17.339 10952 TRACE savanna.context     "Content-Type": "application/xml;charset=UTF-8"
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 425, in post
2014-02-25 11:11:17.339 10952 TRACE savanna.context     return self.request('POST', url, data=data, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 383, in request
2014-02-25 11:11:17.339 10952 TRACE savanna.context     resp = self.send(prep, **send_kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 486, in send
2014-02-25 11:11:17.339 10952 TRACE savanna.context     r = adapter.send(request, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context   File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/adapters.py", line 378, in send
2014-02-25 11:11:17.339 10952 TRACE savanna.context     raise ConnectionError(e)
2014-02-25 11:11:17.339 10952 TRACE savanna.context ConnectionError: HTTPConnectionPool(host='10.0.3.2', port=11000): Max retries exceeded with url: /oozie//v1/jobs (Caused by <class 'httplib.BadStatusLine'>: '')

Alexander Ignatov (aignatov) on 2014-05-27

Changed in sahara:
milestone:	next → juno-2

Revision history for this message

Trevor McKay (tmckay) wrote on 2014-05-27:

#11

Note, the problem in comment 10 can be handled with exception handlers in Sahara and isn't really a problem on the Oozie side. It is better handled as part of https://bugs.launchpad.net/sahara/+bug/1317205

https://bugs.launchpad.net/sahara/+bug/1265068/comments/10

Sergey Lukjanov (slukjanov) on 2014-05-27

Changed in sahara:
status:	Confirmed → Triaged

Revision history for this message

Andrew Lazarev (alazarev) wrote on 2014-06-06:

#12

It looks like failure on oozie side moves job to FAILED state by Oozie (and we update status). Failures because of Sahara code are handled by #1317205. So, marking as dupe.

Sahara

[EDP] Execution stays 'Pending' after failure on oozie side

Bug Description

Other bug subscribers

Remote bug watches