[EDP] Execution stays 'Pending' after failure on oozie side

Bug #1265068 reported by Andrew Lazarev
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Sahara
Triaged
Medium
Unassigned

Bug Description

Steps to repro:
1. Misconfigure oozie (a lot of ways)
2. Run Job

Observed behavior:
Execution will remain in 'Pending' forever. No errors on Savanna side.

Expected behavior:
Execution moved to Error state.

Tags: edp
Revision history for this message
Matthew Farrellee (mattf) wrote :

Do we have a way to poll for a functional oozie without running a job that should take a known amount of time?

Revision history for this message
Andrew Lazarev (alazarev) wrote :

Matthew, I don't know. Filed bug from user perspective.

Revision history for this message
Matthew Farrellee (mattf) wrote :

Seems like the options are to 0) find a way to detect a functional oozie or 1) decide on a timeout for pending

I'd rather do (0), but a functioning oozie may still fail if the cluster itself is not functional. The only way to detect a functioning cluster is to run a simple job through oozie. However, a functioning cluster may only run one job before crashing, which means ultimately (1) is necessary.

If urgent, I'd proceed w/ (1), after careful thought about what's too long of a PENDING wait.

If not urgent or as a long term approach, we should look at doing a oozie status test and potentially running a simple job.

Revision history for this message
Jonathan Maron (jmaron) wrote : Re: [Bug 1265068] Re: [EDP] Execution stays 'Pending' after failure on oozie side

Not that some of these failures can occur prior to the submission to oozie (configuration issues etc), in which case the problem may be more of an issue with propagation of exceptions to the appropriate layer.

— Jon

On Jan 7, 2014, at 4:07 AM, Matthew Farrellee <email address hidden> wrote:

> Seems like the options are to 0) find a way to detect a functional oozie
> or 1) decide on a timeout for pending
>
> I'd rather do (0), but a functioning oozie may still fail if the cluster
> itself is not functional. The only way to detect a functioning cluster
> is to run a simple job through oozie. However, a functioning cluster may
> only run one job before crashing, which means ultimately (1) is
> necessary.
>
> If urgent, I'd proceed w/ (1), after careful thought about what's too
> long of a PENDING wait.
>
> If not urgent or as a long term approach, we should look at doing a
> oozie status test and potentially running a simple job.
>
> --
> You received this bug notification because you are a member of Savanna,
> which is subscribed to Savanna.
> Matching subscriptions: Savanna bugs
> https://bugs.launchpad.net/bugs/1265068
>
> Title:
> [EDP] Execution stays 'Pending' after failure on oozie side
>
> Status in OpenStack Data Processing (Savanna):
> New
>
> Bug description:
> Steps to repro:
> 1. Misconfigure oozie (a lot of ways)
> 2. Run Job
>
> Observed behavior:
> Execution will remain in 'Pending' forever. No errors on Savanna side.
>
> Expected behavior:
> Execution moved to Error state.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/savanna/+bug/1265068/+subscriptions

--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

Revision history for this message
Matthew Farrellee (mattf) wrote :

I agree, there are (too?) many things that can go wrong that Savanna has
no control over. That just strengthens the case that Savanna should have
a reasonable policy for giving up on instance/cluster startup.

Changed in savanna:
milestone: none → icehouse-3
tags: added: edp
Changed in savanna:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Trevor McKay (tmckay) wrote :

Just starting to take a look at this.

I'm posutlating there may be a few subcases we can fix here through use of the Oozie client.

Revision history for this message
Trevor McKay (tmckay) wrote :

Note, I'm distinguishing between "misconfigure Oozie" meaning "Ooozie itself is horked" and "misconfiguring Oozie" meaning create a bogus workflow and submit.

Which one are we talking about, or both?

Revision history for this message
Andrew Lazarev (alazarev) wrote :

@Trevor
In my case there was wrong "pig.jar" file in oozie /share/lib (case #1). But problem is general, so topic is about both cases.

Revision history for this message
Andrew Lazarev (alazarev) wrote :

Checked with broken pig.jar. Execution moved to "KILLED" as expected. So, it seems that the problem is for case #2 only. Don't remember exactly what was wrong with oozie in my case.

Changed in savanna:
milestone: icehouse-3 → next
Revision history for this message
Trevor McKay (tmckay) wrote :

Here is a failure case, fyi. It seems that if you submit a job right after the cluster becomes "Active", it can fail. Apparently Oozie is not quite ready. Edge case, but part of the general problem:

2014-02-25 11:11:17.339 10952 ERROR savanna.context [-] Thread 'Starting Job Execution 428a39c1-817b-4c93-be1b-f1cc3124e378' fails with exception: 'HTTPConnectionPool(host='10.0.3.2', port=11000): Max retries exceeded with url: /oozie//v1/jobs (Caused by <class 'httplib.BadStatusLine'>: '')'
2014-02-25 11:11:17.339 10952 TRACE savanna.context Traceback (most recent call last):
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/savanna/context.py", line 124, in _wrapper
2014-02-25 11:11:17.339 10952 TRACE savanna.context func(*args, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/savanna/service/edp/job_manager.py", line 167, in run_job
2014-02-25 11:11:17.339 10952 TRACE savanna.context job_execution)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/savanna/service/edp/oozie.py", line 37, in add_job
2014-02-25 11:11:17.339 10952 TRACE savanna.context "Content-Type": "application/xml;charset=UTF-8"
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 425, in post
2014-02-25 11:11:17.339 10952 TRACE savanna.context return self.request('POST', url, data=data, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 383, in request
2014-02-25 11:11:17.339 10952 TRACE savanna.context resp = self.send(prep, **send_kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/sessions.py", line 486, in send
2014-02-25 11:11:17.339 10952 TRACE savanna.context r = adapter.send(request, **kwargs)
2014-02-25 11:11:17.339 10952 TRACE savanna.context File "/home/tmckay/src/savanna2/.tox/venv/lib/python2.7/site-packages/requests/adapters.py", line 378, in send
2014-02-25 11:11:17.339 10952 TRACE savanna.context raise ConnectionError(e)
2014-02-25 11:11:17.339 10952 TRACE savanna.context ConnectionError: HTTPConnectionPool(host='10.0.3.2', port=11000): Max retries exceeded with url: /oozie//v1/jobs (Caused by <class 'httplib.BadStatusLine'>: '')

Changed in sahara:
milestone: next → juno-2
Revision history for this message
Trevor McKay (tmckay) wrote :

Note, the problem in comment 10 can be handled with exception handlers in Sahara and isn't really a problem on the Oozie side. It is better handled as part of https://bugs.launchpad.net/sahara/+bug/1317205

https://bugs.launchpad.net/sahara/+bug/1265068/comments/10

Changed in sahara:
status: Confirmed → Triaged
Revision history for this message
Andrew Lazarev (alazarev) wrote :

It looks like failure on oozie side moves job to FAILED state by Oozie (and we update status). Failures because of Sahara code are handled by #1317205. So, marking as dupe.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.