Handle supervision in oq-engine-server

Bug #1214813 reported by Lars Butler
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake Engine
Fix Released
Critical
Michele Simionato

Bug Description

In order to run oq-engine calculations in oq-engine-server, we have to disable the supervisor forking (see https://github.com/gem/oq-engine/blob/584a5265277ce27ae90b362e3bc036f64998470c/openquake/engine/engine.py#L393), otherwise we leak celeryd processes.

We can avoid this leaking issue by not forking, but we lose some important monitoring done by the supervisor process. Namely, work node monitoring (if a worker crashes/goes offline, we abort the calculation).

oq-engine-server will need to implement some other kind of monitoring, perhaps something involving celerymon.

Changed in oq-engine:
importance: Undecided → High
importance: High → Critical
Revision history for this message
Michele Simionato (michele-simionato) wrote :
Changed in oq-engine:
status: New → In Progress
assignee: nobody → Michele Simionato (michele-simionato)
milestone: none → 1.0.1
Changed in oq-engine:
status: In Progress → Fix Committed
Revision history for this message
Lars Butler (lars-butler) wrote :

Hi Michele,

It looks like you've done some great cleanup here!

Is there a corresponding patch to the oq-engine-server? Just removing supervision from the oq-engine doesn't really solve this problem, does it? And if you remove all of this, how do you detect failed celery nodes? Is this feature still there somewhere? (It's a little unclear from the pull request, though I haven't had a chance to test it.)

Revision history for this message
Michele Simionato (michele-simionato) wrote :

Hi Lars, actually there is a task scheduled this sprint for the error management in the engine server. The detection of failed nodes is still there, see https://github.com/gem/oq-engine/blob/master/openquake/engine/engine.py#L488

Changed in oq-engine:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.