Hazard curve calculation does not terminate

Bug #890405 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Won't Fix
High
Unassigned

Bug Description

The OpenQuake job with the attached input files results in 1 realization and 148011 sites; However the hazard curve calculation does not terminate after 148011 calculated hazard curves (on the gemsun cluster).

Tags: defect hazard
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Changed in openquake:
status: New → Confirmed
importance: Undecided → High
tags: added: defect hazard
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

After 1007241 computed hazard curves the system stalls i.e. the workers are idle, rabbitmq is fully operational but the main openquake process on the control node is waiting for a result that never arrives (see the attached strace.log file)

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

The task for which the main process was waiting (b397e04834084796b0bf6b51fe96a65c) was done early on in the calculation:

$ grep 6a65c gemsun0[134].log
gemsun01.log:[2011-11-14 20:33:03,016: INFO/MainProcess] Got task from broker: openquake.hazard.tasks.compute_hazard_curve[b397e048-3408-4796-b0bf-6b51fe96a65c]
gemsun01.log:[2011-11-14 20:33:10,447: INFO/MainProcess] Task openquake.hazard.tasks.compute_hazard_curve[b397e048-3408-4796-b0bf-6b51fe96a65c] succeeded in 7.39070105553s: ['::JOB::159::!hazard_curve_poes!0!74597818006...

Maybe a result message was dropped?

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Re. "After 1007241 computed hazard curves": this statement is unreliable -- I was basing it on the log records observed on gemsun02 but upon closer examination of the time stamps (indicating many duplicate log records?) and comparisons with the log files of the workers it is very uncertain how many hazard curves were *actually* computed.

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

See also bug #890703.

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

RabbitMQ has been identified as the critical component. The investigations in progress are documented here: https://bugs.launchpad.net/openquake/+bug/894024

Changed in openquake:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.