After failed job redis at 100%, celery queue not draining

Bug #943292 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Won't Fix
High
Muharem Hrnjadovic

Bug Description

Observed this morning on the model facility cluster:

    1 - job with hazard calculation failures aborts and is terminated
    2 - redis load goes up and it uses 100% of cpu time
    3 - the worker machines seem idle
    4 - the remaining task messages of the dead job are *not* drained from the celery queue

A job started subsequently appears hung but redis is not responsive and likely the root cause of the problem.

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

The can probably only be reproduced on the model facility cluster.

Changed in openquake:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Muharem Hrnjadovic (al-maisan)
milestone: none → 0.6.0
tags: added: cluster defect redis
tags: added: enduser-visible
Changed in openquake:
milestone: 0.6.0 → 0.6.1
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

I managed to capture an strace log of redis in that situation. Please see attachment.

tags: added: mfcluster
matley (matley)
Changed in openquake:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.