OpenQuake (deprecated)

After failed job redis at 100%, celery queue not draining

Bug #943292 reported by Muharem Hrnjadovic on 2012-02-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenQuake (deprecated)	Won't Fix	High	Muharem Hrnjadovic	OpenQuake (deprecated) 0.6.1

Bug Description

Observed this morning on the model facility cluster:

    1 - job with hazard calculation failures aborts and is terminated
    2 - redis load goes up and it uses 100% of cpu time
    3 - the worker machines seem idle
    4 - the remaining task messages of the dead job are *not* drained from the celery queue

A job started subsequently appears hung but redis is not responsive and likely the root cause of the problem.

Tags:

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2012-02-29:

The can probably only be reproduced on the model facility cluster.

Changed in openquake:
status:	New → Confirmed
importance:	Undecided → High
assignee:	nobody → Muharem Hrnjadovic (al-maisan)
milestone:	none → 0.6.0
tags:	added: cluster defect redis
tags:	added: enduser-visible

Muharem Hrnjadovic (al-maisan) on 2012-03-01

Changed in openquake:
milestone:	0.6.0 → 0.6.1

Revision history for this message

Muharem Hrnjadovic (al-maisan) wrote on 2012-03-05:

redis-strace.log Edit (46.3 MiB, text/plain)

I managed to capture an strace log of redis in that situation. Please see attachment.

Muharem Hrnjadovic (al-maisan) on 2012-03-06

tags:

added: mfcluster

matley (matley) on 2013-04-03

Changed in openquake:
status:	Confirmed → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

redis-strace.log Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.