Redis connections leak on the MF cluster

Bug #1008532 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Won't Fix
High
Muharem Hrnjadovic

Bug Description

gemcontrol ~ $ sudo lsof | grep redis | wc -l
1034

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Changed in openquake:
status: New → In Progress
importance: Undecided → High
tags: added: devop enduser-visible mfcluster
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Redis connection break-down by machine

$ for m in 129.132.181.134 129.132.181.136 129.132.181.138 gemsun01.ethz.ch gemsun02.ethz.ch gemsun03.ethz.ch gemsun04.ethz.ch localhost; do echo $m; cat redis-connections.txt | sort -k9 | grep $m | wc -l; done

129.132.181.134
212
129.132.181.136
197
129.132.181.138
162
gemsun01.ethz.ch
126
gemsun02.ethz.ch
99
gemsun03.ethz.ch
113
gemsun04.ethz.ch
108
localhost
2

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Celery workers broken down by machine

$ for m in gemsun01 gemsun02 gemsun03 gemsun04 gemmicro01 gemmicro02 bigstar04; do echo ">> $m"; ssh $m.ethz.ch "ps ax | grep celeryd | grep -v grep" | wc -l; done
>> gemsun01
17
>> gemsun02
17
>> gemsun03
17
>> gemsun04
17
>> gemmicro01
147
>> gemmicro02
99
>> bigstar04
41

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

gemmicro01 = 129.132.181.134
gemmicro02 = 129.132.181.136
bigstar04 = 129.132.181.138

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

I believe these errors caused the redis connection depletion issue:

JavaException: Java traceback (most recent call last):
  File "HazardCalculator.java", line 125, in org.gem.calc.HazardCalculator.getHazardCurvesAsJson
  File "HazardCalculator.java", line 92, in org.gem.calc.HazardCalculator.getHazardCurves
  File "HazardCurveCalculator.java", line 415, in org.opensha.sha.calc.HazardCurveCalculator.getHazardCurve
  File "CY_2008_AttenRel.java", line 366, in org.opensha.sha.imr.attenRelImpl.CY_2008_AttenRel.setEqkRupture
  File "DoubleParameter.java", line 493, in org.opensha.commons.param.DoubleParameter.setValue
  File "WarningDoubleParameter.java", line 76, in org.opensha.commons.param.WarningDoubleParameter.setValue
  File "WarningDoubleParameter.java", line 519, in org.opensha.commons.param.WarningDoubleParameter.setValue
org.opensha.commons.exceptions.ConstraintException: Rupture Top Depth: setValue(): Value is not allowed: 40.0

This is obviously undesirable and we need to be able to recover from such errors *without* leaking redis connections

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

I spent 2 days debugging this and trying to figure out why new redis connections are opened by the workers in a failure case (involving Java) but did not get anywhere. Giving up.

Changed in openquake:
status: In Progress → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.