mf cluster still leaking redis connections

Bug #952023 reported by Muharem Hrnjadovic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake (deprecated)
Won't Fix
High
Muharem Hrnjadovic

Bug Description

strace on the redis process on gemcontrol shows a lot of these:

accept(4, 0x7fffdf236d10, [16]) = -1 EMFILE (Too many open files)
open("/var/log/redis/redis-server.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = -1 EMFILE (Too many open files)

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :
Download full text (140.2 KiB)

openquake 1576 laurentiu mem REG 8,1 180263 394356 /usr/share/java/jredis-2.2.jar
openquake 1576 laurentiu 19r REG 8,1 180263 394356 /usr/share/java/jredis-2.2.jar
openquake 1576 laurentiu 38r REG 8,1 180263 394356 /usr/share/java/jredis-2.2.jar
redis-ser 24987 redis cwd DIR 8,1 4096 4720856 /var/lib/redis
redis-ser 24987 redis rtd DIR 8,1 4096 2 /
redis-ser 24987 redis txt REG 8,1 250904 8130866 /usr/bin/redis-server
redis-ser 24987 redis mem REG 8,1 1677624 7606784 /lib/x86_64-linux-gnu/libc-2.13.so
redis-ser 24987 redis mem REG 8,1 135500 7606792 /lib/x86_64-linux-gnu/libpthread-2.13.so
redis-ser 24987 redis mem REG 8,1 538928 7606794 /lib/x86_64-linux-gnu/libm-2.13.so
redis-ser 24987 redis mem REG 8,1 141088 7606791 /lib/x86_64-linux-gnu/ld-2.13.so
redis-ser 24987 redis 0u CHR 1,3 0t0 7278 /dev/null
redis-ser 24987 redis 1u CHR 1,3 0t0 7278 /dev/null
redis-ser 24987 redis 2u CHR 1,3 0t0 7278 /dev/null
redis-ser 24987 redis 3u 0000 0,9 0 7269 anon_inode
redis-ser 24987 redis 4u IPv4 1509201265 0t0 TCP *:6379 (LISTEN)
redis-ser 24987 redis 5u IPv4 2514834102 0t0 TCP 129.132.181.130:6379->gemsun04.ethz.ch:46729 (ESTABLISHED)
redis-ser 24987 redis 6u IPv4 2514826191 0t0 TCP 129.132.181.130:6379->gemsun04.ethz.ch:46561 (ESTABLISHED)
redis-ser 24987 redis 7u IPv4 2196140422 0t0 TCP 129.132.181.130:6379->129.132.181.138:46553 (ESTABLISHED)
redis-ser 24987 redis 8u IPv4 1795361325 0t0 TCP 129.132.181.130:6379->129.132.181.138:46554 (ESTABLISHED)
redis-ser 24987 redis 9u IPv4 1795361326 0t0 TCP 129.132.181.130:6379->gemsun01.ethz.ch:50897 (ESTABLISHED)
redis-ser 24987 redis 10u IPv4 2196140423 0t0 TCP 129.132.181.130:6379->gemsun03.ethz.ch:60594 (ESTABLISHED)
redis-ser 24987 redis 11u IPv4 1795361327 0t0 TCP 129.132.181.130:6379->gemsun03.ethz.ch:60595 (ESTABLISHED)
redis-ser 24987 redis 12u IPv4 1795361328 0t0 TCP 129.132.181.130:6379->gemsun01.ethz.ch:50899 (ESTABLISHED)
redis-ser 24987 redis 13u IPv4 1795361329 0t0 TCP 129.132.181.130:6379->129.132.181.138:46557 (ESTABLISHED)
redis-ser 24987 redis 14u IPv4 1795362022 0t0 TCP 129.132.181.130:6379->gemsun03.ethz.ch:60597 (ESTABLISHED)
redis-ser 24987 redis 15u IPv4 1795362023 0t0 TCP 129.132.181.130:6379->gemsun03.ethz.ch:60599 (ESTABLISHED)
redis-ser 24987 redis 16u IPv4 2514045300 0t0 TCP 129.132.181.130:6379->gemsun01.eth...

Changed in openquake:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Muharem Hrnjadovic (al-maisan)
milestone: none → 0.6.1
tags: added: defect mfcluster redis
Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

$ for m in gemsun01 gemsun02 gemsun03 gemsun04 gemmicro01 gemmicro02 bigstar04; do echo ""; echo ">> $m"; ssh $m.ethz.ch "ps ax | grep celeryd | wc -l"; done

>> gemsun01
66

>> gemsun02
48

>> gemsun03
98

>> gemsun04
86

>> gemmicro01
174

>> gemmicro02
103

>> bigstar04
39

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

redis connection break-down by machine:

$ for sys in `grep -i established redis-connection.log | sed 's/..*->\([^:][^:]*\):.*/\1/' | sort -u`; do echo ""; echo ":: $sys"; grep -i established redis-connection.log | grep $sys | wc -l; done

:: 129.132.181.134
207

:: 129.132.181.136
224

:: 129.132.181.138
176

:: gemsun01.ethz.ch
107

:: gemsun02.ethz.ch
83

:: gemsun03.ethz.ch
101

:: gemsun04.ethz.ch
120

:: localhost
1

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Number of celeryd processes in the cluster (including zombies): 614

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

A more accurate number of celeryd processes in the cluster (including zombies): 600

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

Number of zombie celeryd processes in the cluster: 348
non-zombies: 252

That's too many non-zombie celeryd processes, gemsun03 and gemsun04 have a duplicate set of worker process (probably stuck/hanging)

Revision history for this message
Muharem Hrnjadovic (al-maisan) wrote :

For the record: we should have a total of 186 celeryd worker processes in the cluster.

matley (matley)
Changed in openquake:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.