gearmand still accepts new jobs from time to time, but all worker are stuck too and won't receiver any jobs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Gearman |
Triaged
|
Critical
|
Brian Aker |
Bug Description
.../gearmand --port=4730 --pid-file=
i am trying the 1.1.6 release version and noticed the gearmand almost stops answering requests after a while.
This is how a normal "status" request via the admin protocol looks in the debug logfile:
DEBUG 2013-04-22 14:27:54.377299 [ main ] accept() 340 -> libgearman-
INFO 2013-04-22 14:27:54.377365 [ main ] Accepted connection from 127.0.0.1:49282
DEBUG 2013-04-22 14:27:54.377437 [ 1 ] Received CON wakeup event -> libgearman-
DEBUG 2013-04-22 14:27:54.377487 [ 1 ] setsockopt() 340 -> libgearman-
DEBUG 2013-04-22 14:27:54.377504 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-
INFO 2013-04-22 14:27:54.377512 [ 1 ] Gear connection made
DEBUG 2013-04-22 14:27:54.377521 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-
DEBUG 2013-04-22 14:27:54.377539 [ 1 ] read 7 bytes -> libgearman-
INFO 2013-04-22 14:27:54.377546 [ 1 ] Gear unpack
DEBUG 2013-04-22 14:27:54.377552 [ 1 ] Received TEXT 127.0.0.1:40460705 -> libgearman-
DEBUG 2013-04-22 14:27:54.377574 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-
DEBUG 2013-04-22 14:27:54.377593 [ proc ] 127.0.0.1:49282 packet command TEXT -> libgearman-
DEBUG 2013-04-22 14:27:54.377626 [ proc ] text command status 1 arguments -> libgearman-
DEBUG 2013-04-22 14:27:54.377693 [ 1 ] Received RUN wakeup event -> libgearman-
DEBUG 2013-04-22 14:27:54.377705 [ 1 ] GEAR length: 139 gearmand_command_t: GEARMAN_
DEBUG 2013-04-22 14:27:54.377774 [ 1 ] send() 139 bytes to peer 127.0.0.1:49282 -> libgearman-
DEBUG 2013-04-22 14:27:54.377789 [ 1 ] Sent TEXT to 127.0.0.1:40460705 -> libgearman-
DEBUG 2013-04-22 14:27:54.377796 [ 1 ] free() packet's data -> libgearman-
DEBUG 2013-04-22 14:28:22.362232 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-
INFO 2013-04-22 14:28:22.362306 [ 1 ] Peer connection has called close() 127.0.0.1:49282
INFO 2013-04-22 14:28:22.362317 [ 1 ] Disconnected 127.0.0.1:49282
DEBUG 2013-04-22 14:28:22.362479 [ 1 ] Received RUN wakeup event -> libgearman-
INFO 2013-04-22 14:28:22.362490 [ 1 ] Gear connection disconnected
However, after a while the only thing happens is:
DEBUG 2013-04-22 14:33:29.292611 [ main ] accept() 345 -> libgearman-
INFO 2013-04-22 14:33:29.292659 [ main ] Accepted connection from 127.0.0.1:50728
Currently i can only reproduce the problem when setting the loglevel to at least notice. In debug loglevel the gearmand
runs for hours without problems. In Notice loglevel the problem starts after a few minutes when close_wait connections
pile up. The logfile only contains the accepted jobs and some errors on shutdown:
ERROR 2013-05-03 08:57:50.000000 [ proc ] GEARMAND_
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_
Any hints on how to debug that any further?
Changed in gearmand: | |
assignee: | nobody → Brian Aker (brianaker) |
Meanwhile i tried different command line options and it seems like none of them changes the behaviour. I tried
--threads=30 like suggested by mail which maybe delays the problem a little bit, but close wait connections still pile
up till the daemon does not respond anymore.