gearmand still accepts new jobs from time to time, but all worker are stuck too and won't receiver any jobs

Bug #1176324 reported by Brian Aker
22
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Gearman
Triaged
Critical
Brian Aker

Bug Description

.../gearmand --port=4730 --pid-file=/tmp/run/gearmand.pid --daemon --job-retries=0 --threads=10 -q libsqlite3 --store-queue-on-shutdown --libsqlite3-db=/tmp/gearmand.sqlite --log-file=/tmp/gearmand.log --verbose=DEBUG --listen=localhost

i am trying the 1.1.6 release version and noticed the gearmand almost stops answering requests after a while.

This is how a normal "status" request via the admin protocol looks in the debug logfile:

  DEBUG 2013-04-22 14:27:54.377299 [ main ] accept() 340 -> libgearman-server/gearmand.cc:854
   INFO 2013-04-22 14:27:54.377365 [ main ] Accepted connection from 127.0.0.1:49282
  DEBUG 2013-04-22 14:27:54.377437 [ 1 ] Received CON wakeup event -> libgearman-server/gearmand_thread.cc:602
  DEBUG 2013-04-22 14:27:54.377487 [ 1 ] setsockopt() 340 -> libgearman-server/io.cc:809
  DEBUG 2013-04-22 14:27:54.377504 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
   INFO 2013-04-22 14:27:54.377512 [ 1 ] Gear connection made
  DEBUG 2013-04-22 14:27:54.377521 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:118
  DEBUG 2013-04-22 14:27:54.377539 [ 1 ] read 7 bytes -> libgearman-server/io.cc:686
   INFO 2013-04-22 14:27:54.377546 [ 1 ] Gear unpack
  DEBUG 2013-04-22 14:27:54.377552 [ 1 ] Received TEXT 127.0.0.1:40460705 -> libgearman-server/thread.cc:310
  DEBUG 2013-04-22 14:27:54.377574 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
  DEBUG 2013-04-22 14:27:54.377593 [ proc ] 127.0.0.1:49282 packet command TEXT -> libgearman-server/server.cc:98
  DEBUG 2013-04-22 14:27:54.377626 [ proc ] text command status 1 arguments -> libgearman-server/text.cc:63
  DEBUG 2013-04-22 14:27:54.377693 [ 1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:607
  DEBUG 2013-04-22 14:27:54.377705 [ 1 ] GEAR length: 139 gearmand_command_t: GEARMAN_COMMAND_TEXT -> libgearman-server/plugins/protocol/gear/protocol.cc:274
  DEBUG 2013-04-22 14:27:54.377774 [ 1 ] send() 139 bytes to peer 127.0.0.1:49282 -> libgearman-server/io.cc:275
  DEBUG 2013-04-22 14:27:54.377789 [ 1 ] Sent TEXT to 127.0.0.1:40460705 -> libgearman-server/thread.cc:357
  DEBUG 2013-04-22 14:27:54.377796 [ 1 ] free() packet's data -> libgearman-server/packet.cc:393

  DEBUG 2013-04-22 14:28:22.362232 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:118
   INFO 2013-04-22 14:28:22.362306 [ 1 ] Peer connection has called close() 127.0.0.1:49282
   INFO 2013-04-22 14:28:22.362317 [ 1 ] Disconnected 127.0.0.1:49282
  DEBUG 2013-04-22 14:28:22.362479 [ 1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:607
   INFO 2013-04-22 14:28:22.362490 [ 1 ] Gear connection disconnected

However, after a while the only thing happens is:
  DEBUG 2013-04-22 14:33:29.292611 [ main ] accept() 345 -> libgearman-server/gearmand.cc:854
   INFO 2013-04-22 14:33:29.292659 [ main ] Accepted connection from 127.0.0.1:50728

Currently i can only reproduce the problem when setting the loglevel to at least notice. In debug loglevel the gearmand
runs for hours without problems. In Notice loglevel the problem starts after a few minutes when close_wait connections
pile up. The logfile only contains the accepted jobs and some errors on shutdown:

ERROR 2013-05-03 08:57:50.000000 [ proc ] GEARMAND_WAKEUP_RUN(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:382
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_mutex_lock(Invalid argument) -> libgearman-server/gearmand_con.cc:685
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_mutex_lock(Invalid argument) -> libgearman-server/gearmand_con.cc:685
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_mutex_lock(Invalid argument) -> libgearman-server/gearmand_con.cc:685
FATAL 2013-05-03 08:57:50.000000 [ main ] pthread_mutex_lock(Invalid argument) -> libgearman-server/gearmand_con.cc:685

Any hints on how to debug that any further?

Brian Aker (brianaker)
Changed in gearmand:
assignee: nobody → Brian Aker (brianaker)
Revision history for this message
Sven Nierlein (sven-nierlein) wrote :

Meanwhile i tried different command line options and it seems like none of them changes the behaviour. I tried
--threads=30 like suggested by mail which maybe delays the problem a little bit, but close wait connections still pile
up till the daemon does not respond anymore.

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 1176324] Re: gearmand still accepts new jobs from time to time, but all worker are stuck too and won't receiver any jobs
Download full text (4.9 KiB)

Could you provide a small rest case?

Sent from my Ti85

On May 6, 2013, at 3:02, Sven Nierlein <email address hidden> wrote:

> Meanwhile i tried different command line options and it seems like none of them changes the behaviour. I tried
> --threads=30 like suggested by mail which maybe delays the problem a little bit, but close wait connections still pile
> up till the daemon does not respond anymore.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1176324
>
> Title:
> gearmand still accepts new jobs from time to time, but all worker are
> stuck too and won't receiver any jobs
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> .../gearmand --port=4730 --pid-file=/tmp/run/gearmand.pid --daemon
> --job-retries=0 --threads=10 -q libsqlite3 --store-queue-on-shutdown
> --libsqlite3-db=/tmp/gearmand.sqlite --log-file=/tmp/gearmand.log
> --verbose=DEBUG --listen=localhost
>
>
> i am trying the 1.1.6 release version and noticed the gearmand almost stops answering requests after a while.
>
> This is how a normal "status" request via the admin protocol looks in
> the debug logfile:
>
> DEBUG 2013-04-22 14:27:54.377299 [ main ] accept() 340 -> libgearman-server/gearmand.cc:854
> INFO 2013-04-22 14:27:54.377365 [ main ] Accepted connection from 127.0.0.1:49282
> DEBUG 2013-04-22 14:27:54.377437 [ 1 ] Received CON wakeup event -> libgearman-server/gearmand_thread.cc:602
> DEBUG 2013-04-22 14:27:54.377487 [ 1 ] setsockopt() 340 -> libgearman-server/io.cc:809
> DEBUG 2013-04-22 14:27:54.377504 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> INFO 2013-04-22 14:27:54.377512 [ 1 ] Gear connection made
> DEBUG 2013-04-22 14:27:54.377521 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:118
> DEBUG 2013-04-22 14:27:54.377539 [ 1 ] read 7 bytes -> libgearman-server/io.cc:686
> INFO 2013-04-22 14:27:54.377546 [ 1 ] Gear unpack
> DEBUG 2013-04-22 14:27:54.377552 [ 1 ] Received TEXT 127.0.0.1:40460705 -> libgearman-server/thread.cc:310
> DEBUG 2013-04-22 14:27:54.377574 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> DEBUG 2013-04-22 14:27:54.377593 [ proc ] 127.0.0.1:49282 packet command TEXT -> libgearman-server/server.cc:98
> DEBUG 2013-04-22 14:27:54.377626 [ proc ] text command status 1 arguments -> libgearman-server/text.cc:63
> DEBUG 2013-04-22 14:27:54.377693 [ 1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:607
> DEBUG 2013-04-22 14:27:54.377705 [ 1 ] GEAR length: 139 gearmand_command_t: GEARMAN_COMMAND_TEXT -> libgearman-server/plugins/protocol/gear/protocol.cc:274
> DEBUG 2013-04-22 14:27:54.377774 [ 1 ] send() 139 bytes to peer 127.0.0.1:49282 -> libgearman-server/io.cc:275
> DEBUG 2013-04-22 14:27:54.377789 [ 1 ] Sent TEXT to 127.0.0.1:40460705 -> libgearman-server/thread.cc:357
> DEBUG 2013-04-22 14:27:54.377796 [ 1 ] free() packet's data -> libgearman-server/packet.cc:393...

Read more...

Revision history for this message
Sven Nierlein (sven-nierlein) wrote :

I will try, but it really seems to be some kind of race condition. For example it works perfectly when enabling --verbose=DEBUG.

Revision history for this message
Brian Aker (brianaker) wrote :
Download full text (5.0 KiB)

Hi,

I think the issue is with EXCEPTION being poorly defined in the protocol. I went digging into the Python library and think I have found an issue...

Right now I am trying to see hunt it down. Thanks for your patience.

Cheers,
 -Brian

On May 6, 2013, at 2:36 PM, Sven Nierlein <email address hidden> wrote:

> I will try, but it really seems to be some kind of race condition. For
> example it works perfectly when enabling --verbose=DEBUG.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1176324
>
> Title:
> gearmand still accepts new jobs from time to time, but all worker are
> stuck too and won't receiver any jobs
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> .../gearmand --port=4730 --pid-file=/tmp/run/gearmand.pid --daemon
> --job-retries=0 --threads=10 -q libsqlite3 --store-queue-on-shutdown
> --libsqlite3-db=/tmp/gearmand.sqlite --log-file=/tmp/gearmand.log
> --verbose=DEBUG --listen=localhost
>
>
> i am trying the 1.1.6 release version and noticed the gearmand almost stops answering requests after a while.
>
> This is how a normal "status" request via the admin protocol looks in
> the debug logfile:
>
> DEBUG 2013-04-22 14:27:54.377299 [ main ] accept() 340 -> libgearman-server/gearmand.cc:854
> INFO 2013-04-22 14:27:54.377365 [ main ] Accepted connection from 127.0.0.1:49282
> DEBUG 2013-04-22 14:27:54.377437 [ 1 ] Received CON wakeup event -> libgearman-server/gearmand_thread.cc:602
> DEBUG 2013-04-22 14:27:54.377487 [ 1 ] setsockopt() 340 -> libgearman-server/io.cc:809
> DEBUG 2013-04-22 14:27:54.377504 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> INFO 2013-04-22 14:27:54.377512 [ 1 ] Gear connection made
> DEBUG 2013-04-22 14:27:54.377521 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:118
> DEBUG 2013-04-22 14:27:54.377539 [ 1 ] read 7 bytes -> libgearman-server/io.cc:686
> INFO 2013-04-22 14:27:54.377546 [ 1 ] Gear unpack
> DEBUG 2013-04-22 14:27:54.377552 [ 1 ] Received TEXT 127.0.0.1:40460705 -> libgearman-server/thread.cc:310
> DEBUG 2013-04-22 14:27:54.377574 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> DEBUG 2013-04-22 14:27:54.377593 [ proc ] 127.0.0.1:49282 packet command TEXT -> libgearman-server/server.cc:98
> DEBUG 2013-04-22 14:27:54.377626 [ proc ] text command status 1 arguments -> libgearman-server/text.cc:63
> DEBUG 2013-04-22 14:27:54.377693 [ 1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:607
> DEBUG 2013-04-22 14:27:54.377705 [ 1 ] GEAR length: 139 gearmand_command_t: GEARMAN_COMMAND_TEXT -> libgearman-server/plugins/protocol/gear/protocol.cc:274
> DEBUG 2013-04-22 14:27:54.377774 [ 1 ] send() 139 bytes to peer 127.0.0.1:49282 -> libgearman-server/io.cc:275
> DEBUG 2013-04-22 14:27:54.377789 [ 1 ] Sent TEXT to 127.0.0.1:40460705 -> libgearman-server/thread.cc:357
> DEBUG 2013-04-22 14:27:54.377796 [ 1 ] free() ...

Read more...

Revision history for this message
Sven Nierlein (sven-nierlein) wrote :

I don't use the python library. I only use the C-API. Let me know if i can test something.

Revision history for this message
Brian Aker (brianaker) wrote :
Download full text (4.8 KiB)

Can you test trunk?

Is it possible to get the code you are running (or a stripped down version)?

Sent from my Ti85

On May 7, 2013, at 8:26, Sven Nierlein <email address hidden> wrote:

> I don't use the python library. I only use the C-API. Let me know if i
> can test something.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1176324
>
> Title:
> gearmand still accepts new jobs from time to time, but all worker are
> stuck too and won't receiver any jobs
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> .../gearmand --port=4730 --pid-file=/tmp/run/gearmand.pid --daemon
> --job-retries=0 --threads=10 -q libsqlite3 --store-queue-on-shutdown
> --libsqlite3-db=/tmp/gearmand.sqlite --log-file=/tmp/gearmand.log
> --verbose=DEBUG --listen=localhost
>
>
> i am trying the 1.1.6 release version and noticed the gearmand almost stops answering requests after a while.
>
> This is how a normal "status" request via the admin protocol looks in
> the debug logfile:
>
> DEBUG 2013-04-22 14:27:54.377299 [ main ] accept() 340 -> libgearman-server/gearmand.cc:854
> INFO 2013-04-22 14:27:54.377365 [ main ] Accepted connection from 127.0.0.1:49282
> DEBUG 2013-04-22 14:27:54.377437 [ 1 ] Received CON wakeup event -> libgearman-server/gearmand_thread.cc:602
> DEBUG 2013-04-22 14:27:54.377487 [ 1 ] setsockopt() 340 -> libgearman-server/io.cc:809
> DEBUG 2013-04-22 14:27:54.377504 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> INFO 2013-04-22 14:27:54.377512 [ 1 ] Gear connection made
> DEBUG 2013-04-22 14:27:54.377521 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:118
> DEBUG 2013-04-22 14:27:54.377539 [ 1 ] read 7 bytes -> libgearman-server/io.cc:686
> INFO 2013-04-22 14:27:54.377546 [ 1 ] Gear unpack
> DEBUG 2013-04-22 14:27:54.377552 [ 1 ] Received TEXT 127.0.0.1:40460705 -> libgearman-server/thread.cc:310
> DEBUG 2013-04-22 14:27:54.377574 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> DEBUG 2013-04-22 14:27:54.377593 [ proc ] 127.0.0.1:49282 packet command TEXT -> libgearman-server/server.cc:98
> DEBUG 2013-04-22 14:27:54.377626 [ proc ] text command status 1 arguments -> libgearman-server/text.cc:63
> DEBUG 2013-04-22 14:27:54.377693 [ 1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:607
> DEBUG 2013-04-22 14:27:54.377705 [ 1 ] GEAR length: 139 gearmand_command_t: GEARMAN_COMMAND_TEXT -> libgearman-server/plugins/protocol/gear/protocol.cc:274
> DEBUG 2013-04-22 14:27:54.377774 [ 1 ] send() 139 bytes to peer 127.0.0.1:49282 -> libgearman-server/io.cc:275
> DEBUG 2013-04-22 14:27:54.377789 [ 1 ] Sent TEXT to 127.0.0.1:40460705 -> libgearman-server/thread.cc:357
> DEBUG 2013-04-22 14:27:54.377796 [ 1 ] free() packet's data -> libgearman-server/packet.cc:393
>
> DEBUG 2013-04-22 14:28:22.362232 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:11...

Read more...

Revision history for this message
Sven Nierlein (sven-nierlein) wrote :

i tried latest lp:gearmand but bootstrap.sh gives me this:

...
Makefile.am: installing `build-aux/depcomp'
tests/include.am:83: variable `EXTRA_tests_cli_DEPENDENCIES' is defined but no program or
tests/include.am:83: library has `EXTRA_tests_cli' as canonical name (possible typo)
Makefile.am:55: `tests/include.am' included from here
tests/include.am:35: variable `EXTRA_tests_cycle_DEPENDENCIES' is defined but no program or
tests/include.am:35: library has `EXTRA_tests_cycle' as canonical name (possible typo)
Makefile.am:55: `tests/include.am' included from here
tests/include.am:95: variable `EXTRA_tests_gearmand_DEPENDENCIES' is defined but no program or
tests/include.am:95: library has `EXTRA_tests_gearmand' as canonical name (possible typo)
Makefile.am:55: `tests/include.am' included from here
gearmand/include.am:38: variable `EXTRA_gearmand_hostile_gearmand_DEPENDENCIES' is defined but no program or
gearmand/include.am:38: library has `EXTRA_gearmand_hostile_gearmand' as canonical name (possible typo)
Makefile.am:50: `gearmand/include.am' included from here
gearmand/include.am:23: variable `EXTRA_gearmand_gearmand_DEPENDENCIES' is defined but no program or
gearmand/include.am:23: library has `EXTRA_gearmand_gearmand' as canonical name (possible typo)
Makefile.am:50: `gearmand/include.am' included from here
libtest/include.am:115: variable `EXTRA_libtest_libtest_la_DEPENDENCIES' is defined but no program or
libtest/include.am:115: library has `EXTRA_libtest_libtest_la' as canonical name (possible typo)
Makefile.am:41: `libtest/include.am' included from here
autoreconf: automake failed with exit status: 1
./bootstrap.sh:1090: Cannot execute /usr/bin/autoreconf

besides that, i will try to reproduce it with the smalles setup possible...

Revision history for this message
Sven Nierlein (sven-nierlein) wrote :

I could reproduce the problem with the 1.1.7 and the 1.0.6. Seems like the _wake_event never happens.

I added some more debug output, this is the normal workflow:
   INFO 2013-05-12 11:37:42.000000 [ main ] Accepted connection from 127.0.0.1:52706
   INFO 2013-05-12 11:37:42.000000 [ main ] gearmand_con_create() begin
   INFO 2013-05-12 11:37:42.000000 [ main ] gearmand_con_create() end
   INFO 2013-05-12 11:37:42.000000 [ main ] _listen_event() end 127.0.0.1:52706
   INFO 2013-05-12 11:37:42.000000 [ 4 ] _wakeup_event()
   INFO 2013-05-12 11:37:42.000000 [ 4 ] gearmand_con_check_queue()
   INFO 2013-05-12 11:37:42.000000 [ 4 ] Gear connection made
   INFO 2013-05-12 11:37:42.000000 [ 4 ] _thread_packet_read()
   INFO 2013-05-12 11:37:42.000000 [ 4 ] gearman_io_recv()

In some cases the wake up event is missing which then leads to the close waits with a stuck gearmand:
   INFO 2013-05-12 11:58:06.000000 [ main ] Accepted connection from 127.0.0.1:42747
   INFO 2013-05-12 11:58:06.000000 [ main ] gearmand_con_create() begin
   INFO 2013-05-12 11:58:06.000000 [ main ] gearmand_con_create() end
   INFO 2013-05-12 11:58:06.000000 [ main ] _listen_event() end 127.0.0.1:42747

Revision history for this message
Sven Nierlein (sven-nierlein) wrote :

I tested some more releases and the problem has been introduced between version 0.33 and 0.34. All releases after and including 0.34 are stuck after a few minutes. The versions before 0.33 work fine. Unfortunatly there have been tons of changes in that release.

Revision history for this message
Brian Aker (brianaker) wrote :

That for hunting down the version... let me see if anything pops out in a diff.

Revision history for this message
Thorsten Singer (singer-t) wrote :

I 've got the same problem with 1.1.8.

Revision history for this message
Brian Aker (brianaker) wrote :

A fix just went into trunk that will solve for the case where a bad connection would cause not all NOOP to be sent.

Revision history for this message
Brian Moon (brianlmoon) wrote :

We had this happen last night with gearmand 1.1.12. Was that fix that went in in 2013 rolled in by the time 1.1.12 was rolled? None of our workers could connect and the servers with gearmand on them were full of CLOSE_WAIT connections. We had to restart gearmand and wokers on all servers to get things going again.

Revision history for this message
Peter Valdemar Mørch (pmorch) wrote :

Brian, on 2013-08-12 you write:

> A fix just went into trunk that will solve for the case where a bad connection would cause not all NOOP to be sent.

Could you confirm or deny whether the patch in question is:

http://bazaar.launchpad.net/~tangent-trunk/gearmand/1.2/revision/853.1.1 :

    Committer: Brian Aker
    Date: 2013-08-11 01:50:44 UTC
    mto: This revision was merged to the branch mainline in revision 854.
    Revision ID: <email address hidden>

    https://launchpad.net/bugs/1196552
    Fix for removing job from durable queue on cancel.

? It mentions NOOP packet-s and has a time stamp that fits your commit "a fix just went into trunk"...

Also, this patch does not apply to 1.0.6 which is the version currently in debian sid and jessie. Do you think it would be easy to port this patch back to 1.0.6, and could you provide any pointers in the right direction? The patch mostly modifies gearman_server_job_cancel() which is not found in 1.0.6's gearmand_con.cc.

Sven points out above that this got introduced between 0.33 and 0.34. Could you point to exactly which commit introduced the problem? That would go a long way towards my understanding of how to create a patch for 0.34 and presumably for 1.0.6 too. My bzr is lacking, there are 33000+ lines of diffs between 0.33 and 0.34 and (apparently) no tags for 0.33 and 0.34 in bzr, so I must admit I gave up trying to identify the culprit.

Which do you think is the path of least resistence: Getting 1.1.12 compiling and installing in a package in Debian sid/jessie or creating a patch for 1.0.6?

Peter

Revision history for this message
Peter Valdemar Mørch (pmorch) wrote :

Oh, and for info:

I've seen this on debian jessie using mod-gearman. Full explanation in this thread:

Can't get mod-gearman to work no jessie's gearmand version 1.0.6 - Google Groups
https://groups.google.com/forum/#!topic/mod_gearman/IQYHy0qc1CE

(I wish I hadn't misspelled the topic! :-D)

Basically a tiny virtual machine with nagios3 at its default configuration and mod-gearman in between nagios3 and 1 worker on localhost. Close to the smallest mod-gearman installation possible on jessie.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Brian, can you please confirm for people the fix that you were talking about? Also reporters, do you still see this in recent gearmand's?

Changed in gearmand:
importance: Undecided → Critical
status: New → Triaged
Revision history for this message
Jaroslav Štěpánek (cz-guardian) wrote :

Hi Clint,

I'm having similar issues with gearman 1.1.12. The server gets stuck out of nowhere, sometimes accepts few jobs, but otherwise you cannot do anything until restarted.
Here's what I have in the logs, unfortunately it's only INFO, but for the next time I'll switch it to DEBUG.

INFO 2015-08-05 07:45:14.000000 [ main ] Accepted connection from 1.2.3.146:37334
INFO 2015-08-05 07:45:14.000000 [ 4 ] Peer connection has called close()
INFO 2015-08-05 07:45:14.000000 [ 4 ] Disconnected 1.2.3.134:43932
INFO 2015-08-05 07:45:14.000000 [ 4 ] Gear connection disconnected: -:-
INFO 2015-08-05 07:45:14.000000 [ main ] Accepted connection from 1.2.3.134:43973
INFO 2015-08-05 07:45:14.000000 [ main ] Accepted connection from 1.2.3.134:43989
INFO 2015-08-05 07:45:14.000000 [ 7 ] Peer connection has called close()
INFO 2015-08-05 07:45:14.000000 [ 7 ] Disconnected 1.2.3.134:48585
INFO 2015-08-05 07:45:15.000000 [ 7 ] Peer connection has called close()
INFO 2015-08-05 07:45:15.000000 [ 7 ] Disconnected 1.2.3.134:38350
INFO 2015-08-05 07:45:15.000000 [ main ] Accepted connection from 1.2.3.134:44037
INFO 2015-08-05 07:45:15.000000 [ main ] Accepted connection from 1.2.3.134:44058
INFO 2015-08-05 07:45:15.000000 [ 1 ] Peer connection has called close()
INFO 2015-08-05 07:45:15.000000 [ 1 ] Disconnected 1.2.3.134:48665
INFO 2015-08-05 07:45:16.000000 [ 3 ] Peer connection has called close()
INFO 2015-08-05 07:45:16.000000 [ 3 ] Disconnected 1.2.3.146:36709
INFO 2015-08-05 07:45:16.000000 [ 10 ] Peer connection has called close()
INFO 2015-08-05 07:45:16.000000 [ 10 ] Disconnected 1.2.3.146:37268
INFO 2015-08-05 07:45:16.000000 [ main ] Accepted connection from 1.2.3.134:44137
INFO 2015-08-05 07:45:16.000000 [ 8 ] Peer connection has called close()
INFO 2015-08-05 07:45:16.000000 [ 8 ] Disconnected 1.2.3.134:43416
INFO 2015-08-05 07:45:16.000000 [ 3 ] Peer connection has called close()
INFO 2015-08-05 07:45:16.000000 [ 3 ] Disconnected 1.2.3.146:37285
INFO 2015-08-05 07:45:16.000000 [ main ] Accepted connection from 1.2.3.146:37356
INFO 2015-08-05 07:45:16.000000 [ main ] Accepted connection from 1.2.3.146:37358
INFO 2015-08-05 07:45:16.000000 [ main ] Accepted connection from 1.2.3.146:37360
INFO 2015-08-05 07:45:17.000000 [ 2 ] Peer connection has called close()
INFO 2015-08-05 07:45:17.000000 [ 2 ] Disconnected 1.2.3.134:60799
INFO 2015-08-05 07:45:17.000000 [ main ] Accepted connection from 1.2.3.134:44178
INFO 2015-08-05 07:45:17.000000 [ 1 ] Peer connection has called close()
INFO 2015-08-05 07:45:17.000000 [ 1 ] Disconnected 1.2.3.134:41112
INFO 2015-08-05 07:45:17.000000 [ 2 ] Peer connection has called close()
INFO 2015-08-05 07:45:17.000000 [ 2 ] Disconnected 1.2.3.146:37035
INFO 2015-08-05 07:45:17.000000 [ main ] Accepted connection from 1.2.3.146:37369

FYI the ips are changed...

Best regards,
Jaroslav

Revision history for this message
Saad Amin (saadvi) wrote :

same problem

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 1176324] gearmand still accepts new jobs from time to time, but all worker are stuck too and won't receiver any jobs
Download full text (4.9 KiB)

I will look now.

> On Jul 23, 2015, at 17:49, Clint Byrum <email address hidden> wrote:
>
> Brian, can you please confirm for people the fix that you were talking
> about? Also reporters, do you still see this in recent gearmand's?
>
> ** Changed in: gearmand
> Importance: Undecided => Critical
>
> ** Changed in: gearmand
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1176324
>
> Title:
> gearmand still accepts new jobs from time to time, but all worker are
> stuck too and won't receiver any jobs
>
> Status in Gearman:
> Triaged
>
> Bug description:
> .../gearmand --port=4730 --pid-file=/tmp/run/gearmand.pid --daemon
> --job-retries=0 --threads=10 -q libsqlite3 --store-queue-on-shutdown
> --libsqlite3-db=/tmp/gearmand.sqlite --log-file=/tmp/gearmand.log
> --verbose=DEBUG --listen=localhost
>
>
> i am trying the 1.1.6 release version and noticed the gearmand almost stops answering requests after a while.
>
> This is how a normal "status" request via the admin protocol looks in
> the debug logfile:
>
> DEBUG 2013-04-22 14:27:54.377299 [ main ] accept() 340 -> libgearman-server/gearmand.cc:854
> INFO 2013-04-22 14:27:54.377365 [ main ] Accepted connection from 127.0.0.1:49282
> DEBUG 2013-04-22 14:27:54.377437 [ 1 ] Received CON wakeup event -> libgearman-server/gearmand_thread.cc:602
> DEBUG 2013-04-22 14:27:54.377487 [ 1 ] setsockopt() 340 -> libgearman-server/io.cc:809
> DEBUG 2013-04-22 14:27:54.377504 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> INFO 2013-04-22 14:27:54.377512 [ 1 ] Gear connection made
> DEBUG 2013-04-22 14:27:54.377521 [ 1 ] 127.0.0.1:49282 Ready POLLIN -> libgearman-server/gearmand_con.cc:118
> DEBUG 2013-04-22 14:27:54.377539 [ 1 ] read 7 bytes -> libgearman-server/io.cc:686
> INFO 2013-04-22 14:27:54.377546 [ 1 ] Gear unpack
> DEBUG 2013-04-22 14:27:54.377552 [ 1 ] Received TEXT 127.0.0.1:40460705 -> libgearman-server/thread.cc:310
> DEBUG 2013-04-22 14:27:54.377574 [ 1 ] 127.0.0.1:49282 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
> DEBUG 2013-04-22 14:27:54.377593 [ proc ] 127.0.0.1:49282 packet command TEXT -> libgearman-server/server.cc:98
> DEBUG 2013-04-22 14:27:54.377626 [ proc ] text command status 1 arguments -> libgearman-server/text.cc:63
> DEBUG 2013-04-22 14:27:54.377693 [ 1 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:607
> DEBUG 2013-04-22 14:27:54.377705 [ 1 ] GEAR length: 139 gearmand_command_t: GEARMAN_COMMAND_TEXT -> libgearman-server/plugins/protocol/gear/protocol.cc:274
> DEBUG 2013-04-22 14:27:54.377774 [ 1 ] send() 139 bytes to peer 127.0.0.1:49282 -> libgearman-server/io.cc:275
> DEBUG 2013-04-22 14:27:54.377789 [ 1 ] Sent TEXT to 127.0.0.1:40460705 -> libgearman-server/thread.cc:357
> DEBUG 2013-04-22 14:27:54.377796 [ 1 ] free() packet's data -> libgearman-server/packet.cc:393
>
> DEBUG 2013-04-22 14:28:22.362232 [ 1 ] 127.0.0.1:49282 Read...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.