Comment 8 for bug 1220168

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 1220168] Re: Python Gearman worker intermittently stop receiving jobs from gearmand

I audited the noop code in the server and found a case where the noop
count might be off.

I don't think this is the fix to your problem unless this is some io
issue that isn't surfacing in error reporting.

On Sep 17, 2013, at 3:25, Aldrian Obaja <email address hidden> wrote:

> I conducted further tests, and found that the client actually stop
> receiving NOOP commands from the server. Attached is the sample run
> (this one is shorter than the previous one).
>
> In this attachment, now I log when each worker receives a NOOP, NO_JOB,
> or JOB_ASSIGN from any server (unfortunately I couldn't log from which
> server the signal came from)
>
> Notice that in line 188, that's the last time worker 01 receives job from server 4730.
> And at the line 7552, that's the last time worker 01 receives job from any server. Note that there is no "Receive NOOP" from worker 01 past this line, although other workers are still receiving it.
>
> I would really appreciate it if you can take a look at the python-
> distribution code also, to check whether each signal is handled
> correctly or whether there could be any race condition.
>
> ** Attachment added: "Further log file"
> https://bugs.launchpad.net/gearmand/+bug/1220168/+attachment/3825102/+files/logfile_1.zip
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1220168
>
> Title:
> Python Gearman worker intermittently stop receiving jobs from gearmand
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> I found that when using multiple gearmand servers and having the
> workers connect to both of them, somehow (intermittently) some of the
> workers will just accept jobs from one of the servers only. From the
> log I created, it seems that gearmand stops sending job offer to those
> workers. This causes the jobs from that server to be executed only by
> the workers which are still connected to that server. I experienced
> this in a 16-core Ubuntu machine.
>
> Configuration:
> 1. Start gearmand at ports 4730 and 4731
> 2. Start worker B, C, D, and E, all connect to both 4730 and 4731
> 3. Start client A, connects to both 4730 and 4731
> 4. After many jobs are sent, some of the workers will randomly stop receiving job offers from one of the servers.
>
> Try the minimal reproducing code as attached below to test.
> 1. Start gearmand at 4730 and 4731
> 2. Start gearman_worker.py
> 3. Start test.bash
> 4. Check that for the last n jobs (as seen in worker.log), only some of the workers are processing the requests (i.e., some others just don't fetch jobs from the other server anymore). Note that this happens intermittently, so please rerun the worker and the bash script if the behaviour hasn't occur yet.
>
> I used gearmand 1.1.9 on 16-core Ubuntu 11.04
> The worker and client are using Python 2.7 with python-gearman version 2.0.2
>
> Attached is the zip of logfiles.
>
> Note that starting from line 40969 onwards, there are only two workers
> which receive jobs from server 4730 (worker 00 and worker 01). Also
> note that the other workers didn't lose connection with server, as we
> can see at lines 40970 and 40971 that the other two workers are
> waiting with connections open to both servers.
>
> I modified the connection_manager.py (line 128-131) in the python-gearman distribution to log the connection status.
> By analyzing the code in connection_manager.py, we can conclude that the workers are actually waiting (blocked in the gearman.util.select statement), but they just didn't get anything from servers.
>
> Expected behaviour:
> Worker will always receive jobs from any server that still has jobs available.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gearmand/+bug/1220168/+subscriptions