Python Gearman worker intermittently stop receiving jobs from gearmand

Bug #1220168 reported by Aldrian Obaja on 2013-09-03
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Gearman
Undecided
Brian Aker

Bug Description

I found that when using multiple gearmand servers and having the workers connect to both of them, somehow (intermittently) some of the workers will just accept jobs from one of the servers only. From the log I created, it seems that gearmand stops sending job offer to those workers. This causes the jobs from that server to be executed only by the workers which are still connected to that server. I experienced this in a 16-core Ubuntu machine.

Configuration:
1. Start gearmand at ports 4730 and 4731
2. Start worker B, C, D, and E, all connect to both 4730 and 4731
3. Start client A, connects to both 4730 and 4731
4. After many jobs are sent, some of the workers will randomly stop receiving job offers from one of the servers.

Try the minimal reproducing code as attached below to test.
1. Start gearmand at 4730 and 4731
2. Start gearman_worker.py
3. Start test.bash
4. Check that for the last n jobs (as seen in worker.log), only some of the workers are processing the requests (i.e., some others just don't fetch jobs from the other server anymore). Note that this happens intermittently, so please rerun the worker and the bash script if the behaviour hasn't occur yet.

I used gearmand 1.1.9 on 16-core Ubuntu 11.04
The worker and client are using Python 2.7 with python-gearman version 2.0.2

Attached is the zip of logfiles.

Note that starting from line 40969 onwards, there are only two workers which receive jobs from server 4730 (worker 00 and worker 01). Also note that the other workers didn't lose connection with server, as we can see at lines 40970 and 40971 that the other two workers are waiting with connections open to both servers.

I modified the connection_manager.py (line 128-131) in the python-gearman distribution to log the connection status.
By analyzing the code in connection_manager.py, we can conclude that the workers are actually waiting (blocked in the gearman.util.select statement), but they just didn't get anything from servers.

Expected behaviour:
Worker will always receive jobs from any server that still has jobs available.

Brian Aker (brianaker) wrote :

Thank you! Investigating.

Changed in gearmand:
assignee: nobody → Brian Aker (brianaker)
Brian Aker (brianaker) wrote :

Still looking at it.

Brian Aker (brianaker) wrote :

When I look in the server I see NOOP being sent to the workers.

Somewhere in the python code you need to check to see if the event loop catches those. If it is not... the problem looks to be in the python code.

Aldrian Obaja (aldrian-math) wrote :

But why is it sending NOOP when it's supposed to send a job request?

http://gearman.org/protocol

NOOP

    This is used to wake up a sleeping worker so that it may grab a
    pending job.

    Arguments:
    - None.

On Sep 14, 2013, at 5:55 AM, Aldrian Obaja <email address hidden> wrote:

> But why is it sending NOOP when it's supposed to send a job request?
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1220168
>
> Title:
> Gearman worker intermittently stop receiving jobs from gearmand
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> I found that when using multiple gearmand servers and having the
> workers connect to both of them, somehow (intermittently) some of the
> workers will just accept jobs from one of the servers only. From the
> log I created, it seems that gearmand stops sending job offer to those
> workers. This causes the jobs from that server to be executed only by
> the workers which are still connected to that server. I experienced
> this in a 16-core Ubuntu machine.
>
> Configuration:
> 1. Start gearmand at ports 4730 and 4731
> 2. Start worker B, C, D, and E, all connect to both 4730 and 4731
> 3. Start client A, connects to both 4730 and 4731
> 4. After many jobs are sent, some of the workers will randomly stop receiving job offers from one of the servers.
>
> Try the minimal reproducing code as attached below to test.
> 1. Start gearmand at 4730 and 4731
> 2. Start gearman_worker.py
> 3. Start test.bash
> 4. Check that for the last n jobs (as seen in worker.log), only some of the workers are processing the requests (i.e., some others just don't fetch jobs from the other server anymore). Note that this happens intermittently, so please rerun the worker and the bash script if the behaviour hasn't occur yet.
>
> I used gearmand 1.1.9 on 16-core Ubuntu 11.04
> The worker and client are using Python 2.7 with python-gearman version 2.0.2
>
> Attached is the zip of logfiles.
>
> Note that starting from line 40969 onwards, there are only two workers
> which receive jobs from server 4730 (worker 00 and worker 01). Also
> note that the other workers didn't lose connection with server, as we
> can see at lines 40970 and 40971 that the other two workers are
> waiting with connections open to both servers.
>
> I modified the connection_manager.py (line 128-131) in the python-gearman distribution to log the connection status.
> By analyzing the code in connection_manager.py, we can conclude that the workers are actually waiting (blocked in the gearman.util.select statement), but they just didn't get anything from servers.
>
> Expected behaviour:
> Worker will always receive jobs from any server that still has jobs available.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gearmand/+bug/1220168/+subscriptions

Brian Aker (brianaker) on 2013-09-15
summary: - Gearman worker intermittently stop receiving jobs from gearmand
+ Python Gearman worker intermittently stop receiving jobs from gearmand
Aldrian Obaja (aldrian-math) wrote :

I conducted further tests, and found that the client actually stop receiving NOOP commands from the server. Attached is the sample run (this one is shorter than the previous one).

In this attachment, now I log when each worker receives a NOOP, NO_JOB, or JOB_ASSIGN from any server (unfortunately I couldn't log from which server the signal came from)

Notice that in line 188, that's the last time worker 01 receives job from server 4730.
And at the line 7552, that's the last time worker 01 receives job from any server. Note that there is no "Receive NOOP" from worker 01 past this line, although other workers are still receiving it.

I would really appreciate it if you can take a look at the python-distribution code also, to check whether each signal is handled correctly or whether there could be any race condition.

Download full text (3.8 KiB)

I audited the noop code in the server and found a case where the noop
count might be off.

I don't think this is the fix to your problem unless this is some io
issue that isn't surfacing in error reporting.

On Sep 17, 2013, at 3:25, Aldrian Obaja <email address hidden> wrote:

> I conducted further tests, and found that the client actually stop
> receiving NOOP commands from the server. Attached is the sample run
> (this one is shorter than the previous one).
>
> In this attachment, now I log when each worker receives a NOOP, NO_JOB,
> or JOB_ASSIGN from any server (unfortunately I couldn't log from which
> server the signal came from)
>
> Notice that in line 188, that's the last time worker 01 receives job from server 4730.
> And at the line 7552, that's the last time worker 01 receives job from any server. Note that there is no "Receive NOOP" from worker 01 past this line, although other workers are still receiving it.
>
> I would really appreciate it if you can take a look at the python-
> distribution code also, to check whether each signal is handled
> correctly or whether there could be any race condition.
>
> ** Attachment added: "Further log file"
> https://bugs.launchpad.net/gearmand/+bug/1220168/+attachment/3825102/+files/logfile_1.zip
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1220168
>
> Title:
> Python Gearman worker intermittently stop receiving jobs from gearmand
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> I found that when using multiple gearmand servers and having the
> workers connect to both of them, somehow (intermittently) some of the
> workers will just accept jobs from one of the servers only. From the
> log I created, it seems that gearmand stops sending job offer to those
> workers. This causes the jobs from that server to be executed only by
> the workers which are still connected to that server. I experienced
> this in a 16-core Ubuntu machine.
>
> Configuration:
> 1. Start gearmand at ports 4730 and 4731
> 2. Start worker B, C, D, and E, all connect to both 4730 and 4731
> 3. Start client A, connects to both 4730 and 4731
> 4. After many jobs are sent, some of the workers will randomly stop receiving job offers from one of the servers.
>
> Try the minimal reproducing code as attached below to test.
> 1. Start gearmand at 4730 and 4731
> 2. Start gearman_worker.py
> 3. Start test.bash
> 4. Check that for the last n jobs (as seen in worker.log), only some of the workers are processing the requests (i.e., some others just don't fetch jobs from the other server anymore). Note that this happens intermittently, so please rerun the worker and the bash script if the behaviour hasn't occur yet.
>
> I used gearmand 1.1.9 on 16-core Ubuntu 11.04
> The worker and client are using Python 2.7 with python-gearman version 2.0.2
>
> Attached is the zip of logfiles.
>
> Note that starting from line 40969 onwards, there are only two workers
> which receive jobs from server 4730 (worker 00 and worker 01). Also
> note that the other workers didn't lose connection with server, as we
> can see at lines 4...

Read more...

orainxiong (orain-xiong) wrote :

I want to know if this issue has been resolved,.
my environment is gearmand 1.1.12, multiple job server (4731, 4732), persistent queue type is mysql.
 I encountered the same problem, when the program runs for a week, there will be a part of the job retention in mysql queue, the strange thing is, some job but also to work fine.

yunfei (233602551-t) wrote :

I have resolved this problem, gather some code from the branches of python-gearman. Add some code to grab job from servers.

My fork is https://github.com/yunjianfei/python-gearman

Aldrian Obaja (aldrian-math) wrote :

It's great, Yunfei, to see someone actually working on this.

I see that you introduced quite a lot of new changes and added a new class. Can you elaborate more what was the problem and how you fixed it?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers