Thanks, either I or someone else will look at this in the next day or so.
Sent from my C64
On Oct 7, 2011, at 8:36 AM, Marc Easen <email address hidden> wrote:
> During high load I was seeing the same issue but this is what was happening: > > strace of worker: > > sendto(45, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > sendto(45, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=45, events=POLLIN}], 1, 5000) = 1 ([{fd=45, revents=POLLIN}]) > getsockopt(45, SOL_SOCKET, SO_ERROR, [1459455931862482944], [4]) = 0 > sendto(45, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(45, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24 > sendto(45, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > >> From what I can see is the following > > 1) The worker is requesting work > 2) The reponse of GEARMAN_LOST_CONNECTION is returned from the recv() function > 3) The worker then sends a PRE_SLEEP command > 4) The worker then request for work > 4) Gearmand then responds with the response for two requests for jobs, which causes and odd spike in the server (I have located where this is happening) > 5) The worker then send a PRE_SLEEP command > > I have fixed this issue by removing the break for a > GEARMAN_LOST_CONNECTION status, along with saving the state of the > worker, which allows it continue where it left off. > > (Please ignore the git references, I downloaded 0.24 and imported it > into a git repo so I could track the changes) > > > diff --git a/libgearman/worker.cc b/libgearman/worker.cc > index 21488e0..745ca20 100644 > --- a/libgearman/worker.cc > +++ b/libgearman/worker.cc > @@ -672,17 +672,11 @@ gearman_job_st *gearman_worker_grab_job(gearman_worker_st *worker, > > if (gearman_failed(*ret_ptr)) > { > - if (*ret_ptr == GEARMAN_IO_WAIT) > - { > - worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_RECV; > - } > - else > + worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_RECV; > + if (*ret_ptr != GEARMAN_IO_WAIT) > { > gearman_job_free(worker->job); > worker->job= NULL; > - > - if (*ret_ptr == GEARMAN_LOST_CONNECTION) > - break; > } > > return NULL; > > -- > You received this bug notification because you are subscribed to > Gearman. > https://bugs.launchpad.net/bugs/802850 > > Title: > Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23) > > Status in Gearman Server and Client Libraries: > New > > Bug description: > After some time, minutes to hours, with a slight load on the gearman > server (~1 job/min), workers get lost in a loop (per strace) and > gearmand eats up 100% cpu. > > Strace of a worker: > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > .... > > > Strace of gearmand: > > > # strace -p 2820 > Process 2820 attached - interrupt to quit > clock_gettime(CLOCK_MONOTONIC, {3794803, 727265822}) = 0 > epoll_wait(3, > > (... and nothing more.... ) > > > All workers are PHP based. > > > # php --ri gearman > > gearman > > gearman support => enabled > extension version => 0.8.0 > libgearman version => 0.22 > Default TCP Host => 127.0.0.1 > Default TCP Port => 4730 > > To manage notifications about this bug go to: > https://bugs.launchpad.net/gearmand/+bug/802850/+subscriptions
Thanks, either I or someone else will look at this in the next day or so.
Sent from my C64
On Oct 7, 2011, at 8:36 AM, Marc Easen <email address hidden> wrote:
> During high load I was seeing the same issue but this is what was happening: 0\0\0'\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 82944], [4]) = 0 0\0\0'\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0\ 0RES\0\ 0\0\n\0\ 0\0\0", 8192, 0, NULL, NULL) = 24 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 LOST_CONNECTION is returned from the recv() function LOST_CONNECTION status, along with saving the state of the worker. cc b/libgearman/ worker. cc worker. cc worker. cc worker_ grab_job( gearman_ worker_ st *worker, failed( *ret_ptr) ) WORKER_ STATE_GRAB_ JOB_RECV; WORKER_ STATE_GRAB_ JOB_RECV; job_free( worker- >job); LOST_CONNECTION ) /bugs.launchpad .net/bugs/ 802850 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 CLOCK_MONOTONIC , {3794803, 727265822}) = 0 /bugs.launchpad .net/gearmand/ +bug/802850/ +subscriptions
>
> strace of worker:
>
> sendto(45, "\0REQ\
> sendto(45, "\0REQ\
> poll([{fd=45, events=POLLIN}], 1, 5000) = 1 ([{fd=45, revents=POLLIN}])
> getsockopt(45, SOL_SOCKET, SO_ERROR, [14594559318624
> sendto(45, "\0REQ\
> recvfrom(45, "\0RES\
> sendto(45, "\0REQ\
>
>> From what I can see is the following
>
> 1) The worker is requesting work
> 2) The reponse of GEARMAN_
> 3) The worker then sends a PRE_SLEEP command
> 4) The worker then request for work
> 4) Gearmand then responds with the response for two requests for jobs, which causes and odd spike in the server (I have located where this is happening)
> 5) The worker then send a PRE_SLEEP command
>
> I have fixed this issue by removing the break for a
> GEARMAN_
> worker, which allows it continue where it left off.
>
> (Please ignore the git references, I downloaded 0.24 and imported it
> into a git repo so I could track the changes)
>
>
> diff --git a/libgearman/
> index 21488e0..745ca20 100644
> --- a/libgearman/
> +++ b/libgearman/
> @@ -672,17 +672,11 @@ gearman_job_st *gearman_
>
> if (gearman_
> {
> - if (*ret_ptr == GEARMAN_IO_WAIT)
> - {
> - worker->state= GEARMAN_
> - }
> - else
> + worker->state= GEARMAN_
> + if (*ret_ptr != GEARMAN_IO_WAIT)
> {
> gearman_
> worker->job= NULL;
> -
> - if (*ret_ptr == GEARMAN_
> - break;
> }
>
> return NULL;
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https:/
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> ....
>
>
> Strace of gearmand:
>
>
> # strace -p 2820
> Process 2820 attached - interrupt to quit
> clock_gettime(
> epoll_wait(3,
>
> (... and nothing more.... )
>
>
> All workers are PHP based.
>
>
> # php --ri gearman
>
> gearman
>
> gearman support => enabled
> extension version => 0.8.0
> libgearman version => 0.22
> Default TCP Host => 127.0.0.1
> Default TCP Port => 4730
>
> To manage notifications about this bug go to:
> https:/