Hi!
You have confirmed that you have a high to no timeout, and that the server has been set to only request a set number of workers to come online?
Sent from my C64
On Jul 8, 2011, at 1:34 AM, Artur Bodera <email address hidden> wrote:
> Just compiled and installed 0.23 daemon, > recompilled pecl extension, > restarted everything > > ! same thing. > > Below is strace of one of the workers. > > I have noticed that "THE LOOP" happens when the machine has an avg load > of 10+. It does not occur when the machine has no load (i.e. <1) - then > the gearman daemon behaves nicely, idle workers have cpu time < 1s in 1h > of waiting. > > Unfortunatelly as soon as some cpu-intensive work kicks in, one of > "gearman -d" threads starts eatung up the CPU resulting in avg load > > 30. > > > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}]) > getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0 > sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}]) > getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0 > sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}]) > getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0 > sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}]) > getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0 > sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}]) > getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0 > sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}]) > getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0 > sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(12, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 60 > sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=12, events=POLLIN}], 1, 5000^Cto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0 <unfinished ...> > Process 30713 detached > > > ** Summary changed: > > - Gearman 100% cpu usage, workers in a loop (PHP, 0.22) > + Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23) > > -- > You received this bug notification because you are subscribed to > Gearman. > https://bugs.launchpad.net/bugs/802850 > > Title: > Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23) > > Status in Gearman Server and Client Libraries: > New > > Bug description: > After some time, minutes to hours, with a slight load on the gearman > server (~1 job/min), workers get lost in a loop (per strace) and > gearmand eats up 100% cpu. > > Strace of a worker: > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0 > sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 > sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 > poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}]) > .... > > > Strace of gearmand: > > > # strace -p 2820 > Process 2820 attached - interrupt to quit > clock_gettime(CLOCK_MONOTONIC, {3794803, 727265822}) = 0 > epoll_wait(3, > > (... and nothing more.... ) > > > All workers are PHP based. > > > # php --ri gearman > > gearman > > gearman support => enabled > extension version => 0.8.0 > libgearman version => 0.22 > Default TCP Host => 127.0.0.1 > Default TCP Port => 4730 > > To manage notifications about this bug go to: > https://bugs.launchpad.net/gearmand/+bug/802850/+subscriptions
Hi!
You have confirmed that you have a high to no timeout, and that the server has been set to only request a set number of workers to come online?
Sent from my C64
On Jul 8, 2011, at 1:34 AM, Artur Bodera <email address hidden> wrote:
> Just compiled and installed 0.23 daemon, 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0864], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0864], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0864], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0864], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0864], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0864], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0\ 0RES\0\ 0\0\n\0\ 0\0\0\0RES\ 0\0\0\n" ..., 8192, 0, NULL, NULL) = 60 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0 <unfinished ...> /bugs.launchpad .net/bugs/ 802850 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 2192], [4]) = 0 0\0\0\36\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 0\0\0\n\ 0\0\0\0" , 8192, 0, NULL, NULL) = 12 0\0\0\4\ 0\0\0\0" , 12, MSG_NOSIGNAL, NULL, 0) = 12 CLOCK_MONOTONIC , {3794803, 727265822}) = 0 /bugs.launchpad .net/gearmand/ +bug/802850/ +subscriptions
> recompilled pecl extension,
> restarted everything
>
> ! same thing.
>
> Below is strace of one of the workers.
>
> I have noticed that "THE LOOP" happens when the machine has an avg load
> of 10+. It does not occur when the machine has no load (i.e. <1) - then
> the gearman daemon behaves nicely, idle workers have cpu time < 1s in 1h
> of waiting.
>
> Unfortunatelly as soon as some cpu-intensive work kicks in, one of
> "gearman -d" threads starts eatung up the CPU resulting in avg load >
> 30.
>
>
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [82326183955818
> sendto(12, "\0REQ\
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [82326183955818
> sendto(12, "\0REQ\
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [82326183955818
> sendto(12, "\0REQ\
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [82326183955818
> sendto(12, "\0REQ\
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [82326183955818
> sendto(12, "\0REQ\
> recvfrom(12, "\0RES\
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [82326183955818
> sendto(12, "\0REQ\
> recvfrom(12, "\0RES\
> sendto(12, "\0REQ\
> poll([{fd=12, events=POLLIN}], 1, 5000^Cto(12, "\0REQ\
> Process 30713 detached
>
>
> ** Summary changed:
>
> - Gearman 100% cpu usage, workers in a loop (PHP, 0.22)
> + Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https:/
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [11752899691623
> sendto(7, "\0REQ\
> recvfrom(7, "\0RES\
> sendto(7, "\0REQ\
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> ....
>
>
> Strace of gearmand:
>
>
> # strace -p 2820
> Process 2820 attached - interrupt to quit
> clock_gettime(
> epoll_wait(3,
>
> (... and nothing more.... )
>
>
> All workers are PHP based.
>
>
> # php --ri gearman
>
> gearman
>
> gearman support => enabled
> extension version => 0.8.0
> libgearman version => 0.22
> Default TCP Host => 127.0.0.1
> Default TCP Port => 4730
>
> To manage notifications about this bug go to:
> https:/