Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)

Bug #802850 reported by Artur Bodera
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Gearman
Medium
Brian Aker

Bug Description

After some time, minutes to hours, with a slight load on the gearman server (~1 job/min), workers get lost in a loop (per strace) and gearmand eats up 100% cpu.

Strace of a worker:
getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
....

Strace of gearmand:

# strace -p 2820
Process 2820 attached - interrupt to quit
clock_gettime(CLOCK_MONOTONIC, {3794803, 727265822}) = 0
epoll_wait(3,

(... and nothing more.... )

All workers are PHP based.

# php --ri gearman

gearman

gearman support => enabled
extension version => 0.8.0
libgearman version => 0.22
Default TCP Host => 127.0.0.1
Default TCP Port => 4730

Revision history for this message
Brian Aker (brianaker) wrote :

What version of the server are you using?

Revision history for this message
Artur Bodera (abodera) wrote :

# gearmand --version

gearmand 0.22 - https://launchpad.net/gearmand

Revision history for this message
Artur Bodera (abodera) wrote :

I though it might be related to PHP workers getting crazy after long runtimes (mem leaking or shm problems or something) so I implemented worker manager which kills and spawns new workers every 2h. So each php worker process lives (listens) for a maximum of 2h then is killed and a new one spawned.

That didn't help either, and "fresh" workers with just a few minutes of lifetime get crazy:

getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(12, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(12, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [-4886713802550149120], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(12, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12

Revision history for this message
Artur Bodera (abodera) wrote :

Just compiled and installed 0.23 daemon,
recompilled pecl extension,
restarted everything

! same thing.

Below is strace of one of the workers.

I have noticed that "THE LOOP" happens when the machine has an avg load of 10+. It does not occur when the machine has no load (i.e. <1) - then the gearman daemon behaves nicely, idle workers have cpu time < 1s in 1h of waiting.

Unfortunatelly as soon as some cpu-intensive work kicks in, one of "gearman -d" threads starts eatung up the CPU resulting in avg load > 30.

sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(12, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 60
sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=12, events=POLLIN}], 1, 5000^Cto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
Process 30713 detached

summary: - Gearman 100% cpu usage, workers in a loop (PHP, 0.22)
+ Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 802850] Re: Gearman 100% cpu usage, workers in a loop (PHP, 0.22)
Download full text (6.1 KiB)

Hi!

You have confirmed that you have a high to no timeout, and that the server has been set to only request a set number of workers to come online?

Sent from my C64

On Jul 8, 2011, at 1:34 AM, Artur Bodera <email address hidden> wrote:

> Just compiled and installed 0.23 daemon,
> recompilled pecl extension,
> restarted everything
>
> ! same thing.
>
> Below is strace of one of the workers.
>
> I have noticed that "THE LOOP" happens when the machine has an avg load
> of 10+. It does not occur when the machine has no load (i.e. <1) - then
> the gearman daemon behaves nicely, idle workers have cpu time < 1s in 1h
> of waiting.
>
> Unfortunatelly as soon as some cpu-intensive work kicks in, one of
> "gearman -d" threads starts eatung up the CPU resulting in avg load >
> 30.
>
>
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
> sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
> sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
> sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
> sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
> sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(12, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000) = 1 ([{fd=12, revents=POLLIN}])
> getsockopt(12, SOL_SOCKET, SO_ERROR, [823261839558180864], [4]) = 0
> sendto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(12, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 60
> sendto(12, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=12, events=POLLIN}], 1, 5000^Cto(12, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0 <unfinished ...>
> Process 30713 detached
>
>
> ** Summary changed:
>
> - Gearman 100% cpu usage, workers in a loop (PHP, 0.22)
> + Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
> Gearman 1...

Read more...

Revision history for this message
Artur Bodera (abodera) wrote :

Number of workers is always <20

Each worker listens for max 2h, then shuts itself down and a new process is spawned (not a fork, completely new php shell process).

$gmworker = new GearmanWorker();
$gmworker->addOptions(GEARMAN_WORKER_GRAB_UNIQ);
$gmworker->setTimeout(5000);
$gmworker->addFunction($funcName, array($this,'workGearman'))

main loop (with fat removed):

while(
!$this->_signal &&
(
   @$gmworker->work() ||
   $gmworker->returnCode() == GEARMAN_TIMEOUT
) && !$this->_signal
){
   // nothing unusual happened, run another loop iteration
   $db->closeConnection();
   $log->debug('Waiting for jobs...');
}

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 802850] Re: Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
Download full text (4.0 KiB)

What happens if you use the default timeout (latest version of
libgearman), and start the server up with the wake workers at a value
<4?

Sent from my TI85

On Jul 8, 2011, at 4:38 AM, Artur Bodera <email address hidden>
wrote:

> Number of workers is always <20
>
> Each worker listens for max 2h, then shuts itself down and a new
> process
> is spawned (not a fork, completely new php shell process).
>
> $gmworker = new GearmanWorker();
> $gmworker->addOptions(GEARMAN_WORKER_GRAB_UNIQ);
> $gmworker->setTimeout(5000);
> $gmworker->addFunction($funcName, array($this,'workGearman'))
>
> main loop (with fat removed):
>
> while(
> !$this->_signal &&
> (
> @$gmworker->work() ||
> $gmworker->returnCode() == GEARMAN_TIMEOUT
> ) && !$this->_signal
> ){
> // nothing unusual happened, run another loop iteration
> $db->closeConnection();
> $log->debug('Waiting for jobs...');
> }
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7,
> revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7,
> revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7,
> revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7,
> revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7,
> revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [1175289...

Read more...

Revision history for this message
piavlo (piavka) wrote :

Just wanted to add that we been experiencing exactly the same problem since version 0.14 - which is the version we started from.
But I want to note that not only gearmand is at high cpu - but the php workers also consume some considerable abount of cpu - which does not happen - until gearmand cpu usage rises. Once the high cpu usage starts it persists even them there are zero jobs.

The workaround is to restart ALL php workers (we have several hosts with remote php workers - and all need to be restarted) after that
gearmand cpu usage drops. Also note that we have a redundand setup of gearmand servers - workers register at both servers - the load
cpu usage rises concurrently at both of them - and concurrently drops as well then all workers are restarted.

Does that give you any clues to the problem?
For me it seems most likely as problem in php client library - though grearmand could also be the one causes everyone to go cpu crazy.

Anyway what's not clear to me is why everything could be running with low cpu usage for several hours - and then suddenly the load rises - even though there are a lot of concurrent jobs during the low cpu usage.

Thanks
Alex

Revision history for this message
piavlo (piavka) wrote :

Also forgot to add that we have a constant number of workers - around 30.

Revision history for this message
Artur Bodera (abodera) wrote :

Tried with smaller amount of workers and removed the setTimeout() call from worker startup code. Restarted all workers and started 3 cpu-intensive tasks. Only 1 gearman server, no persistence, running as root.

2 workers handled those 3 tasks, resulting in a 20-30 avg load in a few minutes (encoding video). As subprocesses were "niced", machine handled that load quite nicely. After about 2h of continuous work, server stopped responding to pings, http and ssh ports timed out, load skyrocketed above 40+. Other workers started to spin at 100% cpu and the whole thing became unresponsive for about 20 minutes :-(

Revision history for this message
Artur Bodera (abodera) wrote :

I suspected that work() was releasing too early, causing the loop in php to iterate without delay flooding daemon with messages. I've put usleep(500000) into the main loop to compensate.

Unfortunately that didn't help.
It seems that the loop/leaking is inside gearman's php extension (c code).

I wanted to check if it happens only when workers generate machine load.

So I stopped all tasks, left about 10 idle workers connected to gearmand, I've decreased their max lifetime to 1 hour.

Then I started a CPU-intensive process (not a worker, just a normal video encoder process from bash).
After about 5 minutes, the avg load went above 15. After another 30 minutes random gearman worker processes went berserk and were eating 100% cpu, together with 100% cpu gearmand process.

:( Please fix it asap.

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

I wanted to follow up here as well...

I've found that this started happening to us after we have gone from 0.20 to anything after. 0.20 is fairly stable under slight load. We've put one of our larger load items on here and I can tell you some of the issues that we have found:

1. Gearman Job Server is reporting more items being worked on than there is workers.
2. Gearman Job Server is stating that there are more workers than actually exist (there are 2)
3. PHP based workers run stable and low CPU load
4. Gearman Job Server peaks between 60-90% CPU; every once in a while peaking it to 99-100%. (These are huge boxes with a ton of memory).
5. We operate Gearman under a persistence layer for sqlite3.
6. We've found that it does not matter the current load of the server; number of workers nor anything but how long gearmand has been running but that it often happens at "random". Yes - I know that gives you no details but it often happens during idle times.
* Further - I am using supervisord to ensure that if the PHP worker leaks it restarts or if we need to kill it off due to inactivity we can and have it reboot.

Things that solve this issue:
1. Stop workers
2. Stop gearmand
3. Remove sqlite3 database
4. Start gearmand
5. Start workers

Note that we can only do this on lower environments; therefore we have maintained an older version which has substantially helped out. The issue is that it would be very very difficult to run an strace / ptrace over a long period of time in order to catch gearmand when it starts to run out of control. If you want to catch me off the bug tracker I could likely get you in through the VPN to be able to see this in action (we have approximately 6 lower environments and one of them on the later versions are generally spinning out of control at some point).

Regards,

Mike

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

FYI: https://github.com/mwillbanks/rpm/blob/master/SPECS/gearmand.spec
I am using that as the spec (made some modifications to the baseline - also packaging my own boost). This is what is deployed all to the same environments (Cent OS & Redhat):
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
CentOS release 5.6 (Final)

These are all x86-84 machines; each one have a minimum of 2GB of free memory per individual developer environment; development servers have a minimum of 4GB of memory and production servers have 20GB dedicated. This ultimately means that gearman has more than 1GB free memory at all times in the lowest environments and up to 18GB free in production. All of these machines have substantial CPU power whereas when they do fail it consumes the entire CPU.

Thanks!

Revision history for this message
Artur Bodera (abodera) wrote :
Download full text (4.1 KiB)

To complement our env report:
On our servers, we are using fedora x64, there is always at least
1.5GB free ram, we never use persistence. It still occurs regularly.

Please help.

On 2 sie 2011, at 05:15, Mike Willbanks <email address hidden> wrote:

> FYI: https://github.com/mwillbanks/rpm/blob/master/SPECS/gearmand.spec
> I am using that as the spec (made some modifications to the baseline - also packaging my own boost). This is what is deployed all to the same environments (Cent OS & Redhat):
> Red Hat Enterprise Linux Server release 5.6 (Tikanga)
> CentOS release 5.6 (Final)
>
> These are all x86-84 machines; each one have a minimum of 2GB of free
> memory per individual developer environment; development servers have a
> minimum of 4GB of memory and production servers have 20GB dedicated.
> This ultimately means that gearman has more than 1GB free memory at all
> times in the lowest environments and up to 18GB free in production. All
> of these machines have substantial CPU power whereas when they do fail
> it consumes the entire CPU.
>
> Thanks!
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0...

Read more...

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

Some additional information as this just happened on our production environment a few hours ago... generally it has been stable with 0.13. Here is what I observed throughout the duration of the issue:

1. Connections to gearman worked
2. Workers seemed to have little or no impact to the job server
3. The workers were shutdown in order to attempt to restore critical functions
4. After stopping the workers the issue continued
5. Stopping and restarting gearman resolved the issue

It seems like when work was being sent to gearman it was spinning out of control. The amount of work being pushed in was very small. Not sure if it was a queue that was hanging or if it was all around; however, it does seem likely that it was happening on an individual queue. Eventually the work would timeout and the PHP extension putting the work into gearman would timeout - thus gearman not ever receiving the job. The same exact issue occurred this afternoon on our development server which was running 0.24.

Gearman is running with the following options: OPTIONS="-q libsqlite3 --libsqlite3-db=/opt/gearmand/gearmand.sqlite". We've also noticed this when simply running the gearman client from the command line where in some cases when we were load testing we would find odd results... however, these were more infrequent. But yet again the test cycles were limited whereas the job server has been running for over a month without issue in the production environment. Further - it does seem that the more jobs or load that is added to gearman the more often that this can happen - especially longer running background jobs.

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

Error from PHP during this time:
'GearmanClient::doBackground() [gearmanclient.dobackground]: gearman_connection_read:lost connection to server (104)

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

Had another one spin out of control this morning on our test environment....
Looks like we had the following:
4 - stale workers... no connections
4 - active workers (queues seemed to be locked)

Connection thread seemed to be fine but the gearmand process seems to have continued but was high jacking the CPU. One of the workers was taking approximately 30% of the cpu. a restart of the gearmand process seems to resolve this when it happens.

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 802850] Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
Download full text (3.6 KiB)

Were you logging the worker to see what it might be doing?

On Aug 9, 2011, at 5:51 AM, Mike Willbanks wrote:

> Had another one spin out of control this morning on our test environment....
> Looks like we had the following:
> 4 - stale workers... no connections
> 4 - active workers (queues seemed to be locked)
>
> Connection thread seemed to be fine but the gearmand process seems to
> have continued but was high jacking the CPU. One of the workers was
> taking approximately 30% of the cpu. a restart of the gearmand process
> seems to resolve this when it happens.
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> ....
>
>
> Strace of gearmand:
> ...

Read more...

Revision history for this message
timuckun (timuckun) wrote : Re: [Bug 802850] Re: Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)

On Wed, Aug 10, 2011 at 12:51 AM, Mike Willbanks
<email address hidden> wrote:
> Had another one spin out of control this morning on our test environment....
> Looks like we had the following:
> 4 - stale workers... no connections
> 4 - active workers (queues seemed to be locked)
>

Have you tried using the Perl or the Java version of gearman to see if
it's the client or the server?

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :
Download full text (3.5 KiB)

Brian - I was not logging the worker during that time in any meaningful way...
timuckun - I have not; however, it seems related to both the client and the server.

Although some more information that I've uncovered (the queue is not locked at this point). Here are some workers... this first one is running at approximately 19% CPU - unfortunately it is not doing anything different than the others (it does keep a database connection but I have not really seen any leaks).
21:46:05.167822 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:46:05.167883 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:46:05.167952 [ 3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
21:46:05.168019 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:46:05.168086 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

Here is an example of another worker - it is also not doing anything but CPU usage is at 0%.
21:47:23.106105 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:47:23.106156 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:47:23.106934 [ 3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
21:47:23.107013 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:47:23.107071 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

Doing an strace of gearmand shows much of the same but process is running at 64% CPU:
21:54:43.410761 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:54:43.410820 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:54:43.410892 [ 3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
21:54:43.410965 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:54:43.411037 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

What is interesting; is that the workers do support a kill signal by leveraging the timeout in the server. We set the gearman timeout at: 1000 (1s); this allows us to actually handle kill signals. For one worker that maintains a db connection it never works and this is one of the workers that is spinning out of control. Unfortunately; after sending a kill signal to the process things simply get worse on the strace (especially in speed they are sent):
22:07:50.410273 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000039>
22:07:50.410394 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000039>
22:07:50.410507 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}]) <0.000025>
22:07:50.410604 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = ...

Read more...

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

After a restart of all of the workers - the process through strace for gearmand is far, far, far different:

[pid 14267] 22:16:37.171837 [ 3692a0d605] futex(0x1838fc28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 14269] 22:16:37.171975 [ 3691ed4f41] <... recvfrom resumed> 0x183b3de8, 8192, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000134>
[pid 14267] 22:16:37.172017 [ 3692a0d605] <... futex resumed> ) = 0 <0.000153>
[pid 14269] 22:16:37.172118 [ 3691ed48a8] epoll_wait(21, <unfinished ...>
[pid 14267] 22:16:37.172158 [ 3691ec680b] write(25, "\4", 1 <unfinished ...>
[pid 14269] 22:16:37.172267 [ 3691ed48a8] <... epoll_wait resumed> {{EPOLLIN, {u32=24, u64=24}}}, 32, 4294967295) = 1 <0.000111>
[pid 14267] 22:16:37.172307 [ 3691ec680b] <... write resumed> ) = 1 <0.000119>
[pid 14269] 22:16:37.172402 [ 36932044eb] clock_gettime(CLOCK_MONOTONIC, <unfinished ...>
[pid 14267] 22:16:37.172443 [ 3692a0aee9] futex(0x1838fc54, FUTEX_WAIT_PRIVATE, 3690998537, NULL <unfinished ...>
[pid 14269] 22:16:37.172545 [ 36932044eb] <... clock_gettime resumed> {7654352, 151281722}) = 0 <0.000105>
[pid 14269] 22:16:37.172638 [ 3691ec678b] read(24, "\4", 256) = 1 <0.000058>
[pid 14269] 22:16:37.172798 [ 3691ed5151] sendto(33, "\0RES\0\0\0\n\0\0\0\0", 12, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 12 <0.000074>
[pid 14269] 22:16:37.172986 [ 3691ec678b] read(24, 0x435fff40, 256) = -1 EAGAIN (Resource temporarily unavailable) <0.000057>
[pid 14269] 22:16:37.173144 [ 3691ed48a8] epoll_wait(21, {{EPOLLIN, {u32=33, u64=33}}}, 32, 4294967295) = 1 <0.000086>
[pid 14269] 22:16:37.173343 [ 36932044eb] clock_gettime(CLOCK_MONOTONIC, {7654352, 152178722}) = 0 <0.000059>
[pid 14269] 22:16:37.173507 [ 3691ed4f41] recvfrom(33, "\0REQ\0\0\0\4\0\0\0\0", 8192, MSG_DONTWAIT, NULL, NULL) = 12 <0.000056>
[pid 14269] 22:16:37.173674 [ 3692a0b316] futex(0x1838fc54, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x1838fc50, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 <0.000067>
[pid 14267] 22:16:37.173793 [ 3692a0aee9] <... futex resumed> ) = 0 <0.001323>
[pid 14269] 22:16:37.173883 [ 3691ed4f41] recvfrom(33, <unfinished ...>
[pid 14267] 22:16:37.173924 [ 3692a0d605] futex(0x1838fc28, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 14269] 22:16:37.174023 [ 3691ed4f41] <... recvfrom resumed> 0x183b3de8, 8192, 64, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) <0.000102>
[pid 14267] 22:16:37.174061 [ 3692a0d605] <... futex resumed> ) = 0 <0.000109>
[pid 14269] 22:16:37.174156 [ 3691ed48a8] epoll_wait(21, <unfinished ...>
[pid 14267] 22:16:37.174198 [ 3692a0aee9] futex(0x1838fc54, FUTEX_WAIT_PRIVATE, 3690998539, NULL <unfinished ...>

This really looks like there is a much larger issue based on how this would normally react.

Revision history for this message
Mike Willbanks (mike-digitalstruct) wrote :

One last item... a worker difference here as well:

22:31:05.556626 [ 3691ed5105] sendto(14, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000115>
22:31:05.556905 [ 3691ed4ef5] recvfrom(14, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 <0.000065>
22:31:05.557093 [ 3691ed5105] sendto(14, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000101>
22:31:05.557312 [ 3691ecb65f] poll([{fd=14, events=POLLIN}], 1, 1000) = 0 (Timeout) <0.997563>
22:31:06.555256 [ 3691e306d0] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 <0.000203>
22:31:06.555726 [ 3691e306d0] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000095>
22:31:06.555968 [ 3691e306d0] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 <0.000057>
22:31:06.556133 [ 3691e306d0] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000059>
22:31:06.556298 [ 3691e306d0] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 <0.000055>
22:31:06.556457 [ 3691e306d0] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000058>
22:31:06.556623 [ 3691ed5105] sendto(14, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000116>
22:31:06.556901 [ 3691ed4ef5] recvfrom(14, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 <0.000065>
22:31:06.557086 [ 3691ed5105] sendto(14, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000102>
22:31:06.557308 [ 3691ecb65f] poll([{fd=14, events=POLLIN}], 1, 1000) = 0 (Timeout) <0.997496>
22:31:07.555176 [ 3691e306d0] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 <0.000197>
22:31:07.555641 [ 3691e306d0] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000074>
22:31:07.555847 [ 3691e306d0] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 <0.000056>
22:31:07.556010 [ 3691e306d0] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000058>
22:31:07.556175 [ 3691e306d0] rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0 <0.000055>
22:31:07.556334 [ 3691e306d0] rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 <0.000057>
22:31:07.556501 [ 3691ed5105] sendto(14, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000119>
22:31:07.556785 [ 3691ed4ef5] recvfrom(14, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12 <0.000065>
22:31:07.556972 [ 3691ed5105] sendto(14, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000101>
22:31:07.557195 [ 3691ecb65f] poll([{fd=14, events=POLLIN}], 1, 1000 <unfinished ...>

Revision history for this message
Herman J. Radtke III (hermanradtke) wrote :

Mike Willbanks opened http://pecl.php.net/bugs/bug.php?id=23887

I think discussion should be kept here until we determine whether or not this unique to pecl/gearman.

Revision history for this message
timuckun (timuckun) wrote :

I have reported this and similar bugs using the ruby client. I am
pretty sure this is related to the server itself.

On Tue, Aug 23, 2011 at 12:29 PM, Herman J. Radtke III
<email address hidden> wrote:
> Mike Willbanks opened http://pecl.php.net/bugs/bug.php?id=23887
>
> I think discussion should be kept here until we determine whether or not
> this unique to pecl/gearman.
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
>  Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/gearmand/+bug/802850/+subscriptions
>

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Excerpts from timuckun's message of Tue Aug 23 00:44:44 UTC 2011:
> I have reported this and similar bugs using the ruby client. I am
> pretty sure this is related to the server itself.
>

The ruby client that uses libgearman, or a pure ruby implementation?

If the former, I'd lean toward this being an issue with libgearman,
if the latter, then its less clear.

Revision history for this message
timuckun (timuckun) wrote :

> The ruby client that uses libgearman, or a pure ruby implementation?
>
> If the former, I'd lean toward this being an issue with libgearman,
> if the latter, then its less clear.
>

It's a pure ruby client. Sorry.

Revision history for this message
Marc Easen (marc-easen) wrote :

During high load I was seeing the same issue but this is what was happening:

strace of worker:

sendto(45, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
sendto(45, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=45, events=POLLIN}], 1, 5000) = 1 ([{fd=45, revents=POLLIN}])
getsockopt(45, SOL_SOCKET, SO_ERROR, [1459455931862482944], [4]) = 0
sendto(45, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(45, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
sendto(45, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12

From what I can see is the following

1) The worker is requesting work
2) The reponse of GEARMAN_LOST_CONNECTION is returned from the recv() function
3) The worker then sends a PRE_SLEEP command
4) The worker then request for work
4) Gearmand then responds with the response for two requests for jobs, which causes and odd spike in the server (I have located where this is happening)
5) The worker then send a PRE_SLEEP command

I have fixed this issue by removing the break for a GEARMAN_LOST_CONNECTION status, along with saving the state of the worker, which allows it continue where it left off.

(Please ignore the git references, I downloaded 0.24 and imported it into a git repo so I could track the changes)

diff --git a/libgearman/worker.cc b/libgearman/worker.cc
index 21488e0..745ca20 100644
--- a/libgearman/worker.cc
+++ b/libgearman/worker.cc
@@ -672,17 +672,11 @@ gearman_job_st *gearman_worker_grab_job(gearman_worker_st *worker,

           if (gearman_failed(*ret_ptr))
           {
- if (*ret_ptr == GEARMAN_IO_WAIT)
- {
- worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_RECV;
- }
- else
+ worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_RECV;
+ if (*ret_ptr != GEARMAN_IO_WAIT)
             {
               gearman_job_free(worker->job);
               worker->job= NULL;
-
- if (*ret_ptr == GEARMAN_LOST_CONNECTION)
- break;
             }

             return NULL;

Revision history for this message
Brian Aker (brianaker) wrote :
Download full text (5.4 KiB)

Thanks, either I or someone else will look at this in the next day or so.

Sent from my C64

On Oct 7, 2011, at 8:36 AM, Marc Easen <email address hidden> wrote:

> During high load I was seeing the same issue but this is what was happening:
>
> strace of worker:
>
> sendto(45, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> sendto(45, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=45, events=POLLIN}], 1, 5000) = 1 ([{fd=45, revents=POLLIN}])
> getsockopt(45, SOL_SOCKET, SO_ERROR, [1459455931862482944], [4]) = 0
> sendto(45, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(45, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
> sendto(45, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
>
>> From what I can see is the following
>
> 1) The worker is requesting work
> 2) The reponse of GEARMAN_LOST_CONNECTION is returned from the recv() function
> 3) The worker then sends a PRE_SLEEP command
> 4) The worker then request for work
> 4) Gearmand then responds with the response for two requests for jobs, which causes and odd spike in the server (I have located where this is happening)
> 5) The worker then send a PRE_SLEEP command
>
> I have fixed this issue by removing the break for a
> GEARMAN_LOST_CONNECTION status, along with saving the state of the
> worker, which allows it continue where it left off.
>
> (Please ignore the git references, I downloaded 0.24 and imported it
> into a git repo so I could track the changes)
>
>
> diff --git a/libgearman/worker.cc b/libgearman/worker.cc
> index 21488e0..745ca20 100644
> --- a/libgearman/worker.cc
> +++ b/libgearman/worker.cc
> @@ -672,17 +672,11 @@ gearman_job_st *gearman_worker_grab_job(gearman_worker_st *worker,
>
> if (gearman_failed(*ret_ptr))
> {
> - if (*ret_ptr == GEARMAN_IO_WAIT)
> - {
> - worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_RECV;
> - }
> - else
> + worker->state= GEARMAN_WORKER_STATE_GRAB_JOB_RECV;
> + if (*ret_ptr != GEARMAN_IO_WAIT)
> {
> gearman_job_free(worker->job);
> worker->job= NULL;
> -
> - if (*ret_ptr == GEARMAN_LOST_CONNECTION)
> - break;
> }
>
> return NULL;
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(...

Read more...

Revision history for this message
Brian Aker (brianaker) wrote :

Do you have any sort of test case? I can see where storing the state makes sense, but I need to see what would happen if a lost connection did occur.

Revision history for this message
Marc Easen (marc-easen) wrote :

Hi Brian,

The test case I used was to overload the gearmand with more background jobs than the workers can process, in our case that was 600 requests in a 4 second period to 16 workers, where each worker took around 3 seconds on average to complete. After the workers had processed the backlog, gearmand was idling at 100% (in one test where I set gearmand's thread count to 5 it was idling 450%) and the PHP workers were idling at around 20-30% each. I hope this helps

Revision history for this message
Brian Aker (brianaker) wrote :

So the above solution doesn't work.

If you set state to GEARMAN_WORKER_STATE_GRAB_JOB_RECV but you remove the job do to failure then if the loop returns you have no job and the assert that protects from that will assert().

I've got a test case that will now trigger the recv failure, but I don't have the bit that will cause the lockup to occur.

If the worker is seeing a recv() failure, it shouldn't be resending the PRE_SLEEP without reconnecting first (or exiting based on configuration).

Revision history for this message
Brian Aker (brianaker) wrote :

Now if recv() return zero, and shutdown() fails on the socket, the connection state is never set correctly back to zero.

What does this mean? A PRE_SLEEP/etc will be sent though the socket is not connected.

I'm not sure if this will fix the above case, but it does fix one possible failure case (a patch will arrive soonish to lp:gearmand/build ith this fix).

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 802850] Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
Download full text (3.7 KiB)

Thanks, I've got a test case now that I believe tickles the bug.

Cheers,
 -Brian

On Oct 20, 2011, at 11:57 PM, Marc Easen wrote:

> Hi Brian,
>
> The test case I used was to overload the gearmand with more background
> jobs than the workers can process, in our case that was 600 requests in
> a 4 second period to 16 workers, where each worker took around 3 seconds
> on average to complete. After the workers had processed the backlog,
> gearmand was idling at 100% (in one test where I set gearmand's thread
> count to 5 it was idling 450%) and the PHP workers were idling at around
> 20-30% each. I hope this helps
>
> --
> You received this bug notification because you are subscribed to
> Gearman.
> https://bugs.launchpad.net/bugs/802850
>
> Title:
> Gearman 100% cpu usage, workers in a loop (PHP, 0.22, 0.23)
>
> Status in Gearman Server and Client Libraries:
> New
>
> Bug description:
> After some time, minutes to hours, with a slight load on the gearman
> server (~1 job/min), workers get lost in a loop (per strace) and
> gearmand eats up 100% cpu.
>
> Strace of a worker:
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=POLLIN}])
> getsockopt(7, SOL_SOCKET, SO_ERROR, [117528996916232192], [4]) = 0
> sendto(7, "\0REQ\0\0\0\36\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> recvfrom(7, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
> sendto(7, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
> poll([{fd=7, events=POLLIN}], 1, 5000) = 1 ([{fd=7, revents=PO...

Read more...

Revision history for this message
Brian Aker (brianaker) wrote :

Do a pull from lp:gourmand and see if it solves the problem for you.

Changed in gearmand:
importance: Undecided → Medium
status: New → Fix Committed
assignee: nobody → Brian Aker (brianaker)
Revision history for this message
Marc Easen (marc-easen) wrote :

I'm unable to compile lp:gearmand on Redhat 5

make all-am
make[1]: Entering directory `/root/gearmand-bzr/gearmand'
  CXX libgearman/libgearman_libgearman_la-packet.lo
cc1plus: warnings being treated as errors
libgearman/packet.cc:63: warning: ignoring #pragma GCC diagnostic [-Wunknown-pragmas]
make[1]: *** [libgearman/libgearman_libgearman_la-packet.lo] Error 1
make[1]: Leaving directory `/root/gearmand-bzr/gearmand'
make: *** [all] Error 2

Please advise

Revision history for this message
piavlo (piavka) wrote :

removing/renaming the .bzr dirr will do the trick. autoconf has a rule that adds -Werror to gcc flags then it sees the source dir isunder bazaar.

Brian Aker (brianaker)
Changed in gearmand:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers