Gearman

Bug #802850
Comment #20

Comment 20 for bug 802850

Revision history for this message

Mike Willbanks (mike-digitalstruct) wrote on 2011-08-12:

#20

Brian - I was not logging the worker during that time in any meaningful way...
timuckun - I have not; however, it seems related to both the client and the server.

Although some more information that I've uncovered (the queue is not locked at this point). Here are some workers... this first one is running at approximately 19% CPU - unfortunately it is not doing anything different than the others (it does keep a database connection but I have not really seen any leaks).
21:46:05.167822 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:46:05.167883 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:46:05.167952 [ 3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
21:46:05.168019 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:46:05.168086 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

Here is an example of another worker - it is also not doing anything but CPU usage is at 0%.
21:47:23.106105 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:47:23.106156 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:47:23.106934 [ 3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
21:47:23.107013 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:47:23.107071 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

Doing an strace of gearmand shows much of the same but process is running at 64% CPU:
21:54:43.410761 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:54:43.410820 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:54:43.410892 [ 3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
21:54:43.410965 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:54:43.411037 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

What is interesting; is that the workers do support a kill signal by leveraging the timeout in the server. We set the gearman timeout at: 1000 (1s); this allows us to actually handle kill signals. For one worker that maintains a db connection it never works and this is one of the workers that is spinning out of control. Unfortunately; after sending a kill signal to the process things simply get worse on the strace (especially in speed they are sent):
22:07:50.410273 [ 3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000039>
22:07:50.410394 [ 3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000039>
22:07:50.410507 [ 3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}]) <0.000025>
22:07:50.410604 [ 3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0 <0.000065>

This means that the timeout has never happened. I am guessing this portion might be the client end... Killing off the rest of the workers restores the sanity to the server. However; we have encountered other servers where the gearmand process actually deadlocks on the I/O threads. Not sure if this happens because of the overload of server checks or not...

Brian - I was not logging the worker during that time in any meaningful way...
timuckun - I have not; however, it seems related to both the client and the server.

Although some more information that I've uncovered (the queue is not locked at this point).  Here are some workers... this first one is running at approximately 19% CPU - unfortunately it is not doing anything different than the others (it does keep a database connection but I have not really seen any leaks).
21:46:05.167822 [      3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:46:05.167883 [      3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:46:05.167952 [      3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
21:46:05.168019 [      3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:46:05.168086 [      3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

Here is an example of another worker - it is also not doing anything but CPU usage is at 0%.
21:47:23.106105 [      3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:47:23.106156 [      3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:47:23.106934 [      3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n"..., 8192, 0, NULL, NULL) = 36
21:47:23.107013 [      3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:47:23.107071 [      3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

Doing an strace of gearmand shows much of the same but process is running at 64% CPU:
21:54:43.410761 [      3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0
21:54:43.410820 [      3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:54:43.410892 [      3691ed4ef5] recvfrom(15, "\0RES\0\0\0\n\0\0\0\0\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 24
21:54:43.410965 [      3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
21:54:43.411037 [      3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}])

What is interesting; is that the workers do support a kill signal by leveraging the timeout in the server.  We set the gearman timeout at: 1000 (1s); this allows us to actually handle kill signals.  For one worker that maintains a db connection it never works and this is one of the workers that is spinning out of control.  Unfortunately; after sending a kill signal to the process things simply get worse on the strace (especially in speed they are sent):
22:07:50.410273 [      3691ed5105] sendto(15, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000039>
22:07:50.410394 [      3691ed5105] sendto(15, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12 <0.000039>
22:07:50.410507 [      3691ecb65f] poll([{fd=15, events=POLLIN}], 1, 1000) = 1 ([{fd=15, revents=POLLIN}]) <0.000025>
22:07:50.410604 [      3691ed4e5a] getsockopt(15, SOL_SOCKET, SO_ERROR, [569467480393646080], [4]) = 0 <0.000065>

This means that the timeout has never happened.  I am guessing this portion might be the client end...  Killing off the rest of the workers restores the sanity to the server.  However; we have encountered other servers where the gearmand process actually deadlocks on the I/O threads.  Not sure if this happens because of the overload of server checks or not...