Worker crash with long function name

Bug #833394 reported by Sebastian Herbert
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gearman
Fix Released
Medium
Brian Aker

Bug Description

0.23 appears to have introduced a bug where a worker will crash upon receiving a job if the function name is too long (somewhere in the mid-50 characters), which is still present in 0.24. It may take several jobs to observe the crash. The following are all on Ubuntu Natty using boost 1.47.0.

Client side is just running this in a loop:
gearman -f 55_char_function_name_________________________________ payload

Worker side runs:
ubuntu@ip-10-17-168-221:~$ gearmand -d
ubuntu@ip-10-17-168-221:~$ gearadmin --server-version
0.24
ubuntu@ip-10-17-168-221:~$ gearman -w -f 55_char_function_name__________________________________
payloadSegmentation fault
ubuntu@ip-10-17-168-221:~$ gearman -w -f 55_char_function_name__________________________________
payloadpayloadpayloadSegmentation fault
ubuntu@ip-10-17-168-221:~$ gearman -w -f 55_char_function_name__________________________________
payloadpayloadSegmentation fault
ubuntu@ip-10-17-168-221:~$ gearman -w -f 55_char_function_name__________________________________
payloadpayloadSegmentation fault
ubuntu@ip-10-17-168-221:~$ gearman -w -f 55_char_function_name__________________________________
payloadpayloadSegmentation fault
etc.

Here's the end of the worker strace on 0.24:
poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
recvfrom(3, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(3, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0\200H:ip-10-"..., 8192, 0, NULL, NULL) = 152
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0payload", 29payload) = 29
sendto(3, "\0REQ\0\0\0\r\0\0\0\26H:ip-10-17-168-179:7"..., 34, MSG_NOSIGNAL, NULL, 0) = 34
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(3, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0\200H:ip-10-"..., 8192, 0, NULL, NULL) = 152
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0payload", 29payload) = 29
sendto(3, "\0REQ\0\0\0\r\0\0\0\26H:ip-10-17-168-179:7"..., 34, MSG_NOSIGNAL, NULL, 0) = 34
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(3, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0\200H:ip-10-"..., 8192, 0, NULL, NULL) = 152
write(1, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0payload", 29payload) = 29
sendto(3, "\0REQ\0\0\0\r\0\0\0\26H:ip-10-17-168-179:7"..., 34, MSG_NOSIGNAL, NULL, 0) = 34
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\n\0\0\0\0", 8192, 0, NULL, NULL) = 12
sendto(3, "\0REQ\0\0\0\4\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
poll([{fd=3, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
sendto(3, "\0REQ\0\0\0'\0\0\0\0", 12, MSG_NOSIGNAL, NULL, 0) = 12
recvfrom(3, "\0RES\0\0\0\6\0\0\0\0\0RES\0\0\0(\0\0\0\200H:ip-10-"..., 8192, 0, NULL, NULL) = 152
write(1, "\0\0\0\0\0\0\0payload", 14payload) = 14
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++
Segmentation fault

Slightly shorter function name lengths dramatically lower the failure rate. I was able to get through 30K jobs with no issues with a 52-character name.

The same problem exists in 0.23, while I can run 10s of thousands of jobs with much longer job names on 0.22.

Brian Aker (brianaker)
Changed in gearmand:
assignee: nobody → Brian Aker (brianaker)
importance: Undecided → Medium
Brian Aker (brianaker)
Changed in gearmand:
status: New → In Progress
Revision history for this message
Brian Aker (brianaker) wrote :

I've not been able to repeat this. I am pushing up a test case, if you can modify it and demonstrate this that would be useful.

Changed in gearmand:
status: In Progress → Incomplete
Revision history for this message
Sebastian Herbert (herbert-dc-energy) wrote :

I'll give it a go, but probably won't be able to get to it until next week.

Revision history for this message
Brian Aker (brianaker) wrote :

Found it,...

Changed in gearmand:
status: Incomplete → Confirmed
status: Confirmed → In Progress
Revision history for this message
Brian Aker (brianaker) wrote :

Please check lp:gearmand for fix.

Changed in gearmand:
status: In Progress → Fix Committed
Revision history for this message
James (jimmybot) wrote :

Hi Brian, saw that you had posted a fix. We were able to reproduce a similar situation when the unique key provided was longish (also above mid-50s). Curious, what was the fix specifically? The latest build does seem to have fixed things, though we are still doing more testing. Thanks!

Revision history for this message
Brian Aker (brianaker) wrote : Re: [Bug 833394] Re: Worker crash with long function name

Hi,

On Nov 3, 2011, at 7:57 PM, James wrote:

> Hi Brian, saw that you had posted a fix. We were able to reproduce a
> similar situation when the unique key provided was longish (also above
> mid-50s). Curious, what was the fix specifically? The latest build does
> seem to have fixed things, though we are still doing more testing.
> Thanks!

Overflow on the header in the server. The way that memory aligned it wouldn't show up in valgrind or anything else.

Cheers,
 -Brian

Brian Aker (brianaker)
Changed in gearmand:
status: Fix Committed → Fix Released
Revision history for this message
Sebastian Herbert (herbert-dc-energy) wrote :

Confirming that this is fixed in 0.25.

Revision history for this message
Keyur (keyurdg) wrote :

Hi Brian, one small inconsistency: the max function length defined in the header is 512 (http://bazaar.launchpad.net/~gearman-developers/gearmand/trunk/view/head:/libgearman-1.0/constants.h#L76) while the persistent queue using drizzle only has it as 255 (http://gearman.org/index.php?id=manual:job_server#persistent_queues).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.