Segmentation fault in libgearman runtask

Bug #1477798 reported by eric
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gearman
New
Medium
Unassigned

Bug Description

Segfault issue seemingly at random. I cannot reproduce the issue to make it crash.

I've been getting Segmentation faults in my Apache error logs.
[notice] child pid 26412 exit signal Segmentation fault (11)

It's been running in production for many months, and the segfault just started showing its head randomly recently. Gearmand is running with a mysql queue. Restarting gearmand seems to make it go away for a week or so, but then comes back. I also have a cron job that acts as a keep alive (submitting + running a job) so that gearman's mysql connection doesn't timeout.

I have been able to load the debug symbols for gearman and review:

(gdb) bt
#0 0x00002aaab1fab1f8 in _client_run_tasks (client=0x2b7ebc9b52b0,
exit_task=0x0) at libgearman/client.cc:1412
#1 0x00002aaab1fad6cc in gearman_client_run_tasks (client=0x2b7eb4cbf9e0)
at libgearman/client.cc:1721
#2 0x00002aaab1d89e01 in zif_gearman_client_run_tasks (ht=<value optimized
out>, return_value=0x2b7eb6d03fe0, return_value_ptr=<value optimized out>,
this_ptr=<value optimized out>,
    return_value_used=<value optimized out>) at
/var/tmp/gearman/php_gearman.c:3168
#3 0x00002aaab09cfe4a in zend_parse_method_parameters
(num_args=-978219944, this_ptr=0x1, type_spec=0x0) at
/usr/src/debug/php-5.4.43/Zend/zend_API.c:914
#4 0x00002b7ebc855350 in ?? ()
#5 0x010000000000000f in ?? ()
#6 0x0000000000000000 in ?? ()
(gdb) frame 0
#0 0x00002aaab1fab1f8 in _client_run_tasks (client=0x2b7ebc9b52b0,
exit_task=0x0) at libgearman/client.cc:1412
1412 for (client->impl()->task= client->impl()->task_list;
client->impl()->task;
(gdb)

It looks like it is crashing in libgearman/client.cc:1412 why??

Code:
//$jobdata is validated prior

$gmclient= new GearmanClient();
//cast port as int
$gmclient->addServer("127.0.0.1", ((int)$gearmanPort)); # Add default server (localhost).
$gmclient->setTimeout(300);//set timeout to send gearman errors

//assign a unique id for the job (limit the length to prevent db errors)
$uniqueid = substr('process'.uniqid (rand (),true).rand(0, 32767), 0, 63);

//add the job to the background
$job_handle = $gmclient->addTaskBackground("processqueue", $jobdata, null, $uniqueid);

//had some instances where gearmand was slow to respond, loops below try a few times before giving up.
//ping gearman to make sure it is working before sending job
//keep trying for 2 seconds (4 times per second) until success or failure
$pingCount = 1;
while (@$gmclient->ping(serialize("Ping Test")) === FALSE && $pingCount <= 8) {
log("couldn't ping gearman server");
usleep(250000); //sleep for 0.25 seconds before trying again
$pingCount++;
}

//queue the job
//keep trying for 2 seconds (4 times per second) until success or failure
$runTaskCount = 1;
//supress gearman fatal errors so we can catch them
while (@!$gmclient->runTasks() && $runTaskCount <= 8) {
     log('Error:: Could not run task on runTaskCount='.$runTaskCount.'/8 error='.$gmclient->error());
     usleep(250000); //sleep for 0.25 seconds before trying again
     $runTaskCount++;
}

if (@$gmclient->returnCode() != GEARMAN_SUCCESS) //supress gearman fatal errors so we can catch them
{
log("job didn't queue");
}

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Seems like this is probably something corrupting the pointers, but it's hard to say without a test.

Changed in gearmand:
importance: Undecided → Medium
Revision history for this message
eric (nospamthankyou) wrote :

I've improved my logging in the app, so it write dynamically, instead of writing all the logs when the script finishes (which isn't written if segfault occurs).

When queuing a job, firstly ping test is attempted 8 times and fails 8 times.
@$gmclient->ping(serialize("Ping Test"))

Run tasks is suppose to be attempted 8 times, after the first attempt it reports the following error, then segfaults.
error=send_packet(GEARMAN_TIMEOUT) Failed in receiving() -> libgearman/connection.cc:494

I have now been able to reproduce the error. System has been running for 9 days in production. After receiving the segfault notices, I proceeded to test the app (which is uploading and queuing images for resizing). I uploaded a batch of 6 images, 4 were queued successfully, and 2 segfaulted. I restarted gearmand and uploaded another batch of 10 images, all were queued successfully.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.