Segmentation fault in libgearman runtask

Bug #1477798 reported by eric on 2015-07-24
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gearman
Medium
Unassigned

Bug Description

Segfault issue seemingly at random. I cannot reproduce the issue to make it crash.

I've been getting Segmentation faults in my Apache error logs.
[notice] child pid 26412 exit signal Segmentation fault (11)

It's been running in production for many months, and the segfault just started showing its head randomly recently. Gearmand is running with a mysql queue. Restarting gearmand seems to make it go away for a week or so, but then comes back. I also have a cron job that acts as a keep alive (submitting + running a job) so that gearman's mysql connection doesn't timeout.

I have been able to load the debug symbols for gearman and review:

(gdb) bt
#0 0x00002aaab1fab1f8 in _client_run_tasks (client=0x2b7ebc9b52b0,
exit_task=0x0) at libgearman/client.cc:1412
#1 0x00002aaab1fad6cc in gearman_client_run_tasks (client=0x2b7eb4cbf9e0)
at libgearman/client.cc:1721
#2 0x00002aaab1d89e01 in zif_gearman_client_run_tasks (ht=<value optimized
out>, return_value=0x2b7eb6d03fe0, return_value_ptr=<value optimized out>,
this_ptr=<value optimized out>,
    return_value_used=<value optimized out>) at
/var/tmp/gearman/php_gearman.c:3168
#3 0x00002aaab09cfe4a in zend_parse_method_parameters
(num_args=-978219944, this_ptr=0x1, type_spec=0x0) at
/usr/src/debug/php-5.4.43/Zend/zend_API.c:914
#4 0x00002b7ebc855350 in ?? ()
#5 0x010000000000000f in ?? ()
#6 0x0000000000000000 in ?? ()
(gdb) frame 0
#0 0x00002aaab1fab1f8 in _client_run_tasks (client=0x2b7ebc9b52b0,
exit_task=0x0) at libgearman/client.cc:1412
1412 for (client->impl()->task= client->impl()->task_list;
client->impl()->task;
(gdb)

It looks like it is crashing in libgearman/client.cc:1412 why??

Code:
//$jobdata is validated prior

$gmclient= new GearmanClient();
//cast port as int
$gmclient->addServer("127.0.0.1", ((int)$gearmanPort)); # Add default server (localhost).
$gmclient->setTimeout(300);//set timeout to send gearman errors

//assign a unique id for the job (limit the length to prevent db errors)
$uniqueid = substr('process'.uniqid (rand (),true).rand(0, 32767), 0, 63);

//add the job to the background
$job_handle = $gmclient->addTaskBackground("processqueue", $jobdata, null, $uniqueid);

//had some instances where gearmand was slow to respond, loops below try a few times before giving up.
//ping gearman to make sure it is working before sending job
//keep trying for 2 seconds (4 times per second) until success or failure
$pingCount = 1;
while (@$gmclient->ping(serialize("Ping Test")) === FALSE && $pingCount <= 8) {
log("couldn't ping gearman server");
usleep(250000); //sleep for 0.25 seconds before trying again
$pingCount++;
}

//queue the job
//keep trying for 2 seconds (4 times per second) until success or failure
$runTaskCount = 1;
//supress gearman fatal errors so we can catch them
while (@!$gmclient->runTasks() && $runTaskCount <= 8) {
     log('Error:: Could not run task on runTaskCount='.$runTaskCount.'/8 error='.$gmclient->error());
     usleep(250000); //sleep for 0.25 seconds before trying again
     $runTaskCount++;
}

if (@$gmclient->returnCode() != GEARMAN_SUCCESS) //supress gearman fatal errors so we can catch them
{
log("job didn't queue");
}

Clint Byrum (clint-fewbar) wrote :

Seems like this is probably something corrupting the pointers, but it's hard to say without a test.

Changed in gearmand:
importance: Undecided → Medium
eric (nospamthankyou) wrote :

I've improved my logging in the app, so it write dynamically, instead of writing all the logs when the script finishes (which isn't written if segfault occurs).

When queuing a job, firstly ping test is attempted 8 times and fails 8 times.
@$gmclient->ping(serialize("Ping Test"))

Run tasks is suppose to be attempted 8 times, after the first attempt it reports the following error, then segfaults.
error=send_packet(GEARMAN_TIMEOUT) Failed in receiving() -> libgearman/connection.cc:494

I have now been able to reproduce the error. System has been running for 9 days in production. After receiving the segfault notices, I proceeded to test the app (which is uploading and queuing images for resizing). I uploaded a batch of 6 images, 4 were queued successfully, and 2 segfaulted. I restarted gearmand and uploaded another batch of 10 images, all were queued successfully.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers