Segfault in gearman_worker_grab_job

Bug #961904 reported by Vladimir Fedotov on 2012-03-22
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Gearman
Medium
Brian Aker

Bug Description

Sometimes my gearman worker crashes with SIGSEGV.
1. It happens when i terminate gearmand by ctrl-C. It sometimes takes a few cycles of ctrl-C then restarting the gearmand
process to cause this to happen. This problem seems like bug "357881", but in reverse.
2. However, sometimes, but less often, it happens without restarting the gearmand.
My stack trace:
/lib64/libc.so.6 [0x3d876302d0]
/usr/local/lib/libgearman.so.6 [0x2b9da60064f0]
/usr/local/lib/libgearman.so.6(_ZN21gearman_connection_st9receivingER17gearman_packet_stR16gearman_return_tb+0xab) [0x2b9da6009a4b] /usr/local/lib/libgearman.so.6(gearman_worker_grab_job+0xa1) [0x2b9da6011771]
/usr/local/lib/libgearman.so.6(gearman_worker_work+0x43) [0x2b9da6011ca3]
/home/user/bin/MyApp(_ZN7Gearman13WorkerWrapper4workEv+0x18) [0x56a4da]

gearman 0.27
CentOs 5.4

Vladimir Fedotov (4fedotov) wrote :

Once I got this message:
Assertion "packet->universal" failed for function "gearman_packet_free" likely for "Packet that is being freed has not been allocated, most likely this is do to freeing a gearman_task_st or other object twice", at libgearman/packet.cc:256
May be it's not related.

Brian Aker (brianaker) wrote :

You need to write a signal handler if you want to exit out of worker by calling CTRL-C.

Changed in gearmand:
status: New → Invalid
Vladimir Fedotov (4fedotov) wrote :

1. I wrote about exit out of GEARMAND by calling CTRL-C.
2. I wrote, that sometimes it happens without calling CTRL-C.

Changed in gearmand:
status: Invalid → New
Brian Aker (brianaker) wrote :

Ok, got it,... now to reproduce it...

Changed in gearmand:
assignee: nobody → Brian Aker (brianaker)
Brian Aker (brianaker) wrote :

The test case gearman_worker_set_timeout_FAILOVER_TEST can be used to trigger this when run under valgrind.

You can see this for yourself if you enable the test and run:
make valgrind-worker

Brian Aker (brianaker) wrote :

==13910== Invalid write of size 8==13910== at 0x4C1FAD0: gearman_packet_free(gearman_packet_st*) (packet.cc:299)
==13910== by 0x4C248EC: _worker_unregister(gearman_worker_st*, char const*, unsigned long) (worker.cc:506)==13910== by 0x4C249DF: gearman_worker_unregister (worker.cc:535)
==13910== by 0x4078A1: gearman_worker_set_timeout_FAILOVER_TEST(void*) (worker_test.cc:814)
==13910== by 0x41CE53: libtest::Runner::run(test_return_t (*)(void*), void*) (runner.cc:36)==13910== by 0x4128DA: main (test.cc:317)
==13910== Address 0x50e1ce0 is 80 bytes inside a block of size 712 free'd
==13910== at 0x4A062BC: operator delete(void*) (vg_replace_malloc.c:387)
==13910== by 0x4C1E89F: gearman_job_free (job.cc:564)
==13910== by 0x4C24FCD: gearman_worker_grab_job (worker.cc:745)
==13910== by 0x4C25608: gearman_worker_work (worker.cc:951)
==13910== by 0x407856: gearman_worker_set_timeout_FAILOVER_TEST(void*) (worker_test.cc:811)
==13910== by 0x41CE53: libtest::Runner::run(test_return_t (*)(void*), void*) (runner.cc:36)
==13910== by 0x4128DA: main (test.cc:317)
==13910==

Changed in gearmand:
importance: Undecided → Medium
Vladimir Fedotov (4fedotov) wrote :

gearman_worker_set_timeout_FAILOVER_TEST is skipped now (1.1.2)
Are you planning to do smth with this bug? It still happens. Could this be related to the 1063648?

James E. Flemer (jflemer) wrote :

Does this help, using 1.0.2-0~801-2~precise1 on 12.04 LTS, I get this stack from the crash:

#0 0x00007ffdd6a2fa74 in gearman_packet_free (packet=0x214f1a8) at libgearman/packet.cc:305
#1 0x00007ffdd6a35569 in gearman_worker_grab_job (worker=0x1ffc598, job=0x0, ret_ptr=0x7fffc56608bc) at libgearman/worker.cc:788
#2 0x00007ffdd6a35c87 in gearman_worker_work (worker=0x1ffc598) at libgearman/worker.cc:976
#3 0x00007ffdd6c4f868 in zif_gearman_worker_work (ht=<optimized out>, return_value=0x209fd70, return_value_ptr=<optimized out>, this_ptr=<optimized out>,
    return_value_used=<optimized out>) at /tmp/pear/temp/gearman/php_gearman.c:3696

James E. Flemer (jflemer) wrote :

In the case above,
  packet->universal = NULL
So trying to update the (linked list) "packet_list" pointer causes the segfault:
  if (packet->universal->packet_list == packet)

Is the call to gearman_packet_free() (libgearman/worker.cc:788) correct for "NO_JOB"?

Here's the whole "packet":
$4 = {options = {allocated = false, complete = false, free_data = false}, magic = GEARMAN_MAGIC_RESPONSE, command = GEARMAN_COMMAND_NO_JOB, argc = 0 '\000', args_size = 12,
  data_size = 0, universal = 0x0, next = 0x0, prev = 0x0, args = 0x214f270 "", data = 0x0, arg = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 0, 0, 0, 0, 0},
  args_buffer = "\000RES\000\000\000\n", '\000' <repeats 119 times>}

Up one level of the stack, the "job" (worker->job) is:

$2 = {options = {allocated = true, assigned_in_use = false, work_in_use = false, finished = false}, worker = 0x1ffc598, next = 0x0, prev = 0x0, con = 0x0, assigned = {
    options = {allocated = false, complete = false, free_data = false}, magic = GEARMAN_MAGIC_RESPONSE, command = GEARMAN_COMMAND_NO_JOB, argc = 0 '\000', args_size = 12,
    data_size = 0, universal = 0x0, next = 0x0, prev = 0x0, args = 0x214f270 "", data = 0x0, arg = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 0, 0, 0, 0,
      0}, args_buffer = "\000RES\000\000\000\n", '\000' <repeats 119 times>}, work = {options = {allocated = false, complete = false, free_data = false},
    magic = GEARMAN_MAGIC_TEXT, command = GEARMAN_COMMAND_TEXT, argc = 0 '\000', args_size = 0, data_size = 0, universal = 0x0, next = 0x0, prev = 0x0, args = 0x0, data = 0x0,
    arg = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 0, 0, 0, 0, 0}, args_buffer = '\000' <repeats 127 times>}, reducer = 0x0,
  error_code = GEARMAN_UNKNOWN_STATE}

Joe Batting (joe-batting22) wrote :

Hello,

We're running into this same issue as well. Any update on a fix?

Of note, this seems to occur when we have heavy network load.

Environment:

O/S: RHEL 6.5
PHP 5.4.28
PHP gearman extension 1.1.2
libgearman 1.1.8

Backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00007f7d18ef7b35 in ?? () from /usr/lib64/libgearman.so.8
(gdb) bt
#0 0x00007f7d18ef7b35 in ?? () from /usr/lib64/libgearman.so.8
#1 0x00007f7d18efd719 in gearman_worker_grab_job () from /usr/lib64/libgearman.so.8
#2 0x00007f7d18efda30 in gearman_worker_work () from /usr/lib64/libgearman.so.8
#3 0x00007f7d1911bd08 in zif_gearman_worker_work (ht=<value optimized out>, return_value=0x1e87198, return_value_ptr=<value optimized out>, this_ptr=<value optimized out>, return_value_used=<value optimized out>) at /var/tmp/gearman/php_gearman.c:3697
#4 0x000000000065eecc in ?? ()
#5 0x000000000064c808 in execute ()
#6 0x00000000005e2370 in zend_execute_scripts ()
#7 0x0000000000584ef8 in php_execute_script ()
#8 0x000000000068e183 in ?? ()
#9 0x000000000068e948 in ?? ()
#10 0x000000381741ed1d in __libc_start_main () from /lib64/libc.so.6
#11 0x0000000000424269 in _start ()

Pai-Wei Lai (paiwei-lai) wrote :

Hello,

It seems like we having the same issue:

gdb strace:
#0 0x00007fd2bc992026 in gearman_packet_free (packet=0x40fe898) at libgearman/packet.cc:305
#1 0x00007fd2bc9977f3 in gearman_worker_grab_job (worker=0x419d670, job=0x0, ret_ptr=0x7fff5ae5b31c) at libgearman/worker.cc:797
#2 0x00007fd2bcbb1e10 in pygear_worker_grab_job (self=0x414cfc0) at worker.c:270
...

memory trace:
(gdb) frame 1
#1 0x00007fd2bc9977f3 in gearman_worker_grab_job (worker=0x419d670, job=0x0, ret_ptr=0x7fff5ae5b31c) at libgearman/worker.cc:797
797 in libgearman/worker.cc
(gdb) print worker->job->assigned.command
$7 = GEARMAN_COMMAND_TEXT
(gdb) print worker->job->assigned
$8 = {options = {allocated = false, complete = false, free_data = false}, magic = GEARMAN_MAGIC_TEXT, command = GEARMAN_COMMAND_TEXT,
  argc = 0 '\000', args_size = 0, data_size = 0, universal = 0x0, next = 0x0, prev = 0x0, args = 0x0, data = 0x0, arg = {0x0, 0x0, 0x0, 0x0,
    0x0, 0x0, 0x0, 0x0}, arg_size = {0, 0, 0, 0, 0, 0, 0, 0}, args_buffer = '\000' <repeats 127 times>}
(gdb) frame 0
#0 0x00007fd2bc992026 in gearman_packet_free (packet=0x40fe898) at libgearman/packet.cc:305
305 libgearman/packet.cc: No such file or directory.
        in libgearman/packet.cc
(gdb) print packet
$9 = (gearman_packet_st *) 0x40fe898
(gdb) print packet->args
$10 = 0x0
(gdb) print packet->args_buffer
$11 = '\000' <repeats 127 times>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers