segfault in server/mpm/event/event.c:process_socket

Bug #1630413 reported by Ian Wienand on 2016-10-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Undecided
Unassigned

Bug Description

We have seen consistent but infrequent segfaults of apache on a trusty production server with 2.4.7-1ubuntu4.13 (for more examples, see [1])

---
Oct 2 19:01:03 static kernel: [8029151.932468] apache2[10642]: segfault at 7fac797803a8 ip 00007fac90b345e0 sp 00007fac84ff8e20 error 6 in mod_mpm_event.so[7fac90b2e000+d000]
---

Taking the ip - base seems to put us at a consistent offset

---
(gdb) p/x 0x7fac90b345e0 - 0x7fac90b2e000
$1 = 0x65e0

$ addr2line -e ./mod_mpm_event.so 0x65e0
/build/apache2-Rau9Dr/apache2-2.4.7/server/mpm/event/event.c:1064
---

which is at the bottom of process_socket(), which looks like

---
  1058 /*
  1059 * Prevent this connection from writing to our connection state after it
  1060 * is no longer associated with this thread. This would happen if the EOR
  1061 * bucket is destroyed from the listener thread due to a connection abort
  1062 * or timeout.
  1063 */
  1064 c->sbh = NULL;
  1065 return;
  1066 }
---

1064 seems at least plausible as a faulting location...

Some digging through httpd history reveals that this assignment was removed on the 2.4 branch with commit [2], which seems to be largely based on [3]. Things have been shuffled around so much it's hard to tell exactly what might have avoided us going down this path. Even so I'm honestly not sure how to reproduce it -- on a fairly busy server it's seen at most a few times a day.

[1] http://paste.openstack.org/show/584330/
[2] https://github.com/apache/httpd/commit/043eba1a0a190829c073d9ef084358f6693dbbd2
[3] https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827

Brian Morton (rokclimb15) wrote :

Hi Ian, can you raise ulimit, add CoreDumpDirectory, and install apache2-dbg (will restart to make prior two changes effective)? If you make CoreDumpDirectory /tmp, make sure to move your core dump out of the way before you reboot.

https://httpd.apache.org/dev/debugging.html#crashes

Then you'll get a core dump for analysis. If you post it here I can analyze further.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apache2 (Ubuntu):
status: New → Confirmed

We also encountered this issue with 2.4.7-1ubuntu4.17
I can reproduce it from times to times with a light parralel load (50-200 requests in //)

Here is some details:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f811d7fa700 (LWP 26536)]
process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
1064 event.c: No such file or directory.
(gdb) bt
#0 process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1815
#2 0x00007f8129ec6184 in start_thread (arg=0x7f811d7fa700) at pthread_create.c:312
#3 0x00007f8129bf2ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) bt full
#0 process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
        c = 0x7f8124196330
        sbh = 0x7f8124196908
        conn_id = <optimized out>
        rc = <optimized out>
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1815
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 8
        csd = 0x7f81241960b0
        cs = 0x7f81241962b8
        ptrans = 0x7f8124196028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#2 0x00007f8129ec6184 in start_thread (arg=0x7f811d7fa700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f811d7fa700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140192522413824, -4554003917395454359, 0, 0, 140192522414528, 140192522413824, 4597849143617525353, 4597821488042743401}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3 0x00007f8129bf2ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.
(gdb) p c
$1 = (conn_rec *) 0x7f8124196330
(gdb) p *c
Cannot access memory at address 0x7f8124196330
(gdb) p cs
$2 = (event_conn_state_t *) 0x7f81241962b8
(gdb) p *cs
Cannot access memory at address 0x7f81241962b8

And the corresponding core. If you need more information, please ask

Core from previous comment.

Brian Morton (rokclimb15) wrote :

Thanks for the core dump and bt Christophe. After a bit of research, I believe this is a race condition present in 2.4.7 which was subsequently patched, and then the patch refactored when the suspend/resume hooks were added in 2.4.10. The fix in 2.4.7 seems simply enough (just move c->sbh = NULL into the suspend condition above it) but I don't think it would pass SRU justification since it only happens under load and is hard to reproduce. Can you and/or Ian use 2.4.10 from trusty-backports? That shouldn't suffer from this problem.

If not, reply here and I'll get someone to validate my SRU opinion before proceeding.

Download full text (3.4 KiB)

Hi Brian,

Thanks for your fast answer and apologies for my delay

Unfortunately, i reproduced a similar crash on trusty-backports version
Not exactly the same though:

ii apache2 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache HTTP Server
un apache2-api-20120211 <none> <none> (no description available)
ii apache2-bin 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache HTTP Server (binary files and modules)
ii apache2-data 2.4.10-1ubuntu1.1~ubuntu14.04.2 all Apache HTTP Server (common files)
ii apache2-dbg 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache debugging symbols
un apache2-doc <none> <none> (no description available)
ii apache2-mpm-event 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 transitional event MPM package for apache2

(gdb) bt
#0 0x00007feee4331861 in notify_suspend (cs=0x7feee01612a8) at event.c:891
#1 process_socket (my_thread_num=23, my_child_num=<optimized out>, cs=0x7feee01612a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1124
#2 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1875
#3 0x00007feee71eb184 in start_thread (arg=0x7feed2fed700) at pthread_create.c:312
#4 0x00007feee6f17ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) bt full
#0 0x00007feee4331861 in notify_suspend (cs=0x7feee01612a8) at event.c:891
No locals.
#1 process_socket (my_thread_num=23, my_child_num=<optimized out>, cs=0x7feee01612a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1124
        c = <optimized out>
        sbh = 0x7feee0161928
        conn_id = <optimized out>
        rc = <optimized out>
#2 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1875
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 23
        csd = 0x7feee01610a0
        cs = 0x7feee01612a8
        ptrans = 0x7feee0161028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#3 0x00007feee71eb184 in start_thread (arg=0x7feed2fed700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7feed2fed700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140663718860544, 4768455907926276535, 0, 0, 140663718861248, 140663718860544, -4759934683652884041, -4759958051315451465}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#4 0x00007feee6f17ffd in...

Read more...

Brian Morton (rokclimb15) wrote :

Hi Christophe,

I believe I've narrowed down the problem to one fixed in these two changesets:
https://github.com/apache/httpd/commit/59eea59c4be383d004e92fa63b57b995e7a8ef01
https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827

The latter only applies to 2.4.10 since it applies to the suspend/resume hooks. That leaves the first one, which I've applied in my PPA for testing. I've started with 2.4.7 since typically backports aren't for bugfixes.

Would you mind being my guinea pig?

https://launchpad.net/~rokclimb15/+archive/ubuntu/apache2

Sure! Tried with your PPA, still got a segv

Program received signal SIGSEGV, Segmentation fault.
process_socket (my_thread_num=2, my_child_num=<optimized out>, cs=0x7fd636a032a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1066
gdb) bt full
#0 process_socket (my_thread_num=2, my_child_num=<optimized out>, cs=0x7fd636a032a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1066
        c = 0x7fd636a03320
        sbh = 0x7fd636a038f8
        conn_id = <optimized out>
        rc = <optimized out>
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1819
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 2
        csd = 0x7fd636a030a0
        cs = 0x7fd636a032a8
        ptrans = 0x7fd636a03028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#2 0x00007fd64c8c5184 in start_thread (arg=0x7fd642ffd700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7fd642ffd700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140558223791872, -8454169976660319628, 0, 0, 140558223792576, 140558223791872, 8430814151234003572, 8430835441543520884}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3 0x00007fd64c5f1ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers