segfault in server/mpm/event/event.c:process_socket

Bug #1630413 reported by Ian Wienand on 2016-10-05
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Undecided
Unassigned

Bug Description

We have seen consistent but infrequent segfaults of apache on a trusty production server with 2.4.7-1ubuntu4.13 (for more examples, see [1])

---
Oct 2 19:01:03 static kernel: [8029151.932468] apache2[10642]: segfault at 7fac797803a8 ip 00007fac90b345e0 sp 00007fac84ff8e20 error 6 in mod_mpm_event.so[7fac90b2e000+d000]
---

Taking the ip - base seems to put us at a consistent offset

---
(gdb) p/x 0x7fac90b345e0 - 0x7fac90b2e000
$1 = 0x65e0

$ addr2line -e ./mod_mpm_event.so 0x65e0
/build/apache2-Rau9Dr/apache2-2.4.7/server/mpm/event/event.c:1064
---

which is at the bottom of process_socket(), which looks like

---
  1058 /*
  1059 * Prevent this connection from writing to our connection state after it
  1060 * is no longer associated with this thread. This would happen if the EOR
  1061 * bucket is destroyed from the listener thread due to a connection abort
  1062 * or timeout.
  1063 */
  1064 c->sbh = NULL;
  1065 return;
  1066 }
---

1064 seems at least plausible as a faulting location...

Some digging through httpd history reveals that this assignment was removed on the 2.4 branch with commit [2], which seems to be largely based on [3]. Things have been shuffled around so much it's hard to tell exactly what might have avoided us going down this path. Even so I'm honestly not sure how to reproduce it -- on a fairly busy server it's seen at most a few times a day.

[1] http://paste.openstack.org/show/584330/
[2] https://github.com/apache/httpd/commit/043eba1a0a190829c073d9ef084358f6693dbbd2
[3] https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827

Brian Morton (rokclimb15) wrote :

Hi Ian, can you raise ulimit, add CoreDumpDirectory, and install apache2-dbg (will restart to make prior two changes effective)? If you make CoreDumpDirectory /tmp, make sure to move your core dump out of the way before you reboot.

https://httpd.apache.org/dev/debugging.html#crashes

Then you'll get a core dump for analysis. If you post it here I can analyze further.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apache2 (Ubuntu):
status: New → Confirmed

We also encountered this issue with 2.4.7-1ubuntu4.17
I can reproduce it from times to times with a light parralel load (50-200 requests in //)

Here is some details:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f811d7fa700 (LWP 26536)]
process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
1064 event.c: No such file or directory.
(gdb) bt
#0 process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1815
#2 0x00007f8129ec6184 in start_thread (arg=0x7f811d7fa700) at pthread_create.c:312
#3 0x00007f8129bf2ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) bt full
#0 process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
        c = 0x7f8124196330
        sbh = 0x7f8124196908
        conn_id = <optimized out>
        rc = <optimized out>
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1815
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 8
        csd = 0x7f81241960b0
        cs = 0x7f81241962b8
        ptrans = 0x7f8124196028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#2 0x00007f8129ec6184 in start_thread (arg=0x7f811d7fa700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f811d7fa700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140192522413824, -4554003917395454359, 0, 0, 140192522414528, 140192522413824, 4597849143617525353, 4597821488042743401}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3 0x00007f8129bf2ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.
(gdb) p c
$1 = (conn_rec *) 0x7f8124196330
(gdb) p *c
Cannot access memory at address 0x7f8124196330
(gdb) p cs
$2 = (event_conn_state_t *) 0x7f81241962b8
(gdb) p *cs
Cannot access memory at address 0x7f81241962b8

And the corresponding core. If you need more information, please ask

Core from previous comment.

Brian Morton (rokclimb15) wrote :

Thanks for the core dump and bt Christophe. After a bit of research, I believe this is a race condition present in 2.4.7 which was subsequently patched, and then the patch refactored when the suspend/resume hooks were added in 2.4.10. The fix in 2.4.7 seems simply enough (just move c->sbh = NULL into the suspend condition above it) but I don't think it would pass SRU justification since it only happens under load and is hard to reproduce. Can you and/or Ian use 2.4.10 from trusty-backports? That shouldn't suffer from this problem.

If not, reply here and I'll get someone to validate my SRU opinion before proceeding.

Download full text (3.4 KiB)

Hi Brian,

Thanks for your fast answer and apologies for my delay

Unfortunately, i reproduced a similar crash on trusty-backports version
Not exactly the same though:

ii apache2 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache HTTP Server
un apache2-api-20120211 <none> <none> (no description available)
ii apache2-bin 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache HTTP Server (binary files and modules)
ii apache2-data 2.4.10-1ubuntu1.1~ubuntu14.04.2 all Apache HTTP Server (common files)
ii apache2-dbg 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache debugging symbols
un apache2-doc <none> <none> (no description available)
ii apache2-mpm-event 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 transitional event MPM package for apache2

(gdb) bt
#0 0x00007feee4331861 in notify_suspend (cs=0x7feee01612a8) at event.c:891
#1 process_socket (my_thread_num=23, my_child_num=<optimized out>, cs=0x7feee01612a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1124
#2 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1875
#3 0x00007feee71eb184 in start_thread (arg=0x7feed2fed700) at pthread_create.c:312
#4 0x00007feee6f17ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) bt full
#0 0x00007feee4331861 in notify_suspend (cs=0x7feee01612a8) at event.c:891
No locals.
#1 process_socket (my_thread_num=23, my_child_num=<optimized out>, cs=0x7feee01612a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1124
        c = <optimized out>
        sbh = 0x7feee0161928
        conn_id = <optimized out>
        rc = <optimized out>
#2 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1875
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 23
        csd = 0x7feee01610a0
        cs = 0x7feee01612a8
        ptrans = 0x7feee0161028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#3 0x00007feee71eb184 in start_thread (arg=0x7feed2fed700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7feed2fed700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140663718860544, 4768455907926276535, 0, 0, 140663718861248, 140663718860544, -4759934683652884041, -4759958051315451465}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#4 0x00007feee6f17ffd in...

Read more...

Brian Morton (rokclimb15) wrote :

Hi Christophe,

I believe I've narrowed down the problem to one fixed in these two changesets:
https://github.com/apache/httpd/commit/59eea59c4be383d004e92fa63b57b995e7a8ef01
https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827

The latter only applies to 2.4.10 since it applies to the suspend/resume hooks. That leaves the first one, which I've applied in my PPA for testing. I've started with 2.4.7 since typically backports aren't for bugfixes.

Would you mind being my guinea pig?

https://launchpad.net/~rokclimb15/+archive/ubuntu/apache2

Sure! Tried with your PPA, still got a segv

Program received signal SIGSEGV, Segmentation fault.
process_socket (my_thread_num=2, my_child_num=<optimized out>, cs=0x7fd636a032a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1066
gdb) bt full
#0 process_socket (my_thread_num=2, my_child_num=<optimized out>, cs=0x7fd636a032a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1066
        c = 0x7fd636a03320
        sbh = 0x7fd636a038f8
        conn_id = <optimized out>
        rc = <optimized out>
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1819
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 2
        csd = 0x7fd636a030a0
        cs = 0x7fd636a032a8
        ptrans = 0x7fd636a03028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#2 0x00007fd64c8c5184 in start_thread (arg=0x7fd642ffd700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7fd642ffd700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140558223791872, -8454169976660319628, 0, 0, 140558223792576, 140558223791872, 8430814151234003572, 8430835441543520884}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3 0x00007fd64c5f1ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Brian Morton (rokclimb15) wrote :

Hi Christophe,

Let's try something completely different. I have a new build uploaded for testing.

Thanks,

Brian

Good news, it seems good this time !

I almost always reproduced the segfault with a single run of my script (which spawn 200 parrallel requests on a file on our daemon behind fastcgi). Once i had to run it twice to reproduce.

With your last version, i didnt had any crash for 100 consecutives runs of the script or when increasing parrallelism

Brian Morton (rokclimb15) wrote :

Fantastic news! My biggest concern now is that my monkey-patch has introduced some unexpected behavior since we don't try to dereference sbh on each read request (only when the connection state is suspended). This is based on my own observation of the problem rather than an upstream patch since all of the fixes rely on APR functionality introduced in 2.4.10.

Can you do some parallel tests of functionality in addition to crash testing? Ideally, I would test from two different clients to see if it confuses connection information or something else strange. Assuming it doesn't, I might ask if an Apache dev could review my patch for a sanity check.

Hi Brian,
I finally did some functional testing

With 2 differents clients (!= IPs), i started on each of them 100 parrallel GETs, targeting 3 different files. I ran this in a loop with some randomness for an hour and it didnt raised any crash nor consistency mismatch on thoses fetched files

I didnt do any checks on headers nor tested anything else than PUTs but it gives a first idea
Tell me if you had anything else in mind

Brian Morton (rokclimb15) wrote :

Hi Christophe,

That is excellent. Could you please provide me with a test case that previously reproduced the crash? I'd like to try to boil it down to something simple. I will need to demonstrate that it can be reproduced easily and consistently to get an SRU approved. There aren't a lot of reporters of this issue, so it's pretty critical.

Of course, if it doesn't get approved you're welcome to use my PPA until you upgrade to 16.04.

I would have wanted to provide the full chain of the reproducer, but days has already passed, here is the information i can provide for now:

I reproduce the crash with a simple:
rm failed; for i in $(seq 200); do (curl -qo /dev/null "$URL" || touch failed) & done

A single run usually is enough, rarely two. In addition to the curl failure and 'failed' file, there's a error line in Apache's error.log

I used the attached script for the consistency checks, tried with and without the sleep.

I run all of thoses with our closed source daemon as the backend behind fastcgi with the following directive parameters:
FastCgiExternalServer /var/www/proxy -host 127.0.0.1:10000 -flush -idle-timeout 300

So you should be able to reproduce if you have some kind of stub behind fastcgi, which you maybe already have ?

Hi Brian,

Some news:

I tried to reproduce the crash with php5-fpm as a backend of fastcgi. I ran the same test on a php file doing a readfile() on a 50MB file again and i can reproduce the crash systematically !

Here is the script and config file attached
Nothing particular, used php5-fpm package with standard config, and enabled fastcgi, mpm_event

Tell me if you need any specific details

Previous archive was missing content, here is the right one

Dont forget to generate the 50M file in /tmp as i doesnt seems to trigger the bug if the content if not big enough

Brian Morton (rokclimb15) wrote :

Hi Christophe,

Thanks for your hard work on this one. Unfortunately I can't reproduce the crash with your test. I even raised the file size to 500M, but still nothing.

Is there anything I could be missing? Any PPA packages with newer versions of PHP or other Apache modules loaded?

root@trusty-mpm-event-crash-test:~# apache2ctl -M
Loaded Modules:
 core_module (static)
 so_module (static)
 watchdog_module (static)
 http_module (static)
 log_config_module (static)
 logio_module (static)
 version_module (static)
 unixd_module (static)
 access_compat_module (shared)
 actions_module (shared)
 alias_module (shared)
 auth_basic_module (shared)
 authn_core_module (shared)
 authn_file_module (shared)
 authz_core_module (shared)
 authz_host_module (shared)
 authz_user_module (shared)
 autoindex_module (shared)
 deflate_module (shared)
 dir_module (shared)
 env_module (shared)
 fastcgi_module (shared)
 filter_module (shared)
 mime_module (shared)
 mpm_event_module (shared)
 negotiation_module (shared)
 setenvif_module (shared)
 status_module (shared)
root@trusty-mpm-event-crash-test:~# apache2ctl -S
VirtualHost configuration:
*:80 trusty-mpm-event-crash-test.lxd (/etc/apache2/sites-enabled/000-default.conf:1)
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex watchdog-callback: using_defaults
Mutex default: dir="/var/lock/apache2" mechanism=fcntl
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33
root@trusty-mpm-event-crash-test:~# apache2ctl -V
Server version: Apache/2.4.7 (Ubuntu)
Server built: Jul 27 2017 15:20:24
Server's Module Magic Number: 20120211:27
Server loaded: APR 1.5.1-dev, APR-UTIL 1.5.3
Compiled using: APR 1.5.1-dev, APR-UTIL 1.5.3
Architecture: 64-bit
Server MPM: event
  threaded: yes (fixed thread count)
    forked: yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

Hi Brian,

Check the attached archive, i did a small dockerfile so you see what i exactly install in it, and hopefully could reproduce it

cd <unpacked targzdir>
docker build -t apache-1630413 .
docker run -it apache-1630413

# in container:
service php5-fpm start && service apache2 start
curl localhost/50M.php | cmp /tmp/50M - && echo PHP-OK
bash /root/test.sh
tail /var/log/apache2/error.log

i got multiple segv each time i run it
Is that working for you ?

Here is the Xenial version which does *not* reproduce the issue, hopefully :)

(docker build -t apache-1630413 . && docker run apache-1630413 bash /root/run.sh)

Hi Brian,

Here is the requested config output if it helps
Were you able to reproduce with the docker containers ?

Christophe

# apache2ctl -M
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.6. Set the 'ServerName' directive globally to suppress this message
Loaded Modules:
 core_module (static)
 so_module (static)
 watchdog_module (static)
 http_module (static)
 log_config_module (static)
 logio_module (static)
 version_module (static)
 unixd_module (static)
 access_compat_module (shared)
 actions_module (shared)
 alias_module (shared)
 auth_basic_module (shared)
 authn_core_module (shared)
 authn_file_module (shared)
 authz_core_module (shared)
 authz_host_module (shared)
 authz_user_module (shared)
 autoindex_module (shared)
 deflate_module (shared)
 dir_module (shared)
 env_module (shared)
 fastcgi_module (shared)
 filter_module (shared)
 mime_module (shared)
 mpm_event_module (shared)
 negotiation_module (shared)
 setenvif_module (shared)
 status_module (shared)

# apache2ctl -S
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.6. Set the 'ServerName' directive globally to suppress this message
VirtualHost configuration:
*:80 172.17.0.6 (/etc/apache2/sites-enabled/000-default.conf:1)
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex default: dir="/var/lock/apache2" mechanism=fcntl
Mutex watchdog-callback: using_defaults
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33

# apache2ctl -V
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.6. Set the 'ServerName' directive globally to suppress this message
Server version: Apache/2.4.7 (Ubuntu)
Server built: Sep 18 2017 16:37:54
Server's Module Magic Number: 20120211:27
Server loaded: APR 1.5.1-dev, APR-UTIL 1.5.3
Compiled using: APR 1.5.1-dev, APR-UTIL 1.5.3
Architecture: 64-bit
Server MPM: event
  threaded: yes (fixed thread count)
    forked: yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers