segfault in server/mpm/event/event.c:process_socket

Bug #1630413 reported by Ian Wienand
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

We have seen consistent but infrequent segfaults of apache on a trusty production server with 2.4.7-1ubuntu4.13 (for more examples, see [1])

---
Oct 2 19:01:03 static kernel: [8029151.932468] apache2[10642]: segfault at 7fac797803a8 ip 00007fac90b345e0 sp 00007fac84ff8e20 error 6 in mod_mpm_event.so[7fac90b2e000+d000]
---

Taking the ip - base seems to put us at a consistent offset

---
(gdb) p/x 0x7fac90b345e0 - 0x7fac90b2e000
$1 = 0x65e0

$ addr2line -e ./mod_mpm_event.so 0x65e0
/build/apache2-Rau9Dr/apache2-2.4.7/server/mpm/event/event.c:1064
---

which is at the bottom of process_socket(), which looks like

---
  1058 /*
  1059 * Prevent this connection from writing to our connection state after it
  1060 * is no longer associated with this thread. This would happen if the EOR
  1061 * bucket is destroyed from the listener thread due to a connection abort
  1062 * or timeout.
  1063 */
  1064 c->sbh = NULL;
  1065 return;
  1066 }
---

1064 seems at least plausible as a faulting location...

Some digging through httpd history reveals that this assignment was removed on the 2.4 branch with commit [2], which seems to be largely based on [3]. Things have been shuffled around so much it's hard to tell exactly what might have avoided us going down this path. Even so I'm honestly not sure how to reproduce it -- on a fairly busy server it's seen at most a few times a day.

[1] http://paste.openstack.org/show/584330/
[2] https://github.com/apache/httpd/commit/043eba1a0a190829c073d9ef084358f6693dbbd2
[3] https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Ian, can you raise ulimit, add CoreDumpDirectory, and install apache2-dbg (will restart to make prior two changes effective)? If you make CoreDumpDirectory /tmp, make sure to move your core dump out of the way before you reboot.

https://httpd.apache.org/dev/debugging.html#crashes

Then you'll get a core dump for analysis. If you post it here I can analyze further.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apache2 (Ubuntu):
status: New → Confirmed
Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

We also encountered this issue with 2.4.7-1ubuntu4.17
I can reproduce it from times to times with a light parralel load (50-200 requests in //)

Here is some details:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7f811d7fa700 (LWP 26536)]
process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
1064 event.c: No such file or directory.
(gdb) bt
#0 process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1815
#2 0x00007f8129ec6184 in start_thread (arg=0x7f811d7fa700) at pthread_create.c:312
#3 0x00007f8129bf2ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) bt full
#0 process_socket (my_thread_num=8, my_child_num=<optimized out>, cs=0x7f81241962b8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1064
        c = 0x7f8124196330
        sbh = 0x7f8124196908
        conn_id = <optimized out>
        rc = <optimized out>
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1815
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 8
        csd = 0x7f81241960b0
        cs = 0x7f81241962b8
        ptrans = 0x7f8124196028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#2 0x00007f8129ec6184 in start_thread (arg=0x7f811d7fa700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7f811d7fa700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140192522413824, -4554003917395454359, 0, 0, 140192522414528, 140192522413824, 4597849143617525353, 4597821488042743401}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3 0x00007f8129bf2ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
No locals.
(gdb) p c
$1 = (conn_rec *) 0x7f8124196330
(gdb) p *c
Cannot access memory at address 0x7f8124196330
(gdb) p cs
$2 = (event_conn_state_t *) 0x7f81241962b8
(gdb) p *cs
Cannot access memory at address 0x7f81241962b8

And the corresponding core. If you need more information, please ask

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Core from previous comment.

Revision history for this message
Brian Morton (rokclimb15) wrote :

Thanks for the core dump and bt Christophe. After a bit of research, I believe this is a race condition present in 2.4.7 which was subsequently patched, and then the patch refactored when the suspend/resume hooks were added in 2.4.10. The fix in 2.4.7 seems simply enough (just move c->sbh = NULL into the suspend condition above it) but I don't think it would pass SRU justification since it only happens under load and is hard to reproduce. Can you and/or Ian use 2.4.10 from trusty-backports? That shouldn't suffer from this problem.

If not, reply here and I'll get someone to validate my SRU opinion before proceeding.

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :
Download full text (3.4 KiB)

Hi Brian,

Thanks for your fast answer and apologies for my delay

Unfortunately, i reproduced a similar crash on trusty-backports version
Not exactly the same though:

ii apache2 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache HTTP Server
un apache2-api-20120211 <none> <none> (no description available)
ii apache2-bin 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache HTTP Server (binary files and modules)
ii apache2-data 2.4.10-1ubuntu1.1~ubuntu14.04.2 all Apache HTTP Server (common files)
ii apache2-dbg 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 Apache debugging symbols
un apache2-doc <none> <none> (no description available)
ii apache2-mpm-event 2.4.10-1ubuntu1.1~ubuntu14.04.2 amd64 transitional event MPM package for apache2

(gdb) bt
#0 0x00007feee4331861 in notify_suspend (cs=0x7feee01612a8) at event.c:891
#1 process_socket (my_thread_num=23, my_child_num=<optimized out>, cs=0x7feee01612a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1124
#2 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1875
#3 0x00007feee71eb184 in start_thread (arg=0x7feed2fed700) at pthread_create.c:312
#4 0x00007feee6f17ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) bt full
#0 0x00007feee4331861 in notify_suspend (cs=0x7feee01612a8) at event.c:891
No locals.
#1 process_socket (my_thread_num=23, my_child_num=<optimized out>, cs=0x7feee01612a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1124
        c = <optimized out>
        sbh = 0x7feee0161928
        conn_id = <optimized out>
        rc = <optimized out>
#2 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1875
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 23
        csd = 0x7feee01610a0
        cs = 0x7feee01612a8
        ptrans = 0x7feee0161028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#3 0x00007feee71eb184 in start_thread (arg=0x7feed2fed700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7feed2fed700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140663718860544, 4768455907926276535, 0, 0, 140663718861248, 140663718860544, -4759934683652884041, -4759958051315451465}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#4 0x00007feee6f17ffd in...

Read more...

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Christophe,

I believe I've narrowed down the problem to one fixed in these two changesets:
https://github.com/apache/httpd/commit/59eea59c4be383d004e92fa63b57b995e7a8ef01
https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827

The latter only applies to 2.4.10 since it applies to the suspend/resume hooks. That leaves the first one, which I've applied in my PPA for testing. I've started with 2.4.7 since typically backports aren't for bugfixes.

Would you mind being my guinea pig?

https://launchpad.net/~rokclimb15/+archive/ubuntu/apache2

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Sure! Tried with your PPA, still got a segv

Program received signal SIGSEGV, Segmentation fault.
process_socket (my_thread_num=2, my_child_num=<optimized out>, cs=0x7fd636a032a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1066
gdb) bt full
#0 process_socket (my_thread_num=2, my_child_num=<optimized out>, cs=0x7fd636a032a8, sock=<optimized out>, p=<optimized out>, thd=<optimized out>) at event.c:1066
        c = 0x7fd636a03320
        sbh = 0x7fd636a038f8
        conn_id = <optimized out>
        rc = <optimized out>
#1 worker_thread (thd=<optimized out>, dummy=<optimized out>) at event.c:1819
        ti = <optimized out>
        process_slot = <optimized out>
        thread_slot = 2
        csd = 0x7fd636a030a0
        cs = 0x7fd636a032a8
        ptrans = 0x7fd636a03028
        rv = <optimized out>
        is_idle = 0
        te = 0x0
#2 0x00007fd64c8c5184 in start_thread (arg=0x7fd642ffd700) at pthread_create.c:312
        __res = <optimized out>
        pd = 0x7fd642ffd700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140558223791872, -8454169976660319628, 0, 0, 140558223792576, 140558223791872, 8430814151234003572, 8430835441543520884}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0},
            data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#3 0x00007fd64c5f1ffd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Christophe,

Let's try something completely different. I have a new build uploaded for testing.

Thanks,

Brian

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Good news, it seems good this time !

I almost always reproduced the segfault with a single run of my script (which spawn 200 parrallel requests on a file on our daemon behind fastcgi). Once i had to run it twice to reproduce.

With your last version, i didnt had any crash for 100 consecutives runs of the script or when increasing parrallelism

Revision history for this message
Brian Morton (rokclimb15) wrote :

Fantastic news! My biggest concern now is that my monkey-patch has introduced some unexpected behavior since we don't try to dereference sbh on each read request (only when the connection state is suspended). This is based on my own observation of the problem rather than an upstream patch since all of the fixes rely on APR functionality introduced in 2.4.10.

Can you do some parallel tests of functionality in addition to crash testing? Ideally, I would test from two different clients to see if it confuses connection information or something else strange. Assuming it doesn't, I might ask if an Apache dev could review my patch for a sanity check.

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Hi Brian,
I finally did some functional testing

With 2 differents clients (!= IPs), i started on each of them 100 parrallel GETs, targeting 3 different files. I ran this in a loop with some randomness for an hour and it didnt raised any crash nor consistency mismatch on thoses fetched files

I didnt do any checks on headers nor tested anything else than PUTs but it gives a first idea
Tell me if you had anything else in mind

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Christophe,

That is excellent. Could you please provide me with a test case that previously reproduced the crash? I'd like to try to boil it down to something simple. I will need to demonstrate that it can be reproduced easily and consistently to get an SRU approved. There aren't a lot of reporters of this issue, so it's pretty critical.

Of course, if it doesn't get approved you're welcome to use my PPA until you upgrade to 16.04.

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

I would have wanted to provide the full chain of the reproducer, but days has already passed, here is the information i can provide for now:

I reproduce the crash with a simple:
rm failed; for i in $(seq 200); do (curl -qo /dev/null "$URL" || touch failed) & done

A single run usually is enough, rarely two. In addition to the curl failure and 'failed' file, there's a error line in Apache's error.log

I used the attached script for the consistency checks, tried with and without the sleep.

I run all of thoses with our closed source daemon as the backend behind fastcgi with the following directive parameters:
FastCgiExternalServer /var/www/proxy -host 127.0.0.1:10000 -flush -idle-timeout 300

So you should be able to reproduce if you have some kind of stub behind fastcgi, which you maybe already have ?

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Hi Brian,

Some news:

I tried to reproduce the crash with php5-fpm as a backend of fastcgi. I ran the same test on a php file doing a readfile() on a 50MB file again and i can reproduce the crash systematically !

Here is the script and config file attached
Nothing particular, used php5-fpm package with standard config, and enabled fastcgi, mpm_event

Tell me if you need any specific details

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Previous archive was missing content, here is the right one

Dont forget to generate the 50M file in /tmp as i doesnt seems to trigger the bug if the content if not big enough

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Christophe,

Thanks for your hard work on this one. Unfortunately I can't reproduce the crash with your test. I even raised the file size to 500M, but still nothing.

Is there anything I could be missing? Any PPA packages with newer versions of PHP or other Apache modules loaded?

root@trusty-mpm-event-crash-test:~# apache2ctl -M
Loaded Modules:
 core_module (static)
 so_module (static)
 watchdog_module (static)
 http_module (static)
 log_config_module (static)
 logio_module (static)
 version_module (static)
 unixd_module (static)
 access_compat_module (shared)
 actions_module (shared)
 alias_module (shared)
 auth_basic_module (shared)
 authn_core_module (shared)
 authn_file_module (shared)
 authz_core_module (shared)
 authz_host_module (shared)
 authz_user_module (shared)
 autoindex_module (shared)
 deflate_module (shared)
 dir_module (shared)
 env_module (shared)
 fastcgi_module (shared)
 filter_module (shared)
 mime_module (shared)
 mpm_event_module (shared)
 negotiation_module (shared)
 setenvif_module (shared)
 status_module (shared)
root@trusty-mpm-event-crash-test:~# apache2ctl -S
VirtualHost configuration:
*:80 trusty-mpm-event-crash-test.lxd (/etc/apache2/sites-enabled/000-default.conf:1)
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex watchdog-callback: using_defaults
Mutex default: dir="/var/lock/apache2" mechanism=fcntl
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33
root@trusty-mpm-event-crash-test:~# apache2ctl -V
Server version: Apache/2.4.7 (Ubuntu)
Server built: Jul 27 2017 15:20:24
Server's Module Magic Number: 20120211:27
Server loaded: APR 1.5.1-dev, APR-UTIL 1.5.3
Compiled using: APR 1.5.1-dev, APR-UTIL 1.5.3
Architecture: 64-bit
Server MPM: event
  threaded: yes (fixed thread count)
    forked: yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Hi Brian,

Check the attached archive, i did a small dockerfile so you see what i exactly install in it, and hopefully could reproduce it

cd <unpacked targzdir>
docker build -t apache-1630413 .
docker run -it apache-1630413

# in container:
service php5-fpm start && service apache2 start
curl localhost/50M.php | cmp /tmp/50M - && echo PHP-OK
bash /root/test.sh
tail /var/log/apache2/error.log

i got multiple segv each time i run it
Is that working for you ?

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Here is the Xenial version which does *not* reproduce the issue, hopefully :)

(docker build -t apache-1630413 . && docker run apache-1630413 bash /root/run.sh)

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Hi Brian,

Here is the requested config output if it helps
Were you able to reproduce with the docker containers ?

Christophe

# apache2ctl -M
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.6. Set the 'ServerName' directive globally to suppress this message
Loaded Modules:
 core_module (static)
 so_module (static)
 watchdog_module (static)
 http_module (static)
 log_config_module (static)
 logio_module (static)
 version_module (static)
 unixd_module (static)
 access_compat_module (shared)
 actions_module (shared)
 alias_module (shared)
 auth_basic_module (shared)
 authn_core_module (shared)
 authn_file_module (shared)
 authz_core_module (shared)
 authz_host_module (shared)
 authz_user_module (shared)
 autoindex_module (shared)
 deflate_module (shared)
 dir_module (shared)
 env_module (shared)
 fastcgi_module (shared)
 filter_module (shared)
 mime_module (shared)
 mpm_event_module (shared)
 negotiation_module (shared)
 setenvif_module (shared)
 status_module (shared)

# apache2ctl -S
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.6. Set the 'ServerName' directive globally to suppress this message
VirtualHost configuration:
*:80 172.17.0.6 (/etc/apache2/sites-enabled/000-default.conf:1)
ServerRoot: "/etc/apache2"
Main DocumentRoot: "/var/www"
Main ErrorLog: "/var/log/apache2/error.log"
Mutex default: dir="/var/lock/apache2" mechanism=fcntl
Mutex watchdog-callback: using_defaults
PidFile: "/var/run/apache2/apache2.pid"
Define: DUMP_VHOSTS
Define: DUMP_RUN_CFG
User: name="www-data" id=33
Group: name="www-data" id=33

# apache2ctl -V
AH00558: apache2: Could not reliably determine the server's fully qualified domain name, using 172.17.0.6. Set the 'ServerName' directive globally to suppress this message
Server version: Apache/2.4.7 (Ubuntu)
Server built: Sep 18 2017 16:37:54
Server's Module Magic Number: 20120211:27
Server loaded: APR 1.5.1-dev, APR-UTIL 1.5.3
Compiled using: APR 1.5.1-dev, APR-UTIL 1.5.3
Architecture: 64-bit
Server MPM: event
  threaded: yes (fixed thread count)
    forked: yes (variable process count)
Server compiled with....
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=256
 -D HTTPD_ROOT="/etc/apache2"
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="mime.types"
 -D SERVER_CONFIG_FILE="apache2.conf"

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

The trusty docker file does reproduce the segfault, fwiw. I'll just still try to reproduce it in a lxd container, now that I have the setup instructions.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Interesting, I don't get the segfault with the script in a lxd container.

@chris+launchpad-ielf, is your original segfault scenario inside a docker container as well? I wonder if some resource constraints could be at play here.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Sorry, I had a setup problem. I can reproduce the problem in an LXD container as well.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Some more testing. With the packages from Brian's ppa, the crash doesn't happen. I can't vouch for the patch itself since I'm not familiar at all with that code, though.

Changed in apache2 (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Hi Andreas, thanks for your update !

For your initial question: we initially reproduced on docker containers, yes. I just tried on a VM with the same procedure and i could reproduce the issue so it doesnt seems related to container environment

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Just a ping on this bug, confirming it's still in our queue.

Revision history for this message
Brian Morton (rokclimb15) wrote :

Hi Christophe,

Sorry for the delay. Apparently I wasn't getting these notifications for some reason. I'm not well versed enough with Docker to set up an environment to reproduce. I use LXD almost exclusively. Does the crash occur in your Docker container with my patched PPA build? Andreas seems to indicate that it's fixed by the PPA build. Is there any non-default configuration you're running?

Andreas, it might be worth getting a review from someone better versed in Apache internals if possible. I just took an educated guess about the problem based on the traceback.

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Hi Brian,

The crash doesn't occur with your PPA packages (2.4.7-1ubuntu4.19)
To be more precise: it systematically crash on 1 run of the reproducer without your PPA, and i dont have crash with 10 runs of the reproducer on your PPA. I didnt test further but it sound reasonable to me so far.

Concerning the configuration, no there's nothing specific, basically just enabling PHP5

You can examine files in apache-1630413-docker.tgz without running it. Especially php5-fpm.conf and the Dockerfile itself, its straightforward.

And thoses changes are the only changes i made to the base trusty to reproduce.

As said before, i did thoses steps manually on a new trusty VM and i could also reproduce it

Revision history for this message
Brian Morton (rokclimb15) wrote :

Thanks for the clarification Christophe. So it sounds like the fix addresses the problem. I think the patch in that PPA should get more review from an Apache developer before it is used further.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I'm not sure how likely this bug is to get attention from upstream for an old version of apache, unless the problem still happens with the latest release. Any ideas?

Revision history for this message
Christophe Meron (chris+launchpad-ielf) wrote :

Unfortunately, not really

I can argue on why we use Trusty: as we deploy storage software which runs for years in controlled environment, we never upgrade OSes to new releases. Our older platforms are still on Trusty and that makes sense to me.

But that doesn't make an argument to why they should fix an old version of apache.

We can workaround our issue by using backports or hand-made packages. But as it seems to affect anyone using MPM + a not so heavy parallel workload, it seems worth fixing this in the distribution by default

Revision history for this message
Brian Morton (rokclimb15) wrote : Re: [Bug 1630413] Re: segfault in server/mpm/event/event.c:process_socket

Andreas,

I think patching this in Ubuntu only rather than upstream makes sense for
the reasons you've outlined. However, I would prefer that someone with more
Apache experience reviewed the fix.

Thanks,

Brian

On Fri, Dec 7, 2018 at 10:21 AM Christophe Meron <email address hidden>
wrote:

> Unfortunately, not really
>
> I can argue on why we use Trusty: as we deploy storage software which
> runs for years in controlled environment, we never upgrade OSes to new
> releases. Our older platforms are still on Trusty and that makes sense
> to me.
>
> But that doesn't make an argument to why they should fix an old version
> of apache.
>
> We can workaround our issue by using backports or hand-made packages.
> But as it seems to affect anyone using MPM + a not so heavy parallel
> workload, it seems worth fixing this in the distribution by default
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1630413
>
> Title:
> segfault in server/mpm/event/event.c:process_socket
>
> Status in apache2 package in Ubuntu:
> Triaged
>
> Bug description:
> We have seen consistent but infrequent segfaults of apache on a trusty
> production server with 2.4.7-1ubuntu4.13 (for more examples, see [1])
>
> ---
> Oct 2 19:01:03 static kernel: [8029151.932468] apache2[10642]: segfault
> at 7fac797803a8 ip 00007fac90b345e0 sp 00007fac84ff8e20 error 6 in
> mod_mpm_event.so[7fac90b2e000+d000]
> ---
>
> Taking the ip - base seems to put us at a consistent offset
>
> ---
> (gdb) p/x 0x7fac90b345e0 - 0x7fac90b2e000
> $1 = 0x65e0
>
> $ addr2line -e ./mod_mpm_event.so 0x65e0
> /build/apache2-Rau9Dr/apache2-2.4.7/server/mpm/event/event.c:1064
> ---
>
> which is at the bottom of process_socket(), which looks like
>
> ---
> 1058 /*
> 1059 * Prevent this connection from writing to our connection
> state after it
> 1060 * is no longer associated with this thread. This would
> happen if the EOR
> 1061 * bucket is destroyed from the listener thread due to a
> connection abort
> 1062 * or timeout.
> 1063 */
> 1064 c->sbh = NULL;
> 1065 return;
> 1066 }
> ---
>
> 1064 seems at least plausible as a faulting location...
>
> Some digging through httpd history reveals that this assignment was
> removed on the 2.4 branch with commit [2], which seems to be largely
> based on [3]. Things have been shuffled around so much it's hard to
> tell exactly what might have avoided us going down this path. Even so
> I'm honestly not sure how to reproduce it -- on a fairly busy server
> it's seen at most a few times a day.
>
> [1] http://paste.openstack.org/show/584330/
> [2]
> https://github.com/apache/httpd/commit/043eba1a0a190829c073d9ef084358f6693dbbd2
> [3]
> https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1630413/+subscriptions
>

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

> However, I would prefer that someone with more Apache experience reviewed the fix.

Right, that was actually my (very unclear, sorry) point when I commented on upstream's interest in this, since they would be experienced reviewers.

Revision history for this message
Brian Morton (rokclimb15) wrote :

Ah, that makes sense.

On Mon, Dec 10, 2018 at 6:50 AM Andreas Hasenack <email address hidden>
wrote:

> > However, I would prefer that someone with more Apache experience
> reviewed the fix.
>
> Right, that was actually my (very unclear, sorry) point when I commented
> on upstream's interest in this, since they would be experienced
> reviewers.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1630413
>
> Title:
> segfault in server/mpm/event/event.c:process_socket
>
> Status in apache2 package in Ubuntu:
> Triaged
>
> Bug description:
> We have seen consistent but infrequent segfaults of apache on a trusty
> production server with 2.4.7-1ubuntu4.13 (for more examples, see [1])
>
> ---
> Oct 2 19:01:03 static kernel: [8029151.932468] apache2[10642]: segfault
> at 7fac797803a8 ip 00007fac90b345e0 sp 00007fac84ff8e20 error 6 in
> mod_mpm_event.so[7fac90b2e000+d000]
> ---
>
> Taking the ip - base seems to put us at a consistent offset
>
> ---
> (gdb) p/x 0x7fac90b345e0 - 0x7fac90b2e000
> $1 = 0x65e0
>
> $ addr2line -e ./mod_mpm_event.so 0x65e0
> /build/apache2-Rau9Dr/apache2-2.4.7/server/mpm/event/event.c:1064
> ---
>
> which is at the bottom of process_socket(), which looks like
>
> ---
> 1058 /*
> 1059 * Prevent this connection from writing to our connection
> state after it
> 1060 * is no longer associated with this thread. This would
> happen if the EOR
> 1061 * bucket is destroyed from the listener thread due to a
> connection abort
> 1062 * or timeout.
> 1063 */
> 1064 c->sbh = NULL;
> 1065 return;
> 1066 }
> ---
>
> 1064 seems at least plausible as a faulting location...
>
> Some digging through httpd history reveals that this assignment was
> removed on the 2.4 branch with commit [2], which seems to be largely
> based on [3]. Things have been shuffled around so much it's hard to
> tell exactly what might have avoided us going down this path. Even so
> I'm honestly not sure how to reproduce it -- on a fairly busy server
> it's seen at most a few times a day.
>
> [1] http://paste.openstack.org/show/584330/
> [2]
> https://github.com/apache/httpd/commit/043eba1a0a190829c073d9ef084358f6693dbbd2
> [3]
> https://github.com/apache/httpd/commit/285e67883e396f97dc3aad50d9dc345f15220827
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/apache2/+bug/1630413/+subscriptions
>

Revision history for this message
Bryce Harrington (bryce) wrote :

From the context of the bug report, this work was towards an SRU for trusty, for fixes included in newer releases.
Trusty is no longer under standard support. Unless this is reproducible in bionic or newer, I think we can consider this as fixed for supported releases.

Changed in apache2 (Ubuntu):
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.