apache2+ssl hangs on high load

Bug #1028470 reported by Evgeny Anisiforov on 2012-07-24
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Medium
Unassigned

Bug Description

Apache2 stops accepting connections when using mod_ssl and having more than 1000 processes running. This is only happening on ubuntu 12.04 and only with mod_ssl enabled.

Steps to reproduce:
- take a clean install of ubuntu 12.04 server 64bit (i use english installer and all standard settings)
- execute following commands as root:
$ apt-get update
$ apt-get upgrade
$ apt-get install apache2-mpm-prefork

- change /etc/apache2/apache2.conf to start at least 1001 processes:
<IfModule mpm_prefork_module>
    ServerLimit 1500
    StartServers 1500
    MinSpareServers 1400
    MaxSpareServers 1500
    MaxClients 1500
    MaxRequestsPerChild 1200
</IfModule>

- enable mod_ssl and restart apache:
$ a2enmod ssl
$ service apache2 restart

[no further configuration changes requred,
i did not configure any ssl hosts,
only enabled the module]

- verify, that apache is running at least 1001 processes
$ ps ax | grep apache | wc -l
1502

- verify you can connect to localhost:
$ curl http://localhost/
<html><body><h1>It works!</h1>
<p>This is the default web page for this server.</p>
<p>The web server software is running but no content has been added, yet.</p>
</body></html>

- start high load:
$ ab -n 5000 -c 1000 http://localhost/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 500 requests
apr_poll: The timeout specified has expired (70007)
Total of 998 requests completed

- ready, now apache is not working properly:
$ curl -v http://localhost/
* About to connect() to localhost port 80 (#0)
* Trying 127.0.0.1... connected
> GET / HTTP/1.1
> User-Agent: curl/7.22.0 (x86_64-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.1 zlib/1.2.3.4 libidn/1.23 librtmp/2.3
> Host: localhost
> Accept: */*
>
..... silence

There are no errors to find in the logs. After restarting apache it will work for some time,
but continue crashing regurally, if you have some traffic coming to the server.
In my tests i sometimes had crashes even with very few users connecting to the servers.
For greater reproducibility however you will need this high connections number for ab.

This is reproducible, happening every time and i also tested this on 3 different machines.
This is specific to 12.04, as i have the same setup working properly on 11.10 and 12.10.
I'm aware of that 1000 Processes will consume a lot of ram. The machine that is supposed
to run this config has 32GB, so this should not be the problem here.

Notice:
 - apache crashed only with mod_ssl enabled
 - apache crashed only with >1000 processes: 1000 processes runs fine, 1001 will produce a crash

Additional information:
1) The release of Ubuntu you are using
$ lsb_release -rd
Description: Ubuntu 12.04 LTS
Release: 12.04
2) The version of the package you are using
$ apt-cache policy apache2-mpm-prefork
apache2-mpm-prefork:
  Installed: 2.2.22-1ubuntu1
  Candidate: 2.2.22-1ubuntu1
  Version table:
 *** 2.2.22-1ubuntu1 0
        500 http://de.archive.ubuntu.com/ubuntu/ precise/main amd64 Packages
        100 /var/lib/dpkg/status
3) What you expected to happen
i expect apache to handle the 5000 requests as usual and continue accepting connections afterwards
4) What happened instead
apache handles only 1000 requests and stops accepting new connections at all, which is a disaster for any website running on the host

description: updated
Stefan Fritsch (sf-sfritsch) wrote :

I cannot reproduce this on Debian unstable with either 2.2.22-9 or 2.2.22-1.

Wild guess: Do you have a per-user process limit configured in /etc/security/limits.conf ?

If no, it would be helpful if you could provide a backtrace of the process that curl connects to and hangs. There is some documentation about how to do that in
/usr/share/doc/apache2.2-common/README.backtrace. But the doc is for Debian. For Ubuntu, the installing of the debugging symbols works differently (maybe someone else can provide a pointer).

As i have written, this is a clean ubuntu install, i did not change any config files upon mentioned.
/etc/security/limits.conf ist standard config file with no limits inside (only comments).

I can't figure out how to determine the pid of apache child process that curl connects to. If you have an idea, please let me know.

Here is some output, that i get with gdb for a hanging apache:

main apache process (determined with pstree -p):
#0 0x00007f9b07745803 in select () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007f9b07c630fd in apr_sleep () from /usr/lib/libapr-1.so.0
No symbol table info available.
#2 0x00007f9b0853bc69 in ap_wait_or_timeout ()
No symbol table info available.
#3 0x00007f9b08548e79 in ap_mpm_run ()
No symbol table info available.
#4 0x00007f9b0851e4a4 in main ()
No symbol table info available.

some child process
(gdb) bt full
#0 0x00007f9b0774e5b7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.
#1 0x00007f9b07c4f68e in ?? () from /usr/lib/libapr-1.so.0
No symbol table info available.
#2 0x00007f9b07c504a6 in apr_proc_mutex_lock () from /usr/lib/libapr-1.so.0
No symbol table info available.
#3 0x00007f9b085480dd in ?? ()
No symbol table info available.
#4 0x00007f9b0854893a in ?? ()
No symbol table info available.
#5 0x00007f9b085489f7 in ?? ()
No symbol table info available.
#6 0x00007f9b08549374 in ap_mpm_run ()
No symbol table info available.
#7 0x00007f9b0851e4a4 in main ()
No symbol table info available.

This is basically the same as for the working instance before the crash. The output changes, if i disable mod_ssl.
Any ideas?

Changed in apache2 (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Clint Byrum (clint-fewbar) wrote :

This appears to be legitimate, I was able to reproduce it on an HP cloud instance with the given parameters. The first 1000 actual requests always finish, but after that all fail.

I notice these kernel messages:
[ 1131.976324] TCP: Possible SYN flooding on port 80. Dropping request. Check SNMP counters.

But I don't think it is related.

I see this as well *sometimes*:

[Wed Jul 25 20:20:10 2012] [error] server reached MaxClients setting, consider raising the MaxClients setting

But MaxClients is set to 1500 so I'm not sure what that is.

The one difference mod_ssl would introduce would be the use of shared memory for statistical gathering. So maybe the stats are running into a shm limit.

I tried raising shmall to 4194304, but that just slowed things down a bit, it still fails right at 1000. I also tried raising shmmni to 8192, and that did nothing. Same for doubling shmmax.

On comparing strace's with and without mod_ssl enabled, the problem most likely lies with shared memory or semaphore opertaions, which only seem to be happening with mod_ssl. I also tried adjusting the numbers in /proc/sys/kernel/sem but that did not alleviate the problem.

Also its worth noting that 1000 processes is inefficient for more reasons than just memory. Context switching at the process level will be far more expensive than a threaded model. For that reason alone I've set this to "Medium", as its really just not a great way to configure apache.

tags: added: precise

I could verify getting the same log messages on my system.

This however seems to be not directly related to the problem. I see the same messages, when testing it on Ubuntu 11.10. But the apache remains stable and repsonsive on this older version of ubuntu.

I have found some interesting behavior, when using the -k switch with ab to ensure there are really only the specified amount of processes used and no process is busy waiting for the connection to close:
$ ab -k -n 5000 -c 999 http://localhost/
.... all requests will complete without error
$ curl http://localhost/
<html><body><h1>It works!</h1>
<p>This is the default web page for this server.</p>
<p>The web server software is running but no content has been added, yet.</p>
</body></html>
$ curl http://localhost/
... no answer

so here you can literally see how the connection #1001 (the "magic number", that appeared before) is breaking apache. i think maybe its some kind of buffer running full?

I also agree with you, that this feature has medium importance. I think our install of apache is not very common. We are running mod_php and as a consequence relying on the prefork modell, because php is not thread safe. Also please notice that the presented config was changed to maximize reproducibility. On our production system we have about 700 processes running the most of the time, but on peak traffic this number rises to 1000-1200 producing the hanging that i have described.
So while systems with such load may be uncommon, the reported problem is still existing in the real world. Thanks for paying attention to it!

Stefan Fritsch (sf-sfritsch) wrote :

Evgeny, you can use "netstat -tnp |grep curl " to get the other port number of the connection from curl to apache2. With that, you can look for the other end of the connection in "netstat -tnp" output. The last column should give be "123/apache2" where 123 is the pid of the apache2 process. You will have to execute netstat -tnp as root to get the info.

The backtrace of the child process you posted looks more like a process that is waiting for a connection. But one would need the debug info installed to be absolutely sure.

Clint Byrum (clint-fewbar) wrote :

One thing to consider as a workaround to this is to use php5-fpm for per-user PHP, running a daemon per user. This has the additional benefit of being able to limit each user's memory usage individually. You can then switch to apache worker, which I'm sure does not have this issue. This should also be quite a bit more memory efficient as static files will be served from the apache threads rather than all 1000+ processes.

Unfortunately i do not get any PID with this method. The other end of the connection is simply "-", not associated with any apache2 process:

root@ubuntu:/home/jeff# netstat -tnp |grep curl
tcp 0 161 127.0.0.1:33399 127.0.0.1:80 ESTABLISHED 347/curl
root@ubuntu:/home/jeff# netstat -tnp | grep 33399
tcp 0 0 127.0.0.1:80 127.0.0.1:33399 SYN_RECV -
tcp 0 161 127.0.0.1:33399 127.0.0.1:80 ESTABLISHED 347/curl

I have tried capturing the http traffic to get some insight: tcpdump -p -s0 -w dump.cap -i lo port 80
This is the dump of the last curl request. Wireshark shows me multiple tcp retransmits, but no reply from the server. So it may be something on the tcp level, that is going wrong. Could someone with a deeper understanding of tcp take a look on the dump?

If someone is interested, i can provide a downloadable virtual appliance running ubuntu 12.04 with the reported bug (.OVF) from my virtualbox for debugging purposes.

I have made some further experiments:
the bug is still occuring with the latest updates.

I have traced the problem down to the main apache functions. It is not a mod_ssl issue!
What is actually happening, is that due to "a2enmod ssl", the server have to listen on two ports: 80 and 443. This activates AcceptMutex to synchronize running processes. And here some problem occures under Ubuntu 12.04

So an alternative way to reproduce is:
$ a2dismod ssl
$ echo "Listen 81" > /etc/apache2/ports.conf
$ service apache2 restart
$ ab -c 1000 -n 5000 http://localhost/

Changing the "AcceptMutex" config option causes different results. For example with:
AcceptMutex posixsem
i get all 5000 requests done without error. Then trying to run ab again causes apache to stop accepting requests.

description: updated

The issue is in part kernel related
this commit (introduced in 3.2.9) http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=28d82dc1c4edbc352129f97f4ca22624d1fe61de introduced a epoll path limit of 1000 (1000 process listening on the same socket).

This commit (3.2.17) http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=93dc6107a76daed81c07f50215fa6ae77691634f restore the old behavior for non nested epoll path (so you can use unlimited apache process)

Apache shouldn't hang but updating your kernel will solve the issue.
I've opened an apache ticket (https://issues.apache.org/bugzilla/show_bug.cgi?id=54502)

I can confirm, that upgrading the kernel to the most current version solves the problem:
$ sudo apt-get dist-upgrade

Thank you!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.