Slow memory leak, seen on two machines, appears to be dupe of 224945 even after -updates

Bug #422138 reported by Joe McDonagh
26
This bug affects 3 people
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Triaged
Medium
Chuck Short

Bug Description

Binary package hint: apache2

1. Ubuntu Release Info:

Description: Ubuntu 8.04.3 LTS
Release: 8.04

2. Package Information:

apache2-mpm-prefork:
  Installed: 2.2.8-1ubuntu0.11
  Candidate: 2.2.8-1ubuntu0.11
  Version table:
 *** 2.2.8-1ubuntu0.11 0
        500 http://aptproxy hardy-security/main Packages
        500 http://aptproxy hardy-updates/main Packages
        100 /var/lib/dpkg/status
     2.2.8-1 0
        500 http://aptproxy hardy/main Packages

3. I expect apache to not chew up to a couple gigs per process when it runs for a long time.

4. Apache chews threw a couple gigs per process when it runs for a long time

Enabled modules:

alias.conf
alias.load
auth_basic.load
authn_file.load
authz_default.load
authz_groupfile.load
authz_host.load
authz_user.load
autoindex.conf
autoindex.load
cgi.load
deflate.conf
deflate.load
dir.conf
dir.load
env.load
mime.conf
mime.load
negotiation.conf
negotiation.load
php5.conf
php5.load
rewrite.load
setenvif.conf
setenvif.load
ssl.conf
ssl.load
status.conf
status.load

This DOES appear to be a dupe of 224945 (reproducible with ab), however as you can see I am on the hardy-updates version. Is there something obvious I am missing here?

Revision history for this message
Chuck Short (zulcss) wrote :

Interesting,

Can you specify the ab command that you used to reproduce this?

Thanks
chuck

Changed in apache2 (Ubuntu):
assignee: nobody → Chuck Short (zulcss)
status: New → Incomplete
Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :
Download full text (4.6 KiB)

Hey Chuck, thanks for the response. The command is:

ab -n 2000000 -c 100 -f TLS11 https://webserver/

At ten seconds:

30253 www-data 20 0 254m 75m 4432 S 31 0.5 0:03.50 apache2
30252 www-data 20 0 251m 72m 4464 S 30 0.4 0:03.34 apache2
30250 www-data 20 0 250m 70m 4420 S 29 0.4 0:03.26 apache2
30251 www-data 20 0 250m 70m 4448 S 29 0.4 0:03.28 apache2
30249 www-data 20 0 249m 70m 4444 S 28 0.4 0:03.20 apache2
30331 www-data 20 0 244m 65m 4404 R 30 0.4 0:02.96 apache2
30369 www-data 20 0 233m 53m 4432 R 24 0.3 0:02.36 apache2
30410 www-data 20 0 231m 51m 4348 S 22 0.3 0:02.18 apache2
30458 www-data 20 0 224m 45m 4372 S 19 0.3 0:01.86 apache2
30506 www-data 20 0 217m 38m 4324 R 15 0.2 0:01.50 apache2
30557 www-data 20 0 212m 32m 4356 R 12 0.2 0:01.20 apache2
30620 www-data 20 0 207m 27m 4356 S 9 0.2 0:00.92 apache2

This is a significant jump in memory allocation, and children get spawned. I expect the children per my configuration, but the memory use is obscene. At three minutes (stopped ab at this point)

30253 www-data 20 0 366m 187m 4636 S 3 1.2 0:09.52 apache2
30252 www-data 20 0 364m 185m 4656 S 3 1.2 0:09.46 apache2
30251 www-data 20 0 363m 184m 4668 S 3 1.1 0:09.40 apache2
30250 www-data 20 0 361m 182m 4632 S 3 1.1 0:09.30 apache2
30249 www-data 20 0 359m 179m 4676 S 3 1.1 0:09.14 apache2
30331 www-data 20 0 355m 176m 4668 S 3 1.1 0:08.94 apache2
30369 www-data 20 0 343m 164m 4636 S 3 1.0 0:08.34 apache2
30410 www-data 20 0 343m 163m 4628 S 3 1.0 0:08.26 apache2
30458 www-data 20 0 337m 158m 4684 S 3 1.0 0:07.96 apache2
30557 www-data 20 0 320m 141m 4608 R 3 0.9 0:07.08 apache2
30620 www-data 20 0 320m 140m 4676 S 3 0.9 0:07.06 apache2
30675 www-data 20 0 315m 135m 4620 S 3 0.8 0:06.74 apache2
30731 www-data 20 0 311m 131m 4624 S 3 0.8 0:06.54 apache2
30841 www-data 20 0 300m 121m 4656 S 3 0.8 0:05.96 apache2
30901 www-data 20 0 295m 116m 4608 S 3 0.7 0:05.70 apache2
30957 www-data 20 0 294m 114m 4624 S 3 0.7 0:05.62 apache2

Now ab is not running, but my stopwatch program is. Now at 5 minutes:

30458 www-data 20 0 341m 162m 4692 S 0 1.0 0:08.16 apache2
31126 www-data 20 0 290m 111m 4596 S 0 0.7 0:05.42 apache2
31112 www-data 20 0 288m 109m 4604 S 0 0.7 0:05.30 apache2
31210 www-data 20 0 287m 108m 4612 S 0 0.7 0:05.24 apache2
31257 www-data 20 0 287m 107m 4600 S 0 0.7 0:05.24 apache2
31222 www-data 20 0 286m 107m 4636 S 0 0.7 0:05.20 apache2
31410 www-data 20 0 285m 105m 4588 S 0 0.7 0:05....

Read more...

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Sorry forgot to attach the output of AB after I killed it:

ab -n 2000000 -c 100 -f TLS11 https://scan-stag.osdc/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking scan-stag.osdc (be patient)
^C

Server Software: Apache
Server Hostname: scan-stag.osdc
Server Port: 443
SSL/TLS Protocol: TLSv1/SSLv3,DHE-RSA-AES256-SHA,4096,256

Document Path: /
Document Length: 0 bytes

Concurrency Level: 100
Time taken for tests: 185.122 seconds
Complete requests: 9751
Failed requests: 0
Write errors: 0
Non-2xx responses: 9751
Total transferred: 7906663 bytes
HTML transferred: 0 bytes
Requests per second: 52.67 [#/sec] (mean)
Time per request: 1898.496 [ms] (mean)
Time per request: 18.985 [ms] (mean, across all concurrent requests)
Transfer rate: 41.71 [Kbytes/sec] received

Connection Times (ms)
              min mean[+/-sd] median max
Connect: 88 319 487.6 156 3229
Processing: 20 1571 446.0 1700 2235
Waiting: 20 1570 445.9 1699 2235
Total: 140 1889 202.1 1874 5089

Percentage of the requests served within a certain time (ms)
  50% 1874
  66% 1916
  75% 1948
  80% 1972
  90% 2050
  95% 2144
  98% 2293
  99% 2408
 100% 5089 (longest request)

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Chuck, is there anything else you need from me to promote this from incomplete?

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

This took down a box today, had to visit the DC for it... is there any more information-gathering I can do for you or myself here?

Revision history for this message
Stefan Fritsch (sf-sfritsch) wrote :

Is the parameter -f TLS1 necessary to reproduce the problem?

Is the URL / of your webserver a php page? If yes, why is the content length of the page 0, what does the php script do? If not php, what is it? A simple redirect?

A workaround may be to add MaxRequestsPerChild 1000 (or an even lower value) to your configuration.

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote : Re: [Bug 422138] Re: Slow memory leak, seen on two machines, appears to be dupe of 224945 even after -updates

Hi Stefan, thanks for the response. My responses are in-line.
> Is the parameter -f TLS1 necessary to reproduce the problem?
>
No, same behavior. Long-running children just never free up the memory
as long as I am hitting the SSL port, seemingly regardless if I pass an
algorithm to AB or not. I see radically different (as in this time,
normal) behavior if I hit the non-SSL port.
> Is the URL / of your webserver a php page? If yes, why is the content
> length of the page 0, what does the php script do? If not php, what is
> it? A simple redirect?
>
It is the index.php for the CodeIgniter framework, the redirect is
somewhat contrived, as it's handled by some internal CodeIgniter stuff
that I am not very familiar with. I don't think PHP is involved at all
in the leak, since the non-HTTPS port works fine.
> A workaround may be to add MaxRequestsPerChild 1000 (or an even lower
> value) to your configuration.
>
>
That wouldn't really help since long-running children just hold onto
memory no matter what it seems.

I feel like this issue would be huge and brought up by someone else by
now if it's an actual regression. Does the package information I put in
the beginning look right?

--
Joe McDonagh
Operations Engineer
www.colonfail.com

Revision history for this message
Stefan Fritsch (sf-sfritsch) wrote :

> > A workaround may be to add MaxRequestsPerChild 1000 (or an even lower
> > value) to your configuration.
> >
> >
> That wouldn't really help since long-running children just hold onto
> memory no matter what it seems.

A lower MaxRequestsPerChild should make the children terminate earlier.

> Does the package information I put in the beginning look right?

I think so.

Do you use SSLSessionCache dbm? If yes, try switching to shmcb. I seem to remeber a known memory leak in dbm.

You can also try if disabling mod_deflate helps.

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Stefan, thanks for re-iterating the MaxRequestsPerChild tidbit, I had forgotten that option was for lifetime, not concurrency. That'll probably help workaround momentarily.

Also, I am not currently doing any tweaking wrt the SSLSessionCache, I recall messing with it when I first started seeing this problem, but had removed the configuration options since it didn't seem to help much.

I would just like to see some other pair of eyes confirm that a default install with SSL still leaks. This is apparent because when I run the same tests against the HTTP only port things look normal.

Chuck Short (zulcss)
Changed in apache2 (Ubuntu):
importance: Undecided → Low
Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Chuck Short wrote:
> ** Changed in: apache2 (Ubuntu)
> Importance: Undecided => Low
>
>

Chuck, was wondering, should this be marked Low from Undecided if it
hasn't been confirmed nor denied by anyone?

--
Joe McDonagh
Operations Engineer
www.colonfail.com

Revision history for this message
Chuck Short (zulcss) wrote :

@Joe:

Its marked Low so our qa scripts pick it up. Ill take a look at this again after karmic has been released.

Regards
chuck

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Chuck Short wrote:
> @Joe:
>
> Its marked Low so our qa scripts pick it up. Ill take a look at this
> again after karmic has been released.
>
> Regards
> chuck
>
>
Ah, ok, cool. Thanks Chuck.

Revision history for this message
Chuck Short (zulcss) wrote :

@Joe,

have you tried the configuratoin change as suggested?

Thanks
chuck

Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Chuck Short wrote:
> @Joe,
>
> have you tried the configuratoin change as suggested?
>
> Thanks
> chuck
>
>
Hey Chuck, yea to the extreme I think it's at maybe 1 or 100 for
MaxRequestsPerChild, but it seems to resemble the old bug on the
surface. Either way it's quieted down for me, thanks for your attention
to the issue.

--
Joe McDonagh
Operations Engineer
Silent Penguin Services
AIM: YoosingYoonickz
IRC: joe-mac on freenode
www.colonfail.com

Chuck Short (zulcss)
Changed in apache2 (Ubuntu):
assignee: Chuck Short (zulcss) → nobody
assignee: nobody → Chuck Short (zulcss)
importance: Low → Medium
status: Incomplete → Triaged
Revision history for this message
Joe McDonagh (joseph-e-mcdonagh) wrote :

Chuck, you're thinking this is that Debian bug that had an advisory last week right?

Revision history for this message
Chuck Short (zulcss) wrote :

probably

Revision history for this message
Tomas Emma Johansson (etletlien) wrote :

this bug started happening a few weeks ago on poeter.se. the admin said he fixed it, and it worked for a day or two, but the errors are actually worse now, day by day. in the mornings i just got timeouts. otherwise its a 10 sec loading time (from 2-3 secs before). people on smartphones report still getting a SSL error and 2-3 minute loading times. the admin seems to not care though, as usual. the weird part is i also got this on many other small sites, and it all stopped when they "fixed" it on poeter.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.