reload apache2 with mpm_event cause scoreboard is full

Bug #1466926 reported by Branislav Staron on 2015-06-19
84
This bug affects 14 people
Affects Status Importance Assigned to Milestone
apache2 (Ubuntu)
Medium
Unassigned
Trusty
Undecided
Unassigned
Xenial
Undecided
Unassigned
Zesty
Undecided
Unassigned

Bug Description

[Impact]

 * apache2 reload can fill up the scorecard with graceful reastarting
   workers to an amount that it is unable to serve more requests.

 * Backport fix from upstream to avoid the issue

[Test Case]

 * In comment #8 I outlined some steps to fill up the scorecard with
   Gracefully stopping processes. But that never triggered the reported
   bug of eventually breaking new requests for me (It is still good to
   verify some busy tests on this part of the apache code)
   -> Therefore one of the reporters of the bug tested the ppa quite a
      while on his main machines that formerly were affected by the issue.
   -> no clear "do this then X happens" test case steps
 * Perform arbitrary additional functional tests on the package to make sure no regressions are visible

[Regression Potential]

 * It is a rather complex code change in the mpm event handling.
   Only the first change is to code, the other two are for the
   documentation to match.
   We tested some apache benchmarks to check the effect, but found neither
   a positive nor negative impact in general (other than the bug being
   resolved). Yet if people rely on very specific behavior of the mpm
   handling that might change a bit.
   It will change for the good of it, but always remember xkcd.com/1172

   TL;DR: IMHO it clearly has a ">none" regression potential on the mpm
   event handling

[Other Info]

 * Since this is hard to test we had a 2+ week test on the ppa, see
   comments c#8 - c#15

 * It clearly is a fix for some (e.g. the reporter of the bug), but I'd
   understand if the SRU Team would rate it as feature and deny it for an
   SRU - depends on the POV, certainly worth to be reviewed at least.

----

On the clean install Ubuntu 14.04 with Apache without almost any client load the Apache server with the command "service apache2 reload" itself allocates slots marked with "Gracefully finishing" for which rejects new connections.

For full rejection of new requests is sufficient to perform 4x command "service apache2 reload".

Ubuntu 14.04.2 LTS
Apache 2.4.7-ubuntu4.4 (mpm_event)
Kernel 2.16.0-30-generic

Reproduce problem:
#################################################
1/ service apache2 start
______________________________________________________W_________
___________.....................................................
......................

2/ service apache2 reload

.........................GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGG__________________________________________________W__
______________________

3/ service apache2 reload

___W_____________________GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGG__________________________________________________...
......................

4/ service apache2 reload

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG___
W_____________________

5/ service apache2 reload -> Server Apache not responding
With logs in apache error log file:
... [mpm_event:error] [pid 9381:tid 1234563234] AH00485: scoreboard is full, not at MaxRequestWorkers
...
#################################################

My workaround was change to MPM module from "mpm_event" to "mpm_worker".

affects: installation-report (Ubuntu) → apache2 (Ubuntu)
Henti Smith (henti) wrote :

This bug has been discussed on the apache bug tracker :

https://bz.apache.org/bugzilla/show_bug.cgi?id=53555

There seems to be no movement to fix this that I can see. there is a patch which seems to fix it for users up to a higher usage level.

https://bz.apache.org/bugzilla/attachment.cgi?id=33158

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in apache2 (Ubuntu):
status: New → Confirmed
Stewart Campbell (sc-pulsion) wrote :

A patch has now been committed to trunk for this bug:
https://bz.apache.org/bugzilla/show_bug.cgi?id=53555#c65

Thanks Steward to ping with the issue upstream now being resolved and a patch available!

Changed in apache2 (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → Medium
Robie Basak (racb) on 2016-12-12
tags: added: server-next
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apache2 - 2.4.25-3ubuntu2

---------------
apache2 (2.4.25-3ubuntu2) zesty; urgency=medium

  * Undrop (LP 1658469):
    - Don't build experimental http2 module for LTS:
      + debian/control: removed libnghttp2-dev Build-Depends (in universe).
      + debian/config-dir/mods-available/http2.load: removed.
      + debian/rules: removed proxy_http2 from configure.
      + debian/apache2.maintscript: remove http2 conffile.

 -- Nishanth Aravamudan <email address hidden> Fri, 10 Feb 2017 08:53:43 -0800

Changed in apache2 (Ubuntu):
status: Triaged → Fix Released
Nick (n6ck) wrote :

Will this be backported to Trusty or Xenial?

Yes this is impacting on Xenial - when will this be backported to currently supported LTS releases?

Haw Loeung (hloeung) on 2017-07-10
Changed in apache2 (Ubuntu Xenial):
status: New → Confirmed
Changed in apache2 (Ubuntu Trusty):
status: New → Confirmed
Haw Loeung (hloeung) on 2017-07-10
Changed in apache2 (Ubuntu Zesty):
status: New → Fix Released
Changed in apache2 (Ubuntu Xenial):
status: Confirmed → Triaged
Download full text (3.9 KiB)

First of all I beg your pardon, for this bug being dormant.
We started to clear this kind of bugs recently, but obviously one can't do all in one day :-/
In this in particular I was made aware of by others being affected.

## CASE ##
For an SRU we need a reproducible case of some sorts.
On first try in the mpm event config as it is installed by default I can't see this issue.
Tried on Trusty and Xenial, but this stays at all waiting for connection.

I'm through some iterations on this and while not complete yet have some lessons learned, we need:
1. long running requests
2. a graceful restart that puts all those into "G" for a while
3. a lot requests that fail due to most/all slots being blocked

After some iterations I got this two system setup:

# Server
# Prep somewhat large file non compressible file on server
$ dd if=/dev/urandom of=/var/www/html/test1 bs=1M count=32
$ dd if=/dev/urandom of=/var/www/html/test1 bs=1k count=4

# Client
# slow down to somewhat like an internet connection
$ tc qdisc add dev eth0 root handle 1: htb default 12
$ tc class add dev eth0 parent 1:1 classid 1:12 htb rate 4000kbps ceil 12000kbps
$ tc qdisc add dev eth0 parent 1:12 netem delay 200ms

# Client - 150 slow requests
$ ab -q -S -c 150 -n 150 10.0.4.30/test1
# Server reload to cause "G" state
$ apache2ctl status; apache2ctl graceful; apache2ctl status; sleep 5s; apache2ctl status
# Client many fast exceeding the few/no remaining workers
$ ab -q -S -c 150 -n 5000 10.0.4.30/test2

# I can see the status being clogged up in "G" on most workers {1} but things are still working fine :-/
There must be a way to reproduce this that isn't "be a webhoster for 4000 people".
If one of the affected has something better please let me know.

## FIX ##
On the fix itself it is also a bit messy as there were multiple revision, splits of PRs and such.
What I found is that initial proposal of the fix that eventually got into 2.4.25 is attached as [2], but was broken up upstream. There on the 2.4 branch it actually is [3] plus some doc fixups [4],[5] to be correct after the fix.

## Testing ##
For now I have made a ppa available for testing at [6].
This is a backport of the referred fix for Xenial - yet untested.
Since I can't reproduce yet I'm depending on you to:
a) test from the ppa if the fix is working (and showing no related regression
b) helping me with or without the ppa to create some working steps to reproduce

[1]: https://bz.apache.org/bugzilla/show_bug.cgi?id=53555#c39
[2]: https://bz.apache.org/bugzilla/attachment.cgi?id=34202&action=diff&collapsed=&headers=1&format=raw
[3]: https://github.com/apache/httpd/commit/e7407f84ec2a1b7f2c04775a230f147c08860c7c
[4]: https://github.com/apache/httpd/commit/86db1247c70699df6acad75f2491b8baa0030ff6
[5]: https://github.com/apache/httpd/commit/1a7e2114393c9dd9f8d87e53dfd74ce9ede3c3c0
[6]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3034

{1}
18.8 requests/sec - 22.7 MB/second - 1.2 MB/request
1 requests currently being processed, 24 idle workers

PID Connections Threads Async connections
     total accepting busy idle writing keep-alive closing
2661 15 no 0 0 0 0 ...

Read more...

To make it very clear that I have to rely on feedback of other affected users in regard to the testcase and the ppa backport to test I'll mark the X/T tasks incomplete.

Changed in apache2 (Ubuntu Xenial):
status: Triaged → Incomplete
Changed in apache2 (Ubuntu Trusty):
status: Confirmed → Incomplete
Tobias Oetiker (tobi-oetiker) wrote :

thanks for the test PPA ... I am running 2.4.18-2ubuntu3.6~ppa1 now on a system that has been exhibiting the problem every few days ...

It is being triggered by logrotate restarting apache every day btw ...
/etc/logrotate.d/apache2

Hi Tobias,
thanks for the update - it isn't 100% clear to me, does the ppa completely fix the issue for you and also shows no other regression that you'd see?

Tobias Oetiker (tobi-oetiker) wrote :

I have it running on our production server since this morning ... log rotation will happen every day around 6:25 CET ... I will let you know in a few days if the problem is gone ... I have not noticed any regressions in normal operation.

Tobias Oetiker (tobi-oetiker) wrote :

The PPA version has been running without a hitch for the last 19 days on 16.04 ... I'd say it works fine

Thanks Tobias for all the tests.

With all that prep done I'll try to push it to SRU review now.

Changed in apache2 (Ubuntu Trusty):
status: Incomplete → Won't Fix

14.04 is too far away to be reasonable for the backport for now, If one wants to try please feel free.
But Xenial is what was prepared and tested, so lets only go for that atm.

SRU Template prepared and uploaded for the SRU Teams consideration.

description: updated
Changed in apache2 (Ubuntu Xenial):
status: Incomplete → In Progress
Łukasz Zemczak (sil2100) wrote :

With my SRU hat on I am a bit reluctant about accepting this into xenial. Don't get me wrong: I'm not saying no, but it's certainly something I need to know a bit more about to make a proper decision.

The reported problem of course does seem like a bug, but my doubts come from the complexity of the required changes more than anything else. The backported commit for code changes itself seems to be 149 additions and 181 deletions + documentation changes. The changes look a bit invasive, I would like to learn more on how severe this issue is. Is it worth the risk for a bug that's currently set to importance Medium? How frequently does this affect users? From what the reporter mentions there is some working workaround for the problem - does it have any shortcomings?

Hi Łukasz,
that is a fair set of questions, but I'm not a high profile webserver admin either to give you better answers.

We have a few active bug reporters that are affected subscribed, maybe they can try to answer the SRU Teams (=Łukasz in this case) questions?

On Tue, Mar 13, 2018 at 06:25:35AM -0000, ChristianEhrhardt wrote:
> Hi Łukasz,
> that is a fair set of questions, but I'm not a high profile webserver admin either to give you better answers.
>
> We have a few active bug reporters that are affected subscribed, maybe
> they can try to answer the SRU Teams (=Łukasz in this case) questions?

This issue was brought to us some months ago by a Ubuntu Advantage user,
because one of the OpenStack charms (IIRC the keystone charm) was
reloading the configuration often enough to trigger this problem. I
don't have more details than those as I was not actively working on the
issue.

Also is a common (bad?) practice in configuration management software to
reload configurations often, some are better written and reload only
when it's needed, but I've seen a fair amount of sysadmin's scripts
where calling "reload" is considered a cheap and harmless operation.

Haw Loeung (hloeung) wrote :

We ran into this with the main Ubuntu Archive servers (archive.ubuntu.com). They were only configured to reload/graceful on log rotation. Basically workers were held up and eventually all full and not servicing requests.

We've now switched from MPM event to worker but would like this backported so we can switch back to event.

Hello Branislav, or anyone else affected,

Accepted apache2 into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/apache2/2.4.18-2ubuntu3.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

description: updated
Changed in apache2 (Ubuntu Xenial):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-xenial
Łukasz Zemczak (sil2100) wrote :

With the additional rationale and context I have accepted the apache2 upload into xenial-proposed. But since the change is rather big, please be sure to do additional exploratory testing of the packages. I would also request this to age in -proposed for slightly longer than just 7 days, if possible.

Thanks!

Haw Loeung (hloeung) wrote :

We ran with the apache2 package from xenial-proposed on one of the main Ubuntu Archive servers. It's quite high traffic and what we saw was that there's quite a lot of processes being killed off (mixture of both SIGTERM and SIGKILL).

Unfortunately, it was quite user noticeable and we've had to revert.

See error logs for that time period - https://paste.ubuntu.com/p/Pnzr2R7SFv/

tags: added: verification-failed verification-failed-xenial
removed: verification-needed verification-needed-xenial

Thanks a lot hloeung for trying in that environment.
This is exactly what we needed.

Too bad thou for the fix, the TL;DR now is:
- there is an issue if users/scripts reload apache too often (bad practice but existing)
- we have backported the fix that is upstream and tested, but tests show that this will cause new issues in the SRU environment

For the SRU perspective there are four things now:
a) - backport the upstream fix and check for regressions - we tried that and failed

b) - One could start identifying more upstream changes related, but to be honest that will end up considering a backport of a full new apache2 release to Xenial (with probably even more potential fallout, not in terms of stability but e.g. need to adapt configs) -> not going to happen IMHO

c) - Usually we would try to identify a smaller subset of the fix that is more SRUable but this doesn't apply to this case -> So not going to happen either

d) - those (few) affected need to adapt their environment to not call reload so often
     The most likely outcome for now :-/

e) - One could get creative
     What would get to my mind is for example a rate limiting on the reloads.
     One would have to wait up to x time, and collect all requests until that time, then one
     reload would happen and all would return.
     But that is fixing a symptom and would (if done at apache2) surely affect some things out
     there that expect it to be immediately.
     So even for these cases fixing the environment that does the high reload counts is more
     wise, as there the special cases can better be considered.

In terms of an "overall user base SRU tradeoff" it feels safer to recommend those affected to fix the environment, instead of forcing that onto everyone.
So for now, until one has a better idea I'm sadly feeling I have to set this to "won't fix" for now.

@SRU Team - could one cancel the current upload from x-proposed?

Note: removed from proposed

Changed in apache2 (Ubuntu Xenial):
status: Fix Committed → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.