First of all I beg your pardon, for this bug being dormant.
We started to clear this kind of bugs recently, but obviously one can't do all in one day :-/
In this in particular I was made aware of by others being affected.
## CASE ##
For an SRU we need a reproducible case of some sorts.
On first try in the mpm event config as it is installed by default I can't see this issue.
Tried on Trusty and Xenial, but this stays at all waiting for connection.
I'm through some iterations on this and while not complete yet have some lessons learned, we need:
1. long running requests
2. a graceful restart that puts all those into "G" for a while
3. a lot requests that fail due to most/all slots being blocked
After some iterations I got this two system setup:
# Server
# Prep somewhat large file non compressible file on server
$ dd if=/dev/urandom of=/var/www/html/test1 bs=1M count=32
$ dd if=/dev/urandom of=/var/www/html/test1 bs=1k count=4
# Client
# slow down to somewhat like an internet connection
$ tc qdisc add dev eth0 root handle 1: htb default 12
$ tc class add dev eth0 parent 1:1 classid 1:12 htb rate 4000kbps ceil 12000kbps
$ tc qdisc add dev eth0 parent 1:12 netem delay 200ms
# Client - 150 slow requests
$ ab -q -S -c 150 -n 150 10.0.4.30/test1
# Server reload to cause "G" state
$ apache2ctl status; apache2ctl graceful; apache2ctl status; sleep 5s; apache2ctl status
# Client many fast exceeding the few/no remaining workers
$ ab -q -S -c 150 -n 5000 10.0.4.30/test2
# I can see the status being clogged up in "G" on most workers {1} but things are still working fine :-/
There must be a way to reproduce this that isn't "be a webhoster for 4000 people".
If one of the affected has something better please let me know.
## FIX ##
On the fix itself it is also a bit messy as there were multiple revision, splits of PRs and such.
What I found is that initial proposal of the fix that eventually got into 2.4.25 is attached as [2], but was broken up upstream. There on the 2.4 branch it actually is [3] plus some doc fixups [4],[5] to be correct after the fix.
## Testing ##
For now I have made a ppa available for testing at [6].
This is a backport of the referred fix for Xenial - yet untested.
Since I can't reproduce yet I'm depending on you to:
a) test from the ppa if the fix is working (and showing no related regression
b) helping me with or without the ppa to create some working steps to reproduce
First of all I beg your pardon, for this bug being dormant.
We started to clear this kind of bugs recently, but obviously one can't do all in one day :-/
In this in particular I was made aware of by others being affected.
## CASE ##
For an SRU we need a reproducible case of some sorts.
On first try in the mpm event config as it is installed by default I can't see this issue.
Tried on Trusty and Xenial, but this stays at all waiting for connection.
I'm through some iterations on this and while not complete yet have some lessons learned, we need:
1. long running requests
2. a graceful restart that puts all those into "G" for a while
3. a lot requests that fail due to most/all slots being blocked
After some iterations I got this two system setup:
# Server www/html/ test1 bs=1M count=32 www/html/ test1 bs=1k count=4
# Prep somewhat large file non compressible file on server
$ dd if=/dev/urandom of=/var/
$ dd if=/dev/urandom of=/var/
# Client
# slow down to somewhat like an internet connection
$ tc qdisc add dev eth0 root handle 1: htb default 12
$ tc class add dev eth0 parent 1:1 classid 1:12 htb rate 4000kbps ceil 12000kbps
$ tc qdisc add dev eth0 parent 1:12 netem delay 200ms
# Client - 150 slow requests
$ ab -q -S -c 150 -n 150 10.0.4.30/test1
# Server reload to cause "G" state
$ apache2ctl status; apache2ctl graceful; apache2ctl status; sleep 5s; apache2ctl status
# Client many fast exceeding the few/no remaining workers
$ ab -q -S -c 150 -n 5000 10.0.4.30/test2
# I can see the status being clogged up in "G" on most workers {1} but things are still working fine :-/
There must be a way to reproduce this that isn't "be a webhoster for 4000 people".
If one of the affected has something better please let me know.
## FIX ##
On the fix itself it is also a bit messy as there were multiple revision, splits of PRs and such.
What I found is that initial proposal of the fix that eventually got into 2.4.25 is attached as [2], but was broken up upstream. There on the 2.4 branch it actually is [3] plus some doc fixups [4],[5] to be correct after the fix.
## Testing ##
For now I have made a ppa available for testing at [6].
This is a backport of the referred fix for Xenial - yet untested.
Since I can't reproduce yet I'm depending on you to:
a) test from the ppa if the fix is working (and showing no related regression
b) helping me with or without the ppa to create some working steps to reproduce
[1]: https:/ /bz.apache. org/bugzilla/ show_bug. cgi?id= 53555#c39 /bz.apache. org/bugzilla/ attachment. cgi?id= 34202&action= diff&collapsed= &headers= 1&format= raw /github. com/apache/ httpd/commit/ e7407f84ec2a1b7 f2c04775a230f14 7c08860c7c /github. com/apache/ httpd/commit/ 86db1247c70699d f6acad75f2491b8 baa0030ff6 /github. com/apache/ httpd/commit/ 1a7e2114393c9dd 9f8d87e53dfd74c e9ede3c3c0 /launchpad. net/~ci- train-ppa- service/ +archive/ ubuntu/ 3034
[2]: https:/
[3]: https:/
[4]: https:/
[5]: https:/
[6]: https:/
{1}
18.8 requests/sec - 22.7 MB/second - 1.2 MB/request
1 requests currently being processed, 24 idle workers
PID Connections Threads Async connections
total accepting busy idle writing keep-alive closing
2661 15 no 0 0 0 0 5
2697 22 no 0 0 0 0 0
2725 5 no 0 0 0 0 1
2753 15 no 0 0 0 0 3
2589 8 no 0 0 0 0 3
2815 52 yes 1 24 0 0 34
Sum 117 1 24 0 0 46
GGGGGGGGGGGGGGG GGGGGGGGGGGGGGG GGGGGGGGGGGGGGG GGGGGGGGGGGGGGG GGGG GGGGGGGGGGGGGGG GGGGGGGGGGGGGGG GGGGGGGGGGGGGGG G___ _______ _______ _
GGGGGGGGGGGGGGG
_W_____
Appendix: default mpm event conf is /etc/apache2/ mods-available/ mpm_event. conf
StartServers 2
MinSpareThread s 25
MaxSpareThread s 75
ThreadsPerChil d 25
MaxRequestWork ers 150
MaxConnections PerChild 0
ThreadLimit 64