Passing health check does not recover worker from its error state
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Apache2 Web Server |
Fix Released
|
Unknown
|
|||
apache2 (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Jammy |
Fix Released
|
Undecided
|
Michał Małoszewski | ||
Kinetic |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
[Impact]
The mod_proxy_hcheck module provides dynamic health checking of workers. The following error is encountered in jammy and kinetic when enabling mod_proxy_hcheck on some Apache2 nodes, which causes unusual behavior and is simply confusing.This is caused by the lack of a defined macro (PROXY_
[Test Plan]
Make a container for testing:
$ lxc launch ubuntu-daily:jammy jammy-test
$ lxc shell jammy-test
Do the same 2 steps from above for kinetic:
$ lxc launch ubuntu-
$ lxc shell kinetic-test
Type in the code from below inside both the kinetic and jammy containers:
# apt update && apt dist-upgrade -y
# apt install -y apache2 python3
# cat > /etc/apache2/
<VirtualHost *:80>
ServerAdmin <email address hidden>
DocumentRoot "/var/www/html"
ServerName myapp.example.com
ErrorLog "${APACHE_
CustomLog "${APACHE_
ProxyPass / balancer://myapp stickysession=
</VirtualHost>
__EOF__
# cat > /etc/apache2/
<Proxy balancer://myapp>
BalancerMember http://
</Proxy>
__EOF__
# a2enmod status
# a2enmod proxy
# a2enmod proxy_balancer
# a2enmod proxy_http
# a2enmod proxy_hcheck
# a2enmod lbmethod_byrequests
# a2enconf balancers
# a2ensite httpd-hcheck-
# python3 -m http.server --bind 127.0.0.1 8080 &
# PYTHON_PID=$!
# systemctl restart apache2
# curl -s localhost/
# kill -9 $PYTHON_PID
# curl -s localhost -H 'host: myapp.example.com' -o /dev/null
# curl -s localhost/
# python3 -m http.server --bind 127.0.0.1 8080 &
# sleep 10
# curl -s localhost/
Example failed output:
Serving HTTP on 127.0.0.1 port 8080 (http://
ProxyBalancer[
scripts: line 6: 203609 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[
Serving HTTP on 127.0.0.1 port 8080 (http://
127.0.0.1 - - [18/Jan/2023 12:24:05] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:24:10] "GET / HTTP/1.0" 200 -
ProxyBalancer[
Example of (expected) successful output:
Serving HTTP on 127.0.0.1 port 8080 (http://
ProxyBalancer[
scripts: line 6: 202190 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[
Serving HTTP on 127.0.0.1 port 8080 (http://
127.0.0.1 - - [18/Jan/2023 12:23:12] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:23:17] "GET / HTTP/1.0" 200 -
ProxyBalancer[
[Where problems could occur]
* The patch itself modifies the code of mod_proxy_hcheck, so any new bugs involving that module would be suspect.
* The patch changes the state of the workers, so issues cropping up that seem related to workers and its state could be suspect.
* Finally, since the patch modifies C code, issues typical of C code (segfaults, memory leaks, …) would be possible, however since this moves a chunk of code unmodified this seems unlikely.
* There may be a problem when a possible future fix introduces a discrepancy between the value between the defined MODULE_
-------
[Original Report]
While we were in the process of enabling mod_proxy_hcheck on some of our apache2 nodes we encountered an unusual behavior: sometimes, after rebooting a backend, its worker status remains marked as "Init Err" (balancer manager) until another request is made to the backend, no matter how many health checks complete successfully.
The following list shows the sequence of events leading to the problem:
1. Watchdog triggers health check, request is successful; worker status is "Init Ok"
2. HTTP request to apache2 with unreachable backend (rebooting); status becomes "Init Err"
3. Watchdog triggers another health check, request is again successful because the backend recovered; worker status remains "Init Err"
4. same as 3
5. same as 4
The only way for the worker status to recover is to wait for "hcfails" unsuccessful health checks and then again for "hcpasses" requests to be completed or just wait for legitimate traffic to retry the failed worker, which may not happen for a long time for rarely used applications.
This was surprising to us since we were expecting the worker status to be recovered after "hcpasses" successful health checks; however this doesn't seem to happen when the error status is triggered by ordinary traffic to the backend (i.e not health checks).
[Test Case]
# apt update && apt dist-upgrade -y
# apt install -y apache2 python3
# cat > /etc/apache2/
<VirtualHost *:80>
ServerAdmin <email address hidden>
DocumentRoot "/var/www/html"
ServerName myapp.example.com
ErrorLog "${APACHE_
CustomLog "${APACHE_
ProxyPass / balancer://myapp stickysession=
</VirtualHost>
__EOF__
# cat > /etc/apache2/
<Proxy balancer://myapp>
BalancerMember http://
</Proxy>
__EOF__
# a2enmod status
# a2enmod proxy
# a2enmod proxy_balancer
# a2enmod proxy_http
# a2enmod proxy_hcheck
# a2enmod lbmethod_byrequests
# a2enconf balancers
# a2ensite httpd-hcheck-
# python3 -m http.server --bind 127.0.0.1 8080 &
# PYTHON_PID=$!
# systemctl restart apache2
# curl -s localhost/
# kill -9 $PYTHON_PID
# curl -s localhost -H 'host: myapp.example.com' -o /dev/null
# curl -s localhost/
# python3 -m http.server --bind 127.0.0.1 8080 &
# sleep 10
# curl -s localhost/
Example failed output:
Serving HTTP on 127.0.0.1 port 8080 (http://
ProxyBalancer[
scripts: line 6: 203609 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[
Serving HTTP on 127.0.0.1 port 8080 (http://
127.0.0.1 - - [18/Jan/2023 12:24:05] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:24:10] "GET / HTTP/1.0" 200 -
ProxyBalancer[
Example of (expected) successful output:
Serving HTTP on 127.0.0.1 port 8080 (http://
ProxyBalancer[
scripts: line 6: 202190 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[
Serving HTTP on 127.0.0.1 port 8080 (http://
127.0.0.1 - - [18/Jan/2023 12:23:12] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:23:17] "GET / HTTP/1.0" 200 -
ProxyBalancer[
Upstream bug: https:/
Upstream fix: https:/
We would like to see this fix backported in Jammy.
Related branches
- Bryce Harrington (community): Approve
- git-ubuntu bot: Approve
- Canonical Server Reporter: Pending requested
-
Diff: 74 lines (+17/-8)2 files modifieddebian/changelog (+8/-0)
debian/patches/reenable-workers-in-standard-error-state-kinetic-apache2.patch (+9/-8)
- git-ubuntu bot: Approve
- Bryce Harrington (community): Approve
- Canonical Server Reporter: Pending requested
-
Diff: 90 lines (+68/-0)3 files modifieddebian/changelog (+8/-0)
debian/patches/reenable-workers-in-standard-error-state-jammy-apache2.patch (+59/-0)
debian/patches/series (+1/-0)
tags: | added: needs-merge |
Changed in apache2: | |
status: | Unknown → Confirmed |
Changed in apache2 (Ubuntu Jammy): | |
assignee: | nobody → Michał Małoszewski (michal-maloszewski99) |
Changed in apache2 (Ubuntu Kinetic): | |
assignee: | nobody → Michał Małoszewski (michal-maloszewski99) |
Changed in apache2: | |
status: | Confirmed → Fix Released |
tags: | removed: needs-merge |
Changed in apache2 (Ubuntu Jammy): | |
status: | Triaged → In Progress |
Changed in apache2 (Ubuntu Kinetic): | |
status: | Triaged → In Progress |
description: | updated |
description: | updated |
Changed in apache2 (Ubuntu Kinetic): | |
assignee: | Michał Małoszewski (michal-maloszewski99) → nobody |
Changed in apache2 (Ubuntu Kinetic): | |
status: | Fix Committed → Won't Fix |
Created attachment 38407
mod_proxy_hcheck: recover worker from error state
While we were in the process of enabling mod_proxy_hcheck on some of our apache2 nodes we encountered an unusual behavior: sometimes, after rebooting a backend, its worker status remains marked as "Init Err" (balancer manager) until another request is made to the backend, no matter how many health checks complete successfully.
The following list shows the sequence of events leading to the problem:
1. Watchdog triggers health check, request is successful; worker status is "Init Ok"
2. HTTP request to apache2 with unreachable backend (rebooting); status becomes "Init Err"
3. Watchdog triggers another health check, request is again successful because the backend recovered; worker status remains "Init Err"
4. same as 3
5. same as 4
The only way for the worker status to recover is to wait for "hcfails" unsuccessful health checks and then again for "hcpasses" requests to be completed or just wait for legitimate traffic to retry the failed worker, which may not happen for a long time for rarely used applications.
This was surprising to us since we were expecting the worker status to be recovered after "hcpasses" successful health checks; however this doesn't seem to happen when the error status is triggered by ordinary traffic to the backend (i.e not health checks).
We believe this behavior was accidentally introduced in r1725523. The patch we are proposing seems to fix the problem in our environment.