Passing health check does not recover worker from its error state

Bug #2003189 reported by Sistemi CeSIA
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Apache2 Web Server
Fix Released
Unknown
apache2 (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Committed
Undecided
Michał Małoszewski
Kinetic
Fix Committed
Undecided
Michał Małoszewski

Bug Description

[Impact]
The mod_proxy_hcheck module provides dynamic health checking of workers. The following error is encountered in jammy and kinetic when enabling mod_proxy_hcheck on some Apache2 nodes, which causes unusual behavior and is simply confusing.This is caused by the lack of a defined macro (PROXY_WORKER_IS_ERROR) and the lack of an operation on that in the source code. Fix is based on adding the missing part.

[Test Plan]
Make a container for testing:
$ lxc launch ubuntu-daily:jammy jammy-test
$ lxc shell jammy-test
Do the same 2 steps from above for kinetic:

$ lxc launch ubuntu-daily:kinetic kinetic-test
$ lxc shell kinetic-test
Type in the code from below inside both the kinetic and jammy containers:
# apt update && apt dist-upgrade -y
# apt install -y apache2 python3
# cat > /etc/apache2/sites-available/httpd-hcheck-initerr.conf << '__EOF__'
<VirtualHost *:80>
    ServerAdmin <email address hidden>
    DocumentRoot "/var/www/html"
    ServerName myapp.example.com
    ErrorLog "${APACHE_LOG_DIR}/myapp.example.com-error_log"
    CustomLog "${APACHE_LOG_DIR}/myapp.example.com-access_log" common
    ProxyPass / balancer://myapp stickysession=JSESSIONID
</VirtualHost>
__EOF__
# cat > /etc/apache2/conf-available/balancers.conf << '__EOF__'
<Proxy balancer://myapp>
    BalancerMember http://127.0.0.1:8080/ route=app-route hcmethod=GET hcinterval=5 hcpasses=1 hcfails=1
</Proxy>
__EOF__
# a2enmod status
# a2enmod proxy
# a2enmod proxy_balancer
# a2enmod proxy_http
# a2enmod proxy_hcheck
# a2enmod lbmethod_byrequests
# a2enconf balancers
# a2ensite httpd-hcheck-initerr
# python3 -m http.server --bind 127.0.0.1 8080 &
# PYTHON_PID=$!
# systemctl restart apache2
# curl -s localhost/server-status?auto | grep ProxyBalancer | grep Status
# kill -9 $PYTHON_PID
# curl -s localhost -H 'host: myapp.example.com' -o /dev/null
# curl -s localhost/server-status?auto | grep ProxyBalancer | grep Status
# python3 -m http.server --bind 127.0.0.1 8080 &
# sleep 10
# curl -s localhost/server-status?auto | grep ProxyBalancer | grep Status
Example failed output:
Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
ProxyBalancer[0]Worker[0]Status: Init Ok
scripts: line 6: 203609 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[0]Worker[0]Status: Init Err
Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
127.0.0.1 - - [18/Jan/2023 12:24:05] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:24:10] "GET / HTTP/1.0" 200 -
ProxyBalancer[0]Worker[0]Status: Init Err
Example of (expected) successful output:
Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
ProxyBalancer[0]Worker[0]Status: Init Ok
scripts: line 6: 202190 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[0]Worker[0]Status: Init Err
Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
127.0.0.1 - - [18/Jan/2023 12:23:12] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:23:17] "GET / HTTP/1.0" 200 -
ProxyBalancer[0]Worker[0]Status: Init Ok

[Where problems could occur]
The patch itself modifies the code of mod_proxy_hcheck, so any new bugs involving that module would be suspect. The patch changes the state of the workers, so issues cropping up that seem related to workers and its state could be suspect. Finally, since the patch modifies C code, issues typical of C code (segfaults, memory leaks, …) would be possible, however since this moves a chunk of code unmodified this seems unlikely.

----------------------------------------------------------------------

[Original Report]

While we were in the process of enabling mod_proxy_hcheck on some of our apache2 nodes we encountered an unusual behavior: sometimes, after rebooting a backend, its worker status remains marked as "Init Err" (balancer manager) until another request is made to the backend, no matter how many health checks complete successfully.

The following list shows the sequence of events leading to the problem:

1. Watchdog triggers health check, request is successful; worker status is "Init Ok"
2. HTTP request to apache2 with unreachable backend (rebooting); status becomes "Init Err"
3. Watchdog triggers another health check, request is again successful because the backend recovered; worker status remains "Init Err"
4. same as 3
5. same as 4

The only way for the worker status to recover is to wait for "hcfails" unsuccessful health checks and then again for "hcpasses" requests to be completed or just wait for legitimate traffic to retry the failed worker, which may not happen for a long time for rarely used applications.

This was surprising to us since we were expecting the worker status to be recovered after "hcpasses" successful health checks; however this doesn't seem to happen when the error status is triggered by ordinary traffic to the backend (i.e not health checks).

[Test Case]
# apt update && apt dist-upgrade -y
# apt install -y apache2 python3
# cat > /etc/apache2/sites-available/httpd-hcheck-initerr.conf << '__EOF__'
<VirtualHost *:80>
    ServerAdmin <email address hidden>
    DocumentRoot "/var/www/html"
    ServerName myapp.example.com
    ErrorLog "${APACHE_LOG_DIR}/myapp.example.com-error_log"
    CustomLog "${APACHE_LOG_DIR}/myapp.example.com-access_log" common

    ProxyPass / balancer://myapp stickysession=JSESSIONID

</VirtualHost>
__EOF__
# cat > /etc/apache2/conf-available/balancers.conf << '__EOF__'
<Proxy balancer://myapp>
    BalancerMember http://127.0.0.1:8080/ route=app-route hcmethod=GET hcinterval=5 hcpasses=1 hcfails=1
</Proxy>
__EOF__
# a2enmod status
# a2enmod proxy
# a2enmod proxy_balancer
# a2enmod proxy_http
# a2enmod proxy_hcheck
# a2enmod lbmethod_byrequests
# a2enconf balancers
# a2ensite httpd-hcheck-initerr
# python3 -m http.server --bind 127.0.0.1 8080 &
# PYTHON_PID=$!
# systemctl restart apache2
# curl -s localhost/server-status?auto | grep ProxyBalancer | grep Status
# kill -9 $PYTHON_PID
# curl -s localhost -H 'host: myapp.example.com' -o /dev/null
# curl -s localhost/server-status?auto | grep ProxyBalancer | grep Status
# python3 -m http.server --bind 127.0.0.1 8080 &
# sleep 10
# curl -s localhost/server-status?auto | grep ProxyBalancer | grep Status

Example failed output:

Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
ProxyBalancer[0]Worker[0]Status: Init Ok
scripts: line 6: 203609 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[0]Worker[0]Status: Init Err
Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
127.0.0.1 - - [18/Jan/2023 12:24:05] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:24:10] "GET / HTTP/1.0" 200 -
ProxyBalancer[0]Worker[0]Status: Init Err

Example of (expected) successful output:

Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
ProxyBalancer[0]Worker[0]Status: Init Ok
scripts: line 6: 202190 Killed python3 -m http.server --bind 127.0.0.1 8080
ProxyBalancer[0]Worker[0]Status: Init Err
Serving HTTP on 127.0.0.1 port 8080 (http://127.0.0.1:8080/) ...
127.0.0.1 - - [18/Jan/2023 12:23:12] "GET / HTTP/1.0" 200 -
127.0.0.1 - - [18/Jan/2023 12:23:17] "GET / HTTP/1.0" 200 -
ProxyBalancer[0]Worker[0]Status: Init Ok

Upstream bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=66302
Upstream fix: https://svn.apache.org/viewvc?view=revision&revision=1906496

We would like to see this fix backported in Jammy.

Related branches

Revision history for this message
In , Alessandro-cavalier7 (alessandro-cavalier7) wrote :

Created attachment 38407
mod_proxy_hcheck: recover worker from error state

While we were in the process of enabling mod_proxy_hcheck on some of our apache2 nodes we encountered an unusual behavior: sometimes, after rebooting a backend, its worker status remains marked as "Init Err" (balancer manager) until another request is made to the backend, no matter how many health checks complete successfully.

The following list shows the sequence of events leading to the problem:

1. Watchdog triggers health check, request is successful; worker status is "Init Ok"
2. HTTP request to apache2 with unreachable backend (rebooting); status becomes "Init Err"
3. Watchdog triggers another health check, request is again successful because the backend recovered; worker status remains "Init Err"
4. same as 3
5. same as 4

The only way for the worker status to recover is to wait for "hcfails" unsuccessful health checks and then again for "hcpasses" requests to be completed or just wait for legitimate traffic to retry the failed worker, which may not happen for a long time for rarely used applications.

This was surprising to us since we were expecting the worker status to be recovered after "hcpasses" successful health checks; however this doesn't seem to happen when the error status is triggered by ordinary traffic to the backend (i.e not health checks).

We believe this behavior was accidentally introduced in r1725523. The patch we are proposing seems to fix the problem in our environment.

Revision history for this message
In , V-jiz-h (v-jiz-h) wrote :

Thanks for the report.

In general, Health Check errors are considered different from "normal" errors and I can see why the behavior below is both confusing and could be considered "wrong".

The patch looks like a reasonable approach.

Revision history for this message
Paride Legovini (paride) wrote :

Hello and thanks for this bug report and for the reproducer. According to the upstream release notes [1] this bug is fixed in apache2 2.4.55, which is currently in Debian unstable but not in Lunar, so step 1 for fixing this is merging apache2 from Debian to Ubuntu. From what I can tell from the Debian d/changelog this shouldn't be a problematic merge.

Then we'll have to cherry-pick the patch you identified and apply it to the apache2 package in Jammy and Kinetic, following the SRU process [2].

---

From your comment in the linked upstream bug looks like you already rebuild the package with the fix for your local environment. Moreover I see that you already formatted the bug report in a way that is similar to the SRU bug template (again [2]).

If you're interested in getting involved in Ubuntu development I think you are in a good position here. The process the Server Team follows to do SRUs is documented in [3], and I volunteer to offer some guidance where needed, just let us know. (Should you want to get involved I suggest turning your account into a "personal" one, as in my understanding it is now a team account.)

Paride

[1] https://downloads.apache.org/httpd/CHANGES_2.4.55
[2] https://wiki.ubuntu.com/StableReleaseUpdates
[3] https://github.com/canonical/ubuntu-maintainers-handbook/blob/main/PackageFixing.md

tags: added: server-todo
Changed in apache2 (Ubuntu):
status: New → Triaged
Changed in apache2 (Ubuntu Jammy):
status: New → Triaged
Changed in apache2 (Ubuntu Kinetic):
status: New → Triaged
Paride Legovini (paride)
tags: added: needs-merge
Changed in apache2:
status: Unknown → Confirmed
Revision history for this message
In , tititou (christophe-jaillet) wrote :

Fix in trunk in r1904518
Backported in 2.4.x in r1906496

This is part of 2.4.55

Changed in apache2 (Ubuntu Jammy):
assignee: nobody → Michał Małoszewski (michal-maloszewski99)
Changed in apache2 (Ubuntu Kinetic):
assignee: nobody → Michał Małoszewski (michal-maloszewski99)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

It is in Lunar:
 apache2 | 2.4.55-1ubuntu1 | lunar | source, amd64, arm64, armhf, i386, ppc64el, riscv64, s390x
Set fixed in Lunar.

Michal will have a look, verifying the testcase, apply the patch for Jammy and Kinetic, build a PPA and then get this into SRU processing.

Changed in apache2 (Ubuntu):
status: Triaged → Fix Released
Changed in apache2:
status: Confirmed → Fix Released
Bryce Harrington (bryce)
tags: removed: needs-merge
Changed in apache2 (Ubuntu Jammy):
status: Triaged → In Progress
Changed in apache2 (Ubuntu Kinetic):
status: Triaged → In Progress
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

you're missing the "Where problems could occur" part of the SRU header

Revision history for this message
Steve Langasek (vorlon) wrote : Please test proposed package

Hello Sistemi, or anyone else affected,

Accepted apache2 into kinetic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/apache2/2.4.54-2ubuntu1.4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-kinetic to verification-done-kinetic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-kinetic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in apache2 (Ubuntu Kinetic):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-kinetic
Changed in apache2 (Ubuntu Jammy):
status: In Progress → Fix Committed
tags: added: verification-needed-jammy
Revision history for this message
Steve Langasek (vorlon) wrote :

Hello Sistemi, or anyone else affected,

Accepted apache2 into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/apache2/2.4.52-1ubuntu4.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (apache2/2.4.54-2ubuntu1.4)

All autopkgtests for the newly accepted apache2 (2.4.54-2ubuntu1.4) for kinetic have finished running.
The following regressions have been reported in tests triggered by the package:

apache2/2.4.54-2ubuntu1.4 (amd64, arm64, armhf, ppc64el, s390x)
backuppc/4.4.0-6ubuntu1 (amd64)
cinder/2:21.1.0-0ubuntu2.1 (armhf)
gnocchi/4.4.2+git2022100517.b1742b0b-0ubuntu1 (armhf)
lemonldap-ng/2.0.14+ds-1 (arm64)
mediawiki/1:1.35.7-1 (armhf)
mediawiki-extension-codemirror/4.0.0~git20210902.a63f3c2-3 (armhf, s390x)
mediawiki-extension-youtube/1.9.3~git20200711.0f87a53-4 (armhf)
mediawiki-skin-greystuff/1.0.8~git20200711.479faf1-4 (armhf)
piuparts/1.1.5 (amd64, arm64, ppc64el, s390x)
redmine/5.0.2-2 (amd64, armhf)
resource-agents/1:4.11.0-1ubuntu1 (amd64, armhf)
sahara/2:17.0.0-0ubuntu1 (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/kinetic/update_excuses.html#apache2

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

description: updated
Revision history for this message
Ubuntu SRU Bot (ubuntu-sru-bot) wrote : Autopkgtest regression report (apache2/2.4.52-1ubuntu4.6)

All autopkgtests for the newly accepted apache2 (2.4.52-1ubuntu4.6) for jammy have finished running.
The following regressions have been reported in tests triggered by the package:

backuppc/4.4.0-5ubuntu2 (amd64)
cinder/2:20.1.0-0ubuntu2.1 (amd64, armhf)
gnocchi/4.4.1+git2022033113.2339b9e9-0ubuntu1 (armhf)
gvfs/1.48.2-0ubuntu1 (amd64)
libapache2-mod-perl2/2.0.12-1build1 (ppc64el, s390x)
mediawiki/1:1.35.6-1 (armhf)
mediawiki-skin-greystuff/1.0.8~git20200711.479faf1-4 (armhf)
nova/3:25.1.0-0ubuntu2.1 (armhf)
piuparts/1.1.5 (amd64, arm64, ppc64el, s390x)
resource-agents/1:4.7.0-1ubuntu7 (amd64)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/jammy/update_excuses.html#apache2

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Revision history for this message
Bryce Harrington (bryce) wrote (last edit ):

Those are mostly all flaky tests which is usual for apache2. (It would be helpful if autopkgtest would re-run such failures automatically since this is so common with its infrastructure). I've retriggered all these and for Jammy they all passed except for piuparts.

piuparts fails because of a known issue in debootstrap (LP: #2020530), since debootstrap needs an SRU update every cycle to add the new devel release codename (i.e. LP: #1995612 (lunar), LP: #1970454 (kinetic), LP: #1947362 (jammy), LP: #1925753 (impish)) and it has not yet been updated for mantic. That's "not a real regression"[1] for apache2 but I guess just a perennial problem that happens at the start of each new development cycle.

1: https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

On Kinetic, there still seems to be some trouble with backuppc on one arch (which I've retriggered once again in case its a persistent flaky issue). Also, all the apache2 tests fail here due to a 403 permissions error in one of t/apache/mmn.t's test cases. I'm unclear why that fails on kinetic but not jammy, but it's doing so for all architectures so seems more than just flakiness. It does not seem like this relates to the change for this bug, but I don't have evidence to definitively convince me of that.

So, on the jammy side I think the SRU can proceed to release. Kinetic is probably ok but needs more sleuthing.

Revision history for this message
Bryce Harrington (bryce) wrote :

I've now resolved backuppc.

---
As mentioned before, the piuparts failure is a known, unrelated issue; it shouldn't block acceptance of this SRU.

---
The kinetic apache2 failures are due to certificate issues, maybe TLSv1.3-related? The logged error says:

"<p>You don't have permission to access this resource.Reason: Cannot perform Post-Handshake Authentication.<br /></p>"

Given it occurs on kinetic but not jammy, could this be something particular about certificate configuration of kinetic autopkgtest hosts?

In any case, it does not seem to be a flaky test issue and will require further debugging to understand what is going on. So Michal, I'll bump the task back to you to sort this out.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.