haproxy in bionic can get stuck

Bug #1848902 reported by James Troup on 2019-10-20
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
haproxy (Ubuntu)
High
Unassigned
Bionic
High
Unassigned

Bug Description

[Impact]

 * The master process will exit with the status of the last worker.
   When the worker is killed with SIGTERM, it is expected to get 143 as an
   exit status. Therefore, we consider this exit status as normal from a
   systemd point of view. If it happens when not stopping, the systemd
   unit is configured to always restart, so it has no adverse effect.

 * Backport upstream fix - adding another accepted RC to the systemd
   service

[Test Case]

 * You want to install haproxy and have it running. Then sigterm it a lot.
   With the fix it would restart the service all the time, well except
   restart limit. But in the bad case it will just stay down and didn't
   even try to restart it.

   $ apt install haproxy
   $ for x in {1..100}; do pkill -TERM -x haproxy ; sleep 0.1 ; done
   $ systemctl status haproxy

   The above is a hacky way to trigger some A/B behavior on the fix.
   It isn't perfect as systemd restart counters will kick in and you
   essentially check a secondary symptom.
   I'd recommend to in addition run the following:

   $ apt install haproxy
   $ for x in {1..1000}; do pkill -TERM -x haproxy ; sleep 0.001 systemctl
reset-failed haproxy.service; done
   $ systemctl status haproxy

   You can do so with even smaller sleeps, that should keep the service up
   and running (this isn't changing with the fix, but should work with the new code).

[Regression Potential]

 * This eventually is a conffile modification, so if there are other
   modifications done by the user they will get a prompt. But that isn't a
   regression. I checked the code and I can't think of another RC=143 that
   would due to that "no more" detected as error. I really think other
   than the update itself triggering a restart (as usual for services)
   there is no further regression potential to this.

[Other Info]

 * Fix already active in IS hosted cloud without issues since a while
 * Also reports (comment #5) show that others use this in production as
   well

---

On a Bionic/Stein cloud, after a network partition, we saw several units (glance, swift-proxy and cinder) fail to start haproxy, like so:

root@juju-df624b-6-lxd-4:~# systemctl status haproxy.service
● haproxy.service - HAProxy Load Balancer
   Loaded: loaded (/lib/systemd/system/haproxy.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2019-10-20 00:23:18 UTC; 1h 35min ago
     Docs: man:haproxy(1)
           file:/usr/share/doc/haproxy/configuration.txt.gz
  Process: 2002655 ExecStart=/usr/sbin/haproxy -Ws -f $CONFIG -p $PIDFILE $EXTRAOPTS (code=exited, status=143)
  Process: 2002649 ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q $EXTRAOPTS (code=exited, status=0/SUCCESS)
 Main PID: 2002655 (code=exited, status=143)

Oct 20 00:16:52 juju-df624b-6-lxd-4 systemd[1]: Starting HAProxy Load Balancer...
Oct 20 00:16:52 juju-df624b-6-lxd-4 systemd[1]: Started HAProxy Load Balancer.
Oct 20 00:23:18 juju-df624b-6-lxd-4 systemd[1]: Stopping HAProxy Load Balancer...
Oct 20 00:23:18 juju-df624b-6-lxd-4 haproxy[2002655]: [WARNING] 292/001652 (2002655) : Exiting Master process...
Oct 20 00:23:18 juju-df624b-6-lxd-4 haproxy[2002655]: [ALERT] 292/001652 (2002655) : Current worker 2002661 exited with code 143
Oct 20 00:23:18 juju-df624b-6-lxd-4 haproxy[2002655]: [WARNING] 292/001652 (2002655) : All workers exited. Exiting... (143)
Oct 20 00:23:18 juju-df624b-6-lxd-4 systemd[1]: haproxy.service: Main process exited, code=exited, status=143/n/a
Oct 20 00:23:18 juju-df624b-6-lxd-4 systemd[1]: haproxy.service: Failed with result 'exit-code'.
Oct 20 00:23:18 juju-df624b-6-lxd-4 systemd[1]: Stopped HAProxy Load Balancer.
root@juju-df624b-6-lxd-4:~#

The Debian maintainer came up with the following patch for this:

  https://<email address hidden>/msg30477.html

Which was added to the 1.8.10-1 Debian upload and merged into upstream 1.8.13.
Unfortunately Bionic is on 1.8.8-1ubuntu0.4 and doesn't have this patch.

Please consider pulling this patch into an SRU for Bionic.

Related branches

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in haproxy (Ubuntu):
status: New → Confirmed
Paul Collins (pjdc) wrote :

It looks like this is the patch that was merged: https://<email address hidden>/msg30473.html

2018/07/30 : 1.8.13
    - MINOR: systemd: consider exit status 143 as successful

I've built a package with this patch and deployed it to the problematic cloud.

Here's the PPA: https://launchpad.net/~canonical-is-sre/+archive/ubuntu/haproxy-lp1848902/+packages

Paul Collins (pjdc) wrote :

This seems to fix our haproxy problem, although the HA stack still needs help to recover from the situation. But at least haproxy isn't getting in the way anymore.

Paul Collins (pjdc) wrote :

To reproduce:

   apt install haproxy
   for x in {1..100}; do pkill -TERM -x haproxy ; sleep 0.1 ; done
   systemctl status haproxy

Observe that the unit has failed due to exit-code: "Active: failed (Result: exit-code)"

To test the fix:

Repeat the steps above with the fixed package installed.

Observe that although the unit may still have failed, it is not due to exit-code, e.g.: "Active: failed (Result: start-limit-hit)".

If you perform upgrade testing, note that simply installing the fixed package seems to revive haproxy without needing systemctl reset-failed haproxy. Confirm this by checking systemctl status haproxy for "Active: active (running)" after the upgrade.

tags: added: server-next
Simon Déziel (sdeziel) wrote :

In our HAProxy deployments, we always add the following drop-in on Bionic:

    [Service]
    # XXX: returns 143 when SIGTERM'ed by systemd
    SuccessExitStatus=143

It would be great to have the default unit fixed, so thanks!

Changed in haproxy (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged

This was also backported to the 2.x and 1.8 series.
According to git this is included in:

1.8.13
fdc6c62dbebf4b646b4f80c383e3b00f34b0440f

Anything >=1.9
3b479bd5f5f50ce91cabed32bb26556313552d23

Thereby all >Bionic are fixed already.
Consider pulling the fix into the Bionic version.

Changed in haproxy (Ubuntu Bionic):
status: New → Triaged
importance: Undecided → High
Changed in haproxy (Ubuntu):
status: Triaged → Fix Released

Note it was not ported to the 1.6.x series and I'd also want to keep Xenial out of this in general.

Confirmed the test case, waiting for MP review now

Uploaded to Bionic-unapproved

description: updated

Hello James, or anyone else affected,

Accepted haproxy into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/haproxy/1.8.8-1ubuntu0.8 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in haproxy (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic

All autopkgtests for the newly accepted haproxy (1.8.8-1ubuntu0.8) for bionic have finished running.
The following regressions have been reported in tests triggered by the package:

haproxy/1.8.8-1ubuntu0.8 (armhf)

Please visit the excuses page listed below and investigate the failures, proceeding afterwards as per the StableReleaseUpdates policy regarding autopkgtest regressions [1].

https://people.canonical.com/~ubuntu-archive/proposed-migration/bionic/update_excuses.html#haproxy

[1] https://wiki.ubuntu.com/StableReleaseUpdates#Autopkgtest_Regressions

Thank you!

Simon Déziel (sdeziel) wrote :

Verified to be working on Bionic using the provided test case and another simpler one (simply stopping haproxy resulted in the error/143 status).

Preparing to unpack .../haproxy_1.8.8-1ubuntu0.8_amd64.deb ...
Unpacking haproxy (1.8.8-1ubuntu0.8) over (1.8.8-1ubuntu0.7) ...
Setting up haproxy (1.8.8-1ubuntu0.8) ...
Processing triggers for systemd (237-3ubuntu10.33) ...
Processing triggers for rsyslog (8.32.0-1ubuntu4) ...

root@hap1:~# for x in {1..1000}; do pkill -TERM -x haproxy ; sleep 0.001; systemctl reset-failed haproxy.service; done
root@hap1:~# systemctl status haproxy
● haproxy.service - HAProxy Load Balancer
   Loaded: loaded (/lib/systemd/system/haproxy.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-11-28 15:20:01 UTC; 1s ago
     Docs: man:haproxy(1)
           file:/usr/share/doc/haproxy/configuration.txt.gz
  Process: 8076 ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q $EXTRAOPTS (code=exited, status=0/SUCCESS)
 Main PID: 8077 (haproxy)
    Tasks: 2 (limit: 4915)
   CGroup: /system.slice/haproxy.service
           ├─8077 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
           └─8078 /usr/sbin/haproxy -Ws -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid

Nov 28 15:20:01 hap1 systemd[1]: Starting HAProxy Load Balancer...
Nov 28 15:20:01 hap1 systemd[1]: Started HAProxy Load Balancer.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic

The verification of the Stable Release Update for haproxy has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package haproxy - 1.8.8-1ubuntu0.8

---------------
haproxy (1.8.8-1ubuntu0.8) bionic; urgency=medium

  * d/p/lp-1848902-MINOR-systemd-consider-exit-status-143-as-successful.patch:
    fix potential hang in haproxy (LP: #1848902)

 -- Christian Ehrhardt <email address hidden> Tue, 12 Nov 2019 13:16:22 +0100

Changed in haproxy (Ubuntu Bionic):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers