haproxy crashes on in __pool_get_first if unique-id-header is used

Bug #1884149 reported by Simon Fels on 2020-06-18
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HAProxy
Fix Released
Unknown
haproxy (Debian)
Fix Released
Unknown
haproxy (Ubuntu)
High
Unassigned
Bionic
Undecided
Christian Ehrhardt 

Bug Description

[Impact]

 * The handling of locks in haproxy led to a state that between idle http
   connections one could have indicated a connection was destroyed. In that
   case the code went on and accessed a just freed resource. As upstream
   puts it "It can have random implications between requests as
     it may lead a wrong connection's polling to be re-enabled or disabled
     for example, especially with threads."

 * Backport the fix from upstreams 1.8 stable branch

[Test Case]

 * It is a race and might be hard to trigger.
   An haproxy config to be in front of three webservers can be seen below.
   Setting up three apaches locally didn't trigger the same bug, but we
   know it is timing sensitive.

 * Simon (anbox) has a setup which reliably triggers this and will run the
   tests there.

 * The bad case will trigger a crash as reported below.

[Regression Potential]

 * This change is in >=Disco and has no further bugs reported against it
   (no follow on change) which should make it rather safe. Also no other
   change to that file context in 1.8 stable since then.
   The change is on the locking of connections. So if we want to expect
   regressions, then they would be at the handling of concurrent
   connections.

[Other Info]

 * Strictly speaking it is a race, so triggering it depends on load and
   machine cpu/IO speed.

---

Version 1.8.8-1ubuntu0.10 of haproxy in Ubuntu 18.04 (bionic) crashes with

------------------------------------

Thread 2.1 "haproxy" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xfffff77b1010 (LWP 17174)]
__pool_get_first (pool=0xaaaaaac6ddd0, pool=0xaaaaaac6ddd0) at include/common/memory.h:124
124 include/common/memory.h: No such file or directory.
(gdb) bt
#0 __pool_get_first (pool=0xaaaaaac6ddd0, pool=0xaaaaaac6ddd0) at include/common/memory.h:124
#1 pool_alloc_dirty (pool=0xaaaaaac6ddd0) at include/common/memory.h:154
#2 pool_alloc (pool=0xaaaaaac6ddd0) at include/common/memory.h:229
#3 conn_new () at include/proto/connection.h:655
#4 cs_new (conn=0x0) at include/proto/connection.h:683
#5 connect_conn_chk (t=0xaaaaaacb8820) at src/checks.c:1553
#6 process_chk_conn (t=0xaaaaaacb8820) at src/checks.c:2135
#7 process_chk (t=0xaaaaaacb8820) at src/checks.c:2281
#8 0x0000aaaaaabca0b4 in process_runnable_tasks () at src/task.c:231
#9 0x0000aaaaaab76f44 in run_poll_loop () at src/haproxy.c:2399
#10 run_thread_poll_loop (data=<optimized out>) at src/haproxy.c:2461
#11 0x0000aaaaaaad79ec in main (argc=<optimized out>, argv=0xaaaaaac61b30) at src/haproxy.c:3050

------------------------------------

when running on an ARM64 system. The haproxy.cfg looks like this:

------------------------------------

global
    log /dev/log local0
    log /dev/log local1 notice
    maxconn 4096
    user haproxy
    group haproxy
    spread-checks 0
    tune.ssl.default-dh-param 1024
    ssl-default-bind-ciphers ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:!DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:!DHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA256:AES256-SHA256:AES128-SHA:AES256-SHA:AES:!CAMELLIA:DES-CBC3-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!aECDH:!EDH-DSS-DES-CBC3-SHA:!EDH-RSA-DES-CBC3-SHA:!KRB5-DES-CBC3-SHA

defaults
    log global
    mode tcp
    option httplog
    option dontlognull
    retries 3
    timeout queue 20000
    timeout client 50000
    timeout connect 5000
    timeout server 50000

frontend anbox-stream-gateway-lb-5-80
    bind 0.0.0.0:80
    default_backend api_http
    mode http
    http-request redirect scheme https

backend api_http
    mode http

frontend anbox-stream-gateway-lb-5-443
    bind 0.0.0.0:443 ssl crt /var/lib/haproxy/default.pem no-sslv3
    default_backend app-anbox-stream-gateway
    mode http

backend app-anbox-stream-gateway
    mode http
    balance leastconn
    server anbox-stream-gateway-0-4000 10.212.218.61:4000 check ssl verify none inter 2000 rise 2 fall 5 maxconn 4096
    server anbox-stream-gateway-1-4000 10.212.218.93:4000 check ssl verify none inter 2000 rise 2 fall 5 maxconn 4096
    server anbox-stream-gateway-2-4000 10.212.218.144:4000 check ssl verify none inter 2000 rise 2 fall 5 maxconn 4096

------------------------------------

The crash occurs after a first few HTTP requests going through and happens again when systemd restarts the service.

The bug is already reported in Debian https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=921981 and upstream at https://github.com/haproxy/haproxy/issues/40

Using the 1.8.19-1+deb10u2 package from Debian fixes the crash.

Related branches

Robie Basak (racb) wrote :

Based on the upstream report this doesn't appear to be arm64 specific.

tags: added: bitesize server-next
Changed in haproxy (Ubuntu):
status: New → Triaged
importance: Undecided → High
summary: - haproxy crashes on arm64 in __pool_get_first
+ haproxy crashes on in __pool_get_first if unique-id-header is used
Changed in haproxy:
status: Unknown → Fix Released

Fixed in 1.8.19-1 and later which means >=Disco

Changed in haproxy (Ubuntu Bionic):
status: New → Triaged
Changed in haproxy (Ubuntu):
status: Triaged → Fix Released
Download full text (3.5 KiB)

The proper fix on the 1.8 branch for the linked issue is [1]

While checking if that applies to the 1.8.8-1ubuntu0.10 in Bionic it turned out that we don't even have the code that is fixed. So I'm not entirely sure the identified Debian/Upstream bugs are really the "same thing".

The offending commit of that is [2] and only in 1.8.18.

Without [2] there'd be a memory leak which isn't good, but not the crash that you are seeing.

The list of interesting fixes isn't too long:
$ git log --oneline v1.8.8..v1.8.19 -- src/stream.c
109b76f51 BUG/MAJOR: stream: avoid double free on unique_id
56fd86588 BUG/MEDIUM: stream: Don't forget to free s->unique_id in stream_free().
ec70cf52e BUG/MINOR: stream: don't close the front connection when facing a backend error
4b57858a4 BUG/MEDIUM: cli: make "show sess" really thread-safe
784260e63 MINOR: stream/cli: report more info about the HTTP messages on "show sess all"
6d9b1b723 MINOR: stream/cli: fix the location of the waiting flag in "show sess all"
0539df4a0 BUILD: threads: fix minor build warnings when threads are disabled
4bf6d76a2 BUG/MEDIUM: stream: don't crash on out-of-memory
8342ef909 BUG/MEDIUM: session: fix reporting of handshake processing time in the logs
9e1754816 BUG/MINOR: stream: use atomic increments for the request counter

Of these the only "this could be it" seems "4bf6d76a2 BUG/MEDIUM: stream: don't crash on out-of-memory" but you are saying this "occurs after a first few HTTP requests going through" which doesn't sound like usual OOM conditions.

What is the indication that we look at src/stream.c? Is it just the expected fix that was linked - which I disagree? If so we need to look further.

Upstream usually classifies crashes as major, the full list would be:

109b76f51 BUG/MAJOR: stream: avoid double free on unique_id
7cd8fc9eb BUG/MAJOR: spoe: Don't try to get agent config during SPOP healthcheck
4f256797f BUG/MAJOR: spoe: verify that backends used by SPOE cover all their callers' processes
a7f9b5545 BUG/MAJOR: config: verify that targets of track-sc and stick rules are present
a64e5574e BUG/MAJOR: cache: fix confusion between zero and uninitialized cache key
ca3a8768d BUG/MAJOR: stream-int: Update the stream expiration date in stream_int_notify()
69d4ddf91 BUG/MAJOR: http: http_txn_get_path() may deference an inexisting buffer
8e5b0923a BUG/MAJOR: kqueue: Don't reset the changes number by accident.
5877e9b88 BUG/MAJOR: thread: lua: Wrong SSL context initialization.
c28c2bfba BUG/MAJOR: stick_table: Complete incomplete SEGV fix
de9d4c677 BUG/MAJOR: Stick-tables crash with segfault when the key is not in the stick-table
30b244818 BUG/MAJOR: ssl: OpenSSL context is stored in non-reserved memory slot
ade2721ed BUG/MAJOR: ssl: Random crash with cipherlist capture
2b5ef62fc BUG/MAJOR: map: fix a segfault when using http-request set-map
293225b75 MAJOR: spoe: upgrade the SPOP version to 2.0 and remove the support for 1.0
de3b6d5db BUG/MAJOR: lua: Dead lock with sockets
e0f6d4a4e BUG/MAJOR: channel: Fix crash when trying to read from a closed socket

If you look at those does any of them seem to better match your case?

@Simon, if it is so reproducible for you, do you think you'd have a ...

Read more...

Changed in haproxy (Ubuntu Bionic):
assignee: nobody → Christian Ehrhardt  (paelzer)
Simon Fels (morphis) wrote :

Thanks Christian! I was planing to bisect today as I still have the environment up and can easily reproduce.

Simon Fels (morphis) wrote :

Just ran through the bisect:

git bisect start '--term-new=fixed' '--term-old=unfixed'
# fixed: [ebf033b47d58aa04ae9913038c9369dab8740411] [RELEASE] Released version 1.8.19
git bisect fixed ebf033b47d58aa04ae9913038c9369dab8740411
# unfixed: [cd117685f0cff4f2f5577ef6a21eaae96ebd9f28] [RELEASE] Released version 1.8.8
git bisect unfixed cd117685f0cff4f2f5577ef6a21eaae96ebd9f28
# fixed: [88c166870a779985d50f6a2cf840c844bc9b64da] BUG/MINOR: tools: fix set_net_port() / set_host_port() on IPv4
git bisect fixed 88c166870a779985d50f6a2cf840c844bc9b64da
# unfixed: [fdc6c62dbebf4b646b4f80c383e3b00f34b0440f] MINOR: systemd: consider exit status 143 as successful
git bisect unfixed fdc6c62dbebf4b646b4f80c383e3b00f34b0440f
# unfixed: [ad84851746243d85f9be59703e9bee0f5c5f8eba] BUG/MEDIUM: threads: fix the double CAS implementation for ARMv7
git bisect unfixed ad84851746243d85f9be59703e9bee0f5c5f8eba
# fixed: [d9a130e1962c2a5352f33088c563f4248a102c48] BUG/MEDIUM: mux_pt: dereference the connection with care in mux_pt_wake()
git bisect fixed d9a130e1962c2a5352f33088c563f4248a102c48
# unfixed: [a1110e24e5be53ba5fe9ab82372c02a60da06cf9] BUG/MINOR: map: fix map_regm with backref
git bisect unfixed a1110e24e5be53ba5fe9ab82372c02a60da06cf9
# unfixed: [3c42f13badd149c9c3152d7b2e653bde5da7c17a] BUG/MEDIUM: cli/threads: protect all "proxy" commands against concurrent updates
git bisect unfixed 3c42f13badd149c9c3152d7b2e653bde5da7c17a
# unfixed: [d13cb1516cb5ae4cb8322ed630e1d4e1f584fd77] DOC: Fix spelling error in configuration doc
git bisect unfixed d13cb1516cb5ae4cb8322ed630e1d4e1f584fd77
# unfixed: [5b58c92dc9357a87aa3fe94c8121f683feb9c80e] BUG/MINOR: lua: Bad HTTP client request duration.
git bisect unfixed 5b58c92dc9357a87aa3fe94c8121f683feb9c80e
# first fixed commit: [d9a130e1962c2a5352f33088c563f4248a102c48] BUG/MEDIUM: mux_pt: dereference the connection with care in mux_pt_wake()

Cherry picking the identified commit (d9a130e1962c2a5352f33088c563f4248a102c48) on top of 1.8.8 makes things work. I will keep it running for a bit longer and see if it stays stable.

Changed in haproxy (Debian):
status: Unknown → Fix Released

Thanks Simon,
taking a look at that change then ...

Yeah, that one looks better.
The fix is in 1.8.14 and applies as-is to our code in Bionic.

@Simon - would you mind keeping your test system around to test a PPA now and the SRU later on?

@Simon - I have a PPA for a preliminary test at
https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/4108/+packages

If you can give it a try let me know if it works as expected.

description: updated

Everything complete uploaded to Bionic-unapproved for the SRU Team to take a look at it.

Simon Fels (morphis) wrote :

Thanks Christian! Will give the PPA a go tomorrow.

Thanks Simon, and keep the env around - as soon as the SRU Team accepts this into -proposed we will need the same test again.

Robie Basak (racb) wrote :

This looks good, thanks!

One minor note. The dep3 header says: Origin: upstream, https://github.com/haproxy/haproxy/commit/d9a130e1962c2a5352f33088c563f4248a102c48; however that link says "This commit does not belong to any branch on this repository." so it isn't technically "upstream". However, it also says "cherry picked from commit ad7f0ad" and https://github.com/haproxy/haproxy/commit/ad7f0ad1c3c9c541a4c315b24d4500405d1383ee *is* upstream and clearly is the same change, reported as present in the upstream master branch since what looks like their v1.9-dev2 tag. So everything is fine after all. Recording this here in case the d9a130 commit disappears and someone wants to track it down.

Changed in haproxy (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic

Hello Simon, or anyone else affected,

Accepted haproxy into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/haproxy/1.8.8-1ubuntu0.11 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Simon, this is now your time to test once the new version is built in -proposed.
Once you've done that please update the tags here.
Thanks in advance

Simon Fels (morphis) wrote :

I've just applied the package from https://launchpad.net/ubuntu/+source/haproxy/1.8.8-1ubuntu0.11 to our deployment and the crash is fixed. I will keep it running for a while to see if it remains that way.

Thanks for all the work!

Thaks for the test Simon.
I have marked things verified, if in the 7 day aging period you find any issues speak up here.
Otherwise this can hopefully be released in a while and help everone.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package haproxy - 1.8.8-1ubuntu0.11

---------------
haproxy (1.8.8-1ubuntu0.11) bionic; urgency=medium

  * Avoid crashes on idle connections between http requests (LP: #1884149)

 -- Christian Ehrhardt <email address hidden> Mon, 22 Jun 2020 10:41:43 +0200

Changed in haproxy (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for haproxy has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.