Activity log for bug #1894879

Date Who What changed Old value New value Message
2020-09-08 17:37:06 Simon Déziel bug added bug
2020-09-10 18:02:15 Simon Déziel description haproxy 2.0.13-2 keeps crashing for us: # journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with code ' Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current worker #1 (16839) exited with code 139 (Segmentation fault) Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current worker #1 (16877) exited with code 139 (Segmentation fault) Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current worker #1 (16919) exited with code 139 (Segmentation fault) Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current worker #1 (16942) exited with code 139 (Segmentation fault) Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current worker #1 (16955) exited with code 139 (Segmentation fault) Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current worker #1 (16965) exited with code 139 (Segmentation fault) Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current worker #1 (16984) exited with code 139 (Segmentation fault) Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current worker #1 (17077) exited with code 139 (Segmentation fault) Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current worker #1 (17093) exited with code 139 (Segmentation fault) Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current worker #1 (17116) exited with code 139 (Segmentation fault) Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current worker #1 (17135) exited with code 139 (Segmentation fault) Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current worker #1 (17147) exited with code 139 (Segmentation fault) Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current worker #1 (17166) exited with code 134 (Aborted) Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current worker #1 (17266) exited with code 139 (Segmentation fault) Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current worker #1 (17289) exited with code 134 (Aborted) Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current worker #1 (17310) exited with code 139 (Segmentation fault) Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current worker #1 (17334) exited with code 139 (Segmentation fault) Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current worker #1 (17651) exited with code 139 (Segmentation fault) Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current worker #1 (17672) exited with code 139 (Segmentation fault) The server in question uses a config that looks like this: global maxconn 50000 log /dev/log local0 log /dev/log local1 notice chroot /var/lib/haproxy stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners stats timeout 30s user haproxy group haproxy daemon defaults maxconn 50000 log global mode tcp option tcplog # dontlognull won't log sessions where the DNS resolution failed #option dontlognull timeout connect 5s timeout client 15s timeout server 15s resolvers mydns nameserver local 127.0.0.1:53 accepted_payload_size 1400 timeout resolve 1s timeout retry 1s hold other 30s hold refused 30s hold nx 30s hold timeout 30s hold valid 10s hold obsolete 30s frontend foo ... # dns lookup tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host) tcp-request content capture var(txn.remote_ip) len 40 # XXX: the remaining rejections happen in the backend to provide better logging # reject connection on DNS resolution error use_backend be_dns_failed unless { var(txn.remote_ip) -m found } ... # at this point, we should let the connection through default_backend be_allowed When reaching out to haproxy folks in #haproxy, https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was mentioned as a potential fix. https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with "dns": * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766 * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f It would be great to have those (at least ef131ae) SRU'ed to Focal (Bionic doesn't isn't affected as it runs on 1.8). Additional information: # lsb_release -rd Description: Ubuntu 20.04.1 LTS Release: 20.04 # apt-cache policy haproxy haproxy: Installed: 2.0.13-2 Candidate: 2.0.13-2 Version table: *** 2.0.13-2 500 500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages 100 /var/lib/dpkg/status [Impact] * using the do-resolve action causes frequent crashes (>15 times/day on a given machine) and sometimes put haproxy into a bogus state where DNS resolution stops working forever without crashing (DoS) * to address the problem, 3 upstream fixes that were released in a later micro release are backported. Those fixes address problems affecting the do-resolve action that was introduced in the 2.0 branch (which is why only Focal is affected) [Test case] Those crashes are due to the do-resolve action not being thread-safe. As such, it is hard to trigger this on demand, except with production load. The proposed fix should be tested for any regression, ideally on a SMP enabled (multiple CPUs/cores) machine as the crashes happen more easily on those. haproxy should be configured like in your production environment and put to test to see if it still behaves like it should. The more concurrent traffic the better, especially if you make use of the do-resolve action. Any crash should be visible in the journal so you can keep an eye on "journalctl -fu haproxy.service" during the tests. Abnormal exits should be visible with: journalctl -u haproxy --grep ALERT | grep -F 'exited with code ' With the proposed fix, there should be none even when under heavy concurrent traffic. [Regression Potential] The backported fixes address locking and memory management issues. It's possible they could introduce more locking problems or memory leaks. That said, they only change code related to the DNS/do-resolve action which is already known to trigger frequent crashes. The backported fixes come from a later micro version (2.0.16, released on 2020-07-17) so it has seen real world production usage already. Upstream also released another micro version since then and they didn't revert nor rework the DNS/do-resolve action. Furthermore, the patch was tested on a production machine that used to crash >15 times per day. We are now 48 hours after the patch deployment and we saw no crash and no sign of regression either. [Original description] haproxy 2.0.13-2 keeps crashing for us: # journalctl --since yesterday -u haproxy --grep ALERT | grep -F 'exited with code ' Sep 07 18:14:57 foo haproxy[16831]: [ALERT] 250/181457 (16831) : Current worker #1 (16839) exited with code 139 (Segmentation fault) Sep 07 19:45:23 foo haproxy[16876]: [ALERT] 250/194523 (16876) : Current worker #1 (16877) exited with code 139 (Segmentation fault) Sep 07 19:49:01 foo haproxy[16916]: [ALERT] 250/194901 (16916) : Current worker #1 (16919) exited with code 139 (Segmentation fault) Sep 07 19:49:02 foo haproxy[16939]: [ALERT] 250/194902 (16939) : Current worker #1 (16942) exited with code 139 (Segmentation fault) Sep 07 19:49:03 foo haproxy[16953]: [ALERT] 250/194903 (16953) : Current worker #1 (16955) exited with code 139 (Segmentation fault) Sep 07 19:49:37 foo haproxy[16964]: [ALERT] 250/194937 (16964) : Current worker #1 (16965) exited with code 139 (Segmentation fault) Sep 07 23:41:13 foo haproxy[16982]: [ALERT] 250/234113 (16982) : Current worker #1 (16984) exited with code 139 (Segmentation fault) Sep 07 23:41:14 foo haproxy[17076]: [ALERT] 250/234114 (17076) : Current worker #1 (17077) exited with code 139 (Segmentation fault) Sep 07 23:43:20 foo haproxy[17090]: [ALERT] 250/234320 (17090) : Current worker #1 (17093) exited with code 139 (Segmentation fault) Sep 07 23:43:50 foo haproxy[17113]: [ALERT] 250/234350 (17113) : Current worker #1 (17116) exited with code 139 (Segmentation fault) Sep 07 23:43:51 foo haproxy[17134]: [ALERT] 250/234351 (17134) : Current worker #1 (17135) exited with code 139 (Segmentation fault) Sep 07 23:44:44 foo haproxy[17146]: [ALERT] 250/234444 (17146) : Current worker #1 (17147) exited with code 139 (Segmentation fault) Sep 08 00:18:54 foo haproxy[17164]: [ALERT] 251/001854 (17164) : Current worker #1 (17166) exited with code 134 (Aborted) Sep 08 00:27:51 foo haproxy[17263]: [ALERT] 251/002751 (17263) : Current worker #1 (17266) exited with code 139 (Segmentation fault) Sep 08 00:30:36 foo haproxy[17286]: [ALERT] 251/003036 (17286) : Current worker #1 (17289) exited with code 134 (Aborted) Sep 08 00:37:01 foo haproxy[17307]: [ALERT] 251/003701 (17307) : Current worker #1 (17310) exited with code 139 (Segmentation fault) Sep 08 00:40:31 foo haproxy[17331]: [ALERT] 251/004031 (17331) : Current worker #1 (17334) exited with code 139 (Segmentation fault) Sep 08 00:41:14 foo haproxy[17650]: [ALERT] 251/004114 (17650) : Current worker #1 (17651) exited with code 139 (Segmentation fault) Sep 08 00:41:59 foo haproxy[17669]: [ALERT] 251/004159 (17669) : Current worker #1 (17672) exited with code 139 (Segmentation fault) The server in question uses a config that looks like this: global   maxconn 50000   log /dev/log local0   log /dev/log local1 notice   chroot /var/lib/haproxy   stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners   stats timeout 30s   user haproxy   group haproxy   daemon defaults   maxconn 50000   log global   mode tcp   option tcplog   # dontlognull won't log sessions where the DNS resolution failed   #option dontlognull   timeout connect 5s   timeout client 15s   timeout server 15s resolvers mydns   nameserver local 127.0.0.1:53   accepted_payload_size 1400   timeout resolve 1s   timeout retry 1s   hold other 30s   hold refused 30s   hold nx 30s   hold timeout 30s   hold valid 10s   hold obsolete 30s frontend foo   ...   # dns lookup   tcp-request content do-resolve(txn.remote_ip,mydns,ipv4) var(txn.remote_host)   tcp-request content capture var(txn.remote_ip) len 40   # XXX: the remaining rejections happen in the backend to provide better logging   # reject connection on DNS resolution error   use_backend be_dns_failed unless { var(txn.remote_ip) -m found }   ...   # at this point, we should let the connection through   default_backend be_allowed When reaching out to haproxy folks in #haproxy, https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae was mentioned as a potential fix. https://www.haproxy.org/bugs/bugs-2.0.13.html shows 3 commits with "dns": * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=ef131ae * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=39eb766 * https://git.haproxy.org/?p=haproxy-2.0.git;a=commitdiff;h=74d704f It would be great to have those (at least ef131ae) SRU'ed to Focal (Bionic doesn't isn't affected as it runs on 1.8). Additional information: # lsb_release -rd Description: Ubuntu 20.04.1 LTS Release: 20.04 # apt-cache policy haproxy haproxy:   Installed: 2.0.13-2   Candidate: 2.0.13-2   Version table:  *** 2.0.13-2 500         500 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages         100 /var/lib/dpkg/status
2020-09-10 18:02:34 Simon Déziel bug added subscriber Ubuntu Sponsors Team
2020-09-10 18:03:06 Simon Déziel attachment added haproxy-lp1894879.debdiff https://bugs.launchpad.net/ubuntu/+source/haproxy/+bug/1894879/+attachment/5409451/+files/haproxy-lp1894879.debdiff
2020-09-10 18:29:04 Rafael David Tinoco bug added subscriber Ubuntu Server
2020-09-10 18:36:38 Rafael David Tinoco nominated for series Ubuntu Focal
2020-09-10 18:36:38 Rafael David Tinoco bug task added haproxy (Ubuntu Focal)
2020-09-10 18:40:23 Rafael David Tinoco haproxy (Ubuntu): status New Fix Released
2020-09-10 18:40:27 Rafael David Tinoco haproxy (Ubuntu Focal): status New Triaged
2020-09-11 16:04:06 Launchpad Janitor merge proposal linked https://code.launchpad.net/~rafaeldtinoco/ubuntu/+source/haproxy/+git/haproxy/+merge/390625
2020-09-16 17:00:41 Robie Basak haproxy (Ubuntu Focal): status Triaged Fix Committed
2020-09-16 17:00:42 Robie Basak bug added subscriber Ubuntu Stable Release Updates Team
2020-09-16 17:00:44 Robie Basak bug added subscriber SRU Verification
2020-09-16 17:00:46 Robie Basak tags verification-needed verification-needed-focal
2020-09-16 17:01:57 Robie Basak removed subscriber Ubuntu Sponsors Team
2020-09-17 22:05:22 Simon Déziel tags verification-needed verification-needed-focal verification-done verification-done-focal
2020-09-24 08:21:45 Launchpad Janitor haproxy (Ubuntu Focal): status Fix Committed Fix Released
2020-09-24 08:21:49 Łukasz Zemczak removed subscriber Ubuntu Stable Release Updates Team