named: TCP connections sometimes never close due to race in socket teardown

Bug #1909950 reported by Adam Heathcote
20
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bind9 (Ubuntu)
Fix Released
High
Unassigned
Focal
Fix Released
High
Matthew Ruffell

Bug Description

[Impact]

We are seeing busy Bind9 servers stop accepting TCP connections after a period of time. Looking at netstat, named is still listening to port 53 on all interfaces, but if you send a dig, the connection will just time out:

$ dig +tcp ubuntu.com @192.168.122.2
;; Connection to 192.168.122.2#53(192.168.122.2) for ubuntu.com failed: timed out.

Symptoms are the number of tcp connections slowly increase, as well as the tcp high water mark increases, if you run the "rndc status" command. Eventually, the number of tcp connections will reach the tcp connection limit, and named will "break" and no longer accept any new tcp connections.

There will also be a number of connections in the conntrack table stuck in the ESTABLISHED state, even through they are idle and ready to close, and there will be a number of connections in the SYN_SENT state, due to these connections getting stuck since the tcp connection limit has been reached.

This appears to be caused by a race between deactivating a netmgr handle and processing a asynchronous callback for the socket close code, which can get triggered when a client sends a broken packet to the server and then doesn't close the connection properly.

[Testcase]

You will need two VMs to reproduce this issue.

On the first, install bind9:

$ sudo apt install bind9

Set up a caching resolver by editing /etc/bind/named.conf.options and uncommenting the forwarding block, and adding a DNS provider:

forwarders {
        8.8.8.8;
};

If the DNS provider runs on dnsmasq/libvirt, also set:

dnssec-validation yes;

Next, restart the named service:

$ sudo systemctl restart named.service

Edit /etc/resolv.conf and change the resolver to 127.0.0.1.

Disable the systemd-resolved service:

$ sudo systemctl stop systemd-resolved.service

Test to make sure resolving ubuntu.com works, using the IP of the NIC:

$ dig +tcp @192.168.122.21 ubuntu.com
https://paste.ubuntu.com/p/7NQJ6RRJHN/

Now, go to the second VM:

Test to make sure that you can dig the other VM with:

$ dig +tcp @192.168.122.21 ubuntu.com

After that, use tc to intentionally drop some packets, so we can simulate bad clients dropping connections and not closing them properly, so we can see if we can trigger the race.

My NIC is enp1s0, and 30% drop should do the trick.

$ sudo tc qdisc add dev enp1s0 root netem loss 30%

Next, open gnome-terminal and paste and run the below command in 10-15 tabs, the more the better:

$ for run in {1..10000}; do dig +tcp @192.168.122.21 ubuntu.com & done

This parallelizes the connections to the bind9 server, to try and get above the 150 connection limit.

Back on the server, watch the tcp high water mark in:

$ sudo rndc status
..
tcp clients: 0/150
TCP high-water: 10
..

$ sudo rndc status
..
tcp clients: 31/150
TCP high-water: 58
..

$ sudo rndc status
..
tcp clients: 56/150
TCP high-water: 141
..

$ sudo rndc status
..
tcp clients: 142/150
TCP high-water: 150
..

If you can't hit the 150 mark on tcp high water, add more tabs to the other VM and keep hitting the DNS server. This will likely make the other VM unstable as well, FYI.

Eventually, you will hit the 150 mark. After hitting it a bit longer, your bind9 server will be broken.

$ dig +tcp @192.168.122.21 ubuntu.com
;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out.
;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out.

; <<>> DiG 9.16.1-Ubuntu <<>> +tcp @192.168.122.21 ubuntu.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

;; Connection to 192.168.122.21#53(192.168.122.21) for ubuntu.com failed: timed out.

Do this from the bind9 server, so you don't get confused with the 30% packet drop of the other VM.

If you install the test package from the below ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/lp1909950-test

You can hit this bind9 as much as you can, but it will never become broken. If you stop the thundering herd at the 150 max connections, the server will correctly tear down tcp connections, and you will be able to successfully query the DNS server.

[Where problems could occur]

This patch doesn't really introduce any new code, it re-arranges the ordering of events of existing code.

Before, depending on when a thread was scheduled, we could either deactivate the netmgr handle before calling the asynchronous callback for the socket close code, or vice versa.

The patches change this to ensure that the netmgr handle is deactivated before the socket close callback is issued.

If a regression were to occur, we would see similar symptoms to this bug, with sockets not closing properly and eventually exhausting tcp connection limits, which will cause new tcp connections to not be accepted.

In this case, a workaround would be to restart the named service when the tcp high water mark is nearing the tcp connection limit, and wait for a fix to be developed.

Regardless, only TCP connections would be affected, and UDP will still function, meaning at worst, a partial outage would occur.

[Other]

This was fixed in bind9 9.16.2 by the below commit:

commit 01c4c3301e55b7d6a935a95ac0829e37fb317a0e
Author: Witold Kręcicki <email address hidden>
Date: Thu Mar 26 14:25:06 2020 +0100
Subject: Deactivate the handle before sending the async close callback.
Link: https://gitlab.isc.org/isc-projects/bind9/-/commit/01c4c3301e55b7d6a935a95ac0829e37fb317a0e

Upstream bug: https://gitlab.isc.org/isc-projects/bind9/-/issues/1700

This commit is already present in Groovy and Hirsute. Only Focal needs this patch.

Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thanks for taking the time to file this bug and try to make Ubuntu better.

I suppose you are talking about 'tcp-initial-timeout' config option which the upstream default is 30 seconds (300 * 100 milliseconds). Am I right? Or are mentioning something else?

FWIW I got this info from upstream doc:

https://bind9.readthedocs.io/en/v9_16/reference.html

I checked if in the source package available in Ubuntu any change to this config option was made and I found nothing. So I am not so sure if this is indeed a problem in the packaging side. Could you please provide your config files and how can I reproduce this? That'd help to determine whether this is a bug.

For now I marking this bug as Incomplete but when you provide more details please set the Status back to New and we will revisit it.

Changed in bind9 (Ubuntu):
status: New → Incomplete
Revision history for this message
Adam Heathcote (agent-squirrel) wrote :

Thanks for the update.

Config files are attached. You will note that the only override we we for TCP clients is the amount that are allowed to connect as the default is very low for an ISP DNS server.

However we are relying on the default tcp-timeout values. I will note that since we removed the Ubuntu package for Bind9 and have since compiled it from upstream, the software is obeying the config file settings. When running the Ubuntu packaged version, the lines regarding tcp in the conf are completely ignored or so it would seem.

Thanks

Changed in bind9 (Ubuntu):
status: Incomplete → New
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Interesting, as Lucas said I'd not have seen any config/patch that would intend to change any of that.

As for testing could we tap on your experience please?
As I understand you limit the number of connections (and we could use an even lower number).
So would you think if we default deploy a bind9 with e.g. 100 users and then run 100 times something like "dig +tcp +keepalive +keepopen example.com @IPOFSERVER". Would you then expect the 101th invocation of that to fail/hang?

If not could you outline how you used to test/trigger the issue so that we can try to recreate which is step on of trying to track this down?

Revision history for this message
Adam Heathcote (agent-squirrel) wrote :

Yes that exactly, essentially we found that we had the connections set to 1000 and one client opened a few connections but they never closed, they then opened more and more naturally.

In the end one client had over 900 connections. Once we hit 1000 we ran a quick dig against the server and it failed/timed out.

Although now that I think about it, to get BIND to run after compilation because it wanted to wite log files into /var/log I disabled apparmor temporarily and that allowed the write.

I have no tested the packaged bind without apparmor, I can't imagine apparmor forcing bind to hold TCP connections open though, it was however one difference between the two versions of bind being run.

Just for comparison none of our Debian or CentOS or Fedora boxes exhibit this issue.

Revision history for this message
Paride Legovini (paride) wrote :

Hi Adam,

Could you please let us know the upstream version you compiled and reported as working and the Debian version you found working? I had a look at the upstream commit and bug history but couldn't spot an obvious change regarding the handling of the tcp timeouts, at least past-9.16.1 (there is a relevant change [1] but that was released in 9.16.1, which is the version in Focal). Knowing the versions you tested will be very useful when setting up reproducers. Thanks!

[1] https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/3172/commits

Revision history for this message
Adam Heathcote (agent-squirrel) wrote :

Sure thing:

Upstream Build: BIND 9.16.10
Debian Distro Build: 9.11.5

I have since migrated all the affected machines to KNOT-Resolver but I have kept one off our network with Bind9 still running if you need any extra info.

Thanks

Robie Basak (racb)
Changed in bind9 (Ubuntu):
importance: Undecided → High
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hello,

A another user has reported a very similar issue. They operate bind9 resolvers in both recursive and authoritative configurations, and occasionally, their servers will stop responding to TCP requests.

UDP still operates as normal, but TCP connections will time out.

named is still listening on port 53 for both TCP and UDP:

$ sudo ss -lnp | grep 192.168.122.2:53
udp UNCONN 3840 0 192.168.122.2:53 0.0.0.0:* users:(("named",pid=3054136,fd=14))
tcp LISTEN 11 10 192.168.122.2:53 0.0.0.0:* users:(("named",pid=3054136,fd=42),("named",pid=3054136,fd=41))

$ dig +tcp ubuntu.com @192.168.122.2
;; Connection to 192.168.122.2#53(192.168.122.2) for ubuntu.com failed: timed out.

I was reading through the logs, and they use stunnel as a proxy, and stunnel logs:

Jan 04 22:32:02 <host> stunnel[3186033]: LOG3[1261474]: s_connect: s_poll_wait 127.0.0.1:53: TIMEOUTconnect exceeded
Jan 04 22:32:02 <host> stunnel[3186033]: LOG3[1261474]: No more addresses to connect
Jan 04 22:32:02 <host> stunnel[3186033]: LOG5[1261474]: Connection reset: 0 byte(s) sent to TLS, 0 byte(s) sent to socket

It almost appears that the TCP handshake does not complete, which gives evidence towards hitting some sort of file descriptor limit.

You have to restart bind9 to restore DNS over TCP.

I have been trying to reproduce this myself in Focal containers. I configured bind9 as a caching recursive resolver, set the connection limits to 1000, and ran:

$ for i in {1..1500} ; do dig +tcp +keepalive +keepopen @127.0.0.1 ubuntu.com ; done

I then checked /proc/<named pid>/fd and $ ss -t and $ sudo ss -lnp | grep 127.0.0.1:53, but there is no evidence of sockets remaining open which would indicate a file descriptor leak. All 1500 queries succeeded, and did not hang.

I then tried to resolve from a second focal container over the network, but it didn't make any difference.

I swapped in the config that Adam provided in comment #2, and changed the IP addresses to those of my containers and tried again, but I still can't reproduce.

Adam, thanks for letting us know that it appears to be fixed in BIND 9.16.10, and I will start reading the commits between BIND 9.16.1 (which is in Focal), and BIND 9.16.10 for clues.

I will also keep trying to reproduce this.

Adam, if you have any spare time to tinker with some reproduction instructions or to try upstream builds between 9.16.1 and 9.16.10 to see where it is first fixed, it would be much appreciated.

Thanks,
Matthew

tags: added: sts
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in bind9 (Ubuntu):
status: New → Confirmed
Richard Laager (rlaager)
Changed in bind9 (Ubuntu):
status: Confirmed → New
Revision history for this message
Adam Heathcote (agent-squirrel) wrote :

Yeah testing should be no problem. I will get a server ready.

It may take me some time to get this done however.

Changed in bind9 (Ubuntu Focal):
status: New → In Progress
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Adam,

I think I have found the issue that you are experiencing. I don't have a way to reproduce just yet, but I have built a test package for you to try.

I came across the following upstream bug:

https://gitlab.isc.org/isc-projects/bind9/-/issues/1700

The users mention that tcp connections slowly increase and never timeout, and if they look in their conntrack tables, they see a lot of established connections with timeouts of many days, which is bigger than the default 300s Bind usually offers.

It seems there is a race between deactivating a netmgr handle and processing a asynchronous callback for the socket close code, which can get triggered when a client sends a broken packet to the server and then doesn't close the connection properly.

Relevant commit:

commit 01c4c3301e55b7d6a935a95ac0829e37fb317a0e
Author: Witold Kręcicki <email address hidden>
Date: Thu Mar 26 14:25:06 2020 +0100
Subject: Deactivate the handle before sending the async close callback.
Link: https://gitlab.isc.org/isc-projects/bind9/-/commit/01c4c3301e55b7d6a935a95ac0829e37fb317a0e

I have prepared a test package with the above commit in it. Would you be able to install it to your test server and see if it fixes your problem of connections never getting closed?

Please note that these packages are NOT SUPPORTED by Canonical, and are for TEST PURPOSES only. ONLY install in a dedicated test environment.

Instructions to install (on 20.04)
1) sudo add-apt-repository ppa:mruffell/lp1909950-test
2) sudo apt update
3) sudo apt install bind9
4) sudo apt-cache policy bind9 | grep Installed
Installed: 1:9.16.1-0ubuntu2.4+lp1909950v20210201b1
5) sudo systemctl restart named.service

You might want to install the currently released 9.16.1-0ubuntu2.4 version from -updates, make sure you can reproduce the problem, and then install the above test package to make sure the test package fixes things.

Let me know how the test package goes.

Thanks,
Matthew

Revision history for this message
Adam Heathcote (agent-squirrel) wrote :

Hi Matthew

I can confirm that this package did indeed fix the issue. The current packaged version still presented the problem so it looks like you patch fixed it.

I would be interested to know why compiling from upstream doesn't present the issue however.

Thanks

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Hi Adam,

Thank you very much for testing. It is good to hear that the test package solves your issue.

The patch that was in the test package was in the upstream release 9.16.2, so when you built 9.16.10, the patch would have been present and hence compiling from upstream doesn't present the issue.

I will start getting this ready to SRU to Ubuntu 20.04.

Although, I still need to come up with a good reproducer. Maybe I will try sending dig requests in a loop from a different VM, and use tc to randomly drop a percentage of the traffic to see if we can trigger the race that makes the sockets stuck in ESTABLISHED and not torn down.

I will let you know once there is a new build in -proposed for verification.

Thanks,
Matthew

Changed in bind9 (Ubuntu Focal):
importance: Undecided → High
assignee: nobody → Matthew Ruffell (mruffell)
Changed in bind9 (Ubuntu):
status: New → Fix Released
summary: - TCP connections never close
+ named: TCP connections sometimes never close due to race in socket
+ teardown
description: updated
description: updated
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a debdiff for Bind9 on Focal to fix this bug.

Revision history for this message
Matthew Ruffell (mruffell) wrote :

Attached is a V2 of the patch that fixes this bug, Bind9 for Focal.

tags: added: sts-sponsor
tags: added: sts-sponsor-mfo
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Matthew,

Thanks for the great work on this bug. Sponsored to focal.

I've reviewed the debdiff and had only two minor changes:
1) the Description: field to conform with DEP3/deb822 [1,2]
on multiline field (first line and paragraph separators),
2) trimmed 'and-' of the patch name to keep its line under
80 chars in the changelog (we could break it, but it's weird.)

The package built fine on all archs w/ focal-proposed enabled.

The test-case consistently broke bind9 within ~30 seconds
with a powerful client VM (32 CPUs, 5 tmux tabs w/ the loop)
for the version in focal-updates.
The test package consistently survived the test-case.

cheers,
Mauricio

[1] https://dep-team.pages.debian.net/deps/dep3/
[2] https://manpages.debian.org/unstable/dpkg-dev/deb822.5.en.html

description: updated
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

For doc purposes, I've had an interesting time
debugging why the bind9 forwarding didn't work
to a host running dnsmasq/libvirt (DNS server).

After some tcpdump comparisons against a local
dig client that worked fine, it turns out that
dnssec-validation must be changed from 'auto'
to 'yes', and then bind9 forwarding worked OK!

bind forwarder / default (see percent symbol): FAIL / NotImp
---

$ sudo tcpdump -i vnet9 'port 53'
...
22:59:07.461914 IP 192.168.122.11.48475 > rotom.domain: 36180+% [1au] A? ubuntu.com. (51)
22:59:07.462424 IP rotom.domain > 192.168.122.11.48475: 36180 NotImp 0/0/1 (62)
...

local client (no percent symbol): PASS
---

$ sudo tcpdump -i lo 'port 53'
...
22:58:24.444288 IP rotom.47673 > rotom.domain: 30984+ [1au] A? ubuntu.com. (51)
22:58:24.444915 IP rotom.domain > rotom.47673: 30984 4/0/1 A 91.189.88.181, A 91.189.91.44, A 91.189.91.45, A 91.189.88.180 (103)
...

bind forwarder / dnssec-validation yes (NO percent symbol): PASS
---

$ sudo tcpdump -i vnet9 'port 53'
...
23:04:28.551700 IP 192.168.122.11.47530 > rotom.domain: 36699+ [1au] A? ubuntu.com. (51)
23:04:28.648898 IP rotom.domain > 192.168.122.11.47530: 36699 4/0/1 A 91.189.91.45, A 91.189.88.181, A 91.189.88.180, A 91.189.91.44 (126)
...

Reference: https://serverfault.com/questions/399911/tcpdump-dns-output-codes#400044

Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Adam, or anyone else affected,

Accepted bind9 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in bind9 (Ubuntu Focal):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-focal
Revision history for this message
Marc Deslauriers (mdeslaur) wrote :

Unfortunately this SRU got superseded by a security update, and will have to be re-uploaded.

Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Ack; will do. Thanks for the notice!

Changed in bind9 (Ubuntu Focal):
status: Fix Committed → In Progress
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Hello Adam, or anyone else affected,

Accepted bind9 into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/bind9/1:9.16.1-0ubuntu2.7 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in bind9 (Ubuntu Focal):
status: In Progress → Fix Committed
Revision history for this message
Matthew Ruffell (mruffell) wrote :

Performing verification for Bind9 on Focal.

I first installed 9.16.1-0ubuntu2.6 from -updates to ensure that the issue is still present.

I checked that I could look up ubuntu.com through the local caching resolver. From there I started a second VM, and checked I could look up addresses through the first VM. I then added the 30% packet loss rule with tc. From there I opened up 11 tabs in gnome-terminal and hit the first first VM with:

$ for run in {1..10000}; do dig +tcp @192.168.122.21 ubuntu.com & done

https://paste.ubuntu.com/p/sF9SXkWpZK/

We can see that the "TCP high-water" mark kept rising until it reached 150, when I then killed the thundering herd from the second VM. I then did a DNS lookup, and found that named was not listening to TCP, and the lookup timed out.

This confirms that 9.16.1-0ubuntu2.6 from -updates is affected.

I then enabled -proposed, and installed bind9 9.16.1-0ubuntu2.7 and rebooted.

From there, I can checked I could look up ubuntu.com through the local caching resolver, and again started the second VM. The second VM could also look up addresses through the first VM. I again added a 30% packet loss with tc. I then opened up 11 tabs of gnome-terminal and hit the first vm with the dig for loop of doom.

Except this time, once I reached the TCP high water mark and killed the second VM, the number of TCP connections fell back down to 1, and did not get stuck at a higher number. I did a TCP DNS lookup for ubuntu.com on the server, and the request was successful and did not time out. named is listening to TCP connections as it is suppose to.

https://paste.ubuntu.com/p/SzJMzz6xbh/

bind9 9.16.1-0ubuntu2.7 fixes the problem. Happy to mark as verified.

I will also write back in a few days time with feedback from a user, who is testing this fixed package in production.

tags: added: verification-done-focal
removed: verification-needed verification-needed-focal
Revision history for this message
Richard Laager (rlaager) wrote :

> I will also write back in a few days time with feedback from a user,
> who is testing this fixed package in production.

That user is me. I've been running 1:9.16.1-0ubuntu2.7 on a ISP production recursive server "since Fri 2021-02-19 17:44:17 CST; 5 days ago" (per systemd). The system remains stable. The TCP numbers in "rndc status" look good, compared to when we were experiencing the problem of named hitting the TCP client limit due to sockets not closing.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package bind9 - 1:9.16.1-0ubuntu2.7

---------------
bind9 (1:9.16.1-0ubuntu2.7) focal; urgency=medium

  * Fix a race between deactivating socket handle and processing
    async callbacks, which can lead to sockets not being closed
    properly, exhausting TCP connection limits. (LP: #1909950)
    - d/p/lp-1909950-fix-race-between-deactivating-handle-async-callback.patch

 -- Matthew Ruffell <email address hidden> Thu, 18 Feb 2021 16:28:44 +1300

Changed in bind9 (Ubuntu Focal):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for bind9 has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

tags: removed: sts-sponsor-mfo
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.