named: TCP connections sometimes never close due to race in socket teardown
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
bind9 (Ubuntu) |
Fix Released
|
High
|
Unassigned | ||
Focal |
Fix Released
|
High
|
Matthew Ruffell |
Bug Description
[Impact]
We are seeing busy Bind9 servers stop accepting TCP connections after a period of time. Looking at netstat, named is still listening to port 53 on all interfaces, but if you send a dig, the connection will just time out:
$ dig +tcp ubuntu.com @192.168.122.2
;; Connection to 192.168.
Symptoms are the number of tcp connections slowly increase, as well as the tcp high water mark increases, if you run the "rndc status" command. Eventually, the number of tcp connections will reach the tcp connection limit, and named will "break" and no longer accept any new tcp connections.
There will also be a number of connections in the conntrack table stuck in the ESTABLISHED state, even through they are idle and ready to close, and there will be a number of connections in the SYN_SENT state, due to these connections getting stuck since the tcp connection limit has been reached.
This appears to be caused by a race between deactivating a netmgr handle and processing a asynchronous callback for the socket close code, which can get triggered when a client sends a broken packet to the server and then doesn't close the connection properly.
[Testcase]
You will need two VMs to reproduce this issue.
On the first, install bind9:
$ sudo apt install bind9
Set up a caching resolver by editing /etc/bind/
forwarders {
8.8.8.8;
};
If the DNS provider runs on dnsmasq/libvirt, also set:
dnssec-validation yes;
Next, restart the named service:
$ sudo systemctl restart named.service
Edit /etc/resolv.conf and change the resolver to 127.0.0.1.
Disable the systemd-resolved service:
$ sudo systemctl stop systemd-
Test to make sure resolving ubuntu.com works, using the IP of the NIC:
$ dig +tcp @192.168.122.21 ubuntu.com
https:/
Now, go to the second VM:
Test to make sure that you can dig the other VM with:
$ dig +tcp @192.168.122.21 ubuntu.com
After that, use tc to intentionally drop some packets, so we can simulate bad clients dropping connections and not closing them properly, so we can see if we can trigger the race.
My NIC is enp1s0, and 30% drop should do the trick.
$ sudo tc qdisc add dev enp1s0 root netem loss 30%
Next, open gnome-terminal and paste and run the below command in 10-15 tabs, the more the better:
$ for run in {1..10000}; do dig +tcp @192.168.122.21 ubuntu.com & done
This parallelizes the connections to the bind9 server, to try and get above the 150 connection limit.
Back on the server, watch the tcp high water mark in:
$ sudo rndc status
..
tcp clients: 0/150
TCP high-water: 10
..
$ sudo rndc status
..
tcp clients: 31/150
TCP high-water: 58
..
$ sudo rndc status
..
tcp clients: 56/150
TCP high-water: 141
..
$ sudo rndc status
..
tcp clients: 142/150
TCP high-water: 150
..
If you can't hit the 150 mark on tcp high water, add more tabs to the other VM and keep hitting the DNS server. This will likely make the other VM unstable as well, FYI.
Eventually, you will hit the 150 mark. After hitting it a bit longer, your bind9 server will be broken.
$ dig +tcp @192.168.122.21 ubuntu.com
;; Connection to 192.168.
;; Connection to 192.168.
; <<>> DiG 9.16.1-Ubuntu <<>> +tcp @192.168.122.21 ubuntu.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached
;; Connection to 192.168.
Do this from the bind9 server, so you don't get confused with the 30% packet drop of the other VM.
If you install the test package from the below ppa:
https:/
You can hit this bind9 as much as you can, but it will never become broken. If you stop the thundering herd at the 150 max connections, the server will correctly tear down tcp connections, and you will be able to successfully query the DNS server.
[Where problems could occur]
This patch doesn't really introduce any new code, it re-arranges the ordering of events of existing code.
Before, depending on when a thread was scheduled, we could either deactivate the netmgr handle before calling the asynchronous callback for the socket close code, or vice versa.
The patches change this to ensure that the netmgr handle is deactivated before the socket close callback is issued.
If a regression were to occur, we would see similar symptoms to this bug, with sockets not closing properly and eventually exhausting tcp connection limits, which will cause new tcp connections to not be accepted.
In this case, a workaround would be to restart the named service when the tcp high water mark is nearing the tcp connection limit, and wait for a fix to be developed.
Regardless, only TCP connections would be affected, and UDP will still function, meaning at worst, a partial outage would occur.
[Other]
This was fixed in bind9 9.16.2 by the below commit:
commit 01c4c3301e55b7d
Author: Witold Kręcicki <email address hidden>
Date: Thu Mar 26 14:25:06 2020 +0100
Subject: Deactivate the handle before sending the async close callback.
Link: https:/
Upstream bug: https:/
This commit is already present in Groovy and Hirsute. Only Focal needs this patch.
Changed in bind9 (Ubuntu): | |
importance: | Undecided → High |
Changed in bind9 (Ubuntu): | |
status: | Confirmed → New |
Changed in bind9 (Ubuntu Focal): | |
status: | New → In Progress |
summary: |
- TCP connections never close + named: TCP connections sometimes never close due to race in socket + teardown |
description: | updated |
description: | updated |
tags: | added: sts-sponsor-mfo |
tags: | removed: sts-sponsor-mfo |
Thanks for taking the time to file this bug and try to make Ubuntu better.
I suppose you are talking about 'tcp-initial- timeout' config option which the upstream default is 30 seconds (300 * 100 milliseconds). Am I right? Or are mentioning something else?
FWIW I got this info from upstream doc:
https:/ /bind9. readthedocs. io/en/v9_ 16/reference. html
I checked if in the source package available in Ubuntu any change to this config option was made and I found nothing. So I am not so sure if this is indeed a problem in the packaging side. Could you please provide your config files and how can I reproduce this? That'd help to determine whether this is a bug.
For now I marking this bug as Incomplete but when you provide more details please set the Status back to New and we will revisit it.