unbound stops processing tcp (1.9.4-2ubuntu1.4 on focal)

Bug #2012064 reported by Walter

This bug report will be marked for expiration in 58 days if no further activity occurs. (find out why)

8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
unbound (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Hi!

We use unbound as a simple dns recursor. It is bound to a handful IPs. And it is bound to both TCP and UDP.

On more than one occassion have we observed that one of the TCP sockets stopped processing requests.

At that point, netstat would look like this:

# netstat -apnAinet | grep :53.*unbound
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 526703/unbound
tcp 257 0 xx.xxx.xxx.xx:53 0.0.0.0:* LISTEN 526703/unbound
tcp 0 0 yyy.yy.yy.yyy:53 0.0.0.0:* LISTEN 526703/unbound
tcp 0 0 zzz.zzz.zz.zz:53 0.0.0.0:* LISTEN 526703/unbound
udp 0 0 127.0.0.1:53 0.0.0.0:* 526703/unbound
udp 0 0 xx.xxx.xxx.xx:53 0.0.0.0:* 526703/unbound
udp 0 0 yyy.yy.yy.yyy:53 0.0.0.0:* 526703/unbound
udp 0 0 zzz.zzz.zz.zz:53 0.0.0.0:* 526703/unbound

(And a bunch of established connections. But no actual traffic for xx.xxx.xxx.xx:53 TCP.)

The receive queue is non-zero and unbound is not picking anything up on that TCP socket. In the mean time the other sockets work just fine.

I found nothing of interest in the verbosity:1 log.

The process spent its time here:

```
(gdb) bt
#0 0x00007f061830e42a in epoll_wait (epfd=15, events=0x55984a71a0b0, maxevents=64, timeout=2100000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f0618d0d5e9 in ?? () from /lib/x86_64-linux-gnu/libevent-2.1.so.7
#2 0x00007f0618d03625 in event_base_loop () from /lib/x86_64-linux-gnu/libevent-2.1.so.7
#3 0x000055984802f7e0 in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#4 0x0000559847f92fad in worker_work (worker=<optimized out>) at daemon/worker.c:1897
#5 0x0000559847f869db in daemon_fork (daemon=0x559849ac52d0) at daemon/daemon.c:674
#6 0x0000559847f822e8 in run_daemon (need_pidfile=1, log_default_identity=0x7ffc638cdf07 "unbound", debug_mode=1, cmdline_verbose=0, cfgfile=0x55984804be98 "/etc/unbound/unbound.conf") at daemon/unbound.c:700
#7 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:800
```

I do know know whether fd 8 (which is the relevant tcp listening socket) was still in the epoll_wait list of fd 15. An strace saw no events for that fd:

# grep 'epoll_wait.*=8' trace.log
(void)

While other fds showed up just fine:

11:26:28.068155 epoll_wait(15, [{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=5, u64=5}}, {EPOLLIN, {u32=6, u64=6}}], 64, 306) = 3
...
11:26:28.071341 accept4(6, {sa_family=AF_INET, sin_port=htons(36211), sin_addr=inet_addr("aaa.aa.aaa.aa")}, [128->16], SOCK_NONBLOCK) = 41

Maybe fd 8 got EPOLL_CTL_DELeted? Maybe there is no new notification because it should already be known to the application? Maybe there is no notification because the recv queue is full? I'm not intimate with the epoll_wait sematics for listening stream sockets.

Right now I have no leads on the cause yet, but filing the ticket here so others can chime in.

Cheers,
Walter Doekes
OSSO B.V.

----
Version:
unbound 1.9.4-2ubuntu1.4 on Focal
----
Config:

server:
  interface: zzz.zzz.zz.zz
  interface: yyy.yy.yy.yyy
  interface: xx.xxx.xxx.xx
  interface: 127.0.0.1
  verbosity: 1
  access-control: ... allow
  ...

  prefetch: yes
  serve-expired: yes
  serve-expired-ttl: 129600
  serve-expired-ttl-reset: no
  cache-max-ttl: 86400
  cache-min-ttl: 30
  cache-max-negative-ttl: 600
  do-ip4: yes
  do-ip6: no

  extended-statistics: yes
  log-servfail: yes

  qname-minimisation: yes

  auto-trust-anchor-file: "/var/lib/unbound/root.key"

remote-control:
  control-enable: yes
  control-interface: 127.0.0.1
  control-port: 8953
----

Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thanks for taking the time to report the bug.

I haven't tried to reproduce it yet, but I'd like to ask you a few questions.

1) Did you happen to notice whether this problem manifests after certain specific scenarios, or is it some random event?

2) How often does the problem happen?

3) What happens if you tell unbound to listen to fewer interfaces? Does this also happen?

4) Do you think you would be able to come up with a reproducer?

That's it for now.

I'm marking this bug as Incomplete to reflect the fact that we're waiting on more information from you. Please set its status back to New when you provide the information, and we'll continue from there. Thanks!

Changed in unbound (Ubuntu):
status: New → Incomplete
Revision history for this message
Walter (wdoekes) wrote :

Thanks Sergio.

I tried to be as verbose with all the information I had. But here are some additional answers:

(1) No, no known events yet. I'm collecting pcaps in the hope that I find something the next time this happens.

(2) It has happened at least twice:
- 2023-02-22 we noticed this, with unbound:amd64 1.9.4-2ubuntu1.2;
- 2023-03-16 this occurred again, with unbound:amd64 1.9.4-2ubuntu1.4.

The system where it happens does about 120qps (90%+ cached responses).

(3) Hard to tell what happens if we change/add/remove interfaces. This has so far not happened often enough for us to set up a test system. We have _not_ seen this happen on other instances where we have fewer listening interfaces. Those instances are used far less (and probably even less tcp), so that is a more likely reason that we haven't seen it there.

(4) If I can reproduce, I'll be sure to update this and file a bug at upstream if that seems affected too.

Cheers!

Changed in unbound (Ubuntu):
status: Incomplete → New
Revision history for this message
Sergio Durigan Junior (sergiodj) wrote :

Thank you very much for the quick reply, Walter.

This sounds like one of those bugs that are hard to reproduce, which might prove to be a challenge for us to find a proper fix and SRU it into Ubuntu. But let's keep our hopes up.

I'm going to set this bug's status back to Incomplete just to make it clear to future triagers that we're waiting on more information from your side. If I have some spare time, I will try to investigate it a bit more. Please don't hesitate to update the bug with any new information you might have, and set it back to New when you do so. Thank you again.

Changed in unbound (Ubuntu):
status: New → Incomplete
Revision history for this message
Walter (wdoekes) wrote :

Okay. I've managed to reproduce it at least one problem with some scapy code.

We also updated one of our resolvers to Jammy: there the problem is gone. At least the one I could reproduce (*).

If I look at the changes, the most likely fix is this:
https://github.com/NLnetLabs/unbound/pull/122/commits/dc0b1699e53df94aae35e7e4104d6e7c35bf39ce

Reported here:
- https://lists.nlnetlabs.nl/pipermail/unbound-users/2019-August/006361.html ("unbound vs idle tcp connections")
- https://lists.nlnetlabs.nl/pipermail/unbound-users/2019-November/006519.html ("1.9.4: TCP queries when some threads are full")

Broken version: 1.9.4-2ubuntu1.4
Seemingly fixed version: 1.13.1-1ubuntu5.3
The fix above was added in: 1.9.6

Cheers,
Walter
OSSO B.V.

(*) Reproducing:

# ===============================================

#!/usr/bin/env python3
import sys
from scapy import __version__
from scapy.all import (
    DNS, DNSQR, IP, TCP_client, TCP, UDP, Raw, sr1)

SCAPY_VER = [
    (int(i) if i.isdigit() else 0)
    for i in __version__.split('.')]
# commit 1a7061988050f0183868385fc261ed01d1187156
assert SCAPY_VER >= [2, 4, 5], (
    f'scapy 2.4.5+ required for TCP teardown, got {SCAPY_VER}')

BREAK_UNBOUND = 1 # non-zero to break unbound 1.9.4
DNS_SERVER = sys.argv[1] # IP of unbound recursor
DNS_QUERY = DNSQR(qname='daisy.ubuntu.com', qtype='AAAA')

class Hacked_TCP_client(TCP_client):
    """
    TCP_client that stops sending anything after data.

    By not sending an ACK after receiving data,
    The peer will start retransmitting. In the mean time, unbound(1)
    might stall because the TCP socket is busy.
    """
    def send(self, pkt):
        if hasattr(self, '_stop_sending'):
            print(' (discarding send)', pkt)
            return

        if 'P' in pkt[TCP].flags:
            print(' (sending data)', pkt)
            # We send a request/query. We get an ACK. Next, the
            # peer will send a response/answer. We will not ACK.
            self._stop_sending = 1
        else:
            print(' (sending)', pkt)

        return super().send(pkt)

dns_req = DNS(
    qr=0, # is_query (not response)
    opcode=0, # query
    rd=1, # request recursion
    qd=DNS_QUERY)

print(f'TCP DNS request @ {DNS_SERVER}:53...')
if BREAK_UNBOUND:
    sock = Hacked_TCP_client.tcplink(Raw, DNS_SERVER, 53)
else:
    sock = TCP_client.tcplink(Raw, DNS_SERVER, 53)

# Prefix the data with the size of the request.
assert len(dns_req) <= 255, len(dns_req)
tcp_dns_req = bytes([0, len(dns_req)]) + bytes(dns_req)

sock.send(tcp_dns_req)
answer = sock.recv(8192)
data = bytes(answer)
# The response also has a 2-byte size prelude.
tcp_dns_ans_len = (data[0] << 8 | data[1])
answer_dns = DNS(data[2:(tcp_dns_ans_len + 2)])
print(answer_dns)
answer_dns.show2()
print(' ^- tcp query, ok')
print()

# With scapy-2.4.5+ this will send FIN+ACK and do
# a clean teardown.
sock.close()

# ===============================================

After running the above code a couple of times, the TCP socket stops listening; at least for a while. Confirm with: `dig anything.com @YOURSERVER +tcp`

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.