unbound stops processing tcp (1.9.4-2ubuntu1.4 on focal)
This bug report will be marked for expiration in 58 days if no further activity occurs. (find out why)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
unbound (Ubuntu) |
Incomplete
|
Undecided
|
Unassigned |
Bug Description
Hi!
We use unbound as a simple dns recursor. It is bound to a handful IPs. And it is bound to both TCP and UDP.
On more than one occassion have we observed that one of the TCP sockets stopped processing requests.
At that point, netstat would look like this:
# netstat -apnAinet | grep :53.*unbound
tcp 0 0 127.0.0.1:53 0.0.0.0:* LISTEN 526703/unbound
tcp 257 0 xx.xxx.xxx.xx:53 0.0.0.0:* LISTEN 526703/unbound
tcp 0 0 yyy.yy.yy.yyy:53 0.0.0.0:* LISTEN 526703/unbound
tcp 0 0 zzz.zzz.zz.zz:53 0.0.0.0:* LISTEN 526703/unbound
udp 0 0 127.0.0.1:53 0.0.0.0:* 526703/unbound
udp 0 0 xx.xxx.xxx.xx:53 0.0.0.0:* 526703/unbound
udp 0 0 yyy.yy.yy.yyy:53 0.0.0.0:* 526703/unbound
udp 0 0 zzz.zzz.zz.zz:53 0.0.0.0:* 526703/unbound
(And a bunch of established connections. But no actual traffic for xx.xxx.xxx.xx:53 TCP.)
The receive queue is non-zero and unbound is not picking anything up on that TCP socket. In the mean time the other sockets work just fine.
I found nothing of interest in the verbosity:1 log.
The process spent its time here:
```
(gdb) bt
#0 0x00007f061830e42a in epoll_wait (epfd=15, events=
#1 0x00007f0618d0d5e9 in ?? () from /lib/x86_
#2 0x00007f0618d03625 in event_base_loop () from /lib/x86_
#3 0x000055984802f7e0 in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#4 0x0000559847f92fad in worker_work (worker=<optimized out>) at daemon/
#5 0x0000559847f869db in daemon_fork (daemon=
#6 0x0000559847f822e8 in run_daemon (need_pidfile=1, log_default_
#7 main (argc=<optimized out>, argv=<optimized out>) at daemon/
```
I do know know whether fd 8 (which is the relevant tcp listening socket) was still in the epoll_wait list of fd 15. An strace saw no events for that fd:
# grep 'epoll_wait.*=8' trace.log
(void)
While other fds showed up just fine:
11:26:28.068155 epoll_wait(15, [{EPOLLIN, {u32=7, u64=7}}, {EPOLLIN, {u32=5, u64=5}}, {EPOLLIN, {u32=6, u64=6}}], 64, 306) = 3
...
11:26:28.071341 accept4(6, {sa_family=AF_INET, sin_port=
Maybe fd 8 got EPOLL_CTL_DELeted? Maybe there is no new notification because it should already be known to the application? Maybe there is no notification because the recv queue is full? I'm not intimate with the epoll_wait sematics for listening stream sockets.
Right now I have no leads on the cause yet, but filing the ticket here so others can chime in.
Cheers,
Walter Doekes
OSSO B.V.
----
Version:
unbound 1.9.4-2ubuntu1.4 on Focal
----
Config:
server:
interface: zzz.zzz.zz.zz
interface: yyy.yy.yy.yyy
interface: xx.xxx.xxx.xx
interface: 127.0.0.1
verbosity: 1
access-control: ... allow
...
prefetch: yes
serve-expired: yes
serve-
serve-
cache-max-ttl: 86400
cache-min-ttl: 30
cache-
do-ip4: yes
do-ip6: no
extended-
log-servfail: yes
qname-
auto-
remote-control:
control-enable: yes
control-
control-port: 8953
----
Thanks for taking the time to report the bug.
I haven't tried to reproduce it yet, but I'd like to ask you a few questions.
1) Did you happen to notice whether this problem manifests after certain specific scenarios, or is it some random event?
2) How often does the problem happen?
3) What happens if you tell unbound to listen to fewer interfaces? Does this also happen?
4) Do you think you would be able to come up with a reproducer?
That's it for now.
I'm marking this bug as Incomplete to reflect the fact that we're waiting on more information from you. Please set its status back to New when you provide the information, and we'll continue from there. Thanks!