Bug #1752411 “bind9-host, avahi-daemon-check-dns.sh hang forever...” : Bugs : openconnect package : Ubuntu

Revision history for this message

Liam (liam-smit) wrote on 2018-02-28:

#1

Dependencies.txt Edit (3.2 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (6.9 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.1 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (322 bytes, text/plain; charset="utf-8")

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-03-21:

#2

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openconnect (Ubuntu):
status:	New → Confirmed

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-04-03:

#3

I'm also affected. Looking at the process list (ps ax) I found:

23386 pts/4 S+ 0:00 openconnect -vvvvv -s /usr/share/vpnc-scripts/vpnc-script vpn.example.com
23400 pts/4 S+ 0:00 /bin/sh -c /usr/share/vpnc-scripts/vpnc-script
23405 pts/4 S+ 0:00 /bin/sh /usr/share/vpnc-scripts/vpnc-script
23440 ? Ssl 0:00 /usr/lib/NetworkManager/nm-dispatcher
23443 ? S 0:00 /bin/sh -e /etc/NetworkManager/dispatcher.d/01-ifupdown tun0 up
23468 pts/4 S+ 0:00 run-parts --arg=-a --arg=tun0 /etc/resolvconf/update.d
23479 pts/4 S+ 0:00 run-parts /etc/resolvconf/update-libc.d
23500 ? S 0:00 /bin/sh /etc/network/if-up.d/ntpdate
23502 ? S 0:00 flock -n /run/lock/ntpdate /usr/sbin/ntpdate-debian -s
23504 ? S< 0:00 /usr/sbin/ntpdate -s de.pool.ntp.org
23515 pts/4 S+ 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
23534 pts/4 Sl+ 0:00 host -t soa local.
23539 ? S 0:00 run-parts /etc/network/if-up.d
23550 ? S 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
23567 ? Sl 0:00 host -t soa local.
23574 pts/3 R+ 0:00 ps ax

the "avahi-daemon-check-dns.sh" process hangs, maybe because the route to dns isn't setup yet (vpnc-script still running). If I kill this process (pid 23550) the script continues to run and the connect is alive and stable.

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-04-03:

#4

in fact, the "host" command is hanging, sorry.

Revision history for this message

Liam (liam-smit) wrote on 2018-04-03:

#5

Nice work Marc.

I can confirm that if I kill the host command (with PID 23567 in your example) then my VPN connectino works.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-04:

#6

The default timeout for the 'host' command is 10 seconds. Is it taking longer than that?

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-04-04:

#7

If you define "infinite" as longer than 10 seconds - yes. Something is quite bogus here.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-09:

#8

Download full text (4.4 KiB)

I ran into this myself today after upgrading a machine to bionic..

two copies of it running at once.. both stuck on host.
If I execute a new 'host' command it works, but the existing ones are stuck.l

root 14181 0.0 0.0 4628 868 ? Ss 13:05 0:00 /bin/sh -c cat /run/systemd/resolve/stub-resolv.conf | /sbin/resolvconf -a systemd-resolved
root 14292 0.0 0.0 4520 752 ? S 13:05 0:00 run-parts --arg=-a --arg=systemd-resolved /etc/resolvconf/update.d
root 14320 0.0 0.0 4520 748 ? S 13:05 0:00 run-parts /etc/resolvconf/update-libc.d
root 14354 0.0 0.0 4628 1672 ? S 13:05 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
root 14607 0.0 0.0 187532 8380 ? Sl 13:05 0:00 host -t soa local.

root 13775 0.0 0.0 4628 868 ? Ss 13:05 0:00 /bin/sh -ec ifup --allow=hotplug eno1; ifup --allow=auto eno1; if ifquery eno1
>/dev/null; then ifquery --state eno1 >/dev/null; fi
root 13787 0.0 0.0 4592 1868 ? S 13:05 0:00 ifup --allow=auto eno1
root 14179 0.0 0.0 4628 772 ? S 13:05 0:00 /bin/sh -c /bin/run-parts --exit-on-error /etc/network/if-up.d
root 14182 0.0 0.0 4520 768 ? S 13:05 0:00 /bin/run-parts --exit-on-error /etc/network/if-up.d
root 14183 0.0 0.0 4628 772 ? S 13:05 0:00 /bin/sh /etc/network/if-up.d/000resolvconf
root 14461 0.0 0.0 4520 752 ? S 13:05 0:00 run-parts --arg=-a --arg=eno1.inet /etc/resolvconf/update.d
root 14479 0.0 0.0 4520 728 ? S 13:05 0:00 run-parts /etc/resolvconf/update-libc.d
root 14503 0.0 0.0 4628 1612 ? S 13:05 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
root 14606 0.0 0.0 187532 8384 ? Sl 13:05 0:00 host -t soa local.

(gdb) t a a bt

Thread 4 (Thread 0x7f3231f00700 (LWP 14796)):
#0 0x00007f3237b06bb7 in epoll_wait (epfd=5, events=0x7f3238eb1010, maxevents=64, timeout=timeout@entry=-1)
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f323804773b in watcher (uap=0x7f3238eb0010) at ../../../../lib/isc/unix/socket.c:4280
#2 0x00007f3237ddd6db in start_thread (arg=0x7f3231f00700) at pthread_create.c:463
#3 0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f3232701700 (LWP 14794)):
#0 0x00007f3237de39f3 in futex_wait_cancelable (private=<optimised out>, expected=0, futex_word=0x7f3238eae0a4)
at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x7f3238eae028, cond=0x7f3238eae078) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=0x7f3238eae078, mutex=mutex@entry=0x7f3238eae028) at pthread_cond_wait.c:655
#3 0x00007f3238039370 in run (uap=0x7f3238eae010) at ../../../lib/isc/timer.c:808
#4 0x00007f3237ddd6db in start_thread (arg=0x7f3232701700) at pthread_create.c:463
#5 0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f3232f02700 (LWP ...

I ran into this myself today after upgrading a machine to bionic..

two copies of it running at once.. both stuck on host.
If I execute a new 'host' command it works, but the existing ones are stuck.l

root     14181  0.0  0.0   4628   868 ?        Ss   13:05   0:00   /bin/sh -c cat /run/systemd/resolve/stub-resolv.conf | /sbin/resolvconf -a systemd-resolved
root     14292  0.0  0.0   4520   752 ?        S    13:05   0:00     run-parts --arg=-a --arg=systemd-resolved /etc/resolvconf/update.d
root     14320  0.0  0.0   4520   748 ?        S    13:05   0:00       run-parts /etc/resolvconf/update-libc.d
root     14354  0.0  0.0   4628  1672 ?        S    13:05   0:00         /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
root     14607  0.0  0.0 187532  8380 ?        Sl   13:05   0:00           host -t soa local.

root     13775  0.0  0.0   4628   868 ?        Ss   13:05   0:00   /bin/sh -ec ifup --allow=hotplug eno1; ifup --allow=auto eno1;      if ifquery eno1
 >/dev/null; then ifquery --state eno1 >/dev/null; fi
root     13787  0.0  0.0   4592  1868 ?        S    13:05   0:00     ifup --allow=auto eno1
root     14179  0.0  0.0   4628   772 ?        S    13:05   0:00       /bin/sh -c /bin/run-parts --exit-on-error /etc/network/if-up.d
root     14182  0.0  0.0   4520   768 ?        S    13:05   0:00         /bin/run-parts --exit-on-error /etc/network/if-up.d
root     14183  0.0  0.0   4628   772 ?        S    13:05   0:00           /bin/sh /etc/network/if-up.d/000resolvconf
root     14461  0.0  0.0   4520   752 ?        S    13:05   0:00             run-parts --arg=-a --arg=eno1.inet /etc/resolvconf/update.d
root     14479  0.0  0.0   4520   728 ?        S    13:05   0:00               run-parts /etc/resolvconf/update-libc.d
root     14503  0.0  0.0   4628  1612 ?        S    13:05   0:00                 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
root     14606  0.0  0.0 187532  8384 ?        Sl   13:05   0:00                   host -t soa local.

(gdb) t a a bt

Thread 4 (Thread 0x7f3231f00700 (LWP 14796)):
#0  0x00007f3237b06bb7 in epoll_wait (epfd=5, events=0x7f3238eb1010, maxevents=64, timeout=timeout@entry=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1  0x00007f323804773b in watcher (uap=0x7f3238eb0010) at ../../../../lib/isc/unix/socket.c:4280
#2  0x00007f3237ddd6db in start_thread (arg=0x7f3231f00700) at pthread_create.c:463
#3  0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f3232701700 (LWP 14794)):
#0  0x00007f3237de39f3 in futex_wait_cancelable (private=<optimised out>, expected=0, futex_word=0x7f3238eae0a4)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f3238eae028, cond=0x7f3238eae078) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=0x7f3238eae078, mutex=mutex@entry=0x7f3238eae028) at pthread_cond_wait.c:655
#3  0x00007f3238039370 in run (uap=0x7f3238eae010) at ../../../lib/isc/timer.c:808
#4  0x00007f3237ddd6db in start_thread (arg=0x7f3232701700) at pthread_create.c:463
#5  0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f3232f02700 (LWP 14791)):
#0  0x00007f3237de39f3 in futex_wait_cancelable (private=<optimised out>, expected=0, futex_word=0x7f3238eac0cc)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1  __pthread_cond_wait_common (abstime=0x0, mutex=0x7f3238eac028, cond=0x7f3238eac0a0) at pthread_cond_wait.c:502
#2  __pthread_cond_wait (cond=cond@entry=0x7f3238eac0a0, mutex=mutex@entry=0x7f3238eac028) at pthread_cond_wait.c:655
#3  0x00007f32380329e0 in dispatch (manager=0x7f3238eac010) at ../../../lib/isc/task.c:1086
#4  run (uap=0x7f3238eac010) at ../../../lib/isc/task.c:1312
#5  0x00007f3237ddd6db in start_thread (arg=0x7f3232f02700) at pthread_create.c:463
#6  0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 1 (Thread 0x7f3238eeabc0 (LWP 14607)):
#0  0x00007f3237a24236 in __GI___sigsuspend (set=set@entry=0x7ffd71576aa0) at ../sysdeps/unix/sysv/linux/sigsuspend.c:26
#1  0x00007f323803c4d1 in isc__app_ctxrun (ctx0=ctx0@entry=0x7f32382718e0 <isc_g_appctx>) at ../../../../lib/isc/unix/app.c:721
#2  0x00007f323803c7cc in isc__app_run () at ../../../../lib/isc/unix/app.c:754
#3  0x00007f323803d090 in isc_app_run () at ../../../../lib/isc/unix/../app_api.c:198
#4  0x0000563cebce752a in main (argc=4, argv=0x7ffd71576d08) at ../../../bin/dig/host.c:906
(gdb)

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-09:

#9

No VPN in use.. this is probably a bug equally in bind9-host and avahi-daemon

The host shouldn't be getting stuck and avahi should probably make the script timeout somehow

Changed in bind9 (Ubuntu):
importance:	Undecided → Critical
Changed in avahi (Ubuntu):
importance:	Undecided → High
Changed in bind9 (Ubuntu):
importance:	Critical → High
status:	New → Confirmed
Changed in avahi (Ubuntu):
status:	New → Confirmed
summary:	- Can not ping IP addresses on remote network after connect + bind9-host, avahi-daemon-check-dns.sh hang forever causes network + connections to get stuck

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-09:

#10

I did some testing using strace and looking at backtraces of why "host" is stuck, and it's not immediately clear to me why it's getting stuck. Will need to look more in depth into it tracing it's actual execution - it's multi threaded and using poll so not super straight forward from the trace for someone unfamiliar with the code-base.

I did test that when it happens, the network interfaces are up and systemd-resolved is started - and I can see a sendmsg/recvmsg appear to succeed to the systemd stub resolver and my local SNS server. I also tried explicitly setting the timeout with host -W 5 (this should be the default, but wanted to test as there is a -w indefinite option). However the 'host' command always works when I log into the system while the other commands are still stuck in the background - so something strange is going on.

What does work, is executing 'host' under /usr/bin/timeout. Given the severity of this issue (makes startup hang without SSH for several minutes, and blocks everything else from starting up seemingly forever), I would suggest that we should ship a fix for bionic to use timeout to work around the issue for now.

/usr/lib/avahi/avahi-daemon-check-dns.sh : dns_has_local()
OUT=`LC_ALL=C /usr/bin/timeout 5 host -t soa local. 2>&1`

Changed in openconnect (Ubuntu):
status:	Confirmed → Invalid

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2018-04-09:

#11

Can you try running "host" with -d in the scenario where it is hanging?

Also, a fresh bionic system shouldn't run the ifup-down scripts, since it uses netplan. Unless openconnect or one of its dependencies pull it in explicitly. A quick apt-cache rdepends didn't see it.

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-04-09:

#12

output of "trace -Sf host -d -v -w5 -t soa local." Edit (10.0 KiB, text/plain)

My system was upgraded, so maybe a leftover. So I uninstalled ifupdown and upstart. Problem still persists :-(
I will attach a ltrace -Sf host output - maybe it helps...

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-16:

#13

With host -d I simply get

> Trying "local"

When it works normally I get;

Trying "local"
Host local. not found: 3(NXDOMAIN)
Received 98 bytes from 10.48.134.6#53 in 1 ms
Received 98 bytes from 10.48.134.6#53 in 1 ms

The system I am hitting this issue on is an upgraded system (rather than a fresh install which wouldn't use ifupdown)

Because this is a serious issue for bionic upgraders I am attaching a debdiff to use 'timeout' to fix the issue for now because release is imminent. Core issue with 'host' probably still needs to be investigated (as this may add 5s delays to boot-up) however the timeout is probably a good backup anyway. In some ways potentially the entire check-dns script should probably be launched under timeout.

Changed in avahi (Ubuntu):
assignee:	nobody → Trent Lloyd (lathiat)

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-18:

#15

There is a new bind9 upload to bionic-proposed (9.11.3+dfsg-1ubuntu1)

Tested with this version and 'host' is still hanging. So this fix is still required.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2018-04-19:

#16

Some troubleshooting I did with Trent today showed:

a) the "host -t soa local." call triggered a query to 127.0.0.53 as expected, network-wise, which got a response right away

b) we snapshotted ip route and ip addr just before the host call, and saw that the interface responsible for the default route (and route to his dns server) was still down. I wonder if dns_reachable() in /usr/lib/avahi/avahi-daemon-check-dns.sh is doing the right thing. It looks for 127.0.0.1 (and not 127.0.0.53), and, failing that, for a default route. The default route exists, but the link it goes through is still down:

2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000

default via x.x.x.x dev eno1 onlink linkdown

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-04-20:

#17

I'd still like to get the upload debdiff for 'timeout' that I prepared uploaded. Even if we manage to debug the bind9-host issue, it will still be useful to have the timeout command there as a backup. Not long before we run out of time for bionic release.

I am actively looking at the bind9-host issue also, but I do not expect to get that fixed before release.

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2018-04-20:

#18

I think that time is past, we were in beta freeze in the past week, and are in final freeze now. Unless there is a clear test case showing under which conditions this happens and how widespread it is, it's probably best to start thinking in SRU terms.

It looks like a safe change, but since I don't understand the problem entirely yet (when it happens, why), I can't say.

Revision history for this message

Andrej Shadura (andrew.sh) wrote on 2018-05-06:

#19

Please also see https://bugs.debian.org/898038

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2018-05-07:

#20

It should probably also check for 127.0.0.53, not just 127.0.0.1

Bug Watch Updater (bug-watch-updater) on 2018-05-08

Changed in avahi (Debian):
status:	Unknown → New

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-05-22:

#21

Sponsors: Can we get this debdiff uploaded now? We've had a few more reports and I'd like to get this workaround in place.

Revision history for this message

David Sitsky (david-sitsky) wrote on 2018-07-10:

#22

I was hit by this exact problem (update to Ubuntu 18.04), the host command would hang and adding timeout works around the issue. Please put the temporary fix in so others are not affected.

Revision history for this message

Laban Sköllermark (laban-skoller) wrote on 2018-07-19:

#23

I've run into this problem as well after upgrading my Ubuntu 17.10 installation (upgraded from 17.04) to Ubuntu 18.04 last week. My VPN script calling "openconnect" hang and no packets got forwarded. My first workaround was however to connect using nm-applet instead which worked fine (available via package network-manager-openconnect-gnome).

I found this thread after a tip from a colleague after asking for help. It would be nice to have a proper fix for users upgrading to Bionic. The timeout fix/workaround would be good enough in my opinion.

@lathiat: Thanks for the fix in avahi-daemon-check-dns.sh (wrapping "timeout" around the "host" call). I can confirm that this solves the problem for me as well.

A note from my setup:
I have *two* stuck "host -t soa local." processes launched by two different "avahi-daemon-check-dns.sh" instances. pstree says that one is launched by "vpnc-script" (launched by openconnect) and the other one is launched by "01-ifupdown" launched by "nm-dispatcher". Killing the host process started by openconnect->vpnc-script solves the problem.

And as @marvin24 said, uninstalling "ifupdown" does *not* solve the problem.

@muetze-bsw (in duplicate Bug #1772692): I can confirm that uninstalling "avahi-daemon" solves the problem. This will be my "permanent" workaround.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-01:

#24

Hoping to get attention to this again. Since 18.04.1 is out now, more and more users are likely to hit this issue as more users will be upgrading. This issue applies equally to desktop and server scenarios.

I would like to get lp1752411-avahi-host-timeout.diff sponsored for upload please

Robie Basak (racb) on 2018-08-02

tags:

added: server-next

Revision history for this message

Robie Basak (racb) wrote on 2018-08-03:

#25

Thank you for working on this. I agree with your approach. The debdiff looks good.

I think that though it's clear that the bug is in the host command, given that we haven't been able to figure out the fix (I spent some time on it too), it's reasonable to add the timeout command as in your patch as a workaround. No need for the problem to persist when the workaround is so clean and clear. I also approve the timeout workaround for SRU in principle. We can leave a bug task open for bind, but consider a separate bug task resolved in avahi packaging once this workaround is applied.

A couple of comments from your current debdiff:

Please leave a comment above the timeout line explaining why it is there ("Workaround for LP: #1752411" is sufficient). For the SRU, I would prefer a version string of "0.7-3.1ubuntu1.1" ("0.7-3.1ubuntu2" is technically OK but doesn't convey that it is an SRU so well).

Please could you prepare a debdiff for Cosmic so that we can fix it there first? Then follow https://wiki.ubuntu.com/StableReleaseUpdates#Procedure and attach an updated debdiff for the SRU to Bionic. I'll be happy to sponsor both, but will then need review from another SRU team member to accept it from the queue.

Revision history for this message

Simon Quigley (tsimonq2) wrote on 2018-08-17:

#26

Unsubscribing sponsors for now, awaiting the fixes Robie commented about.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-20:

#27

FYI bug 1786261 could be another symptoms of this, reporters there will take a look and might add another affected package to this.

Revision history for this message

Erich E. Hoover (ehoover) wrote on 2018-08-20:

#28

Definitely a duplicate. @paelzer was suggesting in bug 1786261 that dig be used instead, do you guys know if that's a reasonable possibility?

Revision history for this message

fermulator (fermulator) wrote on 2018-08-20:

#29

Notes;

when "things are working", host does either:

while on VPN:
{{{
$ LC_ALL=C host -t soa local.
Host local. not found: 3(NXDOMAIN)

$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 7637
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: e1ff5e7222ad62da (echoed)
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 21 msec
;; SERVER: 192.168.194.20#53(192.168.194.20)
;; WHEN: Mon Aug 20 12:01:19 EDT 2018
;; MSG SIZE rcvd: 46
}}}

while off VPN:
{{{
$ LC_ALL=C host -t soa local.
Host local not found: 2(SERVFAIL)

$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 61619
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 20 12:02:24 EDT 2018
;; MSG SIZE rcvd: 34

}}}

=====
while in the broken/hung state:
^^^^^^^^^^^
=====

{{{
$ LC_ALL=C host -t soa local.

:(

}}}
(even hangs w/ "-W 1") ...

dig command augmented returns!:
{{{
$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16967
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 20 11:56:58 EDT 2018
;; MSG SIZE rcvd: 34
}}}

(I am not familiar enough with SOAL local. lookups though to say if it can replace the host invocation in this method)

/usr/lib/avahi/avahi-daemon-check-dns.sh

dns_has_local() {
  # Some magic to do tests
  if [ -n "${FAKE_HOST_RETURN}" ] ; then
    if [ "${FAKE_HOST_RETURN}" = "true" ]; then
      return 0;
    else
      return 1;
    fi
  fi

  OUT=`LC_ALL=C host -t soa local. 2>&1`
  if [ $? -eq 0 ] ; then
    if echo "$OUT" | egrep -vq 'has no|not found'; then
      return 0
    fi
  else
    # Checking the dns servers failed. Assuming no .local unicast dns, but
    # remove the nameserver cache so we recheck the next time we're triggered
    rm -f ${NS_CACHE}
  fi
  return 1
}

Notes;

when "things are working", host does either:

while on VPN:
{{{
$ LC_ALL=C host -t soa local.
Host local. not found: 3(NXDOMAIN)

$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 7637
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: e1ff5e7222ad62da (echoed)
;; QUESTION SECTION:
;local.				IN	SOA

;; Query time: 21 msec
;; SERVER: 192.168.194.20#53(192.168.194.20)
;; WHEN: Mon Aug 20 12:01:19 EDT 2018
;; MSG SIZE  rcvd: 46
}}}

while off VPN:
{{{
$ LC_ALL=C host -t soa local.
Host local not found: 2(SERVFAIL)

$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 61619
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local.				IN	SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 20 12:02:24 EDT 2018
;; MSG SIZE  rcvd: 34

}}}

=====
while in the broken/hung state:
             ^^^^^^^^^^^
=====

{{{
$ LC_ALL=C host -t soa local.

:(

}}}
 (even hangs w/ "-W 1") ...

dig command augmented returns!:
{{{
$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16967
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local.				IN	SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 20 11:56:58 EDT 2018
;; MSG SIZE  rcvd: 34
}}}

(I am not familiar enough with SOAL local. lookups though to say if it can replace the host invocation in this method)

/usr/lib/avahi/avahi-daemon-check-dns.sh

dns_has_local() { 
  # Some magic to do tests 
  if [ -n "${FAKE_HOST_RETURN}" ] ; then
    if [ "${FAKE_HOST_RETURN}" = "true" ]; then
      return 0;
    else
      return 1;
    fi
  fi

OUT=`LC_ALL=C host -t soa local. 2>&1`
  if [ $? -eq 0 ] ; then
    if echo "$OUT" | egrep -vq 'has no|not found'; then
      return 0
    fi
  else 
    # Checking the dns servers failed. Assuming no .local unicast dns, but
    # remove the nameserver cache so we recheck the next time we're triggered
    rm -f ${NS_CACHE}
  fi
  return 1
}

Revision history for this message

fermulator (fermulator) wrote on 2018-08-20:

#30

(btw; while we're fixing that script ... fix/change backtics to POSIX compliant sub-shell'ing $() ?

Revision history for this message

fermulator (fermulator) wrote on 2018-08-20:

#31

(this is currently in the "openconnect" path despite marked as "invalid" against that package, bug was submitted originally to that project -- can we move to avahi?)

$ dpkg -S /usr/lib/avahi/avahi-daemon-check-dns.sh
avahi-daemon: /usr/lib/avahi/avahi-daemon-check-dns.sh

$ dpkg -s avahi-daemon
Package: avahi-daemon
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 278
Maintainer: Ubuntu Developers <email address hidden>
Architecture: amd64
Multi-Arch: foreign
Source: avahi
Version: 0.7-3.1ubuntu1
Depends: libavahi-common3 (>= 0.6.16), libavahi-core7 (>= 0.6.24), libc6 (>= 2.14), libcap2 (>= 1:2.10), libdaemon0 (>= 0.14), libdbus-1-3 (>= 1.9.14), libexpat1 (>= 2.0.1), adduser, dbus (>= 0.60), lsb-base (>= 3.0-6), bind9-host | host
Recommends: libnss-mdns
Suggests: avahi-autoipd
Conffiles:
/etc/avahi/avahi-daemon.conf 8d4be860ead4cacc2ba5f77e7fadb11d
/etc/avahi/hosts 186990ae1edac95a88dbef6a36a07716
/etc/dbus-1/system.d/avahi-dbus.conf 4b8ff37c10615ae704b7827a438ff534
/etc/default/avahi-daemon 292bdbb95b392a71a0c363eb58b3a119
/etc/init.d/avahi-daemon 7e648c77846d70c4ef1b49c0c4f7cfad
/etc/network/if-up.d/avahi-daemon 6dbf1a91ab420a99d1205972d6401e67
/etc/resolvconf/update-libc.d/avahi-daemon 2cf53ff5a00f9d1fed653a2913de5bc7
/etc/init/avahi-cups-reload.conf 56a60d600cd80a95f2e3b6909c3bda74 obsolete
/etc/init/avahi-daemon.conf 0303b3961d5ffee8f05805b1dd06f475 obsolete
Description: Avahi mDNS/DNS-SD daemon
Avahi is a fully LGPL framework for Multicast DNS Service Discovery.
It allows programs to publish and discover services and hosts
running on a local network with no specific configuration. For
example you can plug into a network and instantly find printers to
print to, files to look at and people to talk to.
.
This package contains the Avahi Daemon which represents your machine
on the network and allows other applications to publish and resolve
mDNS/DNS-SD records.
Homepage: http://avahi.org/
Original-Maintainer: Utopia Maintenance Team <email address hidden>

(this is currently in the "openconnect" path despite marked as "invalid" against that package, bug was submitted originally to that project -- can we move to avahi?)

$ dpkg -S /usr/lib/avahi/avahi-daemon-check-dns.sh
avahi-daemon: /usr/lib/avahi/avahi-daemon-check-dns.sh

$ dpkg -s avahi-daemon
Package: avahi-daemon
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 278
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Multi-Arch: foreign
Source: avahi
Version: 0.7-3.1ubuntu1
Depends: libavahi-common3 (>= 0.6.16), libavahi-core7 (>= 0.6.24), libc6 (>= 2.14), libcap2 (>= 1:2.10), libdaemon0 (>= 0.14), libdbus-1-3 (>= 1.9.14), libexpat1 (>= 2.0.1), adduser, dbus (>= 0.60), lsb-base (>= 3.0-6), bind9-host | host
Recommends: libnss-mdns
Suggests: avahi-autoipd
Conffiles:
 /etc/avahi/avahi-daemon.conf 8d4be860ead4cacc2ba5f77e7fadb11d
 /etc/avahi/hosts 186990ae1edac95a88dbef6a36a07716
 /etc/dbus-1/system.d/avahi-dbus.conf 4b8ff37c10615ae704b7827a438ff534
 /etc/default/avahi-daemon 292bdbb95b392a71a0c363eb58b3a119
 /etc/init.d/avahi-daemon 7e648c77846d70c4ef1b49c0c4f7cfad
 /etc/network/if-up.d/avahi-daemon 6dbf1a91ab420a99d1205972d6401e67
 /etc/resolvconf/update-libc.d/avahi-daemon 2cf53ff5a00f9d1fed653a2913de5bc7
 /etc/init/avahi-cups-reload.conf 56a60d600cd80a95f2e3b6909c3bda74 obsolete
 /etc/init/avahi-daemon.conf 0303b3961d5ffee8f05805b1dd06f475 obsolete
Description: Avahi mDNS/DNS-SD daemon
 Avahi is a fully LGPL framework for Multicast DNS Service Discovery.
 It allows programs to publish and discover services and hosts
 running on a local network with no specific configuration. For
 example you can plug into a network and instantly find printers to
 print to, files to look at and people to talk to.
 .
 This package contains the Avahi Daemon which represents your machine
 on the network and allows other applications to publish and resolve
 mDNS/DNS-SD records.
Homepage: http://avahi.org/
Original-Maintainer: Utopia Maintenance Team <pkg-utopia-maintainers@lists.alioth.debian.org>

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-21:

#32

I agree strongswan/openconnect (and maybe more) are affected by the symptom, while the bug lies in bind9-host/avahi packages at least according to current debugging.

From my experience I guess what would be great to get more traction on this is to get a shorter reproducer than setting up some sort of VPN.
To me it seems several people involved here had those steps to work on it, but no one pasted them clearly here.
So if anybody can provide that please feel free to add them here.

@Trent - you worked on a mitigation of this at least in avahi daemon and currently the status of this bug IMHO is waiting on you to update for the review feedback provided by rbasak in comment #25.
Could you do that or are you having to drop your work on it so that somebody else should take it over?

Changed in strongswan (Ubuntu):
status:	New → Invalid

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-21:

#35

lp1752411-avahi-host-timeout-bionic.patch Edit (1.3 KiB, text/plain)

description:

updated

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-21:

#36

lp1752411-avahi-host-timeout-cosmic.patch Edit (1.3 KiB, text/plain)

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-21:

#37

Request sponsorship of this upload for cosmic and then SRU to bionic
- New debdiff uploaded for both bionic and cosmic
- Fixed the SRU version for bionic
- Added a comment about the workaround to the script
- Updated bug description with SRU template

Tested patch working on bionic with my machine which consistently exhibits the issue with a package built from this diff (albeit with a 5 second delay on network interface up, hopefully after this we can switch to fixing the actual issue with host)

The key note I see on the machine I can reproduce this on (a linux bridge over an Intel I219-LM) is that both the interface route and the default route are in the 'linkdown' state when the host command fires for about 0.7 seconds total. When I looked at a different machine, that stage never happened or at least for a much shorter time (i'd have to check ip monitor again).

I don't expect anyone to reproduce this for testing, i'm happy to test the -proposed packages on an affected machine.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-22:

#38

I've had some minor cleanups on the changelog, but other than that I think the most recent submission is good.

Also I found no issues in testing.
The code path it takes when the timeout triggers is that of a failing host command (bad RC) which I think is just right. It will set things in a way that it considers .local not available, but will rescan later on again - that is perfect for all our cases if later devices have recovered from the odd state.

Thanks Trent, sponsored into Cosmic (and git ubuntu tag pushed).

Please track the migration to cosmic, from there we can then consider to queue up the SRU.

no longer affects:	strongswan (Ubuntu Cosmic)
no longer affects:	strongswan (Ubuntu Bionic)
no longer affects:	openconnect (Ubuntu Cosmic)
no longer affects:	openconnect (Ubuntu Bionic)
Changed in bind9 (Ubuntu Bionic):
status:	New → Confirmed
Changed in avahi (Ubuntu Bionic):
status:	New → Triaged
Changed in avahi (Ubuntu Cosmic):
status:	Confirmed → In Progress

Revision history for this message

fermulator (fermulator) wrote on 2018-08-23:

#39

Are we sure timeout of 5 seconds is appropriate? (it FEELS too long)
My intuition says that if a DNS query takes longer than 1 second it took too long ...

However (consider also the "wait" (-W) parameter for the host command itself)
```

       -W wait
           Timeout: Wait for up to wait seconds for a reply. If wait is less
           than one, the wait interval is set to one second.

           By default, host will wait for 5 seconds for UDP responses and 10
           seconds for TCP connections. These defaults can be overridden by
           the timeout option in /etc/resolv.conf.

See also the -w option.
```

None-the-less, this is a _workaround_ for the issue -- (will the ticket remain open to fix the underlying issue, or a subsequent issue be submitted?)

Revision history for this message

fermulator (fermulator) wrote on 2018-08-23:

#40

PS: I've been running with a hacked /usr/lib/avahi/avahi-daemon-check-dns.sh for a few days with this code:
```
OUT=`LC_ALL=C /usr/bin/timeout 2 host -t soa local. 2>&1`
```
, works like a charm

Revision history for this message

fermulator (fermulator) wrote on 2018-08-23:

#41

We should also consider:

```
# CLEAN
fermulator@fermmy:~$ host -t soa local.
Host local. not found: 3(NXDOMAIN)
fermulator@fermmy:~$ echo $?
1

# BROKEN (host hangs)
fermulator@fermmy:~$ LC_ALL=C /usr/bin/timeout 1 host -t soa local. 2>&1
fermulator@fermmy:~$ echo $?
124

# timeout
fermulator@fermmy:~$ timeout 1 sleep 2
fermulator@fermmy:~$ echo $?
124

# no timeout
fermulator@fermmy:~$ timeout 5 sleep 1
fermulator@fermmy:~$ echo $?
0
```

Isn't the existing logic broken? (perhaps insufficient comments/documentation in this method for me to conclude either way ... the intention maybe is unclear)

```
  if [ $? -eq 0 ] ; then
    if echo "$OUT" | egrep -vq 'has no|not found'; then
      return 0
    fi
  else
    # Checking the dns servers failed. Assuming no .local unicast dns, but
    # remove the nameserver cache so we recheck the next time we're triggered
    rm -f ${NS_CACHE}
  fi

```

later it's used only here
```
if dns_has_local ; then
  # .local from dns server, disabling avahi
  disable_avahi
else
  # no .local from dns server, enabling avahi
  enable_avahi
fi
```

When host call fails (even with timeout), it returns "1" claiming "dns_has_local()=true".
{{{
fermulator@fermmy:~$ OUT="Host local. not found: 3(NXDOMAIN)"
fermulator@fermmy:~$ if echo "$OUT" | egrep -vq 'has no|not found'; then echo "RETURN 0"; else echo "RETURN 1"; fi
RETURN 1
}}}

At least the additional wrapping of timeout (workaround) doesn't make it any worse I suppose ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-23:

#42

Hi fermulator,
I was wondering vice versa if 5 seconds would be too short actually.
Yes the good cases will return in sub-second, but it is the bad cases we want to fix here.
Having a bit more time to recover if it can doesn't seem too bad to me.

On the question of host -W for waiting.
While it isn't fully clarified, we have to consider that in this case there is a real hang inside host or one of its syscalls due to the odd states the devices are in.
That said it could be that the call just blocks forever and a "host internal" timeout would never trigger leaving the system in just the bad state it is now.

Instead the external timeout wrapper should be immune to that, and therefore is better for this case.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-24:

#43

I agree with the sentiment that 5 seconds feels too long, however as a workaround I decided I would just copy the existing timeout. I certainly would not want to make it longer since this is in the critical boot path.

I would generally agree that in general a DNS request should fail faster however there are some cases where it won't, e.g. spanning tree bring up on ports can take 2 seconds.

My hope is to correctly fix host after getting this in, since the impact is very high for affected users.

This check may actually be able to go away, I believe both systemd-resolved and libnss-mdns (latest version that I think is not in bionic) implement the .local label checking to do this at runtime instead of this old hack. So for cosmic+ we can probably get rid of this logic, which always sucked anyway. As we only needed to really disable nss-mdns and not avahi entirely (since apps should normally resolve the IPs using avahi's API anyway, the impact to actual avahi usage is low).

Since the impact is high but only on a smaller subset of users, I think we should go with matching the current timeout for now and worry about further improvements later.

I've verified the cosmic upload is working as expected on a non-affected system.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-24:

#44

Thanks Trent for the extra verification.
Tests also look good so far, but currently since a lot got uploaded due to feature freeze some tests take a while.
We should have that in cosmic soon and then can pick the same for Bionic.

Thanks also for your thoughts on a better long term solution.

Revision history for this message

Trent Lloyd (lathiat) wrote on 2018-08-24:

#45

> When host call fails (even with timeout), it returns "1" claiming "dns_has_local()=true".

0 = true, 1 = false (you implied the opposite)

What may add confusion here is the grep -vq check is like an extra check to make sure host didn't return 0 (success = we found .local) but then say 'not found' anyway. So it returns 0 (true) when host returns 0. It returns 1 when host returns anything else (including timeout); 1 = false which means leave avahi enabled.

Revision history for this message

fermulator (fermulator) wrote on 2018-08-24:

#46

> RE -W/w in `host`
, correct -- even with timeout set, it blocks forever (I tested this several days ago in the dup'd ticket iirc)

> RE timeout
, good thoughts all - sure let's just stick with 5 seconds then

> RE logic true/false (@Trent)
, thanks yes! that'll do it; clarified now in my mind

Revision history for this message

Erich E. Hoover (ehoover) wrote on 2018-08-24:

#47

I've been trying to figure out how to test this with dig instead, and I think I found something. If you have a normal /etc/resolv.conf then you see this:
===
$ dig -t soa local.; echo $?

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 2061
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Aug 24 09:16:38 MDT 2018
;; MSG SIZE rcvd: 34

0
===

If, instead, you add "local" to the search then you get this:
===
$ dig -t soa local.; echo $?

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; connection timed out; no servers could be reached
9
===

This may not be a good test (maybe under some other configuration some sort of response is sent?), but it might be a good idea to figure out how to accomplish this without using host.

Christian Ehrhardt  (paelzer) on 2018-08-28

Changed in avahi (Ubuntu Cosmic):
status:	In Progress → Fix Released

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-29:

#48

The suggested mitigation is in bionic-unapproved for the consideration of the SRU team.

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-29:

#49

@Erich - infinite hangs are usually due to the kernel somewhere, while suggesting dig was a good idea just to try I wonder if we would have to find what "host" actually hangs on to be sure that "dig" in turn will not some day block on just the same.

Can one of you affected when the "host" command hangs check if it is spinning in userspace or if it is a kernel wchan?
$ cat /proc/<pid of host>/wchan
and
$ perf top -p <pid of host>
$ strace -rtf -p <pid of host>
should help to get an idea what it is blocking on.

@Trent - you said you started on strace already, maybe you can provide the full logs here?
Also was it spinning in strace (on the same things) or just waiting?

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-08-30:

#50

strace.txt Edit (48.7 KiB, text/plain)

in fact, there are two host commands running, both show wchan=sigsuspend, perf shows nothing and strace shows commands is suspended. new strace is attached.

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-08-30:

#51

gdb.txt Edit (8.7 KiB, text/plain)

also attching a gdb full bt, which shows that epoll_wait is called it a timeout value of "-1" (infinite) in thread #4.

Revision history for this message

Marc Dietrich (marvin24) wrote on 2018-08-30:

#52

I found a source reference here: https://github.com/fanf2/bind-9/blob/master/lib/isc/unix/socket.c#L4127

Revision history for this message

Brian Murray (brian-murray) wrote on 2018-08-30: Please test proposed package

#53

Hello Liam, or anyone else affected,

Accepted avahi into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/avahi/0.7-3.1ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in avahi (Ubuntu Bionic):
status:	Triaged → Fix Committed
tags:	added: verification-needed verification-needed-bionic

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-31:

#54

@Trent - since you had the most reproducible setup could you take a look at verifying also the Bionic upload?

Also anyone else affected with the VPN cases please give it a try.

Finally, thanks Marc D. for adding all the Data.
If it is really hanging on that epoll like forever we might want to report that upstream as a bug - I feel we now have enough data for that - I'll take a look later what a bug report @bind needs ...

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2018-08-31:

#55

Reported upstream as https://gitlab.isc.org/isc-projects/bind9/issues/520

Revision history for this message

Liam (liam-smit) wrote on 2018-08-31:

#56

I tried the numerous suggested fixes and the all worked for me including uninstalling "avahi-daemon". That last one resolved my problem so I can't test the latest fix.

Revision history for this message

Laban Sköllermark (laban-skoller) wrote on 2018-08-31: Re: [Bug 1752411] Re: bind9-host, avahi-daemon-check-dns.sh hang forever causes network connections to get stuck

#57

Download full text (5.5 KiB)

Hi Liam!

I also uninstalled avahi-daemon some months ago but I verified that the
problem came back when I installed it again, so I think you should still be
able to verify the fix if you want to.

Best regards
Laban

On Fri, Aug 31, 2018, 11:21 Liam <email address hidden> wrote:

> I tried the numerous suggested fixes and the all worked for me including
> uninstalling "avahi-daemon". That last one resolved my problem so I
> can't test the latest fix.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1752411
>
> Title:
> bind9-host, avahi-daemon-check-dns.sh hang forever causes network
> connections to get stuck
>
> Status in avahi package in Ubuntu:
> Fix Released
> Status in bind9 package in Ubuntu:
> Confirmed
> Status in openconnect package in Ubuntu:
> Invalid
> Status in strongswan package in Ubuntu:
> Invalid
> Status in avahi source package in Bionic:
> Fix Committed
> Status in bind9 source package in Bionic:
> Confirmed
> Status in avahi source package in Cosmic:
> Fix Released
> Status in bind9 source package in Cosmic:
> Confirmed
> Status in avahi package in Debian:
> New
>
> Bug description:
> [Impact]
>
> * Network connections for some users fail (in some cases a direct
> interface, in others when connecting a VPN) because the 'host' command
> to check for .local in DNS called by /usr/lib/avahi/avahi-daemon-
> check-dns.sh never times out like it should - leaving the script
> hanging indefinitely blocking interface up and start-up. This appears
> to be a bug in host caused in some circumstances however we implement
> a workaround to call it under 'timeout' as the issue with 'host' has
> not easily been identified, and in any case acts as a fall-back.
>
> [Test Case]
>
> * Multiple people have been unable to create a reproducer on a
> generic machine (e.g. it does not occur in a VM), I have a specific
> machine I can reproduce it on (a Skull Canyon NUC with Intel I219-LM)
> by simply "ifdown br0; ifup br0" and there are clearly 10s of other
> users affected in varying circumstances that all involve the same
> symptoms but no clear test case exists. Best I can suggest is that I
> test the patch on my system to ensure it works as expected, and the
> change is only 1 line which is fairly easily auditible and
> understandable.
>
> [Regression Potential]
>
> * The change is a single line change to the shell script to call host
> with "timeout". When tested on working and non-working system this appears
> to function as expected. I believe the regression potential for this is
> subsequently low.
> * In attempt to anticipate possible issues, I checked that the timeout
> command is in the same path (/usr/bin) as the host command that is already
> called without a path, and the coreutils package (which contains timeout)
> is an Essential package. I also checked that timeout is not a built-in in
> bash, for those that have changed /bin/sh to bash (just in case).
>
> [Other Info]
>
> * N/A
>
> [Original Bug Description]
>
> On 18.04 Openconnect connects successfully to any of multiple VP...

Ubuntu
openconnect package

bind9-host, avahi-daemon-check-dns.sh hang forever causes network connections to get stuck

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
avahi (Debian)	Fix Released	Unknown	debbugs #898038
avahi (Ubuntu)	Fix Released	High	Trent Lloyd
Bionic	Fix Released	High	Trent Lloyd
Cosmic	Fix Released	High	Trent Lloyd
bind9 (Ubuntu)	Confirmed	High	Unassigned
Bionic	Confirmed	Undecided	Unassigned
Cosmic	Won't Fix	High	Unassigned
openconnect (Ubuntu)	Invalid	Undecided	Unassigned
strongswan (Ubuntu)	Invalid	Undecided	Unassigned

Changed in avahi (Ubuntu Bionic):
status:	Fix Committed → Fix Released

Changed in bind9 (Ubuntu Cosmic):
status:	Confirmed → Won't Fix

Ubuntuopenconnect package

bind9-host, avahi-daemon-check-dns.sh hang forever causes network connections to get stuck

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
openconnect package