bind9-host, avahi-daemon-check-dns.sh hang forever causes network connections to get stuck

Bug #1752411 reported by Liam on 2018-02-28
118
This bug affects 20 people
Affects Status Importance Assigned to Milestone
avahi (Debian)
New
Unknown
avahi (Ubuntu)
Status tracked in Cosmic
Bionic
High
Trent Lloyd
Cosmic
High
Trent Lloyd
bind9 (Ubuntu)
Status tracked in Cosmic
Bionic
Undecided
Unassigned
Cosmic
High
Unassigned
openconnect (Ubuntu)
Undecided
Unassigned
strongswan (Ubuntu)
Undecided
Unassigned

Bug Description

[Impact]

 * Network connections for some users fail (in some cases a direct interface, in others when connecting a VPN) because the 'host' command to check for .local in DNS called by /usr/lib/avahi/avahi-daemon-check-dns.sh never times out like it should - leaving the script hanging indefinitely blocking interface up and start-up. This appears to be a bug in host caused in some circumstances however we implement a workaround to call it under 'timeout' as the issue with 'host' has not easily been identified, and in any case acts as a fall-back.

[Test Case]

 * Multiple people have been unable to create a reproducer on a generic machine (e.g. it does not occur in a VM), I have a specific machine I can reproduce it on (a Skull Canyon NUC with Intel I219-LM) by simply "ifdown br0; ifup br0" and there are clearly 10s of other users affected in varying circumstances that all involve the same symptoms but no clear test case exists. Best I can suggest is that I test the patch on my system to ensure it works as expected, and the change is only 1 line which is fairly easily auditible and understandable.

[Regression Potential]

 * The change is a single line change to the shell script to call host with "timeout". When tested on working and non-working system this appears to function as expected. I believe the regression potential for this is subsequently low.
 * In attempt to anticipate possible issues, I checked that the timeout command is in the same path (/usr/bin) as the host command that is already called without a path, and the coreutils package (which contains timeout) is an Essential package. I also checked that timeout is not a built-in in bash, for those that have changed /bin/sh to bash (just in case).

[Other Info]

 * N/A

[Original Bug Description]

On 18.04 Openconnect connects successfully to any of multiple VPN concentrators but network traffic does not flow across the VPN tunnel connection. When testing on 16.04 this works flawlessly. This also worked on this system when it was on 17.10.

I have tried reducing the mtu of the tun0 network device but this has not resulted in me being able to successfully ping the IP address.

Example showing ping attempt to the IP of DNS server:

~$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
# 127.0.0.53 is the systemd-resolved stub resolver.
# run "systemd-resolve --status" to see details about the actual nameservers.

nameserver 172.29.88.11
nameserver 127.0.0.53

liam@liam-lat:~$ netstat -nr
Kernel IP routing table
Destination Gateway Genmask Flags MSS Window irtt Iface
0.0.0.0 192.168.1.1 0.0.0.0 UG 0 0 0 wlp2s0
105.27.198.106 192.168.1.1 255.255.255.255 UGH 0 0 0 wlp2s0
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
172.29.0.0 0.0.0.0 255.255.0.0 U 0 0 0 tun0
172.29.88.11 0.0.0.0 255.255.255.255 UH 0 0 0 tun0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 wlp2s0
liam@liam-lat:~$ ping 172.29.88.11
PING 172.29.88.11 (172.29.88.11) 56(84) bytes of data.
^C
--- 172.29.88.11 ping statistics ---
4 packets transmitted, 0 received, 100% packet loss, time 3054ms

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: openconnect 7.08-3
ProcVersionSignature: Ubuntu 4.15.0-10.11-generic 4.15.3
Uname: Linux 4.15.0-10-generic x86_64
ApportVersion: 2.20.8-0ubuntu10
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Wed Feb 28 22:11:33 2018
InstallationDate: Installed on 2017-06-15 (258 days ago)
InstallationMedia: Ubuntu 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
SourcePackage: openconnect
UpgradeStatus: Upgraded to bionic on 2018-02-22 (6 days ago)

Related branches

Liam (liam-smit) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in openconnect (Ubuntu):
status: New → Confirmed
Marc Dietrich (marvin24) wrote :

I'm also affected. Looking at the process list (ps ax) I found:

23386 pts/4 S+ 0:00 openconnect -vvvvv -s /usr/share/vpnc-scripts/vpnc-script vpn.example.com
23400 pts/4 S+ 0:00 /bin/sh -c /usr/share/vpnc-scripts/vpnc-script
23405 pts/4 S+ 0:00 /bin/sh /usr/share/vpnc-scripts/vpnc-script
23440 ? Ssl 0:00 /usr/lib/NetworkManager/nm-dispatcher
23443 ? S 0:00 /bin/sh -e /etc/NetworkManager/dispatcher.d/01-ifupdown tun0 up
23468 pts/4 S+ 0:00 run-parts --arg=-a --arg=tun0 /etc/resolvconf/update.d
23479 pts/4 S+ 0:00 run-parts /etc/resolvconf/update-libc.d
23500 ? S 0:00 /bin/sh /etc/network/if-up.d/ntpdate
23502 ? S 0:00 flock -n /run/lock/ntpdate /usr/sbin/ntpdate-debian -s
23504 ? S< 0:00 /usr/sbin/ntpdate -s de.pool.ntp.org
23515 pts/4 S+ 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
23534 pts/4 Sl+ 0:00 host -t soa local.
23539 ? S 0:00 run-parts /etc/network/if-up.d
23550 ? S 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
23567 ? Sl 0:00 host -t soa local.
23574 pts/3 R+ 0:00 ps ax

the "avahi-daemon-check-dns.sh" process hangs, maybe because the route to dns isn't setup yet (vpnc-script still running). If I kill this process (pid 23550) the script continues to run and the connect is alive and stable.

Marc Dietrich (marvin24) wrote :

in fact, the "host" command is hanging, sorry.

Liam (liam-smit) wrote :

Nice work Marc.

I can confirm that if I kill the host command (with PID 23567 in your example) then my VPN connectino works.

Trent Lloyd (lathiat) wrote :

The default timeout for the 'host' command is 10 seconds. Is it taking longer than that?

Marc Dietrich (marvin24) wrote :

If you define "infinite" as longer than 10 seconds - yes. Something is quite bogus here.

Trent Lloyd (lathiat) wrote :
Download full text (4.4 KiB)

I ran into this myself today after upgrading a machine to bionic..

two copies of it running at once.. both stuck on host.
If I execute a new 'host' command it works, but the existing ones are stuck.l

root 14181 0.0 0.0 4628 868 ? Ss 13:05 0:00 /bin/sh -c cat /run/systemd/resolve/stub-resolv.conf | /sbin/resolvconf -a systemd-resolved
root 14292 0.0 0.0 4520 752 ? S 13:05 0:00 run-parts --arg=-a --arg=systemd-resolved /etc/resolvconf/update.d
root 14320 0.0 0.0 4520 748 ? S 13:05 0:00 run-parts /etc/resolvconf/update-libc.d
root 14354 0.0 0.0 4628 1672 ? S 13:05 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
root 14607 0.0 0.0 187532 8380 ? Sl 13:05 0:00 host -t soa local.

root 13775 0.0 0.0 4628 868 ? Ss 13:05 0:00 /bin/sh -ec ifup --allow=hotplug eno1; ifup --allow=auto eno1; if ifquery eno1
 >/dev/null; then ifquery --state eno1 >/dev/null; fi
root 13787 0.0 0.0 4592 1868 ? S 13:05 0:00 ifup --allow=auto eno1
root 14179 0.0 0.0 4628 772 ? S 13:05 0:00 /bin/sh -c /bin/run-parts --exit-on-error /etc/network/if-up.d
root 14182 0.0 0.0 4520 768 ? S 13:05 0:00 /bin/run-parts --exit-on-error /etc/network/if-up.d
root 14183 0.0 0.0 4628 772 ? S 13:05 0:00 /bin/sh /etc/network/if-up.d/000resolvconf
root 14461 0.0 0.0 4520 752 ? S 13:05 0:00 run-parts --arg=-a --arg=eno1.inet /etc/resolvconf/update.d
root 14479 0.0 0.0 4520 728 ? S 13:05 0:00 run-parts /etc/resolvconf/update-libc.d
root 14503 0.0 0.0 4628 1612 ? S 13:05 0:00 /bin/sh /usr/lib/avahi/avahi-daemon-check-dns.sh
root 14606 0.0 0.0 187532 8384 ? Sl 13:05 0:00 host -t soa local.

(gdb) t a a bt

Thread 4 (Thread 0x7f3231f00700 (LWP 14796)):
#0 0x00007f3237b06bb7 in epoll_wait (epfd=5, events=0x7f3238eb1010, maxevents=64, timeout=timeout@entry=-1)
    at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x00007f323804773b in watcher (uap=0x7f3238eb0010) at ../../../../lib/isc/unix/socket.c:4280
#2 0x00007f3237ddd6db in start_thread (arg=0x7f3231f00700) at pthread_create.c:463
#3 0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 3 (Thread 0x7f3232701700 (LWP 14794)):
#0 0x00007f3237de39f3 in futex_wait_cancelable (private=<optimised out>, expected=0, futex_word=0x7f3238eae0a4)
    at ../sysdeps/unix/sysv/linux/futex-internal.h:88
#1 __pthread_cond_wait_common (abstime=0x0, mutex=0x7f3238eae028, cond=0x7f3238eae078) at pthread_cond_wait.c:502
#2 __pthread_cond_wait (cond=0x7f3238eae078, mutex=mutex@entry=0x7f3238eae028) at pthread_cond_wait.c:655
#3 0x00007f3238039370 in run (uap=0x7f3238eae010) at ../../../lib/isc/timer.c:808
#4 0x00007f3237ddd6db in start_thread (arg=0x7f3232701700) at pthread_create.c:463
#5 0x00007f3237b0688f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Thread 2 (Thread 0x7f3232f02700 (LWP ...

Read more...

Trent Lloyd (lathiat) wrote :

No VPN in use.. this is probably a bug equally in bind9-host and avahi-daemon

The host shouldn't be getting stuck and avahi should probably make the script timeout somehow

Changed in bind9 (Ubuntu):
importance: Undecided → Critical
Changed in avahi (Ubuntu):
importance: Undecided → High
Changed in bind9 (Ubuntu):
importance: Critical → High
status: New → Confirmed
Changed in avahi (Ubuntu):
status: New → Confirmed
summary: - Can not ping IP addresses on remote network after connect
+ bind9-host, avahi-daemon-check-dns.sh hang forever causes network
+ connections to get stuck
Trent Lloyd (lathiat) wrote :

I did some testing using strace and looking at backtraces of why "host" is stuck, and it's not immediately clear to me why it's getting stuck. Will need to look more in depth into it tracing it's actual execution - it's multi threaded and using poll so not super straight forward from the trace for someone unfamiliar with the code-base.

I did test that when it happens, the network interfaces are up and systemd-resolved is started - and I can see a sendmsg/recvmsg appear to succeed to the systemd stub resolver and my local SNS server. I also tried explicitly setting the timeout with host -W 5 (this should be the default, but wanted to test as there is a -w indefinite option). However the 'host' command always works when I log into the system while the other commands are still stuck in the background - so something strange is going on.

What does work, is executing 'host' under /usr/bin/timeout. Given the severity of this issue (makes startup hang without SSH for several minutes, and blocks everything else from starting up seemingly forever), I would suggest that we should ship a fix for bionic to use timeout to work around the issue for now.

/usr/lib/avahi/avahi-daemon-check-dns.sh : dns_has_local()
  OUT=`LC_ALL=C /usr/bin/timeout 5 host -t soa local. 2>&1`

Changed in openconnect (Ubuntu):
status: Confirmed → Invalid
Andreas Hasenack (ahasenack) wrote :

Can you try running "host" with -d in the scenario where it is hanging?

Also, a fresh bionic system shouldn't run the ifup-down scripts, since it uses netplan. Unless openconnect or one of its dependencies pull it in explicitly. A quick apt-cache rdepends didn't see it.

Marc Dietrich (marvin24) wrote :

My system was upgraded, so maybe a leftover. So I uninstalled ifupdown and upstart. Problem still persists :-(
I will attach a ltrace -Sf host output - maybe it helps...

Trent Lloyd (lathiat) wrote :

With host -d I simply get

> Trying "local"

When it works normally I get;

Trying "local"
Host local. not found: 3(NXDOMAIN)
Received 98 bytes from 10.48.134.6#53 in 1 ms
Received 98 bytes from 10.48.134.6#53 in 1 ms

The system I am hitting this issue on is an upgraded system (rather than a fresh install which wouldn't use ifupdown)

Because this is a serious issue for bionic upgraders I am attaching a debdiff to use 'timeout' to fix the issue for now because release is imminent. Core issue with 'host' probably still needs to be investigated (as this may add 5s delays to boot-up) however the timeout is probably a good backup anyway. In some ways potentially the entire check-dns script should probably be launched under timeout.

Changed in avahi (Ubuntu):
assignee: nobody → Trent Lloyd (lathiat)
Trent Lloyd (lathiat) wrote :

There is a new bind9 upload to bionic-proposed (9.11.3+dfsg-1ubuntu1)

Tested with this version and 'host' is still hanging. So this fix is still required.

Andreas Hasenack (ahasenack) wrote :

Some troubleshooting I did with Trent today showed:

a) the "host -t soa local." call triggered a query to 127.0.0.53 as expected, network-wise, which got a response right away

b) we snapshotted ip route and ip addr just before the host call, and saw that the interface responsible for the default route (and route to his dns server) was still down. I wonder if dns_reachable() in /usr/lib/avahi/avahi-daemon-check-dns.sh is doing the right thing. It looks for 127.0.0.1 (and not 127.0.0.53), and, failing that, for a default route. The default route exists, but the link it goes through is still down:

2: eno1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN group default qlen 1000

default via x.x.x.x dev eno1 onlink linkdown

Trent Lloyd (lathiat) wrote :

I'd still like to get the upload debdiff for 'timeout' that I prepared uploaded. Even if we manage to debug the bind9-host issue, it will still be useful to have the timeout command there as a backup. Not long before we run out of time for bionic release.

I am actively looking at the bind9-host issue also, but I do not expect to get that fixed before release.

Andreas Hasenack (ahasenack) wrote :

I think that time is past, we were in beta freeze in the past week, and are in final freeze now. Unless there is a clear test case showing under which conditions this happens and how widespread it is, it's probably best to start thinking in SRU terms.

It looks like a safe change, but since I don't understand the problem entirely yet (when it happens, why), I can't say.

Andrej Shadura (andrew.sh) wrote :
Andreas Hasenack (ahasenack) wrote :

It should probably also check for 127.0.0.53, not just 127.0.0.1

Changed in avahi (Debian):
status: Unknown → New
Trent Lloyd (lathiat) wrote :

Sponsors: Can we get this debdiff uploaded now? We've had a few more reports and I'd like to get this workaround in place.

David Sitsky (david-sitsky) wrote :

I was hit by this exact problem (update to Ubuntu 18.04), the host command would hang and adding timeout works around the issue. Please put the temporary fix in so others are not affected.

I've run into this problem as well after upgrading my Ubuntu 17.10 installation (upgraded from 17.04) to Ubuntu 18.04 last week. My VPN script calling "openconnect" hang and no packets got forwarded. My first workaround was however to connect using nm-applet instead which worked fine (available via package network-manager-openconnect-gnome).

I found this thread after a tip from a colleague after asking for help. It would be nice to have a proper fix for users upgrading to Bionic. The timeout fix/workaround would be good enough in my opinion.

@lathiat: Thanks for the fix in avahi-daemon-check-dns.sh (wrapping "timeout" around the "host" call). I can confirm that this solves the problem for me as well.

A note from my setup:
I have *two* stuck "host -t soa local." processes launched by two different "avahi-daemon-check-dns.sh" instances. pstree says that one is launched by "vpnc-script" (launched by openconnect) and the other one is launched by "01-ifupdown" launched by "nm-dispatcher". Killing the host process started by openconnect->vpnc-script solves the problem.

And as @marvin24 said, uninstalling "ifupdown" does *not* solve the problem.

@muetze-bsw (in duplicate Bug #1772692): I can confirm that uninstalling "avahi-daemon" solves the problem. This will be my "permanent" workaround.

Trent Lloyd (lathiat) wrote :

Hoping to get attention to this again. Since 18.04.1 is out now, more and more users are likely to hit this issue as more users will be upgrading. This issue applies equally to desktop and server scenarios.

I would like to get lp1752411-avahi-host-timeout.diff sponsored for upload please

Robie Basak (racb) on 2018-08-02
tags: added: server-next
Robie Basak (racb) wrote :

Thank you for working on this. I agree with your approach. The debdiff looks good.

I think that though it's clear that the bug is in the host command, given that we haven't been able to figure out the fix (I spent some time on it too), it's reasonable to add the timeout command as in your patch as a workaround. No need for the problem to persist when the workaround is so clean and clear. I also approve the timeout workaround for SRU in principle. We can leave a bug task open for bind, but consider a separate bug task resolved in avahi packaging once this workaround is applied.

A couple of comments from your current debdiff:

Please leave a comment above the timeout line explaining why it is there ("Workaround for LP: #1752411" is sufficient). For the SRU, I would prefer a version string of "0.7-3.1ubuntu1.1" ("0.7-3.1ubuntu2" is technically OK but doesn't convey that it is an SRU so well).

Please could you prepare a debdiff for Cosmic so that we can fix it there first? Then follow https://wiki.ubuntu.com/StableReleaseUpdates#Procedure and attach an updated debdiff for the SRU to Bionic. I'll be happy to sponsor both, but will then need review from another SRU team member to accept it from the queue.

Simon Quigley (tsimonq2) wrote :

Unsubscribing sponsors for now, awaiting the fixes Robie commented about.

FYI bug 1786261 could be another symptoms of this, reporters there will take a look and might add another affected package to this.

Erich E. Hoover (ehoover) wrote :

Definitely a duplicate. @paelzer was suggesting in bug 1786261 that dig be used instead, do you guys know if that's a reasonable possibility?

fermulator (fermulator) wrote :

Notes;

when "things are working", host does either:

while on VPN:
{{{
$ LC_ALL=C host -t soa local.
Host local. not found: 3(NXDOMAIN)

$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 7637
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: e1ff5e7222ad62da (echoed)
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 21 msec
;; SERVER: 192.168.194.20#53(192.168.194.20)
;; WHEN: Mon Aug 20 12:01:19 EDT 2018
;; MSG SIZE rcvd: 46
}}}

while off VPN:
{{{
$ LC_ALL=C host -t soa local.
Host local not found: 2(SERVFAIL)

$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 61619
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 20 12:02:24 EDT 2018
;; MSG SIZE rcvd: 34

}}}

=====
while in the broken/hung state:
             ^^^^^^^^^^^
=====

{{{
$ LC_ALL=C host -t soa local.

<HANGS FOREVER>

:(

}}}
 (even hangs w/ "-W 1") ...

dig command augmented returns!:
{{{
$ LC_ALL=C dig -t soa local.

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 16967
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Mon Aug 20 11:56:58 EDT 2018
;; MSG SIZE rcvd: 34
}}}

(I am not familiar enough with SOAL local. lookups though to say if it can replace the host invocation in this method)

/usr/lib/avahi/avahi-daemon-check-dns.sh

dns_has_local() {
  # Some magic to do tests
  if [ -n "${FAKE_HOST_RETURN}" ] ; then
    if [ "${FAKE_HOST_RETURN}" = "true" ]; then
      return 0;
    else
      return 1;
    fi
  fi

  OUT=`LC_ALL=C host -t soa local. 2>&1`
  if [ $? -eq 0 ] ; then
    if echo "$OUT" | egrep -vq 'has no|not found'; then
      return 0
    fi
  else
    # Checking the dns servers failed. Assuming no .local unicast dns, but
    # remove the nameserver cache so we recheck the next time we're triggered
    rm -f ${NS_CACHE}
  fi
  return 1
}

fermulator (fermulator) wrote :

(btw; while we're fixing that script ... fix/change backtics to POSIX compliant sub-shell'ing $() ?

fermulator (fermulator) wrote :

(this is currently in the "openconnect" path despite marked as "invalid" against that package, bug was submitted originally to that project -- can we move to avahi?)

$ dpkg -S /usr/lib/avahi/avahi-daemon-check-dns.sh
avahi-daemon: /usr/lib/avahi/avahi-daemon-check-dns.sh

$ dpkg -s avahi-daemon
Package: avahi-daemon
Status: install ok installed
Priority: optional
Section: net
Installed-Size: 278
Maintainer: Ubuntu Developers <email address hidden>
Architecture: amd64
Multi-Arch: foreign
Source: avahi
Version: 0.7-3.1ubuntu1
Depends: libavahi-common3 (>= 0.6.16), libavahi-core7 (>= 0.6.24), libc6 (>= 2.14), libcap2 (>= 1:2.10), libdaemon0 (>= 0.14), libdbus-1-3 (>= 1.9.14), libexpat1 (>= 2.0.1), adduser, dbus (>= 0.60), lsb-base (>= 3.0-6), bind9-host | host
Recommends: libnss-mdns
Suggests: avahi-autoipd
Conffiles:
 /etc/avahi/avahi-daemon.conf 8d4be860ead4cacc2ba5f77e7fadb11d
 /etc/avahi/hosts 186990ae1edac95a88dbef6a36a07716
 /etc/dbus-1/system.d/avahi-dbus.conf 4b8ff37c10615ae704b7827a438ff534
 /etc/default/avahi-daemon 292bdbb95b392a71a0c363eb58b3a119
 /etc/init.d/avahi-daemon 7e648c77846d70c4ef1b49c0c4f7cfad
 /etc/network/if-up.d/avahi-daemon 6dbf1a91ab420a99d1205972d6401e67
 /etc/resolvconf/update-libc.d/avahi-daemon 2cf53ff5a00f9d1fed653a2913de5bc7
 /etc/init/avahi-cups-reload.conf 56a60d600cd80a95f2e3b6909c3bda74 obsolete
 /etc/init/avahi-daemon.conf 0303b3961d5ffee8f05805b1dd06f475 obsolete
Description: Avahi mDNS/DNS-SD daemon
 Avahi is a fully LGPL framework for Multicast DNS Service Discovery.
 It allows programs to publish and discover services and hosts
 running on a local network with no specific configuration. For
 example you can plug into a network and instantly find printers to
 print to, files to look at and people to talk to.
 .
 This package contains the Avahi Daemon which represents your machine
 on the network and allows other applications to publish and resolve
 mDNS/DNS-SD records.
Homepage: http://avahi.org/
Original-Maintainer: Utopia Maintenance Team <email address hidden>

I agree strongswan/openconnect (and maybe more) are affected by the symptom, while the bug lies in bind9-host/avahi packages at least according to current debugging.

From my experience I guess what would be great to get more traction on this is to get a shorter reproducer than setting up some sort of VPN.
To me it seems several people involved here had those steps to work on it, but no one pasted them clearly here.
So if anybody can provide that please feel free to add them here.

@Trent - you worked on a mitigation of this at least in avahi daemon and currently the status of this bug IMHO is waiting on you to update for the review feedback provided by rbasak in comment #25.
Could you do that or are you having to drop your work on it so that somebody else should take it over?

Changed in strongswan (Ubuntu):
status: New → Invalid
Trent Lloyd (lathiat) wrote :
description: updated
Trent Lloyd (lathiat) wrote :
Trent Lloyd (lathiat) wrote :

Request sponsorship of this upload for cosmic and then SRU to bionic
 - New debdiff uploaded for both bionic and cosmic
 - Fixed the SRU version for bionic
 - Added a comment about the workaround to the script
 - Updated bug description with SRU template

Tested patch working on bionic with my machine which consistently exhibits the issue with a package built from this diff (albeit with a 5 second delay on network interface up, hopefully after this we can switch to fixing the actual issue with host)

The key note I see on the machine I can reproduce this on (a linux bridge over an Intel I219-LM) is that both the interface route and the default route are in the 'linkdown' state when the host command fires for about 0.7 seconds total. When I looked at a different machine, that stage never happened or at least for a much shorter time (i'd have to check ip monitor again).

I don't expect anyone to reproduce this for testing, i'm happy to test the -proposed packages on an affected machine.

I've had some minor cleanups on the changelog, but other than that I think the most recent submission is good.

Also I found no issues in testing.
The code path it takes when the timeout triggers is that of a failing host command (bad RC) which I think is just right. It will set things in a way that it considers .local not available, but will rescan later on again - that is perfect for all our cases if later devices have recovered from the odd state.

Thanks Trent, sponsored into Cosmic (and git ubuntu tag pushed).

Please track the migration to cosmic, from there we can then consider to queue up the SRU.

no longer affects: strongswan (Ubuntu Cosmic)
no longer affects: strongswan (Ubuntu Bionic)
no longer affects: openconnect (Ubuntu Cosmic)
no longer affects: openconnect (Ubuntu Bionic)
Changed in bind9 (Ubuntu Bionic):
status: New → Confirmed
Changed in avahi (Ubuntu Bionic):
status: New → Triaged
Changed in avahi (Ubuntu Cosmic):
status: Confirmed → In Progress
fermulator (fermulator) wrote :

Are we sure timeout of 5 seconds is appropriate? (it FEELS too long)
My intuition says that if a DNS query takes longer than 1 second it took too long ...

However (consider also the "wait" (-W) parameter for the host command itself)
```

       -W wait
           Timeout: Wait for up to wait seconds for a reply. If wait is less
           than one, the wait interval is set to one second.

           By default, host will wait for 5 seconds for UDP responses and 10
           seconds for TCP connections. These defaults can be overridden by
           the timeout option in /etc/resolv.conf.

           See also the -w option.
```

None-the-less, this is a _workaround_ for the issue -- (will the ticket remain open to fix the underlying issue, or a subsequent issue be submitted?)

fermulator (fermulator) wrote :

PS: I've been running with a hacked /usr/lib/avahi/avahi-daemon-check-dns.sh for a few days with this code:
```
  OUT=`LC_ALL=C /usr/bin/timeout 2 host -t soa local. 2>&1`
```
, works like a charm

fermulator (fermulator) wrote :

We should also consider:

```
# CLEAN
fermulator@fermmy:~$ host -t soa local.
Host local. not found: 3(NXDOMAIN)
fermulator@fermmy:~$ echo $?
1

# BROKEN (host hangs)
fermulator@fermmy:~$ LC_ALL=C /usr/bin/timeout 1 host -t soa local. 2>&1
fermulator@fermmy:~$ echo $?
124

# timeout
fermulator@fermmy:~$ timeout 1 sleep 2
fermulator@fermmy:~$ echo $?
124

# no timeout
fermulator@fermmy:~$ timeout 5 sleep 1
fermulator@fermmy:~$ echo $?
0
```

Isn't the existing logic broken? (perhaps insufficient comments/documentation in this method for me to conclude either way ... the intention maybe is unclear)

```
  if [ $? -eq 0 ] ; then
    if echo "$OUT" | egrep -vq 'has no|not found'; then
      return 0
    fi
  else
    # Checking the dns servers failed. Assuming no .local unicast dns, but
    # remove the nameserver cache so we recheck the next time we're triggered
    rm -f ${NS_CACHE}
  fi

```

later it's used only here
```
if dns_has_local ; then
  # .local from dns server, disabling avahi
  disable_avahi
else
  # no .local from dns server, enabling avahi
  enable_avahi
fi
```

When host call fails (even with timeout), it returns "1" claiming "dns_has_local()=true".
{{{
fermulator@fermmy:~$ OUT="Host local. not found: 3(NXDOMAIN)"
fermulator@fermmy:~$ if echo "$OUT" | egrep -vq 'has no|not found'; then echo "RETURN 0"; else echo "RETURN 1"; fi
RETURN 1
}}}

At least the additional wrapping of timeout (workaround) doesn't make it any worse I suppose ...

Hi fermulator,
I was wondering vice versa if 5 seconds would be too short actually.
Yes the good cases will return in sub-second, but it is the bad cases we want to fix here.
Having a bit more time to recover if it can doesn't seem too bad to me.

On the question of host -W for waiting.
While it isn't fully clarified, we have to consider that in this case there is a real hang inside host or one of its syscalls due to the odd states the devices are in.
That said it could be that the call just blocks forever and a "host internal" timeout would never trigger leaving the system in just the bad state it is now.

Instead the external timeout wrapper should be immune to that, and therefore is better for this case.

Trent Lloyd (lathiat) wrote :

I agree with the sentiment that 5 seconds feels too long, however as a workaround I decided I would just copy the existing timeout. I certainly would not want to make it longer since this is in the critical boot path.

I would generally agree that in general a DNS request should fail faster however there are some cases where it won't, e.g. spanning tree bring up on ports can take 2 seconds.

My hope is to correctly fix host after getting this in, since the impact is very high for affected users.

This check may actually be able to go away, I believe both systemd-resolved and libnss-mdns (latest version that I think is not in bionic) implement the .local label checking to do this at runtime instead of this old hack. So for cosmic+ we can probably get rid of this logic, which always sucked anyway. As we only needed to really disable nss-mdns and not avahi entirely (since apps should normally resolve the IPs using avahi's API anyway, the impact to actual avahi usage is low).

Since the impact is high but only on a smaller subset of users, I think we should go with matching the current timeout for now and worry about further improvements later.

I've verified the cosmic upload is working as expected on a non-affected system.

Thanks Trent for the extra verification.
Tests also look good so far, but currently since a lot got uploaded due to feature freeze some tests take a while.
We should have that in cosmic soon and then can pick the same for Bionic.

Thanks also for your thoughts on a better long term solution.

Trent Lloyd (lathiat) wrote :

> When host call fails (even with timeout), it returns "1" claiming "dns_has_local()=true".

0 = true, 1 = false (you implied the opposite)

What may add confusion here is the grep -vq check is like an extra check to make sure host didn't return 0 (success = we found .local) but then say 'not found' anyway. So it returns 0 (true) when host returns 0. It returns 1 when host returns anything else (including timeout); 1 = false which means leave avahi enabled.

fermulator (fermulator) wrote :

> RE -W/w in `host`
, correct -- even with timeout set, it blocks forever (I tested this several days ago in the dup'd ticket iirc)

> RE timeout
, good thoughts all - sure let's just stick with 5 seconds then

> RE logic true/false (@Trent)
, thanks yes! that'll do it; clarified now in my mind

Erich E. Hoover (ehoover) wrote :

I've been trying to figure out how to test this with dig instead, and I think I found something. If you have a normal /etc/resolv.conf then you see this:
===
$ dig -t soa local.; echo $?

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; Got answer:
;; WARNING: .local is reserved for Multicast DNS
;; You are currently testing what happens when an mDNS query is leaked to DNS
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 2061
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;local. IN SOA

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: Fri Aug 24 09:16:38 MDT 2018
;; MSG SIZE rcvd: 34

0
===

If, instead, you add "local" to the search then you get this:
===
$ dig -t soa local.; echo $?

; <<>> DiG 9.11.3-1ubuntu1.1-Ubuntu <<>> -t soa local.
;; global options: +cmd
;; connection timed out; no servers could be reached
9
===

This may not be a good test (maybe under some other configuration some sort of response is sent?), but it might be a good idea to figure out how to accomplish this without using host.

Changed in avahi (Ubuntu Cosmic):
status: In Progress → Fix Released

The suggested mitigation is in bionic-unapproved for the consideration of the SRU team.

@Erich - infinite hangs are usually due to the kernel somewhere, while suggesting dig was a good idea just to try I wonder if we would have to find what "host" actually hangs on to be sure that "dig" in turn will not some day block on just the same.

Can one of you affected when the "host" command hangs check if it is spinning in userspace or if it is a kernel wchan?
$ cat /proc/<pid of host>/wchan
and
$ perf top -p <pid of host>
$ strace -rtf -p <pid of host>
should help to get an idea what it is blocking on.

@Trent - you said you started on strace already, maybe you can provide the full logs here?
Also was it spinning in strace (on the same things) or just waiting?

Marc Dietrich (marvin24) wrote :

in fact, there are two host commands running, both show wchan=sigsuspend, perf shows nothing and strace shows commands is suspended. new strace is attached.

Marc Dietrich (marvin24) wrote :

also attching a gdb full bt, which shows that epoll_wait is called it a timeout value of "-1" (infinite) in thread #4.

Hello Liam, or anyone else affected,

Accepted avahi into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/avahi/0.7-3.1ubuntu1.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in avahi (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic

@Trent - since you had the most reproducible setup could you take a look at verifying also the Bionic upload?

Also anyone else affected with the VPN cases please give it a try.

Finally, thanks Marc D. for adding all the Data.
If it is really hanging on that epoll like forever we might want to report that upstream as a bug - I feel we now have enough data for that - I'll take a look later what a bug report @bind needs ...

Liam (liam-smit) wrote :

I tried the numerous suggested fixes and the all worked for me including uninstalling "avahi-daemon". That last one resolved my problem so I can't test the latest fix.

Download full text (5.5 KiB)

Hi Liam!

I also uninstalled avahi-daemon some months ago but I verified that the
problem came back when I installed it again, so I think you should still be
able to verify the fix if you want to.

Best regards
Laban

On Fri, Aug 31, 2018, 11:21 Liam <email address hidden> wrote:

> I tried the numerous suggested fixes and the all worked for me including
> uninstalling "avahi-daemon". That last one resolved my problem so I
> can't test the latest fix.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1752411
>
> Title:
> bind9-host, avahi-daemon-check-dns.sh hang forever causes network
> connections to get stuck
>
> Status in avahi package in Ubuntu:
> Fix Released
> Status in bind9 package in Ubuntu:
> Confirmed
> Status in openconnect package in Ubuntu:
> Invalid
> Status in strongswan package in Ubuntu:
> Invalid
> Status in avahi source package in Bionic:
> Fix Committed
> Status in bind9 source package in Bionic:
> Confirmed
> Status in avahi source package in Cosmic:
> Fix Released
> Status in bind9 source package in Cosmic:
> Confirmed
> Status in avahi package in Debian:
> New
>
> Bug description:
> [Impact]
>
> * Network connections for some users fail (in some cases a direct
> interface, in others when connecting a VPN) because the 'host' command
> to check for .local in DNS called by /usr/lib/avahi/avahi-daemon-
> check-dns.sh never times out like it should - leaving the script
> hanging indefinitely blocking interface up and start-up. This appears
> to be a bug in host caused in some circumstances however we implement
> a workaround to call it under 'timeout' as the issue with 'host' has
> not easily been identified, and in any case acts as a fall-back.
>
> [Test Case]
>
> * Multiple people have been unable to create a reproducer on a
> generic machine (e.g. it does not occur in a VM), I have a specific
> machine I can reproduce it on (a Skull Canyon NUC with Intel I219-LM)
> by simply "ifdown br0; ifup br0" and there are clearly 10s of other
> users affected in varying circumstances that all involve the same
> symptoms but no clear test case exists. Best I can suggest is that I
> test the patch on my system to ensure it works as expected, and the
> change is only 1 line which is fairly easily auditible and
> understandable.
>
> [Regression Potential]
>
> * The change is a single line change to the shell script to call host
> with "timeout". When tested on working and non-working system this appears
> to function as expected. I believe the regression potential for this is
> subsequently low.
> * In attempt to anticipate possible issues, I checked that the timeout
> command is in the same path (/usr/bin) as the host command that is already
> called without a path, and the coreutils package (which contains timeout)
> is an Essential package. I also checked that timeout is not a built-in in
> bash, for those that have changed /bin/sh to bash (just in case).
>
> [Other Info]
>
> * N/A
>
> [Original Bug Description]
>
> On 18.04 Openconnect connects successfully to any of multiple VP...

Read more...

Trent Lloyd (lathiat) on 2018-09-03
tags: added: verification-done-bionic
removed: verification-needed-bionic
Trent Lloyd (lathiat) wrote :

Confirmed the fix works on my affected system, after upgrade to 0.7-3.1ubuntu1.1 from bionic-proposed and a system reboot the boot works (relatively) quickly as expected and doesn't get stuck.

Verified the file in place is the original from the package (and not one modified by me).

Lastly to further verify it was the timeout and not some other change, I changed the timeout from 5 to 17 (to make the time taken more obvious) and rebooted and compared the time spent in network.service with systemd-analyze which changed from 18 seconds (with timeout=5) to 30.368s (with timeout=17). Which is roughly an additional 12 seconds as expected. And also observed the time before SSH was ready to take that much extra.

Looks good to me, changed the bionic verification tag to done but left 'verification-needed'.

Changed in avahi (Ubuntu Bionic):
importance: Undecided → High
assignee: nobody → Trent Lloyd (lathiat)

Package: avahi-daemon
Version: 0.7-3.1ubuntu1.1

Works. Thank you.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package avahi - 0.7-3.1ubuntu1.1

---------------
avahi (0.7-3.1ubuntu1.1) bionic; urgency=medium

  [ Trent Lloyd ]
  * debian/avahi-daemon-check-dns.sh: On some hardware, the 'host'
    command gets stuck and does not timeout as it should leaving this script
    and boot-up hanging indefinitely. Launch host with 'timeout' to kill it
    after 5 seconds in these cases as a workaround. (LP: #1752411)

 -- Christian Ehrhardt <email address hidden> Tue, 28 Aug 2018 11:37:21 +0200

Changed in avahi (Ubuntu Bionic):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for avahi has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

tags: removed: server-next
Andreas Hasenack (ahasenack) wrote :

In https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1797926 looks like the host command, that doesn't timeout and is the subject of this bug here, crashed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.