Intermittent DNS server non-responsiveness

Bug #1847448 reported by Frode Nordahl
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Charm Test Infra
Invalid
Undecided
Unassigned

Bug Description

On multiple occasions I have observed long-running deployments, either on ServerStack or in Icarus MAAS, have charms error out during update-status hook.

Looking at one of the occurrences of this issue it appears to be a issue with the DNS server not responding (see Traceback below).

While the charms running of other stuff in update-status hook is a generic challenge with the reactive framework we need to battle, it does not get us away from the fact that we appear to have intermittent DNS server (or networking) issues in our lab.

Example Traceback:
2019-10-09 09:21:42 DEBUG juju-log tracer>
tracer: hooks phase, 1 handlers queued
tracer: ++ queue handler reactive/layer_openstack.py:54:default_update_status
2019-10-09 09:21:42 INFO juju-log Invoking reactive handler: reactive/layer_openstack.py:54:default_update_status
2019-10-09 09:21:42 DEBUG juju-log tracer: set flag run-default-update-status
2019-10-09 09:21:43 DEBUG juju-log tracer>
tracer: main dispatch loop, 7 handlers queued
tracer: ++ queue handler hooks/relations/ovsdb-cms/requires.py:34:joined:ovsdb-cms
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:79:joined:certificates
tracer: ++ queue handler reactive/layer_openstack.py:62:run_default_update_status
tracer: ++ queue handler reactive/layer_openstack.py:95:default_request_certificates
tracer: ++ queue handler reactive/neutron_api_plugin_ovn_handlers.py:109:assess_status
tracer: ++ queue handler reactive/neutron_api_plugin_ovn_handlers.py:115:poke_ovsdb
tracer: ++ queue handler reactive/neutron_api_plugin_ovn_handlers.py:50:configure_neutron
2019-10-09 09:21:43 INFO juju-log Invoking reactive handler: reactive/layer_openstack.py:62:run_default_update_status
2019-10-09 09:21:43 DEBUG juju-log tracer>
tracer: cleared flag run-default-update-status
tracer: -- dequeue handler reactive/layer_openstack.py:62:run_default_update_status
2019-10-09 09:21:43 INFO juju-log Invoking reactive handler: reactive/layer_openstack.py:95:default_request_certificates
2019-10-09 09:22:13 ERROR juju-log Hook error:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charms/reactive/__init__.py", line 74, in main
    bus.dispatch(restricted=restricted_mode)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 390, in dispatch
    _invoke(other_handlers)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 359, in _invoke
    handler.invoke()
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charms/reactive/bus.py", line 181, in invoke
    self._action(*args)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/charm/reactive/layer_openstack.py", line 103, in default_request_certificates
    for cn, req in instance.get_certificate_requests().items():
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charms_openstack/charm/classes.py", line 269, in get_certificate_requests
    json_encode=False).get('cert_requests', {})
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charmhelpers/contrib/openstack/cert_utils.py", line 121, in get_certificate_request
    req.add_hostname_cn()
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charmhelpers/contrib/openstack/cert_utils.py", line 86, in add_hostname_cn
    'cn': get_hostname(ip),
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charmhelpers/contrib/network/ip.py", line 522, in get_hostname
    result = ns_query(rev)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/charmhelpers/contrib/network/ip.py", line 478, in ns_query
    answers = dns.resolver.query(address, rtype)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/dns/resolver.py", line 1102, in query
    lifetime)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/dns/resolver.py", line 992, in query
    timeout = self._compute_timeout(start, lifetime)
  File "/var/lib/juju/agents/unit-neutron-api-plugin-ovn-0/.venv/lib/python3.6/site-packages/dns/resolver.py", line 799, in _compute_timeout
    raise Timeout(timeout=duration)
dns.exception.Timeout: The DNS operation timed out after 30.001458883285522 seconds

Revision history for this message
Frode Nordahl (fnordahl) wrote :

Actually, the DNS server appears to be dead in general atm!

root@juju-7153cf-0-lxd-3:~# systemd-resolve --status
Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 29 (eth0)
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: 10.246.112.3
          DNS Domain: maas
root@juju-7153cf-0-lxd-3:~# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
29: eth0@if30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 00:16:3e:59:2b:82 brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.246.114.24/21 brd 10.246.119.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::216:3eff:fe59:2b82/64 scope link
       valid_lft forever preferred_lft forever
root@juju-7153cf-0-lxd-3:~# time host 10.246.114.24 10.246.112.3
;; connection timed out; no servers could be reached

real 0m10.015s
user 0m0.004s
sys 0m0.011s

Revision history for this message
Frode Nordahl (fnordahl) wrote :

ubuntu@node-mees:~$ systemd-resolve --status|grep "DNS Serv"
         DNS Servers: 10.246.112.3
         DNS Servers: 10.246.112.3
         DNS Servers: 10.246.112.3
         DNS Servers: 10.246.112.3
         DNS Servers: 10.246.112.3
         DNS Servers: 10.246.112.3
         DNS Servers: 10.246.112.3
ubuntu@node-mees:~$ ip a show dev br-bond0
20: br-bond0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue state UP group default qlen 1000
    link/ether 24:6e:96:47:8c:48 brd ff:ff:ff:ff:ff:ff
    inet 10.246.114.13/21 brd 10.246.119.255 scope global br-bond0
       valid_lft forever preferred_lft forever
    inet6 fe80::266e:96ff:fe47:8c48/64 scope link
       valid_lft forever preferred_lft forever
ubuntu@node-mees:~$ time host 10.246.114.13 10.246.112.3
;; connection timed out; no servers could be reached

real 0m10.015s
user 0m0.015s
sys 0m0.000s

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Download full text (5.3 KiB)

ubuntu@icarus-maas:~⟫ date
Wed Oct 9 10:45:01 UTC 2019
ubuntu@icarus-maas:~⟫ ps axuw |grep bind
bind 404 4.3 0.1 3703664 326056 ? Ssl Sep30 550:09 /usr/sbin/named -f -u bind
maas 15192 0.0 0.0 174476 5292 ? Sl 06:18 0:00 rndc -c /etc/bind/maas/rndc.conf.maas reload

RNDC socket appears to not be serviced by BIND

All BIND threads stuck/quiet.

ubuntu@icarus-maas:~⟫ dpkg -l |grep bind9
ii bind9 1:9.10.3.dfsg.P4-8ubuntu1.14 amd64 Internet Domain Name Server

bind9 (1:9.10.3.dfsg.P4-8ubuntu1.15) xenial; urgency=medium

  * d/p/ubuntu//lp-1833400*: fix race on shutdown (LP: #1833400)
  * d/p/fix-shutdown-race.diff: dig/host/nslookup could crash when interrupted
    close to a query timeout (LP: #1797926)

This does not appear to be relevant for the problem at hand.

ubuntu@icarus-maas:~⟫ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial

Perhaps we should upgrade to Bionic?

ubuntu@icarus-maas:~⟫ sudo strace -ffp 404
sudo: unable to resolve host icarus-maas
strace: Process 404 attached with 51 threads
[pid 458] epoll_wait(7, <unfinished ...>
[pid 457] restart_syscall(<... resuming interrupted restart_syscall ...> <unfinished ...>
[pid 456] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 455] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 454] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 453] futex(0x7f88afcf90d4, FUTEX_WAIT_PRIVATE, 17981, NULL <unfinished ...>
[pid 452] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 451] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 450] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 449] futex(0x7f88004f25c8, FUTEX_WAIT_PRIVATE, 2, NULL <unfinished ...>
[pid 448] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 447] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 446] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 445] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 444] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 443] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 442] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 441] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 440] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 439] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 438] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 437] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 436] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unfinished ...>
[pid 435] futex(0x7f88afcf90a4, FUTEX_WAIT_PRIVATE, 92336049, NULL <unf...

Read more...

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

It's a MaaS upstream bug (according to #4) -- thus nothing to fix in the Test Infra.

Changed in charm-test-infra:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.