bind9 slow response after netplan apply

Bug #1811554 reported by Mitchell Rinzel on 2019-01-13
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bind9 (Ubuntu)
High
Unassigned
netplan.io (Ubuntu)
Undecided
Unassigned

Bug Description

System:
VM running on ESXI 6.0
Description: Ubuntu 18.04.1 LTS
Release: 18.04

Package:
bind9:
  Installed: 1:9.11.3+dfsg-1ubuntu1.3
  Candidate: 1:9.11.3+dfsg-1ubuntu1.3
  Version table:
 *** 1:9.11.3+dfsg-1ubuntu1.3 500
        500 http://archive.ubuntu.com/ubuntu bionic-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1:9.11.3+dfsg-1ubuntu1.2 500
        500 http://archive.ubuntu.com/ubuntu bionic-security/main amd64 Packages
     1:9.11.3+dfsg-1ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu bionic/main amd64 Packages

3. Expected to happen: After issuing the command "sudo netplan apply" with no network changes bind to continue to run as it had.

This happened once to reapply config, and once during a "Daily apt upgrade and clean activities"

4. After the netplan apply completed within 2 minutes we started seeing timeouts and long response times from the server. We have 2 identical builds currently running as caching servers for a large network, the servers were built on the same day and have both experienced the issue. These servers are under heavy load and are replying to 100's of queries a second or more. When investigating the logs for bind and the syslog there are no indications of maximum number of connections, maximum open files, or any other limits reached. Though there are noticed dropped packets and external monitoring on the bind service begins to flap, and manual testing shows some queries timing out.

Issuing a "sudo systemctl restart bind9" instantly resolves the issue.

If there is any other information you need please let me know, but I am unsure of where to look as the named log, kernel log, and syslog are all clear of errors during the timeout issue.

Andreas Hasenack (ahasenack) wrote :

What does /etc/resolv.conf look like? Is it using bind9 as the nameserver, or systemd-resolve (127.0.0.53)? You might not even be using bind9 directly, but via systemd-resolve, so that changes how this should be debugged.

Changed in bind9 (Ubuntu):
status: New → Incomplete
Mitchell Rinzel (mrinzel) wrote :

You are correct, right now /etc/resolv.conf has only the below entry:

nameserver 127.0.0.53
search companydomain.com

Andreas Hasenack (ahasenack) wrote :

You should also check the output of systemd-resolve --status, as that will tell you which dns servers the 127.0.0.53 resolver is using.

You also need to clarify this statement: "After the netplan apply completed within 2 minutes we started seeing timeouts and long response times from the server."
- is bind on the same server where you ran netplan apply?
- when you say "long response times from the server", do you mean bind (wherever it is running), or 127.0.0.53, which in turn may query the bind server you are talking about?
- how did you query the nameserver, was it using a directed tool like dig querying the server directly (dig @<server> <name>), or using the 127.0.0.53 resolver with something generic like "ping <name>", or "host <name>"? There are many pieces involved in name resolution.

We need better reproduction steps in order to evaluate this issue and determine if it's a bug or not.

Mitchell Rinzel (mrinzel) wrote :

I apologize for the vague information, answers below:

- is bind on the same server where you ran netplan apply?

Yes, bind is running on the system where the netplan apply was done.

- when you say "long response times from the server", do you mean bind (wherever it is running), or 127.0.0.53, which in turn may query the bind server you are talking about?

After the netplan apply command was run some queries made to the nameserver start timing out. When failing dig/nslookup would say that no name server could not be reached. Some queries would receive a response after 5-10 seconds with a 0-1ms query time. Finally some queries returned responses as expected.

- how did you query the nameserver, was it using a directed tool like dig querying the server directly (dig @<server> <name>), or using the 127.0.0.53 resolver with something generic like "ping <name>", or "host <name>"? There are many pieces involved in name resolution.

The queries were run from a workstation to the nameserver. Both "dig @<server> <name>" and "nslookup <name> <server>" were attempted.

I did not attempt to use the 127.0.0.53 resolver on the nameserver during the issue as I did not know bind was utilizing it. If the opportunity presents itself I will test the resolver as well.

-----

Results from systemd-resolve --status

Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 3 (ens192) - Private IP, Management Network
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

Link 2 (ens160) - Public IP
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: y.y.y.y (Primary Caching Server)
                      x.x.x.x (Servers own public IP)

I added a netplan.io bug task so that cyphermox sees this ug as well.
He will know best all the substeps that happen one "netplan apply" and if one of them could be related. Furthermore he might have seen such reports but against netplan.io instead of bind.

Well, this would squarely be a bind9 issue. The use of 'netplan apply' there just means that an IP might have changed, or the state of the interface changed enough (bringing it down, then up again, readding static addresses, etc) that bind couldn't make sense of it.

'netplan apply' is supposed to do just that: apply the network configuration, via systemd-networkd. If it happened to be DHCP, you could potentially use 'critical: true', but that's not really going to help much for static addresses (which you'd most likely use for a DNS server).

I think this needs to be investigated in bind9: how does it bind addresses? How does it watch for network devices changes and what does it do in that case?

Robie Basak (racb) wrote :

I'm not sure that it's the same issue, but bug 1811554 fits into the same category of "performance/reliability issue with named under edge case" to me. I think this needs to go into the backlog of things that need deep investigation.

Changed in bind9 (Ubuntu):
importance: Undecided → High
Robie Basak (racb) wrote :

I'm also un-marking Incomplete as I think there's enough information from the reporter to attempt to reproduce.

Changed in bind9 (Ubuntu):
status: Incomplete → Confirmed
Robie Basak (racb) wrote :

And Incomplete (or maybe Invalid) for netplan, since I agree with Mathieu's assessment, at least based on what we know right now.

Changed in netplan.io (Ubuntu):
status: New → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers