Bug #1811554 “bind9 slow response after netplan apply” : Bugs : bind9 package : Ubuntu

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2019-01-14:

#1

What does /etc/resolv.conf look like? Is it using bind9 as the nameserver, or systemd-resolve (127.0.0.53)? You might not even be using bind9 directly, but via systemd-resolve, so that changes how this should be debugged.

Changed in bind9 (Ubuntu):
status:	New → Incomplete

Revision history for this message

Mitchell Rinzel (mrinzel) wrote on 2019-01-15:

#2

You are correct, right now /etc/resolv.conf has only the below entry:

nameserver 127.0.0.53
search companydomain.com

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2019-02-21:

#3

You should also check the output of systemd-resolve --status, as that will tell you which dns servers the 127.0.0.53 resolver is using.

You also need to clarify this statement: "After the netplan apply completed within 2 minutes we started seeing timeouts and long response times from the server."
- is bind on the same server where you ran netplan apply?
- when you say "long response times from the server", do you mean bind (wherever it is running), or 127.0.0.53, which in turn may query the bind server you are talking about?
- how did you query the nameserver, was it using a directed tool like dig querying the server directly (dig @<server> <name>), or using the 127.0.0.53 resolver with something generic like "ping <name>", or "host <name>"? There are many pieces involved in name resolution.

We need better reproduction steps in order to evaluate this issue and determine if it's a bug or not.

Revision history for this message

Mitchell Rinzel (mrinzel) wrote on 2019-02-21:

#4

I apologize for the vague information, answers below:

- is bind on the same server where you ran netplan apply?

Yes, bind is running on the system where the netplan apply was done.

- when you say "long response times from the server", do you mean bind (wherever it is running), or 127.0.0.53, which in turn may query the bind server you are talking about?

After the netplan apply command was run some queries made to the nameserver start timing out. When failing dig/nslookup would say that no name server could not be reached. Some queries would receive a response after 5-10 seconds with a 0-1ms query time. Finally some queries returned responses as expected.

- how did you query the nameserver, was it using a directed tool like dig querying the server directly (dig @<server> <name>), or using the 127.0.0.53 resolver with something generic like "ping <name>", or "host <name>"? There are many pieces involved in name resolution.

The queries were run from a workstation to the nameserver. Both "dig @<server> <name>" and "nslookup <name> <server>" were attempted.

I did not attempt to use the 127.0.0.53 resolver on the nameserver during the issue as I did not know bind was utilizing it. If the opportunity presents itself I will test the resolver as well.

-----

Results from systemd-resolve --status

Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 3 (ens192) - Private IP, Management Network
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

Link 2 (ens160) - Public IP
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: y.y.y.y (Primary Caching Server)
                      x.x.x.x (Servers own public IP)

I apologize for the vague information, answers below:

- is bind on the same server where you ran netplan apply?

Yes, bind is running on the system where the netplan apply was done.

- when you say "long response times from the server", do you mean bind (wherever it is running), or 127.0.0.53, which in turn may query the bind server you are talking about?

After the netplan apply command was run some queries made to the nameserver start timing out. When failing dig/nslookup would say that no name server could not be reached. Some queries would receive a response after 5-10 seconds with a 0-1ms query time. Finally some queries returned responses as expected.

- how did you query the nameserver, was it using a directed tool like dig querying the server directly (dig @<server> <name>), or using the 127.0.0.53 resolver with something generic like "ping <name>", or "host <name>"? There are many pieces involved in name resolution.

The queries were run from a workstation to the nameserver. Both "dig @<server> <name>" and "nslookup <name> <server>" were attempted.

I did not attempt to use the 127.0.0.53 resolver on the nameserver during the issue as I did not know bind was utilizing it. If the opportunity presents itself I will test the resolver as well.

-----

Results from systemd-resolve --status

Global
          DNSSEC NTA: 10.in-addr.arpa
                      16.172.in-addr.arpa
                      168.192.in-addr.arpa
                      17.172.in-addr.arpa
                      18.172.in-addr.arpa
                      19.172.in-addr.arpa
                      20.172.in-addr.arpa
                      21.172.in-addr.arpa
                      22.172.in-addr.arpa
                      23.172.in-addr.arpa
                      24.172.in-addr.arpa
                      25.172.in-addr.arpa
                      26.172.in-addr.arpa
                      27.172.in-addr.arpa
                      28.172.in-addr.arpa
                      29.172.in-addr.arpa
                      30.172.in-addr.arpa
                      31.172.in-addr.arpa
                      corp
                      d.f.ip6.arpa
                      home
                      internal
                      intranet
                      lan
                      local
                      private
                      test

Link 3 (ens192) - Private IP, Management Network
      Current Scopes: none
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no

Link 2 (ens160) - Public IP
      Current Scopes: DNS
       LLMNR setting: yes
MulticastDNS setting: no
      DNSSEC setting: no
    DNSSEC supported: no
         DNS Servers: y.y.y.y (Primary Caching Server)
                      x.x.x.x (Servers own public IP)

Revision history for this message

Christian Ehrhardt  (paelzer) wrote on 2019-02-22:

#5

I added a netplan.io bug task so that cyphermox sees this ug as well.
He will know best all the substeps that happen one "netplan apply" and if one of them could be related. Furthermore he might have seen such reports but against netplan.io instead of bind.

Revision history for this message

Mathieu Trudel-Lapierre (cyphermox) wrote on 2019-02-26:

#6

Well, this would squarely be a bind9 issue. The use of 'netplan apply' there just means that an IP might have changed, or the state of the interface changed enough (bringing it down, then up again, readding static addresses, etc) that bind couldn't make sense of it.

'netplan apply' is supposed to do just that: apply the network configuration, via systemd-networkd. If it happened to be DHCP, you could potentially use 'critical: true', but that's not really going to help much for static addresses (which you'd most likely use for a DNS server).

I think this needs to be investigated in bind9: how does it bind addresses? How does it watch for network devices changes and what does it do in that case?

Revision history for this message

Robie Basak (racb) wrote on 2019-03-19:

#7

I'm not sure that it's the same issue, but bug 1811554 fits into the same category of "performance/reliability issue with named under edge case" to me. I think this needs to go into the backlog of things that need deep investigation.

Changed in bind9 (Ubuntu):
importance:	Undecided → High

Revision history for this message

Robie Basak (racb) wrote on 2019-03-19:

#8

I'm also un-marking Incomplete as I think there's enough information from the reporter to attempt to reproduce.

Changed in bind9 (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Robie Basak (racb) wrote on 2019-03-19:

#9

And Incomplete (or maybe Invalid) for netplan, since I agree with Mathieu's assessment, at least based on what we know right now.

Changed in netplan.io (Ubuntu):
status:	New → Incomplete

Affects		Status	Importance	Assigned to	Milestone
	bind9 (Ubuntu)	Confirmed	High	Unassigned
	netplan.io (Ubuntu)	Incomplete	Undecided	Unassigned

Ubuntu
bind9 package

bind9 slow response after netplan apply

Bug Description

Other bug subscribers

Remote bug watches

Ubuntubind9 package

bind9 slow response after netplan apply

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
bind9 package