Bug #1965521 “named.service starts too early: Unable to fetch DN...” : Bugs : bind9 package : Ubuntu

Revision history for this message

Nick Tait (nick.t) wrote on 2022-03-18:

#1

I discovered that above workaround isn't ideal when the server has multiple network interfaces because the systemd-networkd-wait-online command above will wait for all interfaces to reach routable status. This may cause systemd-networkd-wait-online to timeout (after 10 seconds as per --timeout argument), and if you then run "systemctl status named.service", it shows a failed status for the ExecStartPre command, which isn't ideal.

I experimented with including "--any" in the systemd-networkd-wait-online arguments, but found this wasn't 100% reliable and TBH I'm not entirely sure why. But for now I've resorted to including the interface name in the above command instead, such as:

ExecStartPre=-/lib/systemd/systemd-networkd-wait-online --interface=eno1:routable --timeout=10 --quiet

Obviously the interface name is machine-specific, which makes it impractical to include this command as a general purpose fix in the repo version of named.service. So I've now come to the conclusion that the best way to fix this issue is to implement a change to BIND itself (i.e. /usr/sbin/named), to make it retry a few times before logging the error message above? (FYI This is outside of the realm of my skill set so I guess I'm asking for the maintainer of BIND to determine the feasibility of this request?)

Thanks,
Nick.

Revision history for this message

Paride Legovini (paride) wrote on 2022-03-21:

#2

Hello Nick and thanks for this bug report. I didn't try to reproduce your specific issue, however I can see how it can happen. Unfortunately detecting when network is ready is a tricky thing, as the definition of "ready" is not fixed and it's very dependent on the specific configuration of the system.

Implementing a wait/retry mechanism in named could work, but that's out of scope of Ubuntu; my suggestion here is to file an upstream bug. If you do so please link it here so we'll be able to follow it.

This said, I think there is room to improve on the "network ready enough" detection mechanism in Ubuntu, and bind9 is not the only package that will benefit for it. I'll discuss the topic with the Server Team.

tags:

added: server-triage-discuss

Revision history for this message

Paride Legovini (paride) wrote on 2022-03-21:

#3

@Nick: LP: #1909822 has been reported as fixed in Jammy. Could you please test if you can still reproduce the issue you described here on a clean Jammy system? Thanks!

Marking this as Incomplete for now.

tags:	removed: server-triage-discuss
Changed in bind9 (Ubuntu):
status:	New → Incomplete

Revision history for this message

Nick Tait (nick.t) wrote on 2022-03-21:

#4

Hi Paride.

Thanks for your updates. It is good news that there is a fix for #1909822.

However this fix won't help with the current issue, because the problem here isn't whether or not BIND uses interfaces that are added after BIND is running, but rather the fact that BIND doesn't have connectivity to the root DNS servers when it starts.

After reading your first update, I do agree that improving systemd-networkd-wait-online "network ready enough" is the preferred way to go, so lets focus on that solution...

In update #1 I mentioned: I experimented with including "--any" in the systemd-networkd-wait-online arguments, but found this wasn't 100% reliable and TBH I'm not entirely sure why.

The ExecStartPre command I'd tested was:

ExecStartPre=-/lib/systemd/systemd-networkd-wait-online --operational-state=routable --any --timeout=10 --quiet

I don't know why adding "--any" didn't work, because this particular server only has one network interface. I wondered if it was actually picking up the "lo" interface? I only just now noticed there is an option to ignore interfaces, so maybe I should have tried adding "--ignore=lo" above? But TBH if that is the problem it seems like something that should be fixed in systemd-networkd-wait-online?

Regardless of why it didn't work, can I please specify the behaviour that I believe the above command should implement: Where you've specified a state of "routeable" and said "any" interface, IMO the goal should be the existence of a static route in the main routing table?

What do you think?

Thanks,
Nick.

Revision history for this message

Paride Legovini (paride) wrote on 2022-03-22:

#5

@Nick I'm not convinced by --any, as AIUI we don't want "any" interface, but "the right one(s)". I'll mark this bug (again) for further discussion.

tags:

added: server-triage-discuss

Revision history for this message

Nick Tait (nick.t) wrote on 2022-03-23:

#6

Hi Paride.

The fundamental problem I see with your last statement is how do you know what "the right one(s)" are? That will depend on BIND configuration, such as whether named is launched with a '-4' or '-6' option, and possibly even the value of configuration options such as 'listen-on' and 'listen-on-v6'?

Perhaps if we start with the 'need' that BIND has, and then work backwards, we will converge on a solution?

Here is my thinking:
1. The "Unable to fetch DNSKEY set '.': failure" error results from BIND trying to query the root DNS servers when it starts up, but not having the requisite level of network connectivity to do so.
2. In order to access the root DNS servers, the host needs access to the Internet...
3. The best indicator that Internet connectivity is available is the presence of a default route.
4. The default route requirement could be met by IPv4 or IPv6, so this could be satisfied by either of the following:
    * The IPv4 'main' routing table contains an entry for "0.0.0.0/0"
    * The IPv6 'main' routing table contains an entry for "::/0"
5. Therefore I believe we need a command that can be added as an 'ExecStartPre' option in named.service, that will wait until either of the above conditions (described in 5 above) are met.
6. Some potential solutions could be:
    a) Invocation of "systemd-networkd-wait-online" with a combination of existing parameters that the program will interpret to mean "wait until either of the requirements described in 5 above are met".
    b) Invocation of "systemd-networkd-wait-online" with a new parameter that the program will interpret to mean "wait until either of the requirements described in 5 above are met".
    c) Use of a different (new?) tool whose specific purpose is to "wait until either of the requirements described in 5 above are met".
7. Whichever solution is chosen, the tool should be generic enough that it can be used for other services, and should provide the ability to select only IPv4 or only IPv6, or both. This should be controlled via a command-line parameter, which for consistency with other Linux programs should be:
    * "-4" = use only IPv4, even if the host machine is capable of IPv6.
    * "-6" = use only IPv6, even if the host machine is capable of IPv4.
    * Specifying neither option should mean both IPv4 and IPv6.
    * NB: "-4" and "-6" are mutually exclusive.

My suggestion in my earlier comment #4 is an example of solution (a), but it doesn't satisfy 7 above. So I concede there are other (better) options that probably need to be considered?

Keen to hear your thoughts?

Thanks,
Nick.

Hi Paride.

The fundamental problem I see with your last statement is how do you know what "the right one(s)" are? That will depend on BIND configuration, such as whether named is launched with a '-4' or '-6' option, and possibly even the value of configuration options such as 'listen-on' and 'listen-on-v6'?

Perhaps if we start with the 'need' that BIND has, and then work backwards, we will converge on a solution?

Here is my thinking:
1. The "Unable to fetch DNSKEY set '.': failure" error results from BIND trying to query the root DNS servers when it starts up, but not having the requisite level of network connectivity to do so.
2. In order to access the root DNS servers, the host needs access to the Internet...
3. The best indicator that Internet connectivity is available is the presence of a default route.
4. The default route requirement could be met by IPv4 or IPv6, so this could be satisfied by either of the following:
    * The IPv4 'main' routing table contains an entry for "0.0.0.0/0"
    * The IPv6 'main' routing table contains an entry for "::/0"
5. Therefore I believe we need a command that can be added as an 'ExecStartPre' option in named.service, that will wait until either of the above conditions (described in 5 above) are met.
6. Some potential solutions could be:
    a) Invocation of "systemd-networkd-wait-online" with a combination of existing parameters that the program will interpret to mean "wait until either of the requirements described in 5 above are met".
    b) Invocation of "systemd-networkd-wait-online" with a new parameter that the program will interpret to mean "wait until either of the requirements described in 5 above are met".
    c) Use of a different (new?) tool whose specific purpose is to "wait until either of the requirements described in 5 above are met".
7. Whichever solution is chosen, the tool should be generic enough that it can be used for other services, and should provide the ability to select only IPv4 or only IPv6, or both. This should be controlled via a command-line parameter, which for consistency with other Linux programs should be:
    * "-4" = use only IPv4, even if the host machine is capable of IPv6.
    * "-6" = use only IPv6, even if the host machine is capable of IPv4.
    * Specifying neither option should mean both IPv4 and IPv6.
    * NB: "-4" and "-6" are mutually exclusive.

My suggestion in my earlier comment #4 is an example of solution (a), but it doesn't satisfy 7 above. So I concede there are other (better) options that probably need to be considered?

Keen to hear your thoughts?

Thanks,
Nick.

Revision history for this message

Simon Déziel (sdeziel) wrote on 2022-03-23:

#7

Download full text (4.0 KiB)

Hi Nick,

As you mentioned in the issue description, "Unable to fetch DNSKEY set '.': failure" is not a fatal error as named is still fully functional.

This is because named comes with the current root zone KSK (key id 20326) compiled in. The error is because it tries to refresh it using RFC5011 mechanism (https://www.rfc-editor.org/rfc/rfc5011.html) but that will be retried so failing to do it on startup isn't a big deal IMHO. Even less worrying since the root zone KSK changes very infrequently.

To double check this, I created a Jammy container and provided it with only an IPv6. There, I can see the error message due to named starting before the IPv6 address is configured. However, named has no problem providing resolution once the IPv6 becomes available:

root@jammy-bind:~# journalctl -n 8 -u named
Mar 23 13:40:36 jammy-bind systemd[1]: Started BIND Domain Name Server.
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './NS/IN': 192.112.36.4#53
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './DNSKEY/IN': 192.33.4.12#53
Mar 23 13:40:36 jammy-bind named[120]: managed-keys-zone: Unable to fetch DNSKEY set '.': failure
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './NS/IN': 192.33.4.12#53
Mar 23 13:40:36 jammy-bind named[120]: resolver priming query complete
Mar 23 13:40:38 jammy-bind named[120]: listening on IPv6 interface eth0, fd42:2192:4f89:5adc:216:3eff:fe19:df84#53
Mar 23 13:40:49 jammy-bind named[120]: resolver priming query complete

root@jammy-bind:~# dig +rrcomments +dnssec -t dnskey . @::1

; <<>> DiG 9.18.0-2ubuntu3-Ubuntu <<>> +rrcomments +dnssec -t dnskey . @::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63243
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
; COOKIE: ae8a685e179cfece01000000623b23e881248f1ef945af75 (good)
;; QUESTION SECTION:
;. IN DNSKEY

;; ANSWER SECTION:
. 172665 IN DNSKEY 256 3 8 AwEAAak/ZU9wDNQD7XTAGTDkn32UR8I6auRDekbGky+yyWKdUHmwAJv9 0YHCUTib8aVBgNgbxkeeZGRx3W4+XhMZbfUr5fMwmD3u9P2yzJpbRtjG NM/XZvzGs9HHNymz3Bp851anHZfNy6pJud265/XMKzFlAY8sMJjum0hv x/DuCDELLyhsvdfOD9rHM93UXO0bcAjvI8tjZsGI+Pfp9KdxF9vS/sAz pFXKsldix+e6xv8rRS6WPg2LAooxF+eO5DgFSilYmnyCK4VPJ7ntjD/8 m0bs128ZT1eY3oXCbojDv59lLAgrdGSbcVxQF2KHoUHDmkOC5BzG/1xR tW4v/3y4/H8= ; ZSK; alg = RSASHA256 ; key id = 47671
. 172665 IN DNSKEY 256 3 8 AwEAAZym4HCWiTAAl2Mv1izgTyn9sKwgi5eBxpG29bVlefq/r+TGCtmU ElvFyBWHRjvf9mBglIlTBRse22dvzNOI+cYrkjD6LOHuxMoc/d4WtXWK dviNmrtWF2GpjmDOI98gLd4BZ0U/lY847mJP9LypFABZcEn3zM3vce4E e1A3upSlFQ2TFyJSD9HvMnP4XneFexBxV96RpLcy2O+u2W6ChIiDCjlr owPCcU3zXfXxyWy/VKM6TOa8gNf+aKaVkcv/eIh5er8rrsqAi9KT8O5h mhzYLkUOQEXVSRORV0RMt9l3JSwWxT1MebEDvtfBag3uo+mZwWSFlpc9 kuzyWBd72Ec= ; ZSK; alg = RSASHA256 ; key id = 9799
. 172665 IN DNSKEY 257 3 8 AwEAAaz/tAm8yTn4Mfeh5eyI96WSVexTBAvkMgJzkKTOiW1vkIbzxeF3 +/4RgWOq7HrxRixHlFlExOLAJr5emLvN7SWXgnLh4+B5xQlNVz8Og8kv ArMtNROxVQuCaSnIDdD5LKyWbRd2n9WGe2R8PzgCmr3EgVLrjyBxWezF 0jLHwVN8efS3rCj/EWgvIWgb9tarpVUDK/b58Da+sqqls3eNbuv7pr+e oZG+SrDK6nWeL3c6H5Apxz7LjVc1uTIdsIXxuOLYA4/ilBmSVIzuDWfd RUfhHdY6+cn8...

Hi Nick,

As you mentioned in the issue description, "Unable to fetch DNSKEY set '.': failure" is not a fatal error as named is still fully functional.

This is because named comes with the current root zone KSK (key id 20326) compiled in. The error is because it tries to refresh it using RFC5011 mechanism (https://www.rfc-editor.org/rfc/rfc5011.html) but that will be retried so failing to do it on startup isn't a big deal IMHO. Even less worrying since the root zone KSK changes very infrequently.

To double check this, I created a Jammy container and provided it with only an IPv6. There, I can see the error message due to named starting before the IPv6 address is configured. However, named has no problem providing resolution once the IPv6 becomes available:

root@jammy-bind:~# journalctl -n 8 -u named
Mar 23 13:40:36 jammy-bind systemd[1]: Started BIND Domain Name Server.
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './NS/IN': 192.112.36.4#53
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './DNSKEY/IN': 192.33.4.12#53
Mar 23 13:40:36 jammy-bind named[120]: managed-keys-zone: Unable to fetch DNSKEY set '.': failure
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './NS/IN': 192.33.4.12#53
Mar 23 13:40:36 jammy-bind named[120]: resolver priming query complete
Mar 23 13:40:38 jammy-bind named[120]: listening on IPv6 interface eth0, fd42:2192:4f89:5adc:216:3eff:fe19:df84#53
Mar 23 13:40:49 jammy-bind named[120]: resolver priming query complete

root@jammy-bind:~# dig +rrcomments +dnssec -t dnskey . @::1

; <<>> DiG 9.18.0-2ubuntu3-Ubuntu <<>> +rrcomments +dnssec -t dnskey . @::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63243
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
; COOKIE: ae8a685e179cfece01000000623b23e881248f1ef945af75 (good)
;; QUESTION SECTION:
;.				IN	DNSKEY

;; ANSWER SECTION:
.			172665	IN	DNSKEY	256 3 8 AwEAAak/ZU9wDNQD7XTAGTDkn32UR8I6auRDekbGky+yyWKdUHmwAJv9 0YHCUTib8aVBgNgbxkeeZGRx3W4+XhMZbfUr5fMwmD3u9P2yzJpbRtjG NM/XZvzGs9HHNymz3Bp851anHZfNy6pJud265/XMKzFlAY8sMJjum0hv x/DuCDELLyhsvdfOD9rHM93UXO0bcAjvI8tjZsGI+Pfp9KdxF9vS/sAz pFXKsldix+e6xv8rRS6WPg2LAooxF+eO5DgFSilYmnyCK4VPJ7ntjD/8 m0bs128ZT1eY3oXCbojDv59lLAgrdGSbcVxQF2KHoUHDmkOC5BzG/1xR tW4v/3y4/H8=  ; ZSK; alg = RSASHA256 ; key id = 47671
.			172665	IN	DNSKEY	256 3 8 AwEAAZym4HCWiTAAl2Mv1izgTyn9sKwgi5eBxpG29bVlefq/r+TGCtmU ElvFyBWHRjvf9mBglIlTBRse22dvzNOI+cYrkjD6LOHuxMoc/d4WtXWK dviNmrtWF2GpjmDOI98gLd4BZ0U/lY847mJP9LypFABZcEn3zM3vce4E e1A3upSlFQ2TFyJSD9HvMnP4XneFexBxV96RpLcy2O+u2W6ChIiDCjlr owPCcU3zXfXxyWy/VKM6TOa8gNf+aKaVkcv/eIh5er8rrsqAi9KT8O5h mhzYLkUOQEXVSRORV0RMt9l3JSwWxT1MebEDvtfBag3uo+mZwWSFlpc9 kuzyWBd72Ec=  ; ZSK; alg = RSASHA256 ; key id = 9799
.			172665	IN	DNSKEY	257 3 8 AwEAAaz/tAm8yTn4Mfeh5eyI96WSVexTBAvkMgJzkKTOiW1vkIbzxeF3 +/4RgWOq7HrxRixHlFlExOLAJr5emLvN7SWXgnLh4+B5xQlNVz8Og8kv ArMtNROxVQuCaSnIDdD5LKyWbRd2n9WGe2R8PzgCmr3EgVLrjyBxWezF 0jLHwVN8efS3rCj/EWgvIWgb9tarpVUDK/b58Da+sqqls3eNbuv7pr+e oZG+SrDK6nWeL3c6H5Apxz7LjVc1uTIdsIXxuOLYA4/ilBmSVIzuDWfd RUfhHdY6+cn8HFRm+2hM8AnXGXws9555KrUB5qihylGa8subX2Nn6UwN R1AkUTV74bU=  ; KSK; alg = RSASHA256 ; key id = 20326
.			172665	IN	RRSIG	DNSKEY 8 0 172800 20220412000000 20220322000000 20326 . g2Rjm8rCMXEN7BJezHm7o67VTPmp9ETDJqiTQG9HNK31nAyp8iXGEcux uviojbobzmjuvjI9KSOLQX6QD1C/4lWovapyZQrEl8L5Ja0tP9H720mw y5TYgcsE5wmojjugOLAW+avQ1L62J+dh3wqmuOqS3K7wIzJ6eciOi3cB rlEXJYK5w1b7jM+qf+sOt5xTUQ3YhpmYJK94gPYMBrkLEaWKcU2DP6LT HqeFQviBhUb8hN60kitd92zHt3qfaCIFrbTm3fGdttu7LYlN3OwSlN21 m0/3iuoA9Q4LNimgqhxKEFzKQ/96477E1V9wyjiaxMcp7IL30Ocb8nmQ Ub2FKg==

;; Query time: 0 msec
;; SERVER: ::1#53(::1) (UDP)
;; WHEN: Wed Mar 23 13:43:04 UTC 2022
;; MSG SIZE  rcvd: 1169

Because named works fine despite the annoying failure message, I'd be reluctant to make things more complicated by trying to delay named's startup.

Please note that I only tested with Jammy/Ubuntu 22.04 so your mileage may vary on Focal/Ubuntu 20.04.

Christian Ehrhardt  (paelzer) on 2022-03-23

tags:

added: network-online-ordering
removed: server-triage-discuss

Revision history for this message

Paride Legovini (paride) wrote on 2022-03-23:

#8

Thanks Simon for setting up a reproducer and verifying! @Nick do you agree with Simon's findings, which basically mean that the error this bug report is about is mostly a cosmetic thing, as named will retry?

Anyway, we acknowledge that in general "service X starts before network is ready" is an issue, we have some other bugs in the same category, collected under the network-online-ordering tag:

https://bugs.launchpad.net/bugs/+bugs?field.tag=network-online-ordering

Thing is, there no general way (that we can see) to implement a "wait for network to be ready" logic, especially given that the concept of "ready" is not well defined and can vary case by case. Being generally more strict is on waiting for networking is certain to produce unwanted side effects in other cases (think of "bind9 never starts" bugs). Waiting for network can only be handled with cooperation with the service in question, which can for example retry on failure, as apparently bind9 does.

I'm marking this as Triaged with Low importance (like most other network-online-ordering).

Changed in bind9 (Ubuntu):
status:	Incomplete → Triaged
importance:	Undecided → Low

Revision history for this message

Nick Tait (nick.t) wrote on 2022-03-25:

#9

Thanks Simon & Paride.

That is reassuring to know that BIND will retry. Based on that I'm happy for you to treat this as a low priority issue. I still do think it is worth fixing (somehow), but better to deal with it in a generic way that helps other packages too, rather than trying to cobble together a BIND-specific fix.

I have a workaround that stops this error from bugging me (i.e. using systemd-networkd-wait-online) so I'm happy. :-)

Thanks for your all your time and efforts.

Nick.

Revision history for this message

Paride Legovini (paride) wrote on 2022-09-22:

#10

Retriaging this bug about 6 months later. I think it still is in its correct state: we still don't have a general and reliable way to detect that "network is at least as online as it should be", see my comment above for more details.

Ubuntu
bind9 package

named.service starts too early: Unable to fetch DNSKEY set '.': failure

Bug Description

Other bug subscribers

Remote bug watches

Ubuntubind9 package

named.service starts too early: Unable to fetch DNSKEY set '.': failure

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
bind9 package