named.service starts too early: Unable to fetch DNSKEY set '.': failure

Bug #1965521 reported by Nick Tait
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
bind9 (Ubuntu)
Triaged
Low
Unassigned

Bug Description

I have two servers that run named.service, and I recently discovered that (on both servers), when I reboot and then run "systemctl status named.service" (or "journalctl -u named.service"), I see messages like this:

Mar 18 21:03:05 mail named[859]: managed-keys-zone/xxx: Unable to fetch DNSKEY set '.': failure

...where xxx is the view name, and this error is repeated for each view. (I have many views.)

(OTOH if the server is already up and running, and then I start named.service it starts up with no errors.)

By creating a shell script that ran various "ip" diagnostic commands, and adding this to named.service as a "ExecStartPre" hook, I was able to determine that the error above occurs because BIND is being started before the network is available. (The network interfaces didn't even have IP addresses at that time.)

I should highlight at this point that in spite of the error, as far as I know BIND was running OK, serving DNS as normal. I can only guess that it had cached copies of the required records or something like that?

Anyway I don't like seeing errors like this in my logs, so...

My initial attempt to solve this problem involved setting named.service to start after network-online.target. Results were mixed. Sometimes there were no errors on reboot, but more often than not the same errors were there.

Then I worked out that network-online.target is based on systemd-networkd-wait-online, which by default only waits until IP addresses are assigned to interfaces. To solve the error above, I needed it to wait for the operational status to become "routable". I was able to achieve this by specifying the following in /etc/systemd/system/named.service.d/override.conf (i.e. file content is between the "-----" lines)

-----
[Unit]
After=network-online.target

[Service]
ExecStartPre=-/lib/systemd/systemd-networkd-wait-online --operational-state=routable --timeout=10 --quiet
-----

Effectively this causes systemd to delay starting named.service until the network interfaces have addresses, and then when it does start named.service, the ExecStartPre line above waits (for up to 10 seconds) until network routes are added before BIND (i.e. /usr/sbin/named) is launched.

Can I please request that the named.service definition in the bind9 package is updated to include the options above?

Final note: Although this bug would appear to be similar to 1909822 ( https://bugs.launchpad.net/ubuntu/+source/bind9/+bug/1909822 ), the error message I observed is different, and so I've raised this as a separate bug report. Having said that, I suspect the solution that I'm offering above would fix both issues (and would be slightly more optimal than that offered by 1909822).

System-specific information follows:
# lsb_release -rd
Description: Ubuntu 21.10
Release: 21.10

# apt-cache policy bind9
bind9:
  Installed: 1:9.16.15-1ubuntu1.2
  Candidate: 1:9.16.15-1ubuntu1.2
  Version table:
 *** 1:9.16.15-1ubuntu1.2 500
        500 http://nz.archive.ubuntu.com/ubuntu impish-updates/main amd64 Packages
        500 http://nz.archive.ubuntu.com/ubuntu impish-security/main amd64 Packages
        100 /var/lib/dpkg/status
     1:9.16.15-1ubuntu1 500
        500 http://nz.archive.ubuntu.com/ubuntu impish/main amd64 Packages

Thanks,
Nick.

Revision history for this message
Nick Tait (nick.t) wrote :

I discovered that above workaround isn't ideal when the server has multiple network interfaces because the systemd-networkd-wait-online command above will wait for all interfaces to reach routable status. This may cause systemd-networkd-wait-online to timeout (after 10 seconds as per --timeout argument), and if you then run "systemctl status named.service", it shows a failed status for the ExecStartPre command, which isn't ideal.

I experimented with including "--any" in the systemd-networkd-wait-online arguments, but found this wasn't 100% reliable and TBH I'm not entirely sure why. But for now I've resorted to including the interface name in the above command instead, such as:

ExecStartPre=-/lib/systemd/systemd-networkd-wait-online --interface=eno1:routable --timeout=10 --quiet

Obviously the interface name is machine-specific, which makes it impractical to include this command as a general purpose fix in the repo version of named.service. So I've now come to the conclusion that the best way to fix this issue is to implement a change to BIND itself (i.e. /usr/sbin/named), to make it retry a few times before logging the error message above? (FYI This is outside of the realm of my skill set so I guess I'm asking for the maintainer of BIND to determine the feasibility of this request?)

Thanks,
Nick.

Revision history for this message
Paride Legovini (paride) wrote :

Hello Nick and thanks for this bug report. I didn't try to reproduce your specific issue, however I can see how it can happen. Unfortunately detecting when network is ready is a tricky thing, as the definition of "ready" is not fixed and it's very dependent on the specific configuration of the system.

Implementing a wait/retry mechanism in named could work, but that's out of scope of Ubuntu; my suggestion here is to file an upstream bug. If you do so please link it here so we'll be able to follow it.

This said, I think there is room to improve on the "network ready enough" detection mechanism in Ubuntu, and bind9 is not the only package that will benefit for it. I'll discuss the topic with the Server Team.

tags: added: server-triage-discuss
Revision history for this message
Paride Legovini (paride) wrote :

@Nick: LP: #1909822 has been reported as fixed in Jammy. Could you please test if you can still reproduce the issue you described here on a clean Jammy system? Thanks!

Marking this as Incomplete for now.

tags: removed: server-triage-discuss
Changed in bind9 (Ubuntu):
status: New → Incomplete
Revision history for this message
Nick Tait (nick.t) wrote :

Hi Paride.

Thanks for your updates. It is good news that there is a fix for #1909822.

However this fix won't help with the current issue, because the problem here isn't whether or not BIND uses interfaces that are added after BIND is running, but rather the fact that BIND doesn't have connectivity to the root DNS servers when it starts.

After reading your first update, I do agree that improving systemd-networkd-wait-online "network ready enough" is the preferred way to go, so lets focus on that solution...

In update #1 I mentioned: I experimented with including "--any" in the systemd-networkd-wait-online arguments, but found this wasn't 100% reliable and TBH I'm not entirely sure why.

The ExecStartPre command I'd tested was:

ExecStartPre=-/lib/systemd/systemd-networkd-wait-online --operational-state=routable --any --timeout=10 --quiet

I don't know why adding "--any" didn't work, because this particular server only has one network interface. I wondered if it was actually picking up the "lo" interface? I only just now noticed there is an option to ignore interfaces, so maybe I should have tried adding "--ignore=lo" above? But TBH if that is the problem it seems like something that should be fixed in systemd-networkd-wait-online?

Regardless of why it didn't work, can I please specify the behaviour that I believe the above command should implement: Where you've specified a state of "routeable" and said "any" interface, IMO the goal should be the existence of a static route in the main routing table?

What do you think?

Thanks,
Nick.

Revision history for this message
Paride Legovini (paride) wrote :

@Nick I'm not convinced by --any, as AIUI we don't want "any" interface, but "the right one(s)". I'll mark this bug (again) for further discussion.

tags: added: server-triage-discuss
Revision history for this message
Nick Tait (nick.t) wrote :

Hi Paride.

The fundamental problem I see with your last statement is how do you know what "the right one(s)" are? That will depend on BIND configuration, such as whether named is launched with a '-4' or '-6' option, and possibly even the value of configuration options such as 'listen-on' and 'listen-on-v6'?

Perhaps if we start with the 'need' that BIND has, and then work backwards, we will converge on a solution?

Here is my thinking:
1. The "Unable to fetch DNSKEY set '.': failure" error results from BIND trying to query the root DNS servers when it starts up, but not having the requisite level of network connectivity to do so.
2. In order to access the root DNS servers, the host needs access to the Internet...
3. The best indicator that Internet connectivity is available is the presence of a default route.
4. The default route requirement could be met by IPv4 or IPv6, so this could be satisfied by either of the following:
    * The IPv4 'main' routing table contains an entry for "0.0.0.0/0"
    * The IPv6 'main' routing table contains an entry for "::/0"
5. Therefore I believe we need a command that can be added as an 'ExecStartPre' option in named.service, that will wait until either of the above conditions (described in 5 above) are met.
6. Some potential solutions could be:
    a) Invocation of "systemd-networkd-wait-online" with a combination of existing parameters that the program will interpret to mean "wait until either of the requirements described in 5 above are met".
    b) Invocation of "systemd-networkd-wait-online" with a new parameter that the program will interpret to mean "wait until either of the requirements described in 5 above are met".
    c) Use of a different (new?) tool whose specific purpose is to "wait until either of the requirements described in 5 above are met".
7. Whichever solution is chosen, the tool should be generic enough that it can be used for other services, and should provide the ability to select only IPv4 or only IPv6, or both. This should be controlled via a command-line parameter, which for consistency with other Linux programs should be:
    * "-4" = use only IPv4, even if the host machine is capable of IPv6.
    * "-6" = use only IPv6, even if the host machine is capable of IPv4.
    * Specifying neither option should mean both IPv4 and IPv6.
    * NB: "-4" and "-6" are mutually exclusive.

My suggestion in my earlier comment #4 is an example of solution (a), but it doesn't satisfy 7 above. So I concede there are other (better) options that probably need to be considered?

Keen to hear your thoughts?

Thanks,
Nick.

Revision history for this message
Simon Déziel (sdeziel) wrote :
Download full text (4.0 KiB)

Hi Nick,

As you mentioned in the issue description, "Unable to fetch DNSKEY set '.': failure" is not a fatal error as named is still fully functional.

This is because named comes with the current root zone KSK (key id 20326) compiled in. The error is because it tries to refresh it using RFC5011 mechanism (https://www.rfc-editor.org/rfc/rfc5011.html) but that will be retried so failing to do it on startup isn't a big deal IMHO. Even less worrying since the root zone KSK changes very infrequently.

To double check this, I created a Jammy container and provided it with only an IPv6. There, I can see the error message due to named starting before the IPv6 address is configured. However, named has no problem providing resolution once the IPv6 becomes available:

root@jammy-bind:~# journalctl -n 8 -u named
Mar 23 13:40:36 jammy-bind systemd[1]: Started BIND Domain Name Server.
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './NS/IN': 192.112.36.4#53
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './DNSKEY/IN': 192.33.4.12#53
Mar 23 13:40:36 jammy-bind named[120]: managed-keys-zone: Unable to fetch DNSKEY set '.': failure
Mar 23 13:40:36 jammy-bind named[120]: network unreachable resolving './NS/IN': 192.33.4.12#53
Mar 23 13:40:36 jammy-bind named[120]: resolver priming query complete
Mar 23 13:40:38 jammy-bind named[120]: listening on IPv6 interface eth0, fd42:2192:4f89:5adc:216:3eff:fe19:df84#53
Mar 23 13:40:49 jammy-bind named[120]: resolver priming query complete

root@jammy-bind:~# dig +rrcomments +dnssec -t dnskey . @::1

; <<>> DiG 9.18.0-2ubuntu3-Ubuntu <<>> +rrcomments +dnssec -t dnskey . @::1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63243
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags: do; udp: 1232
; COOKIE: ae8a685e179cfece01000000623b23e881248f1ef945af75 (good)
;; QUESTION SECTION:
;. IN DNSKEY

;; ANSWER SECTION:
. 172665 IN DNSKEY 256 3 8 AwEAAak/ZU9wDNQD7XTAGTDkn32UR8I6auRDekbGky+yyWKdUHmwAJv9 0YHCUTib8aVBgNgbxkeeZGRx3W4+XhMZbfUr5fMwmD3u9P2yzJpbRtjG NM/XZvzGs9HHNymz3Bp851anHZfNy6pJud265/XMKzFlAY8sMJjum0hv x/DuCDELLyhsvdfOD9rHM93UXO0bcAjvI8tjZsGI+Pfp9KdxF9vS/sAz pFXKsldix+e6xv8rRS6WPg2LAooxF+eO5DgFSilYmnyCK4VPJ7ntjD/8 m0bs128ZT1eY3oXCbojDv59lLAgrdGSbcVxQF2KHoUHDmkOC5BzG/1xR tW4v/3y4/H8= ; ZSK; alg = RSASHA256 ; key id = 47671
. 172665 IN DNSKEY 256 3 8 AwEAAZym4HCWiTAAl2Mv1izgTyn9sKwgi5eBxpG29bVlefq/r+TGCtmU ElvFyBWHRjvf9mBglIlTBRse22dvzNOI+cYrkjD6LOHuxMoc/d4WtXWK dviNmrtWF2GpjmDOI98gLd4BZ0U/lY847mJP9LypFABZcEn3zM3vce4E e1A3upSlFQ2TFyJSD9HvMnP4XneFexBxV96RpLcy2O+u2W6ChIiDCjlr owPCcU3zXfXxyWy/VKM6TOa8gNf+aKaVkcv/eIh5er8rrsqAi9KT8O5h mhzYLkUOQEXVSRORV0RMt9l3JSwWxT1MebEDvtfBag3uo+mZwWSFlpc9 kuzyWBd72Ec= ; ZSK; alg = RSASHA256 ; key id = 9799
. 172665 IN DNSKEY 257 3 8 AwEAAaz/tAm8yTn4Mfeh5eyI96WSVexTBAvkMgJzkKTOiW1vkIbzxeF3 +/4RgWOq7HrxRixHlFlExOLAJr5emLvN7SWXgnLh4+B5xQlNVz8Og8kv ArMtNROxVQuCaSnIDdD5LKyWbRd2n9WGe2R8PzgCmr3EgVLrjyBxWezF 0jLHwVN8efS3rCj/EWgvIWgb9tarpVUDK/b58Da+sqqls3eNbuv7pr+e oZG+SrDK6nWeL3c6H5Apxz7LjVc1uTIdsIXxuOLYA4/ilBmSVIzuDWfd RUfhHdY6+cn8...

Read more...

tags: added: network-online-ordering
removed: server-triage-discuss
Revision history for this message
Paride Legovini (paride) wrote :

Thanks Simon for setting up a reproducer and verifying! @Nick do you agree with Simon's findings, which basically mean that the error this bug report is about is mostly a cosmetic thing, as named will retry?

Anyway, we acknowledge that in general "service X starts before network is ready" is an issue, we have some other bugs in the same category, collected under the network-online-ordering tag:

  https://bugs.launchpad.net/bugs/+bugs?field.tag=network-online-ordering

Thing is, there no general way (that we can see) to implement a "wait for network to be ready" logic, especially given that the concept of "ready" is not well defined and can vary case by case. Being generally more strict is on waiting for networking is certain to produce unwanted side effects in other cases (think of "bind9 never starts" bugs). Waiting for network can only be handled with cooperation with the service in question, which can for example retry on failure, as apparently bind9 does.

I'm marking this as Triaged with Low importance (like most other network-online-ordering).

Changed in bind9 (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → Low
Revision history for this message
Nick Tait (nick.t) wrote :

Thanks Simon & Paride.

That is reassuring to know that BIND will retry. Based on that I'm happy for you to treat this as a low priority issue. I still do think it is worth fixing (somehow), but better to deal with it in a generic way that helps other packages too, rather than trying to cobble together a BIND-specific fix.

I have a workaround that stops this error from bugging me (i.e. using systemd-networkd-wait-online) so I'm happy. :-)

Thanks for your all your time and efforts.

Nick.

Revision history for this message
Paride Legovini (paride) wrote :

Retriaging this bug about 6 months later. I think it still is in its correct state: we still don't have a general and reliable way to detect that "network is at least as online as it should be", see my comment above for more details.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.