named.service starts too early: Unable to fetch DNSKEY set '.': failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
bind9 (Ubuntu) |
Triaged
|
Low
|
Unassigned |
Bug Description
I have two servers that run named.service, and I recently discovered that (on both servers), when I reboot and then run "systemctl status named.service" (or "journalctl -u named.service"), I see messages like this:
Mar 18 21:03:05 mail named[859]: managed-
...where xxx is the view name, and this error is repeated for each view. (I have many views.)
(OTOH if the server is already up and running, and then I start named.service it starts up with no errors.)
By creating a shell script that ran various "ip" diagnostic commands, and adding this to named.service as a "ExecStartPre" hook, I was able to determine that the error above occurs because BIND is being started before the network is available. (The network interfaces didn't even have IP addresses at that time.)
I should highlight at this point that in spite of the error, as far as I know BIND was running OK, serving DNS as normal. I can only guess that it had cached copies of the required records or something like that?
Anyway I don't like seeing errors like this in my logs, so...
My initial attempt to solve this problem involved setting named.service to start after network-
Then I worked out that network-
-----
[Unit]
After=network-
[Service]
ExecStartPre=
-----
Effectively this causes systemd to delay starting named.service until the network interfaces have addresses, and then when it does start named.service, the ExecStartPre line above waits (for up to 10 seconds) until network routes are added before BIND (i.e. /usr/sbin/named) is launched.
Can I please request that the named.service definition in the bind9 package is updated to include the options above?
Final note: Although this bug would appear to be similar to 1909822 ( https:/
System-specific information follows:
# lsb_release -rd
Description: Ubuntu 21.10
Release: 21.10
# apt-cache policy bind9
bind9:
Installed: 1:9.16.
Candidate: 1:9.16.
Version table:
*** 1:9.16.
500 http://
500 http://
100 /var/lib/
1:
500 http://
Thanks,
Nick.
tags: |
added: network-online-ordering removed: server-triage-discuss |
I discovered that above workaround isn't ideal when the server has multiple network interfaces because the systemd- networkd- wait-online command above will wait for all interfaces to reach routable status. This may cause systemd- networkd- wait-online to timeout (after 10 seconds as per --timeout argument), and if you then run "systemctl status named.service", it shows a failed status for the ExecStartPre command, which isn't ideal.
I experimented with including "--any" in the systemd- networkd- wait-online arguments, but found this wasn't 100% reliable and TBH I'm not entirely sure why. But for now I've resorted to including the interface name in the above command instead, such as:
ExecStartPre= -/lib/systemd/ systemd- networkd- wait-online --interface= eno1:routable --timeout=10 --quiet
Obviously the interface name is machine-specific, which makes it impractical to include this command as a general purpose fix in the repo version of named.service. So I've now come to the conclusion that the best way to fix this issue is to implement a change to BIND itself (i.e. /usr/sbin/named), to make it retry a few times before logging the error message above? (FYI This is outside of the realm of my skill set so I guess I'm asking for the maintainer of BIND to determine the feasibility of this request?)
Thanks,
Nick.