NM-controlled dnsmasq prevents other DNS servers from starting

Bug #959037 reported by Alkis Georgopoulos
96
This bug affects 18 people
Affects Status Importance Assigned to Milestone
djbdns (Ubuntu)
Confirmed
Undecided
Unassigned
Precise
Won't Fix
Undecided
Unassigned
dnsmasq (Ubuntu)
Fix Released
Undecided
Unassigned
Precise
Won't Fix
High
Mathieu Trudel-Lapierre
network-manager (Ubuntu)
Confirmed
Low
Unassigned
Precise
Won't Fix
High
Mathieu Trudel-Lapierre
pdns-recursor (Ubuntu)
Invalid
Undecided
Unassigned
Precise
Invalid
Undecided
Unassigned
pdnsd (Ubuntu)
Invalid
Undecided
Unassigned
Precise
Invalid
Undecided
Unassigned

Bug Description

As described in https://blueprints.launchpad.net/ubuntu/+spec/foundations-p-dns-resolving, network manager now starts a dnsmasq instance for local DNS resolving.

That breaks the default bind9 and dnsmasq installations, for people that actually want to install a DNS server.
Having to manually comment out "#dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf doesn't sound good, and if it stays that way, it should be moved to the bind9 and dnsmasq postinst scripts.

Please make network-manager smarter so that it checks if bind9 or dnsmasq are installed, so that it doesn't start the local resolver in that case.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Well, that's already partly done. dnsmasq will fail to start with bind is running, as it should; based on port 53 already being in use or not.

As another option, you may also wish to switch dns=dnsmasq to dns=bind to use bind directly as a resolver. There are other reasons to have dnsmasq and/or bind installed, so even checking for existence isn't the right way to cover this.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

I don't think we'll cover this particular use case for Precise. I understand your requirement and how the need to change the settings in /etc/NetworkManager/NetworkManager.conf isn't great, but it's a one-time thing and isn't something we can safely do as part of the install processes for dnsmasq or bind. Then, for the reasons above other options aren't available.

There's another possibility to make this easier by making sure Bind always starts before NetworkManager, but most cases will not actually see bind and NetworkManager installed on the same system; and fixing this would require migrating bind from a sysvinit script to a new upstart job.

I'm keeping the task open as it's absolutely a valid request, we just won't have time to focus on fixing this for the Precise release. (Sorry)

Changed in network-manager (Ubuntu):
status: New → Triaged
importance: Undecided → Low
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> I don't think we'll cover this particular use case for Precise.

Excuse me, but how is installing bind9 or dnsmasq a "particular use case"?
I'm talking about the default installation, not some corner case...

> most cases will not actually see bind and NetworkManager installed on the same system

We have 250 schools here that use NetworkManager and dnsmasq as the DNS server, are there any stats that show that this is actually rare?
And, actually more rare than the split VPN need that the local resolver addresses?

Since the local resolver implementation seems a bit immature and needs to break two packages in order to work, one of them in main, wouldn't it be better if it was postponed and not be applied in an LTS release until it's more cooperative?

Kind regards,
Alkis Georgopoulos

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

I think I've been unclear. Using NetworkManager with *bind* is a relatively unusual use case. dnsmasq with NetworkManager for resolution is what we're aiming for *by default*, and that's what also part of the default install. Everything has been put in place so that split VPN and such are correctly addressed with NetworkManager spawning dnsmasq as necessary, which is what dns=dnsmasq achieves.

I'm not sure in this case what you mean by breaks two packages. There's a lot of benefits to having a local resolver other than the libc one (split DNS, faster and more efficient resolution, etc.).

I do feel we've tested this well, thoroughly, and that it's very cooperative and efficient. Please, tell me more about your setup so we can make sure we cater for this use case before release.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

What I mean here is that default installs normally don't involve installing a local DNS server, except perhaps as a caching resolver. The caching resolver use case is covered by spawning dnsmasq from NetworkManager; the local DNS server isn't. We do think that there is relatively few such installs of a server that depends on NetworkManager running; and that's definitely not the default setup for Ubuntu Server (where NetworkManager isn't installed by default).

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> Please, tell me more about your setup so we can make sure we cater for this use case before release.

1) Install precise-desktop-i386.iso to some-pc.
2) Install dnsmasq. Fails to start. OK, annoying but let's see if the problem goes away after reboot.
3) Reboot. Try to `dig @some-pc ubuntu.com` from *another* PC.

Here's the problem. It *sometimes* works. The "caching resolver" implementation introduced a race condition.
So if the nm-spawned dnsmasq starts first, then the dnsmasq package is broken, and doesn't fulfill its stated goal to "provide DNS to a small network" out of the box and without manual editing of nm conffiles.
If the real dnsmasq starts first, then the "caching resolver" is broken instead.

Because of time constrains, I think that checking if [ -d /etc/dnsmasq.d ] before spawning dnsmasq from nm, would satisfy most of dnsmasq users. I don't think there are many users that want to keep the nm-spawned dnsmasq when they install the real one. Maybe something similar can be done for bind too.
In the future, maybe the "caching resolver" implementation can start using /etc/dnsmasq.d itself, along with the KVM-spawned instances too, so that people only have one dnsmasq instance instead of multiple ones?

(The reason we're using the desktop iso instead of the server one, is that we need a desktop environment in our servers for our LTSP thin clients, and because teachers work on our servers, they're not headless).

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Another idea would be to create a "spawn-local-resolver" sysvinit or upstart job that lists dnsmasq and bind in its dependencies, so that it always starts after any known DNS servers, ensuring that no race conditions occur for the :53 port checking.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

And yet another idea would be to make a package out of the local resolver configuration, and declare that it Breaks: dnsmasq, bind9.
That way anyone installing dnsmasq or bind9 would get rid of the local resolver package and its conflicting configuration.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

If you're installing dnsmasq on top of the standard desktop install, why is it such an issue to edit the NetworkManager configuration to cater it to your needs? Wouldn't it make sense it this case to go further steps and make sure the network connection is setup in /etc/network/interfaces rather than NM, to ensure you don't suddenly get a different IP address from DHCP?

I don't think adding complexity by creating new virtual packages for configurations is a sensible thing to do; and setting up a special upstart job to spawn a local resolver won't work (NM spawns it itself, using a custom configuration on purpose).

Since NM relies on dnsmasq-base for the standalone binary rather than the 'dnsmasq' package itself; I guess a workable solution would be to check for /etc/default/dnsmasq and not spawn dnsmasq if the value of ENABLED is 1. Working on top of that for later releases we might then be able to try speaking to a running instance via DBus in such cases to pass server changes to it.

Setting to Triaged; we've got a way to possibly deal with this use case...

Changed in network-manager (Ubuntu):
importance: Low → Medium
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Does it help any if the daemon dnsmasq is configured to only listen on the interface meant for the ltsp clients, if there's a specific interface for this?

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

There's other probably far simpler (and safer) workarounds. What's your configuration for the dnsmasq like?

Upstream mentions some configurations at the dnsmasq level that are very relevant for this particular case:

in /etc/dnsmasq.conf:

#except-interface=
# Or which to listen on by address (remember to include 127.0.0.1 if
# you use this.)
#listen-address=

The problem is that listen-address probably shouldn't contain 127.0.0.1 if dnsmasq is meant to be used to resolve things for ltsp clients; also, except-interface=lo may be a good idea here to avoid listening on the loopback interface. That way both instances should start fine.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Hi Mathieu,

> If you're installing dnsmasq on top of the standard desktop install, why is it such an issue to edit the NetworkManager configuration to cater it to your needs?
> except-interface=lo may be a good idea here to avoid listening on the loopback interface

It's not about me; it's that the default dnsmasq/bind installations are now broken on desktop installations.
For the needs of our schools here in every LTS release we're making repositories with custom packages for automated installation + configuration, so the nm configuration editing is just a sed away, much less trouble than even reporting the bug in the first place.

> Wouldn't it make sense it this case to go further steps and make sure the network connection is setup in /etc/network/interfaces rather than NM, to ensure you don't suddenly get a different IP address from DHCP?

No, network manager supports static IPs (even though we don't always need them even on LTSP servers) and doing it without /etc/network/interfaces allows teachers to see the network status from the nm applet.

> and setting up a special upstart job to spawn a local resolver won't work (NM spawns it itself, using a custom configuration on purpose).

Right, that's why I'm saying that the local resolver implementation is immature, it doesn't integrate with the rest of the distro, but it breaks other packages by launching a DNS server from hardcoded C code instead of a regular sysvinit/upstart script like all the other daemons.

> I guess a workable solution would be to check for /etc/default/dnsmasq and not spawn dnsmasq if the value of ENABLED is 1.

That would indeed be workable, please do implement it.

> listen-address probably shouldn't contain 127.0.0.1 if dnsmasq is meant to be used to resolve things for ltsp clients

Thin client sessions run on the server, and would be resolved from the nm-spawned dnsmasq instance without caching, while LTSP fat client sessions would be resolved from the normal dnsmasq instance with caching.
Having one DNS server for half of the clients and another for the other half is bound to cause confusion and problems.

Anyway, I think I've made my point, if it's too difficult to do for Precise just postpone it until the next release. To workaround the problem for Greek schools I'll make an ltsp-server-dnsmasq package and sed the nm configuration in its postinst.

Cheers,
Alkis

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

The parsing of /etc/default/dnsmasq won't fly.

Please, do post your dnsmasq configuration so we can try to figure out the right way to integrate this with the current setup.

As for the set of resolvers on the network, that's not exactly the "plan": all systems used to have the libc resolver. Now any system that runs NetworkManager will also be running a local dnsmasq instance since that handles a bunch of issues (more than three servers, split DNS, broken IPv6 DNS, etc) far better than libc. Then they can easily speak to a network DNS server if necessary or resolve directly to the internet.

I don't understand how your systems are setup, and I think that's where the confusion come from. What I'm expecting is that the LTSP server also runs a dnsmasq daemon to provide resolving to all the LTSP clients; with none of the clients running dnsmasq "locally". Isn't that the case?

I do think there are simpler ways to fix this than doing a sed of the nm configuration.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> Please, do post your dnsmasq configuration so we can try to figure out the right way to integrate this with the current setup.

Just assume the default dnsmasq configuration, any other settings we have there are completely unrelated to this problem.
When one installs dnsmasq, it's supposed to start listening on 0.0.0.0:53, without manually editing any configuration files at all, i.e. with the stock /etc/dnsmasq.conf.
Now with the local resolver listening on 127.0.0.1:53, dnsmasq complains that the port is in use and fails to start.

> Now any system that runs NetworkManager will also be running a local dnsmasq

Let's step back a bit and talk about that. You're launching a DNS server without using a sysvinit or upstart job. So you're bypassing update-rc.d, policy-rc.d, upstart .override files, package Conflicts:, Provides: etc, all the standard framework for managing services.
Why wouldn't it be more reasonable to start the local resolver service normally like all the other daemons?
Even make a package out of it, and declare that it Conflicts: bind9, dnsmasq, so that people installing those automatically get rid of the local resolver and its conflicting configuration?
If you assume that "network-manager contains a hardcoded DNS server", then the network-manager package itself should conflict with other DNS servers... But that shouldn't be the case, people should be allowed to install any DNS server they want alongside network-manager, and that could be done seamlessly and without editing any configuration files at all if:
network-manager recommented the local-resolver package,
and the local-resolver package conflicted with the other dns server packages.

Then, when I install dnsmasq over the desktop installation, the local-resolver package would be automatically uninstalled, and I wouldn't have to edit any configuration file at all to resolve the conflict, it would be resolved by the package manager.

> I don't understand how your systems are setup, and I think that's where the confusion come from. What I'm expecting is that the LTSP server also runs a dnsmasq daemon to provide resolving to all the LTSP clients; with none of the clients running dnsmasq "locally".

The problem isn't LTSP specific, it applies to anyone that wants to use dnsmasq as a DNS server for his local network.
But yes, for LTSP labs that use dnsmasq, it is exactly as you described it. Now, LTSP clients are all diskless and netbooted, but of two kinds: thin and fat clients. Imagine thin clients like XDMCP clients, i.e. many users working remotely on the same server. So those would be using the local resolver, and miss the caching feature and the speed up that it offers.
Imagine fat clients like regular machines that have nameserver=the LTSP server in their resolv.conf. In the solution you proposed above, those would be using the real dnsmasq instance, with caching and everything.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Then at this point the issue is that dnsmasq is shipped with a default configuration that while it's technically "correct"; binds on all interfaces and should normally be modified by the admin to suit the needs of their network. That configuration will break with NM making use of dnsmasq-base as a local resolver; and most likely also bombs with qemu/kvm virtual machines.

I want to make this easy for people in your situation, but having a system-wide instance isn't going to work. Not only is it way too complex for what we're trying to achieve (let alone confusing to users to see packages get removed by metapackages), but you always risk that someone modifying the system-wide config meant for use with NetworkManager then causes totally unwanted behavior when NetworkManager tries to add nameservers to the configuration. That's without counting that this still doesn't fix the issue of resolving for virtual machines, which you'll almost certainly want to resolve separately from anything else (and to think of it, installing virt-manager and virtual machine on your setup probably breaks just as bad as NM).

I've been trying hard to offer solutions and I've proposed configuration changes to the shipped config which cover the issue nicely for your case. If you don't want to apply these changes, that's fine; you're obviously free to implement a fix however you see fit :)

For precise +1 there may be a way to move dnsmasq initialization in NM to use 127.0.1.1, and allow this in dnsmasq with upstream's help, but that's not even going to solve this particular issue.

Reducing the priority since we won't look at this until Precise+1 and there aren't many reports about such issues.

Changed in network-manager (Ubuntu):
importance: Medium → Low
Revision history for this message
Marco Menardi (mmenaz) wrote :

I run ltsp also, and even if I remove NM completely, I think that Alkis's setup is interesting and would love to be able to use it also in the near future, so this "breakage" will affect me too.
As general consideration I find scaring that installing a package can bring such problems "just because we think that usually is not used often". I really want GNU/Linux keep being an predictable system and apt packaging a very good one, so please consider to fix this issue before release.
Thanks in advance

Revision history for this message
Asmo Koskinen (asmok) wrote :

Me, too. Fix this one. '#dns=dnsmasq' is ugly hack, not for real humans, who run ltsp server at school.

Here is my bug report:

https://bugs.launchpad.net/ubuntu/+source/ltsp/+bug/955785

Best Regards Asmo Koskinen.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Please read the whole thread and see the various other workarounds provided; granted the default shipped configuration for dnsmasq doesn't play well with NetworkManager, but it's easy to adjust to your particular needs and workaround this issue; which also only happens if the system acting as a server locally runs both dnsmasq and NetworkManager.

We've clearly identified that having dnsmasq bind to particular interfaces is an easy way to work around this and is a very good idea anyway. Please make sure your dnsmasq configuration sets interface= to the interface on which it should listen, and possibly also uncomment bind-interfaces in /etc/dnsmasq.conf. At that point the changes to /etc/NetworkManager/NetworkManager.conf won't be required.

This isn't just a simple fix for this; the default shipped configuration for dnsmasq is just as "guilty" as network-manager for assuming it should bind on all addresses and all interfaces.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> This isn't just a simple fix for this; the default shipped configuration for dnsmasq is just as "guilty" as network-manager for assuming it should bind on all addresses and all interfaces.

I disagree; most system services bind to all addresses and interfaces by default (sshd, cupsd, bind, dnsmasq, dhcp, tftp, nbd, inetd, rpc...). And I do want DNS services for my thin client sessions running on the server, so I do want dnsmasq listening in all addresses.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Mathieu, some help please?

After my ltsp-pnp package comments out dns=dnsmasq in /etc/NetworkManager/NetworkManager.conf, it runs
invoke-rc.d dnsmasq restart from its postinst,
but that fails as the nm-spawned dnsmasq instance is still listening on port 53.

And if I kill it before starting the normal dnsmasq, that leaves the DNS configuration broken...

How can I tell resolv.conf and network-manager to reload their configurations?
Is it necessary to restart the network-manager service? And if it is, is that enough? I'd hate to have to tell the users that they need to restart their servers... :(

Thanks!

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

You need to restart network-manager after changing the configuration value.

It's unfortunate that the configuration needs to be changed, but it's needed. I sympathize with your use case, but there is sufficient benefit in using NM together with dnsmasq and resolvconf to solve other DNS resolution issues to inconvenience those who use dnsmasq separately as a standalone daemon (to have to change the config to suit their needs).

We won't be fixing this for Precise, but I've started discussion with dnsmasq upstream to possibly deal differently with the binding and allow running instances on other IP addresses (such as 127.0.1.1 or so). It's still going to need sufficient amounts of work to fix dnsmasq's method of binding to interfaces and how NM starts and interfaces with dnsmasq (though I already have patches for NM, but they're useless without the fixes in dnsmasq). At this point though, the simplest way to deal with this remains to edit interfaces= to map to the relevant external interfaces (eth0, wlan0, etc.) and let the NM-spawned instance get started on lo.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> At this point though, the simplest way to deal with this remains to edit interfaces= to map to the relevant external interfaces (eth0, wlan0, etc.) and let the NM-spawned instance get started on lo.

We can't do that; we need DNS caching for thin client sessions which run on the server with DNS=127.0.0.1. We need to completely disable the nm dnsmasq spawning.

> You need to restart network-manager after changing the configuration value.

Thank you, I think that's too much to do from a postinst so I'll probably document it as part of the installation process.

For the record, I think that the proper way to solve the problem is from libc itself. Ask Simon to allow calling dnsmasq like a library, or communicate with it via a socket, whatever's needed, but no :53 port hooking, this is reserved for real DNS servers, not for helpers for libc shortcomings.

Thanks again for all the feedback,
Alkis

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

caching> That's one good reason where this is currently failing. The NM instance won't cache. That's disabled on purpose, but we'll re-enable for 12.10 or later once we can have per-user caches and something secure.

library> unfortunately, that won't help. library use, with not being able to keep state (e.g. have I tried this server yet? did it respond?) is one of the issues we're fixing with dnsmasq, which can't be tackled by a library.

Using dnsmasq via dbus is a likely good way to fix this, but there are countless possible issues with assuming that the centrally running instance of dnsmasq is the one you also want to use for resolving your own stuff, and to update with information from DHCP.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Since this won't be fixed for Precise from the network-manager side, the dnsmasq package now is broken by default in desktop installations.
So I've added the dnsmasq package in the "Affects:" list, to make it easier for people to locate the cause of the problem so that fewer duplicate bug reports are filed (it's an LTS release, I suppose many people will be bitten by it in the next 5 years).

Also, even though it's not the correct place to solve the problem, the dnsmasq.postinst could be temporarily modified to disable the local resolver. I can propose a patch for it if the maintainer is interested.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

That wouldn't be the right process though. The configuration itself shipped by default should be patched, that can be done with a simple patch to the dnsmasq package.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> The configuration itself shipped by default should be patched

If you mean something like:
except-interface=lo
bind-interfaces

...I just tested them and they do allow both dnsmasq instances to run.

But of course those settings won't be acceptable to most dnsmasq users, as listening on "lo" is usually desired too (local DNS cache; DHCP/TFTP for VMs etc). So I don't think that crippling the default dnsmasq functionality is a good way to solve this problem. DNS clients shouldn't hook port 53; it's reserved for DNS servers.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dnsmasq (Ubuntu):
status: New → Confirmed
Revision history for this message
Bert Voegele (bertvoegele-deactivatedaccount) wrote :

Just as a short reminder, there are more DNS-resolver/server available as packages out there than just bind and dnsmasq, i.e. djbdns and it's derivates. Until I removed the annoying dns=dnsmasq line in /e/N/Nconf, NM disconnected the WLAN after a couple of minutes, throwing an error about dnsmasq not able to bind to 127.0.0.1.
I'm puzzled about the default inclusion of dnsmasq as a local resolver for standard users. If a connection is to be shared, it might be useful to bind dnsmasq to the shared iface to provide DHCP and DNS, like it's done with libvirt-bin.

Thomas Hood (jdthood)
summary: - Don't start local resolver if a DNS server is installed
+ Standalone dnsmasq is not compatible out of the box with NM+dnsmasq
summary: - Standalone dnsmasq is not compatible out of the box with NM+dnsmasq
+ Don't start local resolver if a DNS server is installed
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

@jdthood: the "Standalone dnsmasq is not compatible out of the box with NM+dnsmasq" title hints that the problem is caused by the dnsmasq package, i.e. that it should be crippled and not listen on "lo" by default in order to coexist with the local resolver implementation.

I don't think this is the case, I don't think the dnsmasq package does anything wrong; I just cross-linked the bug report in case other people hit the problem and try to find it in the dnsmasq bug page.

The problem should be fixed from the network-manager side.

Otherwise, similar bug reports should be filed against all other DNS server packages, not just dnsmasq. But I really think that people do want their DNS servers to listen on "lo" by default. They wouldn't want to break that just to help the local resolver implementation.

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Listening on lo is fine; and blocking other DNS servers from being started isn't. I think we're in violent agreement there. The problem is how to fix this.

I'm not saying dnsmasq should be crippled, but that it should special-case lo and not just listen on 0.0.0.0; because that binds to any further use of port 53, which might not work with any further processes that might want to legitimately listen on port 53.

That's pretty much how the solution is shaping to be: when listening on all interfaces, listen on each interfaces separately; binding to the IP address attached to the interface (or via any other mean). We should then be able to have dnnsmasq listen on 127.0.1.1:53 to satisfy the need for a local resolver.

Revision history for this message
Thomas Hood (jdthood) wrote :

@Alkis: Your title "Dont..." is not a description of a problem.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote : Re: Local resolver prohibits DNS servers from running

@Thomas: cool, I hope this one's better.

summary: - Don't start local resolver if a DNS server is installed
+ Local resolver prohibits DNS servers from running
Revision history for this message
Thomas Hood (jdthood) wrote :

I just re-read the whole discussion and thought it would be useful (for me, at least) to summarize it.

The original bug report was that NM+dnsmasq and standalone dnsmasq are incompatible because they have overlapping network socket address ranges, 0.0.0.0:53 and 127.0.0.1:53.

One solution is for the administrator to comment out "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf.

Another solution is as described by the submitter's title: "[Hey NetworkManager,] Don't start local resolver if a DNS server is installed".

Another solution favored by Mathieu is for the NM-enslaved dnsmasq and the standalone dnsmasq to use disjoint network socket address ranges.

Early on, Mathieu said that solving this problem would not be a top priority because not many users want to combine the DNS server role (running bind or dnsmasq) with the DNS client role (running NM+dnsmasq).

Alkis argued that the incompatibility is a serious bug that should be prevented using package dependencies or eliminated automatically by maintainer scripts or other means. The administrator shouldn't have to search the web to figure out how to make the dnsmasq package work. Troublesome is the fact that standalone dnsmasq sometimes works, sometimes doesn't, in the presence of NM+dnsmasq.

Along the way Alkis levelled some fundamental criticisms against the design of NM+dnsmasq.

I think that there is a clash of civilizations here: the Debian way (modular components that just work together in any combination allowed by package dependencies) versus the RedHat way (big daemons with limited options that own subsystems).

Revision history for this message
Thomas Hood (jdthood) wrote :

Alkis: Why do you need the dnsmasq package at all? You want NM and dnsmasq. Why not just use the NM-enslaved dnsmasq?

If the latter doesn't meet your needs, could it be adapted somehow to meet your needs?

Assuming that there are good reasons for using NM and standalone dnsmasq, I'd be inclined to agree with Alkis (if I understood him correctly) that a good solution would be to put the NM-dnsmasq integration stuff into a package and make this conflict with the standalone dnsmasq package.

Revision history for this message
Thomas Hood (jdthood) wrote :

Hmm, I wasn't very clear. What I meant in my questions above (#34) was this. If NM+dnsmasq is the best solution for name service for the local host, isn't it also a better solution than NM-together-with-standalone-dnsmasq for remote hosts? If so then another solution approach is to enhance NM so that its enslaved dnsmasq listens on non-loopback addresses too. Once this is implemented the network-manager package could be made to Conflict with the dnsmasq package.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Thomas, that was a very good summary at comment #33!

> Why do you need the dnsmasq package at all? You want NM and dnsmasq. Why not just use the NM-enslaved dnsmasq?

The NM-enslaved dnsmasq uses hardcoded options (in C) that provide extremely limited functionality.
 * It doesn't listen on ethX (--listen-address=127.0.0.1). So we can't use our servers as DNS servers for our local network PCs, i.e. it's completely useless for LANs.
 * It doesn't cache requests (--cache-size=0). No caching ==> no DNS queries speedup. This again is very significant for LANs as there are many concurrent users.
 * Finally, we also need the DHCP and TFTP functionality of dnsmasq, so even if NM+dnsmasq included a real DNS server, we'd have to run another dnsmasq instance (without a DNS service in that case) for its 2 other services.

> a good solution would be to put the NM-dnsmasq integration stuff into a package and make this conflict with the standalone dnsmasq package.

I completely agree, and to also conflict with bind9 and any other DNS server packages.

Revision history for this message
Thomas Hood (jdthood) wrote :

What lies behind the problem being discussed here is the simple fact that there exists no single adequate network configuration utility for GNU/Linux. I am most familiar with Debian. From Debian we inherit ifupdown which was designed for static configuration. Debian developers have known for more than ten years that ifupdown needed to be replaced, but have never managed to come up with a replacement. From RedHat we get NetworkManager which was never intended to be a general network configurer but in the absence of any alternative continues to be enhanced with new features. Considerable effort has obviously been spent in Ubuntu just to get NM to coexist with other networking packages. It still doesn't fully cooperate with them (see #47379 for another example) and will probably never be well integrated with them.

So we are still forced to choose between two network configuration approaches, NM-oriented in the desktop version and ifupdown-oriented in the server version. Each one has its limitations. If you try to combine the two, as you (Alkis) want to do, then you are confronted with these limitations. You are lucky that all you have to do is comment out one line in a configuration file to get things to work!

We can continue playing around with the existing tools so that they work better in particular use cases but what we really need is a properly designed network configuration utility to supersede both ifupdown and NM.

I am vaguely aware of the Wicd project. Must go read up on that.

Revision history for this message
Thomas Hood (jdthood) wrote :

* Some thinking about[0][1], if not much coding of[2], a successor to ifupdown was done in the netconf project[3] led by Debian Developer martin krafft[4][5].

[0]http://people.debian.org/~madduck/talks/netconf_fosdem_2007.02.25/slides.s5.html
[1]http://lists.alioth.debian.org/pipermail/netconf-devel/
[2]http://lists.alioth.debian.org/pipermail/netconf-commits/
[3]https://alioth.debian.org/projects/netconf/
[4]madduck AT debian.org
[5]http://people.debian.org/~madduck/

* One small step toward harmonizing desktop network configuration and server network configuration was taken with the introduction of resolvconf in both versions of 12.04. But there again, NM integrates bare-minimally with resolvconf; NM doesn't let resolvconf prioritize nameserver information according to interface-order(5) but sends resolvconf one big lump of nameserver information called "NetworkManager".

* If Ubuntu doesn't switch to wicd or netconf or something else then another possibility to be explored is to break up NM into components that can be better integrated with other parts of the distro. This is, of course, rather difficult without cooperation from upstream.

Revision history for this message
Thomas Hood (jdthood) wrote : Re: NM-controlled dnsmasq prevents other DNS servers from running

Based on comment #28, marked as affecting djbdns.

summary: - Local resolver prohibits DNS servers from running
+ NM-controlled dnsmasq prevents other DNS servers from running
Thomas Hood (jdthood)
summary: - NM-controlled dnsmasq prevents other DNS servers from running
+ NM-controlled dnsmasq prevents other DNS servers from running, yet
+ network-manager doesn't Conflict with their packages
Revision history for this message
Thomas Hood (jdthood) wrote : Re: NM-controlled dnsmasq prevents other DNS servers from running, yet network-manager doesn't Conflict with their packages

But enough dreaming. Given the world as it is, the immediate challenge is to make NM+dnsmasq compatible with standalone nameservers. (Otherwise network-manager should Conflict with those nameservers' packages.)

Solutions mentioned earlier:
* Tell the administrator to comment out "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf after installing dnsmasq or another DNS server package.
* Change NM so that it acts as if "dns=dnsmasq" is absent if a DNS server package is installed.
* Change standalone dnsmasq such that it doesn't listen on 0.0.0.0:53, doesn't listen on 127.0.1.1:53 and change NM so that its dnsmasq listens only on 127.0.1.1:53.

Here's a new idea.

* Enhance the resolver(3) so that nameservers can be specified in resolv.conf using the <address>:<port> notation
* Change NM such that it causes its slave dnsmasq to listen on another (than 53) port number P and sends "nameserver 127.0.0.1:P" to resolvconf.

Thomas Hood (jdthood)
summary: - NM-controlled dnsmasq prevents other DNS servers from running, yet
- network-manager doesn't Conflict with their packages
+ NM-controlled dnsmasq prevents other DNS servers from starting
Thomas Hood (jdthood)
Changed in pdnsd (Ubuntu):
status: New → Invalid
Thomas Hood (jdthood)
Changed in pdns-recursor (Ubuntu):
status: New → Invalid
Thomas Hood (jdthood)
Changed in dnsmasq (Ubuntu):
status: Confirmed → Invalid
status: Invalid → Confirmed
Changed in pdns-recursor (Ubuntu Precise):
status: New → Invalid
Changed in pdnsd (Ubuntu Precise):
status: New → Invalid
Changed in network-manager (Ubuntu Precise):
status: New → Triaged
Changed in dnsmasq (Ubuntu Precise):
status: New → Confirmed
Changed in network-manager (Ubuntu Precise):
importance: Undecided → Low
Changed in network-manager (Ubuntu):
status: Triaged → Fix Released
Changed in djbdns (Ubuntu Precise):
status: New → Confirmed
Changed in djbdns (Ubuntu):
status: New → Confirmed
Changed in network-manager (Ubuntu Precise):
assignee: nobody → Mathieu Trudel-Lapierre (mathieu-tl)
Changed in dnsmasq (Ubuntu):
status: Confirmed → Fix Released
Changed in dnsmasq (Ubuntu Precise):
assignee: nobody → Mathieu Trudel-Lapierre (mathieu-tl)
importance: Undecided → High
status: Confirmed → Triaged
Changed in network-manager (Ubuntu Precise):
importance: Low → High
75 comments hidden view all 155 comments
Revision history for this message
Robin Battey (zanfur) wrote :
Download full text (4.0 KiB)

> Are you sure? I am only aware of named.conf's "listen-on { IP_ADDRESS; }". If there is a feature such as you describe then presumably named binds ALL:53 and then filters according to the addresses on the specified interfaces.

Nope, I just verified, you're quite correct. I hadn't heard of it either, but upon (mis)reading comments above I presumed without verifying. Bad on me.

> A question about the NSS plugin idea. Will this work only for software that uses glibc? What about alternative resolver libraries?

Anything that uses the gethostbyname(3) call uses the NSS chain. That means essentially everything that isn't a resolver itself uses nsswitch.conf. DNS resolver libraries won't use NSS by design, because they are the resolvers themselves that are *used* by NSS. This is why there are no names in their respective configuration files, save for what they're serving (remote addresses are specified by address). If any DNS resolver itself reads nsswitch.conf, it's doing somethign Very Wrong.

The idea of NSS is that the DNS resolvers aren't *supposed* to use it. They are the exporters of NSS services, not the consumers. I don't know of any of them that use NSS for their own resolution; they are just one link in the NSS chain that is used by the (libc) name resolver libraries. When you hit the DNS service itself, you really *don't* want it to start the NSS chain over, because that would just lead to a loop.

My proposal for using NSS in place of NetworkManager's dnsmasq is to create a new NSS plugin and place it earlier in the NSS chain than the standard DNS resolver. For instance, a line like so:

  hosts: files mdns4_minimal [NOTFOUND=return] network_manager [NOTFOUND=return] dns mdns4

This is straight from my Precise install, with the addition of the "network_manager [NOTFOUND=return]" stanza. It says that first you check /etc/hosts (that's "files"), then a subset of avahi ("mdns4_minimal [NOTFOUND=return]"), then your NM plugin "network_manager [NOTFOUND=return]", plain old DNS ("dns"), then avahi again ("mdns4").

It would not conflict with any other NSS plugin, because they are all tried in turn until a match is found. If you place it directly in front of the DNS resolver plugin in nsswitch.conf, it will be used before the standard DNS lookup, allowing you to do all the fancy connection-specific magic you need to do, while returning "Try Next" for anything non-connection specific, thus allowing the normal DNS resolver plugin (which reads resolv.conf) to do things as normal. This is *instead* of hooking in at resolv.conf, as you do now. People can install any resolver they want, and it works as designed. This lets you listen on high-numbered ports as well, *and* lets you have per-user dnsmasq instances (per user vpns?), while still running Bind or a normal dnsmasq instance on *:53.

Right now, the dnsmasq for NM basically hijacks resolv.conf, which means it's hooking into the DNS NSS plugin's resolution (it's the plugin that reads resolv.conf, not the applications, using code in libc). This is causing conflicts, because in order to use resolv.conf, you need to be running on port 53 -- and it would take re-writing ...

Read more...

Revision history for this message
Svartalf (frank-earlconsult) wrote :

This is a bad idea as it's been implemented, guys- there's tons of local installations that use internal DNS (My CenturyLink router or my day-job's setup, for example...) that this flatly breaks out of box. You've got to do a bunch of manual interventions for MANY corporate desktop and home desktop situations. It doesn't honor lookups against the local, specified by DHCP, DNS servers- it goes out to the DNS roots and goes from there. Works FINE for JUST surfing the 'net. It's an EPIC FAIL for normal, typical DNS use right now because there's no honoring any internal only DNS entries with it as it is out of box.

It's nice that you're trying to make it easier for VPN, etc. but in the corporate desktop story, you're using OpenVPN, PPTP, or something like Sonicwall's solution. This means it's going to re-direct DNS on you ANYHOW, defeating the nice thing you're attempting here. If you think you're changing their minds, think again.

As it stands, I'm going off to cripple this less than well thought out design decision so that things MIGHT work better on my setups. I suggest thinking through *ALL* prospective use-cases of things before implementing something like this in the future- it really, really ticks people off when it doesn't work like it's supposed to.

Revision history for this message
Thomas Hood (jdthood) wrote :

@Svartalf: Can you please describe in more technical detail what fails to work on the machines in question, and share with us what you know about the causes of these malfunctionings? Once we have some idea what you're talking about we can help you further.

You wrote:
> there's tons of local installations that use internal DNS

What do you mean by "internal DNS"?

> It doesn't honor lookups against the local, specified by DHCP, DNS servers [...]

Ubuntu 12.04 *does* use DNS nameserver addresses provided by DHCP. Can you please explain what you are talking about here?

> OpenVPN, PPTP, or something like Sonicwall's solution [is] going to re-direct DNS on you ANYHOW
> If you think you're changing their minds, think again.

Ubuntu software works properly in Ubuntu 12.04 (except where it doesn't --- see the BTS). Third party software may fail to work properly, but it's up to the third party to fix that.

Third parties who think they can dictate how free host operating systems work can go fly a kite. Just my personal view.

Revision history for this message
John Hupp (john.hupp) wrote :

I don't know how my case enters this discussion, but it is certainly connected to the current default installation wherein network-manager starts an instance of dnsmasq to act as a DHCP, DNS and TFTP server.

I was troubleshooting an LTSP-PNP client boot problem under Lubuntu Quantal. I installed with a single NIC per https://help.ubuntu.com/community/UbuntuLTSP/ltsp-pnp.

The problem is that the LTSP client, after successfully getting DHCP assignments, fails to download the pxelinux boot image. It reports "PXE-E32: TFTP open timeout."

To be more specific on the DHCP assignments, it identifies my hardware router as the DHCP server and the default gateway. It identifies the LTSP server as proxy and boot server.

I can also run this on the server itself to get a similar failure:
$ cd /tmp
$ tftp 192.168.1.102 -v -m binary -c get /ltsp/i386/pxelinux.0
mode set to octet
Connected to 192.168.1.102 (192.168.1.102), port 69
getting from 192.168.1.102:/var/lib/tftpboot/ltsp/i386/pxelinux.0 to pxelinux.0 [octet]
Transfer timed out.

A CRITICAL NOTE: This is using the default network-manager configuration of the network interface (using the default DHCP configuration, and the connection is "Available to all users").

If I merely configure the network interface (again for DHCP) via /etc/network/interfaces, the TFTP error disappears and the LTSP client boots.

But it introduces a new problem on both server and client: DNS resolution fails.

I can fix the DNS resolution problem by creating /etc/resolvconf/resolv.conf.d/tail with contents:
nameserver (my nameserver 1)
nameserver (my nameserver 2)

But trying to identify and perhaps work around the problem with network-manager and dnsmasq, I undid the changes to /etc/network/interfaces and deleted /etc/resolvconf/resolv.conf.d/tail.

It turns out that if I merely
$ sudo service dnsmasq restart
then the LTSP client will boot normally.

Hunting for some diagnostic information, I ran this command before and after restarting dnsmasq:
$ sudo netstat -nap | grep dnsmasq

Relevant output before restarting:
udp 0 0 127.0.0.1:69 0.0.0.0:* 887/dnsmasq

After restarting:
udp 0 0 127.0.0.1:69 0.0.0.0:* 1967/dnsmasq
udp 0 0 192.168.1.102:69 0.0.0.0:* 1967/dnsmasq
(where 192.168.1.102 is the server IP)

So dnsmasq is not binding to my server IP during boot.

If I remove /etc/dnsmasq.d/network-manager (which issues the sole dnsmasq directive to bind all the interfaces instead of listening on 0.0.0.0) and restart the server it allows the client to boot normally.

Revision history for this message
Thomas Hood (jdthood) wrote :

> the current default installation wherein network-manager starts
> an instance of dnsmasq to act as a DHCP, DNS and TFTP server.

NetworkManager starts an instance of dnsmasq to act only as a non-caching DNS nameserver forwarder. This instance listens only on the loopback interface 127.0.1.1.

If your client is DHCPing with a dnsmasq instance on an Ubuntu server then that dnsmasq instance is most probably a "standalone" instance, configured by means of files included in the "dnsmasq" package (not to be confused with the "dnsmasq-base" package which contains little more than the dnsmasq binary and which both the dnsmasq package and the network-manager package depend on) and started by an initscript, not by NetworkManager.

In reading further into your text my understanding is hampered by the fact that I am not entirely sure which machine you are referring to at different points in your text.

> The problem is that the LTSP client, after successfully getting
> DHCP assignments, fails to download the pxelinux boot image.
> It reports "PXE-E32: TFTP open timeout."
> To be more specific on the DHCP assignments, it identifies
> my hardware router as the DHCP server and the default gateway.
> It identifies the LTSP server as proxy and boot server.

Is your LTSP server running Ubuntu and standalone dnsmasq? Then shouldn't the client use your LTSP server as the DHCP server?

> So dnsmasq is not binding to my server IP during boot.
> If I remove /etc/dnsmasq.d/network-manager
> (which issues the sole dnsmasq directive to bind all the
> interfaces instead of listening on 0.0.0.0) and restart the
> server it allows the client to boot normally.

I think I know what is happening. The network-manager package causes (by means of the /etc/dnsmasq.d/network-manager file) the standalone dnsmasq to start in "bind-interfaces" mode. In that mode dnsmasq doesn't listen on the wildcard IP address; it only listens on the addresses assigned to interfaces that are up when it (dnsmasq) starts. At boot, dnsmasq starts before the external interface is configured via DHCP, so dnsmasq doesn't listen on the external interface. If dnsmasq is restarted after the external interface is configured then dnsmasq listens on that interface too.

If you remove /etc/dnsmasq.d/network-manager then standalone dnsmasq listens on the wildcard address when it starts and all is well except that now standalone dnsmasq conflicts with the NetworkManager-controlled dnsmasq instance. To fix this you have to disable the latter. Edit /etc/NetworkManager/NetworkManager.conf and comment out the line "dns=dnsmasq": put a '#' at the beginning of the line.

In the future we hope that standalone dnsmasq running in bind-interfaces mode will be enhanced such that it listens on interfaces that are brought up after it (dnsmasq) starts. The author of dnsmasq, Simon Kelley, has already implemented this enhancement experimentally. Once that work is done it will be possible to run dnsmasq in bind-interfaces mode without causing the problem that you ran into.

Revision history for this message
John Hupp (john.hupp) wrote :

RE Thomas Hood's #120: That is very interesting, though I admit it is near the outer limits of my current understanding.

To address the only questions above:

>> The problem is that the LTSP client, after successfully getting
>> DHCP assignments, fails to download the pxelinux boot image.
>> It reports "PXE-E32: TFTP open timeout."
>> To be more specific on the DHCP assignments, it identifies
>> my hardware router as the DHCP server and the default gateway.
>> It identifies the LTSP server as proxy and boot server.

> Is your LTSP server running Ubuntu and standalone dnsmasq? Then shouldn't the client use your LTSP server as the DHCP server?

The LTSP server is running Lubuntu with the default network configuration, whatever that is. I understand you to be saying that this would be a standalone instance of dnsmasq started by an initscript, prepared to handle DHCP and TFTP. And apart from that, network-manager starts another instance of dnsmasq to handle DNS.
       Regarding whether the client should use the LTSP server as the DHCP server: I imagine that it is prepared to handle DHCP, and probably does in a standard LTSP setup with 2 NIC's and the client connected to the second NIC, but in this LTSP-PNP setup with a single NIC, the client is connected to the router, and the LTSP server defers to the router handling DHCP.

------------------

Your explanation is very interesting because it explains why my blindly-applied work-around is effective. (And kudos to Simon Kelley who is working to make it possible for everything to work as configured right out of the box.)

But I don't understand what you said about standalone dnsmasq conflicting with network-manager's instance of dnsmasq when /etc/dnsmasq.d/network-manager is removed.

Apart from not understanding how the conflict arises, I wonder: Should this conflict be manifesting itself somehow? Everything seems to be working right now.

And would disabling network-manager's DNS-handling instance of dnsmasq then result in the need to set up an alternative DNS handler?

I'm willing to apply another solution blindly, as I did in removing /etc/dnsmasq.d/network-manager, but it would be nice to understand more about it.

Revision history for this message
Thomas Hood (jdthood) wrote :

> the LTSP server defers to the router handling DHCP.

OK, I get it.

> I don't understand what you said about standalone dnsmasq
> conflicting with network-manager's instance of dnsmasq
> when /etc/dnsmasq.d/network-manager is removed.

When /etc/dnsmasq.d/network-manager is present, standalone dnsmasq starts in bind-interfaces mode and only listens on the addresses assigned to configured network interfaces. This does not include 127.0.1.1, since 127.0.1.1 is not the address of any configured interface. So in this mode standalone dnsmasq does not conflict with NM-dnsmasq which listens on 127.0.1.1. (At most one process can listen on any given address:port combination.)

Remove that file and standalone dnsmasq starts in a mode where it tries to listen at all addresses. But it can't do this if NM-dnsmasq is already listening at some address.

> Should this conflict be manifesting itself somehow?
> Everything seems to be working right now.

Well, I am not sure which workaround, if any, you are currently relying on.

If you commented out "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf then there is no conflict because NM doesn't start the NM-dnsmasq process.

> And would disabling network-manager's DNS-handling
> instance of dnsmasq then result in the need to set up
> an alternative DNS handler?

No. If NM-dnsmasq is enabled then resolv.conf contains "nameserver 127.0.1.1" so that applications using the resolver library access NM-dnsmasq; NM-dnsmasq forwards queries to the upstream nameserver at the address A.A.A.A which was obtained via DHCP or otherwise. If NM-dnsmasq is disabled then resolv.conf simply contains "nameserver A.A.A.A".

> I'm willing to apply another solution blindly, as I did
> in removing /etc/dnsmasq.d/network-manager,
> but it would be nice to understand more about it.

If you are running Ubuntu 12.04 then the best solution for now is to
* comment out the "bind-interfaces" line in /etc/dnsmasq.d/network-manager;
* comment out the "dns=dnsmasq" line in /etc/NetworkManager/NetworkManager.conf.

If you are running Ubuntu 12.10 and have dnsmasq version 2.63-1ubuntu1 then you can, instead,
* replace the "bind-interfaces" line in /etc/dnsmasq.d/network-manager with a "bind-dynamic" line.

The "bind-dynamic" mode is the new mode that I referred to above and which Simon referred to earlier in comment #94. Please test it! If it works well then it should become the default, as mentioned above in comments ##99, 102.

Revision history for this message
John Hupp (john.hupp) wrote :

Thanks for the explanation of how removal of /etc/dnsmasq.d/network-manager sets up a conflict between standalone dnsmasq and NM-dnsmasq. (But also see my surprising observation below.)

>> Should this conflict be manifesting itself somehow?
>> Everything seems to be working right now.

>Well, I am not sure which workaround, if any, you are currently relying on.

>If you commented out "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf then there is no conflict because NM doesn't start the NM-dnsmasq process.

My workaround was simply to remove /etc/dnsmasq.d/network-manager. Everything seemed to work after that.

I did not comment out "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf, which apparently should have subjected the system to the conflict described above. But as I say, I saw no problems. Could this be due to some compensation made by the new dnsmasq, since I am in fact running v.2.63?

Thanks also for the explanation of how disabling NM-dnsmasq does not break DNS.

Since I have dnsmasq v2.63, I tried the experimental solution: I restored /etc/dnsmasq.d/network-manager and replaced the "bind-interfaces" line with a "bind-dynamic" line. As far as I can tell, everything works.

Thank you!

Revision history for this message
Thomas Hood (jdthood) wrote :

Question: Why did everything work on your machine when standalone dnsmasq wasn't in bind-interfaces mode but /etc/NM/NM.conf contained "dns=dnsmasq"?

Hypothesis: Standalone dnsmasq started first; network-manager second. NM tried to start NM-dnsmasq but this failed because of the address conflict and NM fell back to non-dnsmasq mode, which works fine. If this hypothesis is correct then there may be lines in the syslog that look like this:

   [date] [hostname] NetworkManager[pid]: <info> DNS: starting dnsmasq...
   [date] [hostname] dnsmasq[pid]: failed to create listening socket for 127.0.1.1: Address already in use

Revision history for this message
John Hupp (john.hupp) wrote :

I thought I was done with this kind of issue, but I may be back for more.

It turns out that the only LTSP client that boots normally is the one that I was doing all of the above troubleshooting on. Others that I have tried in my little 2-PC setup all stop at a blank/black screen after successfully getting to the Lubuntu splash screen.

I have now set up forwarding of the client syslog messages to the server, and the log always ends with a string of ntpd items, the last of which is "ntpd[1314]: Listening on routing socket on fd #24 for interface updates"

I found this other Ubuntu Precise bug (#999725) https://bugs.launchpad.net/ubuntu/+source/ntp/+bug/999725 which reports that ntp is being started before DNS resolution is available. A quick scan of the initial comments shows that the discussion revolves around network-manager's handling of network configuration. The bug is currently marked Expired due to inactivity.

Bug #999725 seems to involve some of the same issues as the ones dealt with here.

Comments? Troubleshooting? Workarounds?

Revision history for this message
Thomas Hood (jdthood) wrote :

That the last syslog entries are made by ntpd doesn't necessarily mean that the machine is hanging because of ntpd. It could be hanging at the next step, for example.

Bug #999725 reports that ntp doesn't work properly when it is started before NIS, which is not to be confused with DNS. Probably not related.

Unfortunately I don't have any idea why the second client hangs whereas the first one doesn't.

Revision history for this message
John Hupp (john.hupp) wrote :

Agreed. And I had hoped that I could eliminate ntpd as the source of the problem by using a simple switch in the LTSP configuration to turn it off for the client. Unfortunately that does not seem to be effective in disabling ntpd. Troubleshooting that elsewhere .....

Revision history for this message
Thomas Hood (jdthood) wrote :

Belated reply to Robin Battey's #116.

My question in #115 was about alternative resolver libraries, not about DNS resolver libraries. There are libraries that play the same role as the whole glibc resolver. Generally these alternative resolver libraries include DNS resolvers and read /etc/resolv.conf for compatibility with the glibc resolver but I'd like to know whether or not, or to what extent, they also obey /etc/nsswitch.conf.

I believe I understand your basic idea well enough. Instead of using resolv.conf to direct name queries to nm-dnsmasq, use a new NSS module. This new NSS module, foo, would be like the existing dns "module" except that it would only talk to nm-dnsmasq, or would allow other ports than 53 to be specified so that nm-dnsmasq could be talked to over another port than 53. The new module would be named on the "hosts:" line in /etc/nsswitch.conf instead of "dns". (I don't see the point of listing both foo and dns, since foo *is* DNS.)

But how much less work would this be than adapting the glibc code so that ports other than 53 can be specified, e.g., via a new config file with enhanced semantics that (if present) overrides resolv.conf? And how much less is the risk of breaking software that uses alternative resolver libraries?

Revision history for this message
Robin Battey (zanfur) wrote :
Download full text (4.1 KiB)

You've got the basic idea. The nsswitch.conf file is where Name Service services are configured, and "hosts" is one of them. DNS is *one* way to look up hosts, but so is "files" (/etc/hosts) and "mdns4" (avahi). Anything that extends how names are translated to addresses should, imnho, be done through NSS. This is because *everything* supports NSS. For instance, NIS and NIS+ hosts are done through NSS, and this is supported by essentially everything, because it's the standard. All of the "enterprise" directory services I've come across use an NSS plugin (usually the "nis" one). It's just simply the right way to do it.

I wouldn't worry about resolver libraries that don't use glibc. They're typically DNS-specific, and are typically configured by their own files anyway. Dig, for instance, will use whatever server you tell it to, and ignore resolv.conf (though it uses it as a default). Same goes for the "host" tool -- they're used for querying specific DNS servers. However, those resolvers *also* ignore /etc/hosts, because that's referenced by the "files" NSS plugin. Any service that uses gethostbyname(3) is using glibc, and that's going to be everything except edge cases that are intentionally doing their own thing anyway. Things that try to emulate glibc behavior by first checking /etc/hosts and then /etc/resolv.conf are simply doing it wrong, and will miss (for instance) avahi, NIS, and any other directory service that may be installed.

I'm surprised at the idea that it will be less work to modify glibc. Even if it's technically easier to make a change in the glibc code than to create your own NSS plugin, you have a myriad of problems: NM functionality would now have a dependency on a nonstandard patch of glibc, the documentation for /etc/resolv.conf will be inconsistent for only this distribution, there could (will) be resistance by the glibc folks, who knows what you'll break when you alter how glibc behaves, etc.

However, with an NSS module, we have a huge number of advantages:

  * It's the standard way of achieving this type of thing and is hence supported by most everything
  * It's the standard way of achieving this type of thing so it's very well documented
  * It's the standard way of achieving this type of thing so it's very modularized and isolated, and if NM stops working it will continue along the chain without screwing up plugins further up like (unlike when dnsmasq dies with the proposed glibc change)
  * It's the standard way of achieving this type of things so the things that don't support it are, in general, doing it wrong and that's a bug on their end
  * It's the standard way of achieving this type of thing so there's already a package (libnss-mdns) that adds a hosts NSS module, meaning both that we know it works and that it is "officially supported by ubuntu"
  * It could be owned by the NM project instead of creating a dependency on a glibc patch that would not be taken up by distributions very quickly
  * You could make it do other interesting things like have static /etc/hosts-like entries per connection.

You get the idea. If you want to see an example of an NSS hosts plugin packaged for ubuntu, that ...

Read more...

Revision history for this message
todaioan (alan-ar06) wrote :

<email address hidden>

Revision history for this message
Thomas Hood (jdthood) wrote :

You may be right that developing a new "nm-dns" module would be easier than trying to enhance the existing dns module to support nonstandard ports.

But the more immediately relevant comparison is the comparison between the current solution and any solution involving a new or an enhanced NSS module. The current solution is to run nm-dnsmasq at 127.0.1.1:53. This solution has already been rolled out and seems to be working well. (To my own surprise I haven't seen any complaints related to the switch from 127.0.0.1 to 127.0.1.1, even though I have been following AskUbuntu and ubuntuforums.) Any alternative has to offer significant benefits if it's going to be considered for adoption, considering the amount of work and the risk involved. What benefits would the nm-dns module or the enhanced dns module give us relative to what we have now? One is: the ability to run nm-dnsmasq on another port, freeing up port 53 for BIND named listening on ALL:53. What else? Would the NSS-module approach make it easier to implement per-user caches, for example? (I see that Solaris provides per-user instances of nscd for this purpose.)

Robin, please submit a version of your comment #129 as a new bug report against network-manager, requesting that the connection to nm-dnsmasq be implemented by means of a new NSS module. Give your arguments in favor. Then we can continue the discussion in an open bug report rather than in this fix-released one.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

> To my own surprise I haven't seen any complaints related to the switch from 127.0.0.1 to 127.0.1.1, even though I have been following AskUbuntu and ubuntuforums.

It's possible that a large portion of Ubuntu users that are using dnsmasq as a DNS server, only use LTS releases, so complains might only show up after 2 years.
E.g. in 300 schools here we settled with disabling the nm-spawned dnsmasq from NetworkManager.conf, and haven't seen the implemented solution yet.
Btw please don't backport the current solution to Precise, the "bind-interfaces" part will break all those existing setups.
The nss-based solution does sound like it wouldn't cause any problems at all, though.

Revision history for this message
Thomas Hood (jdthood) wrote :

> Btw please don't backport the current solution to Precise

In comment #110 MTL said that backporting the fix to Precise *is* planned.

Quantal includes dnsmasq 2.63 which has the new "bind-dynamic" option. In bind-dynamic mode dnsmasq works as it does in bind-interfaces mode but also updates its list of listen addresses whenever network interfaces are configured and deconfigured. It appears to work well. In bind-dynamic mode, as in bind-interfaces mode, standalone dnsmasq is compatible with NM-dnsmasq listening at 127.0.1.1. I would suggest therefore that if the switch from 127.0.0.1 to 127.0.1.1 for NM-dnsmasq is backported to Precise then dnsmasq 2.63 should simultaneously be backported to Precise and dnsmasq should be forced into bind-dynamic mode rather than into bind-interfaces mode.

Revision history for this message
Thomas Hood (jdthood) wrote :

I wrote in comment #131:
> What benefits would the nm-dns module or the enhanced
> dns module give us relative to what we have now? One is:
> the ability to run nm-dnsmasq on another port, freeing up
> port 53 for BIND named listening on ALL:53. What else?

I just installed bind9 and was surprised to see that in its default configuration named behaves just like dnsmasq in bind-dynamic mode. That is, it listens on port 53 at all addresses assigned to interfaces. When interfaces are created or configured, named starts listening on those as well. With this behavior, it shouldn't often (ever?) be necessary to configure named to listen on the wildcard address.

Is there any nameserver out there that does still conflict with nm-dnsmasq listening at 127.0.1.1:53?

Revision history for this message
Thomas Hood (jdthood) wrote :

The O'Reilly book _DNS and BIND_ says:

[QUOTE]
10.4.3.2 Interface interval

We've said already that BIND, by default, listens on all of a host's network interfaces. BIND 8 is actually smart enough to notice when a network interface on the host it's running on comes up or goes down. To do this, it periodically scans the host's network interfaces. This happens once each interface interval, which is 60 minutes by default. If you know the host your name server runs on has no dynamic network interfaces, you can disable scanning for new interfaces by setting the interface interval to zero to avoid unnecessary hourly overhead:

options {
                interface-interval 0;
};
On the other hand, if your host brings up or tears down network interfaces more often than every hour, you may want to reduce the interval.

[/QUOTE]

But when I tried it, named noticed right away that I had brought up an interface. Will investigate further.

Revision history for this message
Robin Battey (zanfur) wrote :

In response to #131 and #134 by Thomas:

I would argue that "will it conflict with anything that exists?" is the wrong question, here. Certainly it will conflict in the future, and removing the users ability to run a DNS service on the wildcard address is suboptimal at best, even if they don't *need* to. To directly answer the question about something that conflicts: the internal resolver of the samba4 packages. They're beta right now, but the scheduled release date is December, and there's no parameter (yet) for altering the port or interfaces. This is actually the one that bit me originally.

To answer "what does it give us?", currently NM invokes a single dnsmasq instance that must be shared between all users. This isn't ideal, because NM connections can be per-user, and this could lead information disclosure at worst and oddly-rearranged DNS resolve orders at best. With an NSS module, you could spin up one dnsmasq instance for the system on a possibly priviliged port (but not 53) and one per user (above 1024), and link them together as forwarders so that only the user owning the connection will use the resolution they've specified in the GUI. It would require som tracking of which user's instance is on which port,and auto-invoking them when necessary, and shutting it down when the user logs out, but would allow for much more flexible and clean separation of user settings.

For the record, I am happy to write the NSS plugin myself, but it would require some changes in NM core itself, so I would have to work with someone on the NM team to implement it. If you're interested, and know who that would be, please do let me know.

I will also create a new bug report as requested.

Revision history for this message
Thomas Hood (jdthood) wrote :

> something that conflicts: the internal resolver of the samba4 packages

Please file another report against samba4 describing the conflict with nm-dnsmasq.

Revision history for this message
Robin Battey (zanfur) wrote :

I would if I considered it a bug. (I didn't fully describe the current state of samba4, because I figured it was irrelevant: You can alter the interfaces it binds to, but not for *only* the dns resolver -- so currently, if you want samba4 listening on the wildcard address you'll need the dns resolver listening there too.) It would be a nice feature, sure. But, it's nm-dnsmasq is the one breaking away from standards in ways that will break other packages, so I'm reporting the conflict here.

Btw, named immediately notices because of the /etc/network/if-{up,down}.d/bind9 scripts that trigger "rndc reconfig" when an interface goes up or down.

Revision history for this message
Thomas Hood (jdthood) wrote :

If "libnss-nm-dns" would make it easier to introduce per-user caching and/or if it improved security then those would be important benefits.

Currently nm-dnsmasq has caching disabled because of concerns about cache poisoning and information leakage.

    https://blueprints.launchpad.net/ubuntu/+spec/foundations-p-dns-resolving

If there have already been discussions of per-user caching in Ubuntu then someone please give me the link.

The only approach that I have seen so far is per-user nscd in Solaris and (I now see) FreeBSD.

    http://docs.oracle.com/cd/E19963-01/html/821-1462/nscd-1m.html
    http://www.unix.com/man-page/freebsd/8/NSCD

Revision history for this message
Thomas Hood (jdthood) wrote :

> Btw, named immediately notices because of the
> /etc/network/if-{up,down}.d/bind9 scripts that trigger
> "rndc reconfig" when an interface goes up or down.

Ah, yes. There is also a hook at /etc/ppp/ip-{up,down}.d/bind9.

But named also notices immediately when I bring up an with NetworkManager. Any idea what the mechanism is there?

When I bring down an interface with NetworkManager, named does *not* notice this right away.

Revision history for this message
Thomas Hood (jdthood) wrote :

Whoa. When an interface is brought up with NM the scripts in /etc/network/if-up.d/ somehow get run (how?) but when an interface is downed with NM, the scripts in /etc/network/if-down.d/ don't get run (inconsistent!).

Revision history for this message
Thomas Hood (jdthood) wrote :

Aha. /etc/NetworkManager/dispatcher.d/01ifupdown run-partses /etc/network/if-up.d/ on "up" and /etc/network/if-post-down.d/ on "down" (which is actually "post-down" in ifupdown terminology). And there is no /etc/network/if-post-down.d/bind9 so named doesn't get nudged when NM takes down an interface. Just reported this in bug #1087228.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

I'm still having problems with this on 14.04.

After the default installation, I installed dnsmasq and DNS stopped working until system restart.

Now it's only working for a few seconds after each network-manager restart!

If I comment out
#dns=dnsmasq
in NetworkManager.conf, then everything is fine again.

For the 500+ schools that we're supporting here, we'll just continue commenting out #dns=dnsmasq because it doesn't cooperate with the regular dnsmasq installation,
but if you want me to provide more info to troubleshoot this issue, I'd be glad to.

I'm attaching the output of nm-tool. My effective dnsmasq.conf is:

$ egrep -rv '^#|^$' /etc/dnsmasq.*
/etc/dnsmasq.d/network-manager:bind-interfaces
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-range=10.160.67.0,proxy
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-range=10.161.254.0,proxy
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-range=192.168.67.20,192.168.67.250,8h
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:enable-tftp
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:tftp-root=/var/lib/tftpboot/
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-option=17,/opt/ltsp/i386
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-vendorclass=etherboot,Etherboot
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-vendorclass=pxe,PXEClient
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-vendorclass=ltsp,"Linux ipconfig"
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-boot=net:pxe,/ltsp/i386/pxelinux.0
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-boot=net:etherboot,/ltsp/i386/nbi.img
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-boot=net:ltsp,/ltsp/i386/lts.conf
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-option=vendor:pxe,6,2b
/etc/dnsmasq.d/ltsp-server-dnsmasq.conf:dhcp-no-override

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

The fix for this issue caused another regression, dnsmasq now doesn't function correctly as a tftp server either.

I just tried Trusty (dnsmasq 2.68-1), and network manager ships /etc/dnsmasq.d/network-manager with:
bind-interfaces

So now dnsmasq only binds 127.0.0.1 for its tftp service:
udp 0 0 127.0.0.1:69 0.0.0.0:* 954/dnsmasq
udp6 0 0 ::1:69 :::* 954/dnsmasq

...and of course that breaks everything. Removing that file makes tftp work again.

Mathieu, could you please package the modifications to /etc/NetworkManager/NetworkManager.conf and to /etc/dnsmasq.d/network-manager as a separate, network-manager-local-resolver.deb package, maybe even produced by the network manager source code, and Recommented: it from network-manager,

...so that people that want to use dnsmasq as a real server can just blacklist it without suffering on each new Ubuntu installation?

E.g. for the 500+ schools we maintain here, we could then just Conflict: network-manager-local-resolver from our main package and forget the whole thing...

Thanks,
Alkis

Changed in network-manager (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Or better yet, ltsp-server-standalone could "Conflict: network-manager-local-resolver" so that all LTSP sysadmins that use dnsmasq don't bother searching for a solution and manually editing configuration files...

Revision history for this message
Thomas Hood (jdthood) wrote :

> I just tried Trusty (dnsmasq 2.68-1), and network manager ships /etc/dnsmasq.d/network-manager with:
>
> bind-interfaces
>
> So now dnsmasq only binds 127.0.0.1 for its tftp service:
>
> udp 0 0 127.0.0.1:69 0.0.0.0:* 954/dnsmasq
> udp6 0 0 ::1:69 :::* 954/dnsmasq
>
> ...and of course that breaks everything. Removing that file makes tftp work again.

Alkis, does it work properly if you change "bind-interfaces" to "bind-dynamic"?

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Thomas, yup, TFTP appears to be working fine with bind-dynamic.

I'll test if re-enabling "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.conf along with bind-dynamic allows dnsmasq co-exist with nm-dnsmasq, and report back.

Thanks!

Revision history for this message
John Hupp (john.hupp) wrote :

Through Raring and Saucy, my two modifications to the given LTSP-PNP setup have been:

In /etc/dnsmasq.d/network-manager replace the "bind-interfaces" line with a "bind-dynamic" line.

Edit /etc/dnsmasq.d/ltsp-server-dnsmasq.conf: comment out the port=0 line

And those two mods still work for me in Saucy, but I'm running into what seems to be an NBD-related kernel bug, which I'm trying to commit bisect on the upstream kernel. Clients fail to boot, generating "Error: socket failed: connection refused."

It's off-topic, but this problem does not appear in Trusty?

Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Now that we can use bind-dynamic, I have nothing against setting that value instead of bind-interfaces, if it indeed solves the latest issues that were reported.

However, I'd really appreciate if separate bugs could be opened rather than reopening this bug, it would make each individual issue easier to see and fix.

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

Mathieu, I reopened this bug because it was never resolved... not just for the TFTP issue.
Please see my #143 comment.
If you want more feedback tell me what to send, but DNS never worked properly for me when dnsmasq and nm-dnsmasq are both running.

Revision history for this message
Warwick Bruce Chapman (warwickchapman) wrote :

What is the status of this as at 16.04?

Revision history for this message
Alkis Georgopoulos (alkisg) wrote :

The network-manager package still ships /etc/dnsmasq.d/network-manager
with "bind-interfaces" in it
and that breaks the TFTP server of dnsmasq
and sometimes even the DNS server of dnsmasq.

"bind-dynamic" is a little better, but too unreliable to be used in production.

So this bug is still not resolved, after 150 messages it was just made a little worse.

One workaround is to undo the "solution" offered in this bug report:
1) In /etc/NetworkManager/NetworkManager.conf, comment out: # dns=dnsmasq
2) And in /etc/dnsmasq.d/network-manager, comment out: #bind-interfaces

A better solution would be for Mathieu to create a separate package for the nm-spawned dnsmasq, one that would conflict with the real dnsmasq server so that it would be automatically uninstalled when the sysadmin would install the real dnsmasq.

I can send a patch for that if it will be accepted.

Revision history for this message
Steve Langasek (vorlon) wrote :

The Precise Pangolin has reached end of life, so this bug will not be fixed for that release

Changed in dnsmasq (Ubuntu Precise):
status: Triaged → Won't Fix
Changed in network-manager (Ubuntu Precise):
status: Triaged → Won't Fix
Steve Langasek (vorlon)
Changed in djbdns (Ubuntu Precise):
status: Confirmed → Won't Fix
Displaying first 40 and last 40 comments. View all 155 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.