Dnsmasq fails to resolve if any upstream nameserver is unreachable

Bug #991308 reported by William Lightning
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Expired
Undecided
Unassigned
network-manager (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

Release: 12.04
Hardware: HP Compaq 6910p Laptop

If the backup DNS server is unreachable, dnsmasq does not try to reach primary DNS server.

In the following default configuration (clean install of 12.04) 1 out of 4 DNS queries fail. (DNSMasq seems to rotate queries between the listed servers)

wlightning@archon:~$ cat /run/nm-dns-dnsmasq.conf
server=75.75.75.75
server=75.75.76.76
wlightning@archon:~$

wlightning@archon:~$ cat /etc/resolv.conf
# Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)
# DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN
nameserver 127.0.0.1
search hsd1.or.comcast.net
wlightning@archon:~$

wlightning@archon:~$ nm-tool

NetworkManager Tool

State: connected (global)

- Device: ttyUSB0 --------------------------------------------------------------
  Type: Mobile Broadband (CDMA)
  Driver: sierra
  State: disconnected
  Default: no

  Capabilities:

- Device: wlan0 ----------------------------------------------------------------
  Type: 802.11 WiFi
  Driver: iwl4965
  State: unavailable
  Default: no
  HW Address: 00:1D:E0:35:69:AF

  Capabilities:

  Wireless Properties
    WEP Encryption: yes
    WPA Encryption: yes
    WPA2 Encryption: yes

  Wireless Access Points

- Device: eth0 [Wired connection 1] -------------------------------------------
  Type: Wired
  Driver: e1000e
  State: connected
  Default: yes
  HW Address: 00:1B:38:E7:F5:02

  Capabilities:
    Carrier Detect: yes
    Speed: 1000 Mb/s

  Wired Properties
    Carrier: on

  IPv4 Settings:
    Address: 10.0.0.11
    Prefix: 24 (255.255.255.0)
    Gateway: 10.0.0.1

    DNS: 75.75.75.75
    DNS: 75.75.76.76

wlightning@archon:~$

mtr of 75.75.75.75 (anycast addr):
                                       My traceroute [v0.80]
archon (0.0.0.0) Sun Apr 29 10:13:58 2012
Keys: Help Display mode Restart statistics Order of fields quit
                                                           Packets Pings
 Host Loss% Snt Last Avg Best Wrst StDev
 1. 10.0.0.1 0.0% 51 0.6 0.6 0.6 1.5 0.2
 2. 73.94.124.1 0.0% 51 9.0 9.2 7.1 14.7 1.5
 3. 68.85.148.1 0.0% 51 7.9 10.6 7.4 24.7 4.0
 4. 68.87.216.13 0.0% 51 9.2 19.8 8.5 150.8 29.0
 5. 68.87.218.162 0.0% 50 10.8 14.1 8.1 85.0 13.4
    68.87.218.158
 6. 68.87.216.41 0.0% 50 9.5 13.2 9.5 67.4 8.6
 7. 75.75.75.75 0.0% 50 9.9 11.0 9.1 17.2 1.8

mtr of 75.75.76.76 (anycast addr):
                                       My traceroute [v0.80]
archon (0.0.0.0) Sun Apr 29 10:12:42 2012
Keys: Help Display mode Restart statistics Order of fields quit
                                                           Packets Pings
 Host Loss% Snt Last Avg Best Wrst StDev
 1. 10.0.0.1 0.0% 51 0.6 0.8 0.5 3.6 0.5
 2. 73.94.124.1 0.0% 51 9.0 9.3 7.0 18.3 2.5
 3. 68.85.148.9 0.0% 51 7.3 9.6 7.2 22.3 2.9
 4. 68.85.243.253 0.0% 51 8.3 13.8 7.5 81.9 15.6
 5. 68.86.95.97 0.0% 51 14.5 17.2 11.9 33.0 4.4
    68.86.91.197
    68.86.90.213
    68.86.95.93
    68.86.95.89
 6. 68.86.95.214 0.0% 51 39.5 41.5 38.1 74.5 6.4
 7. 68.86.199.130 0.0% 51 51.9 40.8 38.2 55.3 3.3
 8. 75.75.76.76 44.0% 50 38.5 40.2 38.3 46.2 2.2

Revision history for this message
William Lightning (kassah) wrote :

Added above tag because the root issue that caused me to find this was not present on 11.10: "In the following default configuration 1 out of 4 DNS queries fail. (DNSMasq seems to rotate queries between the listed servers)"

More information:
I noticed this originally when the computer was reporting unable to update packages, because it could not resolve hostnames needed to download packages, and was evidenced further by Firefox unable to load pages inconsistantly (i.e. it would load google fine for a couple of clicks, then fail, then start working again) when my 11.10 desktop was able to load the same pages consistantly.

tags: added: regression-release
Revision history for this message
Mathieu Trudel-Lapierre (cyphermox) wrote :

Well, dnsmasq is supposed to try the servers in parallel and use the fastest response; not really rotate using one or the other.

It would be useful if you could provide a packet trace of what is happening when you open firefox and notice this behavior, I would like to see if dnsmasq really is only asking one of the DNS servers in this case. I suspect you may instead notice that one is not responding and the other is very slow, or the other way around.

Changed in network-manager (Ubuntu):
status: New → Incomplete
importance: Undecided → Medium
Revision history for this message
Thomas Hood (jdthood) wrote :

Workaround: In /etc/NetworkManager/NetworkManager.conf comment out "dns=dnsmasq" so that dnsmasq isn't used.

Revision history for this message
Thomas Hood (jdthood) wrote :

 Simon, do you think that dnsmasq could misbehave as described here?

Revision history for this message
Simon Kelley (simon-thekelleys) wrote :

>Simon, do you think that dnsmasq could misbehave as described here?
The only way I can see for this to occur is if a DNS server is return wrong (ie NXDOMAIN or NODATA) answers, no answer shouldn't be a problem.

I suggest adding --log-queries to the dnsmasq configuration to try and get a handle on what's happening.

Simon.

Revision history for this message
William Lightning (kassah) wrote :

Well when I did a manual dig @75.75.76.76 it timed out half of the queries. The few responses it gave did seem good. That may not have been representative though of the overall.

Unfortunately (or fortunately) the issue has stopped happening.

Thank you for your attention to this issue.

Revision history for this message
William Lightning (kassah) wrote :

aparently I spoke to soon =(

Revision history for this message
Thomas Hood (jdthood) wrote :

Does the standalone dnsmasq behave the same way?

To find out, disable nm-dnsmasq by commenting out the line "dns=dnsmasq" in /etc/NetworkManager/NetworkManager.com. Restart network-manager. Then install the dnsmasq package.

If standalone dnsmasq also misbehaves then turn on log-queries mode: add the line

    DNSMASQ_OPTS="--log-queries"

to /etc/default/dnsmasq and restart dnsmasq with "sudo /etc/init.d/dnsmasq restart". When you look up names dnsmasq will now log its actions in /var/log/syslog. Attach a file containing syslog snippets showing what happens behind the scenes when dnsmasq fails to resolve a name.

Revision history for this message
Thomas Hood (jdthood) wrote :

In bug #979067 there is a similar description of apparent dnsmasq misbehavior.

There Guillaume Melquiond writes:
> It takes a lot of failures before all the unreachable servers
> have been exhausted, which makes for a poor user experience.

Revision history for this message
William Lightning (kassah) wrote :

Yes, it happens with dnsmasq by itself. However it appears that dnsmasq is caching the result, so when it sees a good one it will keep serving it.

Here is some example output from the syslog:

Jun 22 13:42:18 archon dnsmasq[9635]: started, version 2.59 cachesize 150
Jun 22 13:42:18 archon dnsmasq[9635]: compile time options: IPv6 GNU-getopt DBus i18n DHCP TFTP conntrack IDN
Jun 22 13:42:18 archon dnsmasq[9635]: reading /var/run/dnsmasq/resolv.conf
Jun 22 13:42:18 archon dnsmasq[9635]: using nameserver 75.75.76.76#53
Jun 22 13:42:18 archon dnsmasq[9635]: using nameserver 75.75.75.75#53
Jun 22 13:42:18 archon dnsmasq[9635]: read /etc/hosts - 8 addresses
Jun 22 13:42:18 archon dnsmasq[9635]: query[SOA] local from 127.0.0.1
Jun 22 13:42:18 archon dnsmasq[9635]: forwarded local to 75.75.75.75
Jun 22 13:42:18 archon dnsmasq[9635]: forwarded local to 75.75.76.76
Jun 22 13:42:18 archon dnsmasq[9635]: query[SOA] local from 127.0.0.1
Jun 22 13:42:18 archon dnsmasq[9635]: forwarded local to 75.75.75.75
..
Jun 22 13:43:09 archon dnsmasq[9635]: query[A] slashdot.org from 127.0.0.1
Jun 22 13:43:09 archon dnsmasq[9635]: forwarded slashdot.org to 75.75.75.75
Jun 22 13:43:09 archon dnsmasq[9635]: reply slashdot.org is 216.34.181.45
Jun 22 13:43:26 archon dnsmasq[9635]: query[A] slashdot.org from 127.0.0.1
Jun 22 13:43:26 archon dnsmasq[9635]: cached slashdot.org is 216.34.181.45
..
Jun 22 13:46:00 archon dnsmasq[9814]: query[A] gmail.com from 127.0.0.1
Jun 22 13:46:00 archon dnsmasq[9814]: forwarded gmail.com to 75.75.75.75
Jun 22 13:46:00 archon dnsmasq[9814]: forwarded gmail.com to 75.75.76.76
Jun 22 13:46:05 archon dnsmasq[9814]: query[A] gmail.com from 127.0.0.1
Jun 22 13:46:05 archon dnsmasq[9814]: forwarded gmail.com to 75.75.75.75
Jun 22 13:46:05 archon dnsmasq[9814]: forwarded gmail.com to 75.75.76.76
Jun 22 13:46:10 archon dnsmasq[9814]: query[A] gmail.com from 127.0.0.1
Jun 22 13:46:10 archon dnsmasq[9814]: forwarded gmail.com to 75.75.75.75
Jun 22 13:46:10 archon dnsmasq[9814]: forwarded gmail.com to 75.75.76.76

the last was generated by dig @localhost gmail.com

Revision history for this message
William Lightning (kassah) wrote :

Just noticed this as well:
http://mydeviceinfo.comcast.net/device.php?tier=-1&devid=298&e=0&d3=0&s=n&so=0&sc=1546

To quote the red bit at top:
"Comcast has identified a software defect on the Arris TG852 and TG862, which may cause problems for a small number of users attempting to use third party DNS services. Arris and Comcast are working to correct this issue and will deploy updated device firmware to resolve the issue. If a customer does not wish to wait for the updated firmware, the customer may email us at <email address hidden> and a replacement device will be provided at no cost to the customer."

Although I am using the default Comcast provided DNS servers of 75.75.75.75 and 75.75.76.76, so not sure this would affect it or not.

Revision history for this message
Guillaume Melquiond (guillaume-melquiond) wrote :

Here is what I get when routing to the DNS servers goes through the wrong interfaces. See bug #979067 for additional details on my setup.

Jun 25 12:50:51 dnsmasq[3799]: reading /var/run/dnsmasq/resolv.conf
Jun 25 12:50:51 dnsmasq[3799]: using nameserver yy.yy.36.37#53
Jun 25 12:50:51 dnsmasq[3799]: using nameserver yy.yy.34.35#53
Jun 25 12:50:51 dnsmasq[3799]: using nameserver xx.xx.213.253#53
Jun 25 12:50:51 dnsmasq[3799]: query[A] www.google.fr from 127.0.0.1
Jun 25 12:50:51 dnsmasq[3799]: forwarded www.google.fr to xx.xx.213.253
Jun 25 12:50:51 dnsmasq[3799]: forwarded www.google.fr to yy.yy.34.35
Jun 25 12:50:51 dnsmasq[3799]: forwarded www.google.fr to yy.yy.36.37
Jun 25 12:50:51 dnsmasq[3799]: nameserver yy.yy.34.35 refused to do a recursive query

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in dnsmasq (Ubuntu):
status: New → Confirmed
Revision history for this message
Thomas Hood (jdthood) wrote :

Guillaume wrote in #979067:

> [Dnsmasq] tries to access DNS servers from the wireless
> network [...and...] fails to resolve requests. It takes a lot of
> failures before all the unreachable servers have been
> exhausted, which makes for a poor user experience.

William reports that dnsmasq does not try other nameservers at all, whereas Guillaume seems to be saying that dnsmasq will only try other nameservers after a long time. Can you guys describe in more detail what happens?

Revision history for this message
Guillaume Melquiond (guillaume-melquiond) wrote :

What I meant is that DNS resolution always fails at first, and then after lots of tries, it starts to work. But now that I look at the logs, it may just have been a fluke: it eventually works because one of the reachable nameservers happens to reply fast enough.

Revision history for this message
Thomas Hood (jdthood) wrote :

What do you mean by "fast enough", though? Do you mean that all the listed nameservers are very slow, some of them infinitely so (perhaps because of a routing problem) and dnsmasq either always (William's case) or usually (your case) times out before any nameserver replies?

Revision history for this message
Guillaume Melquiond (guillaume-melquiond) wrote :

No, that's the contrary, they reply too fast, in a sense. If you look at the log I posted, dnsmasq does receive some kind of reply from the unreachable nameservers. I don't know whether this is a crafted reply by the firewall or whether the query actually reached the nameserver but it replied with some kind of failure because the query did not come through the proper route. If you are interested, I can capture a dump of the packets the next time I'm on these networks. Anyway, as I currently understand it, all the nameservers queried by dnsmasq are replying somehow, and dnsmasq only considers the first reply. As a consequence, there is a race between the servers, so DNS resolution will succeed only if the properly routed server replies the fastest.

Revision history for this message
Thomas Hood (jdthood) wrote :

(Marked as invalid for n-m because this is probably a dnsmasq issue. Can add n-m back in if it turns out nm-dnsmasq needs to be run with other options.)

> dnsmasq does receive some kind of reply from the unreachable nameservers

This bug is getting interesting! :)

summary: - DNS Querying fails if any DNS server is unreachable
+ Dnsmasq fails to resolve if any upstream nameserver is unreachable
Changed in network-manager (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
William Lightning (kassah) wrote :

I continue to have issues here. This is likely unrelated, but worth mentioning that I noticed VirtualBox VMs using the NAT virtual network are also affected by the issues in question (VMs using the Bridged virtual network adapter do not).

I've noticed that the website www.portlandgeneral.com if you try and login using the quick login it always fails on lookup (in VirtualBox VMs with NAT too!)

Revision history for this message
William Lightning (kassah) wrote :

I may need to look in logs, I'm not 100% it's lookup that's failing it. Just figured I'd mention it here.

Revision history for this message
Thomas Hood (jdthood) wrote :

Returning to this issue....

Consider the log in comment #10. A successful lookup is done.

> Jun 22 13:43:09 archon dnsmasq[9635]: query[A] slashdot.org from 127.0.0.1
> Jun 22 13:43:09 archon dnsmasq[9635]: forwarded slashdot.org to 75.75.75.75
> Jun 22 13:43:09 archon dnsmasq[9635]: reply slashdot.org is 216.34.181.45

Later an unsuccessful one is done.

> Jun 22 13:46:00 archon dnsmasq[9814]: query[A] gmail.com from 127.0.0.1
> Jun 22 13:46:00 archon dnsmasq[9814]: forwarded gmail.com to 75.75.75.75
> Jun 22 13:46:00 archon dnsmasq[9814]: forwarded gmail.com to 75.75.76.76

It looks as if *both* nameservers are failing to respond in the latter case.

It was originally reported that one out of four queries fails. This is the failure frequency one would get if each nameserver failed half the time and the servers fail independently of each other. William reported in comment #6 that one of the nameservers, 75.75.76.76, fails half the time. Is it possible that the other nameserver is also failing half the time?

More generally, William, have you learned anything new about the problem in the past month and a half?

Changed in dnsmasq (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for dnsmasq (Ubuntu) because there has been no activity for 60 days.]

Changed in dnsmasq (Ubuntu):
status: Incomplete → Expired
Revision history for this message
William Lightning (kassah) wrote :

Appologies, this is no longer an issue for me in 12.10.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.