[Jaunty] Multiple DNS resolver issues on ia32 and AMD64

Bug #326718 reported by floid
This bug report is a duplicate of:  Bug #313218: IPV6 causes slow internet access. Edit Remove
4
Affects Status Importance Assigned to Milestone
apt (Ubuntu)
Invalid
Undecided
Unassigned
Nominated for Jaunty by floid
glibc (Ubuntu)
New
Undecided
Unassigned
Nominated for Jaunty by floid

Bug Description

[EDIT: This bug was originally filed against apt as "apt-get "Could not resolve" DNS names (64-bit Jaunty Alpha 4)"]
[It has since become obvious that I have uncovered at least two separate bugs or unexpected behaviors in system-wide resolver behavior, though I haven't ruled out further fragility in apt-get itself!]
[I am hesitant to completely rewrite this "Description" which describes my first experience of the problem; please see the further comments and attachments on this bug that expand the scope of the issue.]
[Of course, prior to this edit, the original "Binary package hint:" for this bug was "apt".]

===ORIGINAL DESCRIPTION BELOW:===

apt-get seems to get into trouble resolving DNS names when other programs, including nslookup and dig, Firefox, and ping are unaffected.

This results in warnings like:
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/jaunty-security/Release.gpg Could not resolve 'security.ubuntu.com'
...and obviously makes it difficult to use apt!

In researching the problem, people seem fond of blaming nsswitch/mdns/avahi, specific DNS servers or the phase of the moon. However, there is clearly some real bug (or unexpected behavior) here.

Not aiding my attempts to debug the issue: when I attempt to use tcpdump to monitor the DNS lookups, e.g. with `sudo tcpdump -ni eth0 -vvvv > dump`, it often goes away! Then, without tcpdump running, it's back! Perhaps this is actually a kernel bug of some flavor?

I am having a bit of trouble following the actual lookups (and results) in the strace output [produced via `sudo strace -v -s4096 apt-get update`] attached. Can someone have a look at what is going on there? At the time that was taken:

resolv.conf was/is:

# Generated by NetworkManager
domain gateway.2wire.net
search gateway.2wire.net
nameserver 172.16.0.1

[This is a 2Wire Homeportal gateway, which has some sort of DNS cache/DNS proxy in it -- at least enough to let you address the device by the hostname 'homeportal'. I've never had any trouble running lookups through it on numerous other Linux, BSD, and Windows systems.]

nsswitch.conf was:

# /etc/nsswitch.conf
#
# Example configuration of GNU Name Service Switch functionality.
# If you have the `glibc-doc-reference' and `info' packages installed, try:
# `info libc "Name Service Switch"' for information about this file.

passwd: compat
group: compat
shadow: compat

#hosts: files mdns4_minimal [NOTFOUND=continue] dns mdns4
hosts: files dns
networks: files

protocols: db files
services: db files
ethers: db files
rpc: db files

netgroup: nis

[The behavior was first observed with Jaunty's default nsswitch.conf; I first tried the [NOTFOUND=continue] edit for mdns lookups commented above, then chopped it down to only files and dns, and have since chopped it down to *only* "dns." Mysteriously, chopping it down to only "dns" has "switched the bug around" -- now apt-get can resolve us.archive.ubuntu.com but security.ubuntu.com fails. Unless I'm running tcpdump on the interface - then everything works!?]

Revision history for this message
floid (jkanowitz) wrote :
Revision history for this message
floid (jkanowitz) wrote :

Hardware is SMP (4850e dual-core) based around an Asus M3A78-EM with a RealTek gigabit chip -- I think this is really an 8111c but would have to open the case to be sure.

Revision history for this message
floid (jkanowitz) wrote :
Revision history for this message
floid (jkanowitz) wrote :

Figures - I've jumped the gun on reporting this as apt's bug.

I decided to test Jaunty Alpha 4, *32-bit* on the same hardware; it shows the same behavior with apt, but crucially -- and different from what I've observed with the 64-bit install -- Firefox was also getting confused, not finding "bugs.launchpad.net," for instance.

I now smell two bugs, one an apparent race in "search" domain handling, the other still mysterious.

...

Regarding the race condition, my gateway *is* sort of to blame; via DHCP, the following ends up in resolv.conf (note the domain and search keywords):

ubuntu@ubuntu:~$ cat /etc/resolv.conf
# Generated by NetworkManager
domain gateway.2wire.net
search gateway.2wire.net
nameserver 172.16.0.1

[I should note that I'm not aware of a way to *not* make it advertise that as the domain via DHCP, but that hasn't been a problem for other systems. Notably, this does not even pose a problem for a Ubuntu 8.10 box here with an identical resolv.conf!]

Attached here is verbose tcpdump (-n -s1500 -vvv -XX) output when trying to reach bugs.launchpad.net with Firefox: With the above search keyword in resolv.conf, it appears queries for both "bugs.launchpad.net" and "bugs.launchpad.net.gateway.2wire.net" are made, and if the response for the latter is received later, the resolver ignores out the properly-returned A record for the former. "Oops!" Now why is it sending the request for "bugs.launchpad.net.gateway.2wire.net" *after* the A record for "bugs.launchpad.net" is properly received!?

Removing the domain and search statements from resolv.conf convinces the resolver to stop making those queries. bugs.launchpad.net then resolves properly -- and I'm able to post this update.

Revision history for this message
floid (jkanowitz) wrote :

This is the dump for a run of `apt-get update` made *after* resolv.conf was edited (and confirmed not to have been overwritten by any DHCP renewal), where apt-get still generated the following warnings:

W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jaunty/Release.gpg Could not resolve 'archive.ubuntu.com'

W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jaunty/main/i18n/Translation-en_US.bz2 Could not resolve 'archive.ubuntu.com'

W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jaunty/restricted/i18n/Translation-en_US.bz2 Could not resolve 'archive.ubuntu.com'

W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jaunty-updates/Release.gpg Could not resolve 'archive.ubuntu.com'

W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jaunty-updates/main/i18n/Translation-en_US.bz2 Could not resolve 'archive.ubuntu.com'

W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/jaunty-updates/restricted/i18n/Translation-en_US.bz2 Could not resolve 'archive.ubuntu.com'

...

Again, both of these packet dumps are from this test boot of *32-bit* Jaunty, so this is not just a 64-bit problem.

Even more bizarrely, while removing the search keyword from resolv.conf *initially* let me reach bugs.launchpad.net reliably, something unknown has happened while I've been attempting to post *this* update - once again, Firefox can't resolve the host even as `dig`, `host`, and a plethora of other tools reliably resolve it. The problem is intermittent, since clearly I'm posting now...

Revision history for this message
floid (jkanowitz) wrote :

Agh, wrong attachment for the post directly above. *This* attachment is the full run, accidentally attached a snippet taken previously.

floid (jkanowitz)
description: updated
Revision history for this message
floid (jkanowitz) wrote :

I should also note that this smells similar to bug #291589, but an 8.10 install here with libc6 "2.8~20080505-0ubuntu7" is not having problems, while "2.9-0ubuntu9" on 64-bit [and whatever ia32 Alpha 4 ships with, if different] is showing the issue.

Revision history for this message
floid (jkanowitz) wrote :

Still a problem with libc6 2.9-0ubuntu10 (as anticipated, nothing obviously related in the changelog).

Revision history for this message
floid (jkanowitz) wrote :

I notice RedHatters are experiencing this as well with glibc 2.9.x, although it may be stirred in with other issues:
https://bugzilla.redhat.com/show_bug.cgi?id=459756

Everyone seems to be poking at it rather blindly right now; my own (uneducated) guess is that even in relatively sane configurations, there can be a race based on the order in which A and AAAA answers are returned, absent more mechanisms to ensure good A answers aren't discarded or ignored in favor of AAAA on systems that really only have IPv4 routes anyway.

[If this is the case... then the idea in the RedHat bug is probably not bad; being able to specify policy in nsswitch.conf along the lines of "hosts: files dnsv4 [notfound=continue] dnsv6" would 'prefer' A responses without masking the existence of v6-only systems that only have AAAA records, not A.]

I'm surprised this isn't getting more attention - am I alone, or is it that anyone else affected is having trouble reaching bugs.launchpad.net to comment? :}

Revision history for this message
Jane Atkinson (irihapeti) wrote :
Download full text (3.5 KiB)

I'm not sure that this is just a Jaunty issue, unless the problem I'm having is caused by something totally different.

I'm running Hardy 32 bit on two machines linked by crossover ethernet cable. The desktop machine is on dialup and is set up to forward to the laptop.

Laptop is fine as far as resolving addresses is concerned. The desktop is returning this message when I run apt-get:

....
W: Failed to fetch http://archive.canonical.com/ubuntu/dists/hardy/Release.gpg Could not resolve 'archive.canonical.com'

W: Failed to fetch http://archive.canonical.com/ubuntu/dists/hardy/partner/i18n/Translation-en_NZ.bz2 Could not resolve 'archive.canonical.com'

W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/hardy-security/Release.gpg Could not resolve 'security.ubuntu.com'

W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/hardy-security/main/i18n/Translation-en_NZ.bz2 Could not resolve 'security.ubuntu.com'

W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/hardy-security/restricted/i18n/Translation-en_NZ.bz2 Could not resolve 'security.ubuntu.com'

W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/hardy-security/universe/i18n/Translation-en_NZ.bz2 Could not resolve 'security.ubuntu.com'

W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/hardy-security/multiverse/i18n/Translation-en_NZ.bz2 Could not resolve 'security.ubuntu.com'

W: Failed to fetch http://debian.scribus.net/debian/dists/hardy/Release.gpg Could not resolve 'debian.scribus.net'

W: Failed to fetch http://debian.scribus.net/debian/dists/hardy/main/i18n/Translation-en_NZ.bz2 Could not resolve 'debian.scribus.net'

W: Failed to fetch http://debian.scribus.net/debian/dists/hardy/non-free/i18n/Translation-en_NZ.bz2 Could not resolve 'debian.scribus.net'

W: Failed to fetch http://ppa.launchpad.net/psyke83/ubuntu/dists/hardy/Release.gpg Could not resolve 'ppa.launchpad.net'

W: Failed to fetch http://ppa.launchpad.net/psyke83/ubuntu/dists/hardy/main/i18n/Translation-en_NZ.bz2 Could not resolve 'ppa.launchpad.net'

W: Failed to fetch http://ppa.launchpad.net/laney/ubuntu/dists/hardy/Release.gpg Could not resolve 'ppa.launchpad.net'

W: Failed to fetch http://ppa.launchpad.net/laney/ubuntu/dists/hardy/main/i18n/Translation-en_NZ.bz2 Could not resolve 'ppa.launchpad.net'

W: Failed to fetch http://packages.medibuntu.org/dists/hardy/Release.gpg Could not resolve 'packages.medibuntu.org'

W: Failed to fetch http://packages.medibuntu.org/dists/hardy/free/i18n/Translation-en_NZ.bz2 Could not resolve 'packages.medibuntu.org'

W: Failed to fetch http://packages.medibuntu.org/dists/hardy/non-free/i18n/Translation-en_NZ.bz2 Could not resolve 'packages.medibuntu.org'

W: Some index files failed to download, they have been ignored, or old ones used instead.
W: You may want to run apt-get update to correct these problems
...

I'm also noticing some problems with Firefox. I can get a "page not found" error, and then click on "try again", and the page resolves straight away.

Initially I blamed apt-cacher-ng for the problems. Having tried changing all sorts of host config files, I ended up reinstalling from a backup. I didn't reinstall apt-cac...

Read more...

Revision history for this message
floid (jkanowitz) wrote :

Hi "H" - Thank you for the attention; may I suggest filing a separate bug so we don't reenact the RedHat bug's confusion over multiple glibc versions? :}

To better understand your issue, it will help to post the versions of glibc installed on each machine - `dpkg-query -s libc6` will return this, if memory serves - and what each machine is using for a name server (contents of /etc/resolv.conf for each; contents of /etc/nsswitch.conf also).

I'm not familiar with the plug'n'play forwarding features of Ubuntu, if they're what you're using, but it can be easy to get into strange situations when using NAT, particularly if UDP packets are involved. You could have a resolver bug, but you could also have a NAT bug/misconfiguration that, for instance, is routing some responses to the laptop incorrectly at the expense of the desktop. If you are also using a software firewall on the desktop, that's another variable that can also accidentally block legitimate packets. Are you only having the problem with apt-get and other glibc resolver consumers, or does `dig` fail as well?

In my case, I am using a DNS server that is "known good" as far as not presenting any problems with 8.10 / libc6 "2.8~20080505-0ubuntu7" or BSD boxes on the same network, or to `dig` on the same machine, so I can at least claim to have identified specific new misbehavior between libc6 2.9-0ubuntu10 and the working libc6 2.8~20080505-0ubuntu7.

Revision history for this message
Jane Atkinson (irihapeti) wrote :

Floid

Thanks for the prompt reply.

Version of libc6 is the same for both machines: 2.7-10ubuntu4

resolv.conf is also the same on both machines:
...
nameserver 202.27.158.40
nameserver 202.27.156.72
...

These belong to my isp, and there are no problems reported at the moment.

...
# /etc/nsswitch.conf

passwd: compat
group: compat
shadow: compat

hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4
networks: files

protocols: db files
services: db files
ethers: db files
rpc: db files

netgroup: nis
...
Again, the same on both machines.

`dig` resolves all the addresses that apt-get can't find. I tested it on both machines.

I'm using a software firewall on the desktop (just an iptables script). It has been working OK previously and I just added some packet forwarding commands. (Incidentally, I'm new to a lot of this, and I'm finding that much of the documentation assumes that one already knows about running servers.)

I take your point about filing a separate bug. At this stage I'll not do so, because there's a fairly good to excellent chance that my situation is self-inflicted. :) I just wondered if my situation would throw any light on your issues. I'm happy to answer further questions if it's going to be helpful to you, otherwise I'll leave it to you people.

I'll probably need to restore the desktop from backup again in the next day or so, but I have taken a backup copy of the current system so it can be revisted later if needed.

Heni

Revision history for this message
floid (jkanowitz) wrote :

Hm. Specific to Heni's issue here:

Nothing really rings a bell for me right now, but `dig` working reliably is telling - packets must be getting through, so you're having some type of resolver or nsswitch trouble.

One way to achieve a little more determinism is to remove, or reorder, the "mdns4_minimal" priority in nsswitch.conf; for instance, change:
hosts: files mdns4_minimal [NOTFOUND=return] dns mdns4
to:
hosts: files dns mdns4

This removes the priority given to Avahi/mDNS lookups, which I see people have sometimes blamed for erratic resolver behavior. I doubt this is really the culprit, but it never hurts to simplify while troubleshooting. (One thing that has bothered me: what is the rationale for making [NOTFOUND=return] the default there? For desktop installs, shouldn't it be =continue, so if mDNS is somehow "accidentally" tried -- say in situations with both mDNS and a .local domain in the conventional DNS -- there's the best chance that the right thing happens in the end?)

...

For a basic primer on how DNS works, I've always found djb's writings helpful and to-the-point, if technical -- have a look at http://cr.yp.to/djbdns/intro-dns.html and the rest of the djbdns documentation if you need a quick understanding of how the protocol works; you'll probably have to pull out `tcpdump` or another sniffer to get to the bottom of your situation (and if using tcpdump, remember the -n option, or the act of tcpdumping will itself generate a lot of DNS lookups!).

Things have gotten more complicated recently, as resolvers have added more protections against poisoning (though I think any major changes would've come after glibc 2.7, unless "ubuntu4" includes a backported security patch -- I only see up to ubuntu3 in the changelog!), and as they try to be more IPv6-ready or otherwise deal with the reality of a world with both A and AAAA records; still, that should still get you started. I need to eat my own dogfood and analyze my own dumps there, but I was hoping someone who could read some of the more arcane aspects 'by eye' would come along and spare me nights flipping back and forth with reference material. :}

...

In your case, if it was working "until recently," and nothing obvious is haywire in packet dumps, it would be interesting to try to figure out when you last upgraded the libc6 package, what the previous version had been, and if downgrading to it magically solves your problem. I don't have a lot of experience with apt forensics, but a record of that change might be hidden somewhere. That's what the kernel kids call "bisecting," a fancy word for flipping between versions until you narrow down which one started causing problems!

Revision history for this message
Jane Atkinson (irihapeti) wrote :

An update:

I noticed, looking at my firewall log, that some packets to the nameserver were getting blocked by the firewall. On the off-chance, I switched back to my original stand-alone ufw firewall to see if that would change things - apt-get update works now! Now I need to figure out why the iptables script should have been blocking apt-get but not dig.

Slightly red face, and sorry if I've wasted anyone's time. At least I've learned something, even if it isn't of much help to you people. (And I don't have to re-restore from backup, which is nice.)

Heni

Revision history for this message
floid (jkanowitz) wrote :

Heh! Let me know if you notice a difference between the apt-get / libc resolver queries and dig ones, that may be useful!

I've just realized that, of course, wireshark should have a lazy-person's DNS dump interpreter, which should save me some time on the manual sanity-checking of individual bytes. Of course, for all I know, I may be forgetting an option to tcpdump itself that'd do the equivalent, I'm lucky to remember -X on a good day.

Revision history for this message
floid (jkanowitz) wrote :

*bonk* - the sound of my head hitting the desk.

New theory:

A race would have been fun, but for my bug(s), pointing /etc/resolv.conf straight to ns.ubuntu.com satisfies apt-get.

The difference? ns.ubuntu.com is authoritative. So is apt-get's use of getaddrinfo(3) throwing out non-authoritative responses?
...If it is, is there a reason for this? Did I miss a memo where non-authoritative caches are now considered dangerous?

Of course, `dig` hand-crafts its own queries so it's only good for looking at DNS "reality," not resolver behavior.

Revision history for this message
floid (jkanowitz) wrote :

Hmm, I must not be thinking straight - blame the flu, even if I'm almost over it. :} If it were *simply* a matter of authority, this wouldn't explain the intermittent behavior.

For instance, now I'm running apt-get update (in gdb, but I need to go back and build a non-stripped version, doh) with resolv.conf pointed at my non-authoritative cache at 172.16.0.1 and, strangely enough, on this run everything has resolved happily and fine.

Revision history for this message
floid (jkanowitz) wrote :

Figures, I sit down to finally debug this and find that glibc 2.9-0ubuntu12 was released yesterday; it contains a patch which appears to... well, at least patch the issue. Lookups, even with apt, appear to be working with this version!

So is the fix appropriate? From the changelog:

glibc (2.9-0ubuntu12) jaunty; urgency=low

  * debian/patches/all/fedora-nss_dns-gethostbyname4-disable.diff: Patch
    from Fedora 2.9-3 to temporarily disable _nss_dns_gethostbyname4_r,
    which caused problems for systems with broken IPv6 connectivity
    (LP: #313218, https://bugzilla.redhat.com/show_bug.cgi?id=459756).

I am having trouble finding this particular .diff - where am I supposed to look? - but assume it is substantially similar to:

http://pasky.or.cz/~pasky/dev/glibc/glibc-2.10-dns-no-gethostbyname4.diff

found via Google.

...so apparently parallel lookups were codified as _nss_dns_gethostbyname4_r. Fair enough.

That version bears the following comment: "This should work in theory, but it turns out that many cheap DSL modems and similar devices have buggy DNS servers - if the AAAA query arrives too quickly after the A query, the server will generate only a single reply with the A query id but returning an error for the AAAA query; we get stuck waiting for the second reply."

This blames "cheap DSL modems and similar devices," but If I understand my own dumps (see: tcpdump_snippet - search race.txt for example) with the "broken" resolver, this was not the case for my configuration: separate queries were issued from the same source port but with different IDs, the nameserver properly responded to both, then for inexplicable reasons the resolver reissued the same queries (reusing the same IDs, but a new source port) a second time, before blindly trying a third set of requests with the search domain appended. Rather than "getting stuck waiting," it rapidly repeated itself.

Clearly something was screwy with the resolver algorithm, rather than the particular DNS server, unless I have overlooked some subtle noncompliance in the responses. If I get a chance, I hope to explore the "_nss_dns_gethostbyname4_r" behavior in greater depth and come to an absolute and reasoned conclusion. :}

To reiterate, though: In the meantime, this patch, reverting to the earlier strategies, does seem to "fix everything."

Revision history for this message
Colin Watson (cjwatson) wrote :

Indeed, and sorry I didn't notice this bug sooner. The patch in our source package is http://bazaar.launchpad.net/~ubuntu-toolchain/ubuntu-toolchain/glibc-2.5-package/annotate/head%3A/patches/all//fedora-nss_dns-gethostbyname4-disable.diff and is indeed the same as the one you found on pasky.or.cz. I'll mark this bug as a duplicate; I believe upstream are working on a better fix, but in the meantime we should get by fine with this workaround.

Revision history for this message
Colin Watson (cjwatson) wrote :

BTW, thanks for your very detailed report!

Changed in apt:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.