dnsmasq fails when the ARP cache is full

Bug #1702726 reported by Christopher Berner on 2017-07-06
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
dnsmasq (Ubuntu)
Undecided
Unassigned

Bug Description

Test setup:
OS: Ubuntu 16.04
Hardware: D15_v2 VM on Azure

Steps to reproduce:
1) sudo apt-get install dnsmasq
2) sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=1
3) sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=1
4) sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=1
5) dig @127.0.0.1 google.com

Result:
~$ dig @127.0.0.1 google.com
../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument

; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 google.com
; (1 server found)
;; global options: +cmd
;; connection timed out; no servers could be reached

However, an external DNS server still works fine (dig @8.8.8.8 google.com, for example).

We discovered this as the default max ARP cache size is 1024, and we're running a large cluster with a lot of intra-cluster network traffic. Increasing the size of the ARP cache solves this problem, but it seems like dnsmasq should still work and just be slow, like other applications (curl for example just takes longer to connect)

Hi Christopher,
thanks for the report and the nice steps to reproduce.
I can absolutely confirm your finding.

I checked up to latest dnsmasq as it is in the current development release (Artful).
That is 2.77 from 01-Jun-2017 so really not too old :-)

I appreciate the quality of this bug report and I'm sure it'll be helpful to others experiencing the same issue. But I checked and neither Ubuntu nor Debian have any patches on top of upstream dnsmasq.

Thereby this sounds like an upstream bug to me. The best route to getting it fixed in Ubuntu (and actually everywhere) in this case would be to file an upstream bug if you're able to do that.

If you do end up filing an upstream bug, please link to it from here - that would be awesome. Thanks in advance!

Changed in dnsmasq (Ubuntu):
status: New → Confirmed

Testing this, the results are not quite as clear-cut as the example. I
don't always see the same errors.

Also, I don't understand why the send() calls in dig, which are sending
UDP packets over the loopback interface, should return the invalid
argument. ARP is not needed over loopback, surely?

Looking at an strace of dnsmasq, what I see is that either the query
never arrives at dnsmasq, or it gets answered correctly but the answers
never makes it back to dig: the UDP packets are being dropped in the
kernel. (In the later case, the send() of the reply gets the same
invalid argument error that dig is seeing)

The lesson here is that if the arp-cache overflows, UDP, (even over lo)
drops packets. There's really not much dnsmasq can do about that. I
guess the only answer is "don't let your arp-cache overflow".

(or possibly, work on getting the kernel to behave better under these
circumstances)

TL;DR not a dnsmasq bug.

Cheers,

Simon.

On 06/07/17 18:17, Christopher Berner wrote:
> Public bug reported:
>
> Test setup:
> OS: Ubuntu 16.04
> Hardware: D15_v2 VM on Azure
>
> Steps to reproduce:
> 1) sudo apt-get install dnsmasq
> 2) sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=1
> 3) sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=1
> 4) sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=1
> 5) dig @127.0.0.1 google.com
>
> Result:
> ~$ dig @127.0.0.1 google.com
> ../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
> ../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
> ../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
>
> ; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 google.com
> ; (1 server found)
> ;; global options: +cmd
> ;; connection timed out; no servers could be reached
>
> However, an external DNS server still works fine (dig @8.8.8.8
> google.com, for example).
>
> We discovered this as the default max ARP cache size is 1024, and we're
> running a large cluster with a lot of intra-cluster network traffic.
> Increasing the size of the ARP cache solves this problem, but it seems
> like dnsmasq should still work and just be slow, like other applications
> (curl for example just takes longer to connect)
>
> ** Affects: dnsmasq (Ubuntu)
> Importance: Undecided
> Status: New
>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers