Comment 2 for bug 1702726

Revision history for this message
Simon Kelley (simon-thekelleys) wrote : Re: [Bug 1702726] [NEW] dnsmasq fails when the ARP cache is full

Testing this, the results are not quite as clear-cut as the example. I
don't always see the same errors.

Also, I don't understand why the send() calls in dig, which are sending
UDP packets over the loopback interface, should return the invalid
argument. ARP is not needed over loopback, surely?

Looking at an strace of dnsmasq, what I see is that either the query
never arrives at dnsmasq, or it gets answered correctly but the answers
never makes it back to dig: the UDP packets are being dropped in the
kernel. (In the later case, the send() of the reply gets the same
invalid argument error that dig is seeing)

The lesson here is that if the arp-cache overflows, UDP, (even over lo)
drops packets. There's really not much dnsmasq can do about that. I
guess the only answer is "don't let your arp-cache overflow".

(or possibly, work on getting the kernel to behave better under these
circumstances)

TL;DR not a dnsmasq bug.

Cheers,

Simon.

On 06/07/17 18:17, Christopher Berner wrote:
> Public bug reported:
>
> Test setup:
> OS: Ubuntu 16.04
> Hardware: D15_v2 VM on Azure
>
> Steps to reproduce:
> 1) sudo apt-get install dnsmasq
> 2) sudo sysctl -w net.ipv4.neigh.default.gc_thresh1=1
> 3) sudo sysctl -w net.ipv4.neigh.default.gc_thresh2=1
> 4) sudo sysctl -w net.ipv4.neigh.default.gc_thresh3=1
> 5) dig @127.0.0.1 google.com
>
> Result:
> ~$ dig @127.0.0.1 google.com
> ../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
> ../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
> ../../../../lib/isc/unix/socket.c:2104: internal_send: 127.0.0.1#53: Invalid argument
>
> ; <<>> DiG 9.10.3-P4-Ubuntu <<>> @127.0.0.1 google.com
> ; (1 server found)
> ;; global options: +cmd
> ;; connection timed out; no servers could be reached
>
> However, an external DNS server still works fine (dig @8.8.8.8
> google.com, for example).
>
> We discovered this as the default max ARP cache size is 1024, and we're
> running a large cluster with a lot of intra-cluster network traffic.
> Increasing the size of the ARP cache solves this problem, but it seems
> like dnsmasq should still work and just be slow, like other applications
> (curl for example just takes longer to connect)
>
> ** Affects: dnsmasq (Ubuntu)
> Importance: Undecided
> Status: New
>