Bug #1465000 “MAAS TFTP fails when MAAS interface is not the def...” : Bugs : MAAS

Revision history for this message

Mike Pontillo (mpontillo) wrote on 2015-06-15:

#1

Download full text (4.5 KiB)

Short answer:

As you saw from the "File exists" error, this is an area in Linux networking that is not well supported by ifupdown. (that is, addresses in the same subnets on two different interfaces). Try setting the following sysctl and then re-running the test:

sysctl -w net.ipv4.conf.all.arp_filter=1

Unless I haven't reproduced the same issue you have, I don't think we should attempt to fix this in MAAS. Read on for the details.

Long answer:

I configured an extra interface on a test virtual machine in order to replicate the described setup (more or less; I don't think there is a need for the first interface to be a bond).

My setup in /etc/network/interfaces:

# The primary network interface
auto eth0
iface eth0 inet static
  address 172.16.100.11/24
  # gateway 172.16.100.1
  dns-nameserver 127.0.0.1 8.8.8.8 8.8.4.4

auto eth1
iface eth1 inet static
  address 172.16.100.12/24
  gateway 172.16.100.1
  dns-nameserver 127.0.0.1 8.8.8.8 8.8.4.4

Previously, my MAAS setup was working fine with eth0 being the only active interface talking to the cluster's hosts.

As soon as I moved the "gateway" statement to eth1, I started seeing strange behavior in MAAS. I didn't see a PXE/TFTP failure, but I *did* notice that MAAS was timing out while talking to the BMC, and the node remained in "Commissioning" state. (the BMC is off-subnet, so it would make sense that it needs to talk through the default gateway in order to power on the node.)

Firing up Wireshark, I can see that in this situation, the cluster will ARP for its gateway via the MAC on eth0, and the gateway dutifully responds. However, nothing happens after that. My current theory is that the ARP response doesn't match the reverse-path filter and is silently discarded.

With that being the theory, there are two potential sysctl values that may affect this behavior[1]:

$ sudo sysctl -w net.ipv4.conf.all.arp_filter=1
$ sudo sysctl -w net.ipv4.conf.all.rp_filter=0

(On Trusty, the default value for arp_filter is 0, and the default value for rp_filter is 1.)

For me, setting *either* setting was a workaround that allowed nodes to complete commissioning.

Another similar situation I can think of is if you have a wired NIC plugged in at the same time as you are using wireless (on the same subnet). It looks like NetworkManager solves this, at least in part, by adding routes with a 'metric', so that when the wired conncetion is plugged in, no packets are sourced from the wireless interface. From /sbin/ip route:

172.16.0.0/24 dev eth0 proto kernel scope link src 172.16.0.166 metric 1
172.16.0.0/24 dev wlan1 proto kernel scope link src 172.16.0.189 metric 9

[1] from https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt -

rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
     Each incoming packet is tested against the FIB and if the interface
     is not the best reverse path the packet check will fail.
     By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
     Each incoming packet's source address is also tested against the FIB
     and if the source address is not re...