MAAS TFTP fails when MAAS interface is not the default route device
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
Invalid
|
Wishlist
|
Unassigned |
Bug Description
I have a reproducible failure with MAAS based on the exact default route to the cluster LAN.
In my case I have two interfaces on the cluster LAN (where machines PXE boot) for the cluster controller:
bond0 is 192.168.9.2 and is dual-10G
p2p1 is 192.168.9.4 and is 1G
The bond0 interface is declared to the cluster controller, and is descibed as fully managed (DHCP and DNS). The p2p1 interface is not declared at all in the current config.
In /etc/network/
Only one of them can have a "gateway" entry to provide a default route, otherwise ifup will fail on the second one trying to configure the default route, saying the file already exists.
MAAS works fine when the default route device is bond0 but PXE fails (thanks to TFTP hanging) when p2p1 is the default gateway device.
Interestingly, debugging this is tricky because normal Linux clients can TFTP just fine from MAAS with either configuration.
Changed in maas: | |
milestone: | next → 1.9.0 |
Short answer:
As you saw from the "File exists" error, this is an area in Linux networking that is not well supported by ifupdown. (that is, addresses in the same subnets on two different interfaces). Try setting the following sysctl and then re-running the test:
sysctl -w net.ipv4. conf.all. arp_filter= 1
Unless I haven't reproduced the same issue you have, I don't think we should attempt to fix this in MAAS. Read on for the details.
Long answer:
I configured an extra interface on a test virtual machine in order to replicate the described setup (more or less; I don't think there is a need for the first interface to be a bond).
My setup in /etc/network/ interfaces:
# The primary network interface
auto eth0
iface eth0 inet static
address 172.16.100.11/24
# gateway 172.16.100.1
dns-nameserver 127.0.0.1 8.8.8.8 8.8.4.4
auto eth1
iface eth1 inet static
address 172.16.100.12/24
gateway 172.16.100.1
dns-nameserver 127.0.0.1 8.8.8.8 8.8.4.4
Previously, my MAAS setup was working fine with eth0 being the only active interface talking to the cluster's hosts.
As soon as I moved the "gateway" statement to eth1, I started seeing strange behavior in MAAS. I didn't see a PXE/TFTP failure, but I *did* notice that MAAS was timing out while talking to the BMC, and the node remained in "Commissioning" state. (the BMC is off-subnet, so it would make sense that it needs to talk through the default gateway in order to power on the node.)
Firing up Wireshark, I can see that in this situation, the cluster will ARP for its gateway via the MAC on eth0, and the gateway dutifully responds. However, nothing happens after that. My current theory is that the ARP response doesn't match the reverse-path filter and is silently discarded.
With that being the theory, there are two potential sysctl values that may affect this behavior[1]:
$ sudo sysctl -w net.ipv4. conf.all. arp_filter= 1 conf.all. rp_filter= 0
$ sudo sysctl -w net.ipv4.
(On Trusty, the default value for arp_filter is 0, and the default value for rp_filter is 1.)
For me, setting *either* setting was a workaround that allowed nodes to complete commissioning.
Another similar situation I can think of is if you have a wired NIC plugged in at the same time as you are using wireless (on the same subnet). It looks like NetworkManager solves this, at least in part, by adding routes with a 'metric', so that when the wired conncetion is plugged in, no packets are sourced from the wireless interface. From /sbin/ip route:
172.16.0.0/24 dev eth0 proto kernel scope link src 172.16.0.166 metric 1
172.16.0.0/24 dev wlan1 proto kernel scope link src 172.16.0.189 metric 9
[1] from https:/ /www.kernel. org/doc/ Documentation/ networking/ ip-sysctl. txt -
rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
Each incoming packet is tested against the FIB and if the interface
is not the best reverse path the packet check will fail.
By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not re...