MAAS TFTP fails when MAAS interface is not the default route device

Bug #1465000 reported by Mark Shuttleworth
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Wishlist
Unassigned

Bug Description

I have a reproducible failure with MAAS based on the exact default route to the cluster LAN.

In my case I have two interfaces on the cluster LAN (where machines PXE boot) for the cluster controller:

  bond0 is 192.168.9.2 and is dual-10G
  p2p1 is 192.168.9.4 and is 1G

The bond0 interface is declared to the cluster controller, and is descibed as fully managed (DHCP and DNS). The p2p1 interface is not declared at all in the current config.

In /etc/network/interfaces both p2p1 and bond0 are declared and assigned static IP addresses.

Only one of them can have a "gateway" entry to provide a default route, otherwise ifup will fail on the second one trying to configure the default route, saying the file already exists.

MAAS works fine when the default route device is bond0 but PXE fails (thanks to TFTP hanging) when p2p1 is the default gateway device.

Interestingly, debugging this is tricky because normal Linux clients can TFTP just fine from MAAS with either configuration.

Revision history for this message
Mike Pontillo (mpontillo) wrote :
Download full text (4.5 KiB)

Short answer:

As you saw from the "File exists" error, this is an area in Linux networking that is not well supported by ifupdown. (that is, addresses in the same subnets on two different interfaces). Try setting the following sysctl and then re-running the test:

sysctl -w net.ipv4.conf.all.arp_filter=1

Unless I haven't reproduced the same issue you have, I don't think we should attempt to fix this in MAAS. Read on for the details.

Long answer:

I configured an extra interface on a test virtual machine in order to replicate the described setup (more or less; I don't think there is a need for the first interface to be a bond).

My setup in /etc/network/interfaces:

# The primary network interface
auto eth0
iface eth0 inet static
  address 172.16.100.11/24
  # gateway 172.16.100.1
  dns-nameserver 127.0.0.1 8.8.8.8 8.8.4.4

auto eth1
iface eth1 inet static
  address 172.16.100.12/24
  gateway 172.16.100.1
  dns-nameserver 127.0.0.1 8.8.8.8 8.8.4.4

Previously, my MAAS setup was working fine with eth0 being the only active interface talking to the cluster's hosts.

As soon as I moved the "gateway" statement to eth1, I started seeing strange behavior in MAAS. I didn't see a PXE/TFTP failure, but I *did* notice that MAAS was timing out while talking to the BMC, and the node remained in "Commissioning" state. (the BMC is off-subnet, so it would make sense that it needs to talk through the default gateway in order to power on the node.)

Firing up Wireshark, I can see that in this situation, the cluster will ARP for its gateway via the MAC on eth0, and the gateway dutifully responds. However, nothing happens after that. My current theory is that the ARP response doesn't match the reverse-path filter and is silently discarded.

With that being the theory, there are two potential sysctl values that may affect this behavior[1]:

$ sudo sysctl -w net.ipv4.conf.all.arp_filter=1
$ sudo sysctl -w net.ipv4.conf.all.rp_filter=0

(On Trusty, the default value for arp_filter is 0, and the default value for rp_filter is 1.)

For me, setting *either* setting was a workaround that allowed nodes to complete commissioning.

Another similar situation I can think of is if you have a wired NIC plugged in at the same time as you are using wireless (on the same subnet). It looks like NetworkManager solves this, at least in part, by adding routes with a 'metric', so that when the wired conncetion is plugged in, no packets are sourced from the wireless interface. From /sbin/ip route:

172.16.0.0/24 dev eth0 proto kernel scope link src 172.16.0.166 metric 1
172.16.0.0/24 dev wlan1 proto kernel scope link src 172.16.0.189 metric 9

[1] from https://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt -

rp_filter - INTEGER
 0 - No source validation.
 1 - Strict mode as defined in RFC3704 Strict Reverse Path
     Each incoming packet is tested against the FIB and if the interface
     is not the best reverse path the packet check will fail.
     By default failed packets are discarded.
 2 - Loose mode as defined in RFC3704 Loose Reverse Path
     Each incoming packet's source address is also tested against the FIB
     and if the source address is not re...

Read more...

Changed in maas:
status: New → Triaged
Revision history for this message
Mike Pontillo (mpontillo) wrote :

I spoke with Jay Vosburgh about this issue, as we've started touching on aspects of networking deep inside the kernel. When testing this, there are a few additional points to consider:

(1) "sudo ip neigh flush all" (or the equivalent) must be executed whenever a change to the sysctl values or /etc/network/interfaces occurs. This must be done on any nodes under test, the MAAS server, and the gateway. (otherwise, the systems under test may behave inconsistently until ARP times out.)[1]

(2) The reason this works in a similar situation for NetworkManager (wired and wireless on the same subnet) is because a metric (preference) value is added to the per-interface route to the connected subnet. This eliminates any ambiguity in the route table, and also causes ARP to only respond on the interface with the lowest metric value, thereby forcing all traffic through a single interface.[2]

(3) Another potentially useful sysctl in this situation is "arp_ignore".[3] If the "arp_ignore" sysctl is set to 1, this will cause ARP to only respond on a particular interface if the address is actually configured there. (it seems to me that this sysctl and arp_filter would achieve the same result; one does so by implicit filtering, the other is explicit.)

[1] Normally when a host has multiple network interfaces on a single subnet, the reason is for redundancy (failover). In a failover situation, a userland process (or, in the case of a bond interface, the kernel) is often responsible for managing the failover by moving an IP address/MAC to the new desired interface. (if the MAC is moved, and/or the interface is brought down and back up again, the kernel could also send the gratuitous ARP if the "arp_notify" sysctl is set.) After the gratuitous ARPs are sent, the neighbor tables of the other hosts on the subnet will be updated. (Obviously, ifupdown doesn't do this for us.)

[2] note that one thing that NetworkManager does NOT do is send gratuitous ARPs when route changes happen. So you may have to wait until remote hosts expire their neighbor cache and re-ARP for in-flight TCP sessions, etc, to recover. (since rp_filter=1 by default)

[3] arp_ignore - INTEGER
 Define different modes for sending replies in response to
 received ARP requests that resolve local target IP addresses:
 0 - (default): reply for any local target IP address, configured
 on any interface
 1 - reply only if the target IP address is local address
 configured on the incoming interface
 2 - reply only if the target IP address is local address
 configured on the incoming interface and both with the
 sender's IP address are part from same subnet on this interface
 3 - do not reply for local addresses configured with scope host,
 only resolutions for global and link addresses are replied
 4-7 - reserved
 8 - do not reply for all local addresses

 The max value from conf/{all,interface}/arp_ignore is used
 when ARP request is received on the {interface}

Revision history for this message
Mark Shuttleworth (sabdfl) wrote : Re: [Bug 1465000] Re: MAAS TFTP fails when MAAS interface is not the default route device

Mike, that's FANTASTIC background and debugging, thank you! Appreciate
the insights. Am travelling but will test this and make a recommendation
once I'm home and have easy-manual-access to the gMAAS.

It looks like we should try to detect potential problem situations and
make a recommendation through the UI because quite a few folks have been
bitten by this, and very few indeed would be as knowledgeable about the
guts of Linux networking policy configuration. I think we are being
distinctly more conservative than other common developer OS's - as we
are with icmp_redirect, and that makes for a lot of head scratching.

Mark

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Sure, let's sync with Andres about how to get this in given other priorities. Meanwhile, I look forward to the results of any additional testing!

Somewhat unrelated to this issue, but, I'm curious what issues are being seen with ICMP redirects. I noticed that We have accept_redirects=0 and secure_redirects=1 as defaults on Trusty. It's worth noting that in IPv6, redirects work a little differently:

IPv6:
https://tools.ietf.org/html/rfc4861#section-4.5

IPv4:
https://tools.ietf.org/html/rfc792#page-12

IPv6 redirects have the interesting feature that if you have multiple L3 subnets on a L2 network, the router can tell you "don't go through me to reach this host; it's on the same L2 as you".

Changed in maas:
importance: Undecided → Wishlist
milestone: none → next
Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Right. Currently we are conservative about redirects, and that is
probably because an adversary might try to reroute traffic through
themselves in inappropriate ways. However, it mostly results in cases
where networking fails in silent and hard to debug ways on Ubuntu but
works just fine on most but not all other devices. We wasted a lot of
time with the SA team in Malta working on Orange Boxes because of this,
for example. Most people just don't know how that works so can't figure
out why something that looks like it should work, fails.

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

So far I have managed to confirm that adding the ifmetric package and a "metric 10" to the p2p1 interface (the second, alternate interface) appears to solve the problem. I haven't yet dove into sysctl option permutations.

Revision history for this message
Mike Pontillo (mpontillo) wrote :

Sounds good. Keep in mind that the interface routes are only part of the story. The gateway routes may need corresponding metric values as well. A quick test shows that this should be possible; no "File exists" error! For example

# /sbin/ip route add default via 172.16.100.1 dev eth0 proto static metric 1
# /sbin/ip route add default via 172.16.100.1 dev wlan0 proto static metric 9

# /sbin/ip route | grep ^default
default via 172.16.100.1 dev eth0 proto static metric 1
default via 172.16.100.1 dev wlan0 proto static metric 9

Changed in maas:
milestone: next → 1.9.0
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Interestingly, I think I've filed a related bug, who's root cause might actually be the same: https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1664748

See @mpontillo's analysis in https://bugs.launchpad.net/ubuntu/+source/maas/+bug/1664748/comments/8

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Right. I think this is one area where our security approach hurts us
more than it helps.

Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Closing this issue as it was caused by a misconfiguration of the network interfaces on the provisioned machines. Using ifmetric to prioritise traffic on one of the interfaces solves the PXE booting issue.

Changed in maas:
milestone: 1.9.0 → none
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.