Comment 2 for bug 1790941

Revision history for this message
Patrick Bonnell (pbonnell) wrote :

This issue isn't actually a bug; this is how Linux behaves.
For this to make more sense, please consider the following information:

rp_filter - INTEGER
 0 - No source validation.
 1 - Strict mode as defined in RFC3704 Strict Reverse Path
     Each incoming packet is tested against the FIB and if the interface
     is not the best reverse path the packet check will fail.
     By default failed packets are discarded.
 2 - Loose mode as defined in RFC3704 Loose Reverse Path
     Each incoming packet's source address is also tested against the FIB
     and if the source address is not reachable via any interface
     the packet check will fail.

 Current recommended practice in RFC3704 is to enable strict mode
 to prevent IP spoofing from DDos attacks. If using asymmetric routing
 or other complicated routing, then loose mode is recommended.

 The max value from conf/{all,interface}/rp_filter is used
 when doing source validation on the {interface}.

 Default value is 0. Note that some distributions enable it
 in startup scripts.

arp_ignore - INTEGER
 Define different modes for sending replies in response to
 received ARP requests that resolve local target IP addresses:
 0 - (default): reply for any local target IP address, configured
 on any interface
 1 - reply only if the target IP address is local address
 configured on the incoming interface
 2 - reply only if the target IP address is local address
 configured on the incoming interface and both with the
 sender's IP address are part from same subnet on this interface
 3 - do not reply for local addresses configured with scope host,
 only resolutions for global and link addresses are replied
 4-7 - reserved
 8 - do not reply for all local addresses

 The max value from conf/{all,interface}/arp_ignore is used
 when ARP request is received on the {interface}

Now consider this system:

  ----------- 10.10.0.5 ---- -----------
 | eth0|-------------------| | | |
 | | 00:00:00:00:00:05 | | 10.10.0.12 | |
 | VM-1 | | br |-------------------| VM-2 |
 | | 10.10.0.6 | | 00:00:00:00:00:12 | |
 | eth1|-------------------| | | |
  ----------- 00:00:00:00:00:06 ---- -----------

IPs are assigned on a per node basis, not on a per interface basis.
By default, each vNIC is configured with rp_filter=1, meaning that if an ARP
request is received on an interface, it must be able to find a route back to
the source IP from the interface it was received on, otherwise no ARP response
will be sent back. In addition, each interface is configured with arp_ignore=0,
meaning that any interface can respond to the ARP request as long as the node
holds that IP.

This is what happens when 10.10.0.12 (VM-2) pings 10.10.0.6 (VM-1):
 - An ARP request is broadcasted by 10.10.0.12 asking for the MAC address
  of 10.10.0.6.
 - eth1 (10.10.0.6) sees the request and checks the routing table to determine
  if a route back to the source IP address can be found (this is a security
  mechanism put in place with rp_filter).
 - VM-1's routing table has two entries on how to find the source address
  (One entry for eth0 and one entry for eth1).
 - It sees that the source address (10.10.0.12) can be found on eth0 and
  stops looking through the routing table since it has found a working
  route.
 - Since rp_filter=1, eth1 cannot reply to the ARP request because the route
  back to the source address is on a different interface. We know that eth1
  does have a route back to the source address, its just that when it looks
  through the routing table for a reverse path, the first entry with the correct
  route is using eth0, thus eth1 never replies to the ARP request.
 - Since the ARP request is being broadcasted, eth0 also receives the request,
  and since arp_ignore=0, eth0 can even reply to the request.
 - As I mentioned above, IPs are assigned on a per node basis, not on a per
  interface basis, so when eth0 receives the ARP request for 10.10.0.6,
  it can see that 10.10.0.6 belongs to this node and therefore sends an ARP
  reply. The problem with this is that eth0 does not reply with the MAC address
  of eth1 (00:00:00:00:00:06), it replies with its MAC address (00:00:00:00:00:05).
 - When the ARP reply from eth0 reaches the bridge, it is dropped because the bridge
  is aware that the MAC address 00:00:00:00:00:05 does not belong to the IP 10.10.0.6,
  thus the ARP reply never makes it back to VM-2.

Proposed Solutions:
-------------------

1) Set rp_filter=2:
 - The rp_filter field can be modified on VM-1 with 'sysctl -w net.ipv4.conf.all.rp_filter=2'.
 - I mentioned above that eth1 could not reply to the ARP request because the route
  it found back to the source IP was using eth0. By setting rp_filter=2, eth1 is
  allowed to send the reply back to VM-2.
 - Now if you ping 10.10.0.6 from VM-2, the ping request is sent to eth1, and the reply
  is from eth0.

2) Create a new routing table for each vNIC:
 - Modify /etc/iproute2/rt_tables and create a table for each vNIC (For this example, I've
  named the tables t0 and t1).
 - Execute the following commands (based on the example above):
  - For eth0:
   ip route add default via 10.10.0.1 dev eth0 table t0
   ip route add 10.10.0.0/24 dev eth0 src 10.10.0.5 table t0
   ip rule add from 10.10.0.5 table t0
  - For eth1:
   ip route add default via 10.10.0.1 dev eth1 table t1
   ip route add 10.10.0.0/24 dev eth1 src 10.10.0.6 table t1
   ip rule add from 10.10.0.6 table t1
 - Note that these routes are only temporary and will be deleted from their new tables if the
  interface is turned down. If you seek persistant routes then you will have to create the
  appropriate network-scripts.

3) Create a new namespace for each additional vNIC:
 - For eth1:
  ip netns add new-space0
  ip netns set dev eth1 netns new-space0
  ip netns exec new-space0 ip link set dev eth1 up
  ip netns exec new-space0 ip addr add 10.10.0.6/24 brd + dev eth1

I am sure more solutions are available, but these are the ones I tested. The way the VM connectivity
tests are done today will have to be modified, but once again, this is just how Linux operates.