StarlingX

Bug #1790941
Comment #2

Comment 2 for bug 1790941

Revision history for this message

Patrick Bonnell (pbonnell) wrote on 2018-09-28:

This issue isn't actually a bug; this is how Linux behaves.
For this to make more sense, please consider the following information:

rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
     Each incoming packet is tested against the FIB and if the interface
     is not the best reverse path the packet check will fail.
     By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
     Each incoming packet's source address is also tested against the FIB
     and if the source address is not reachable via any interface
     the packet check will fail.

Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
or other complicated routing, then loose mode is recommended.

The max value from conf/{all,interface}/rp_filter is used
when doing source validation on the {interface}.

Default value is 0. Note that some distributions enable it
in startup scripts.

arp_ignore - INTEGER
Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured
on any interface
1 - reply only if the target IP address is local address
configured on the incoming interface
2 - reply only if the target IP address is local address
configured on the incoming interface and both with the
sender's IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host,
only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses

The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}

Now consider this system:

IPs are assigned on a per node basis, not on a per interface basis.
By default, each vNIC is configured with rp_filter=1, meaning that if an ARP
request is received on an interface, it must be able to find a route back to
the source IP from the interface it was received on, otherwise no ARP response
will be sent back. In addition, each interface is configured with arp_ignore=0,
meaning that any interface can respond to the ARP request as long as the node
holds that IP.

This is what happens when 10.10.0.12 (VM-2) pings 10.10.0.6 (VM-1):
- An ARP request is broadcasted by 10.10.0.12 asking for the MAC address
  of 10.10.0.6.
- eth1 (10.10.0.6) sees the request and checks the routing table to determine
  if a route back to the source IP address can be found (this is a security
  mechanism put in place with rp_filter).
- VM-1's routing table has two entries on how to find the source address
  (One entry for eth0 and one entry for eth1).
- It sees that the source address (10.10.0.12) can be found on eth0 and
  stops looking through the routing table since it has found a working
  route.
- Since rp_filter=1, eth1 cannot reply to the ARP request because the route
  back to the source address is on a different interface. We know that eth1
  does have a route back to the source address, its just that when it looks
  through the routing table for a reverse path, the first entry with the correct
  route is using eth0, thus eth1 never replies to the ARP request.
- Since the ARP request is being broadcasted, eth0 also receives the request,
  and since arp_ignore=0, eth0 can even reply to the request.
- As I mentioned above, IPs are assigned on a per node basis, not on a per
  interface basis, so when eth0 receives the ARP request for 10.10.0.6,
  it can see that 10.10.0.6 belongs to this node and therefore sends an ARP
  reply. The problem with this is that eth0 does not reply with the MAC address
  of eth1 (00:00:00:00:00:06), it replies with its MAC address (00:00:00:00:00:05).
- When the ARP reply from eth0 reaches the bridge, it is dropped because the bridge
  is aware that the MAC address 00:00:00:00:00:05 does not belong to the IP 10.10.0.6,
  thus the ARP reply never makes it back to VM-2.

Proposed Solutions:
-------------------

1) Set rp_filter=2:
- The rp_filter field can be modified on VM-1 with 'sysctl -w net.ipv4.conf.all.rp_filter=2'.
- I mentioned above that eth1 could not reply to the ARP request because the route
  it found back to the source IP was using eth0. By setting rp_filter=2, eth1 is
  allowed to send the reply back to VM-2.
- Now if you ping 10.10.0.6 from VM-2, the ping request is sent to eth1, and the reply
  is from eth0.

2) Create a new routing table for each vNIC:
- Modify /etc/iproute2/rt_tables and create a table for each vNIC (For this example, I've
  named the tables t0 and t1).
- Execute the following commands (based on the example above):
  - For eth0:
   ip route add default via 10.10.0.1 dev eth0 table t0
   ip route add 10.10.0.0/24 dev eth0 src 10.10.0.5 table t0
   ip rule add from 10.10.0.5 table t0
  - For eth1:
   ip route add default via 10.10.0.1 dev eth1 table t1
   ip route add 10.10.0.0/24 dev eth1 src 10.10.0.6 table t1
   ip rule add from 10.10.0.6 table t1
- Note that these routes are only temporary and will be deleted from their new tables if the
  interface is turned down. If you seek persistant routes then you will have to create the
  appropriate network-scripts.

3) Create a new namespace for each additional vNIC:
- For eth1:
  ip netns add new-space0
  ip netns set dev eth1 netns new-space0
  ip netns exec new-space0 ip link set dev eth1 up
  ip netns exec new-space0 ip addr add 10.10.0.6/24 brd + dev eth1

I am sure more solutions are available, but these are the ones I tested. The way the VM connectivity
tests are done today will have to be modified, but once again, this is just how Linux operates.

This issue isn't actually a bug; this is how Linux behaves. 
For this to make more sense, please consider the following information:

rp_filter - INTEGER
	0 - No source validation.
	1 - Strict mode as defined in RFC3704 Strict Reverse Path
	    Each incoming packet is tested against the FIB and if the interface
	    is not the best reverse path the packet check will fail.
	    By default failed packets are discarded.
	2 - Loose mode as defined in RFC3704 Loose Reverse Path
	    Each incoming packet's source address is also tested against the FIB
	    and if the source address is not reachable via any interface
	    the packet check will fail.

Current recommended practice in RFC3704 is to enable strict mode
	to prevent IP spoofing from DDos attacks. If using asymmetric routing
	or other complicated routing, then loose mode is recommended.

The max value from conf/{all,interface}/rp_filter is used
	when doing source validation on the {interface}.

Default value is 0. Note that some distributions enable it
	in startup scripts.

arp_ignore - INTEGER
	Define different modes for sending replies in response to
	received ARP requests that resolve local target IP addresses:
	0 - (default): reply for any local target IP address, configured
	on any interface
	1 - reply only if the target IP address is local address
	configured on the incoming interface
	2 - reply only if the target IP address is local address
	configured on the incoming interface and both with the
	sender's IP address are part from same subnet on this interface
	3 - do not reply for local addresses configured with scope host,
	only resolutions for global and link addresses are replied
	4-7 - reserved
	8 - do not reply for all local addresses

The max value from conf/{all,interface}/arp_ignore is used
	when ARP request is received on the {interface}

Now consider this system:

-----------      10.10.0.5      ----                     -----------
 |       eth0|-------------------|    |                   |           |
 |           | 00:00:00:00:00:05 |    |    10.10.0.12     |           |
 |    VM-1   |                   | br |-------------------|    VM-2   |
 |           |     10.10.0.6     |    | 00:00:00:00:00:12 |           |
 |       eth1|-------------------|    |                   |           |
  -----------  00:00:00:00:00:06  ----                     -----------

IPs are assigned on a per node basis, not on a per interface basis.
By default, each vNIC is configured with rp_filter=1, meaning that if an ARP 
request is received on an interface, it must be able to find a route back to 
the source IP from the interface it was received on, otherwise no ARP response
will be sent back. In addition, each interface is configured with arp_ignore=0,
meaning that any interface can respond to the ARP request as long as the node
holds that IP.

This is what happens when 10.10.0.12 (VM-2) pings 10.10.0.6 (VM-1):
	- An ARP request is broadcasted by 10.10.0.12 asking for the MAC address
		of 10.10.0.6.
	- eth1 (10.10.0.6) sees the request and checks the routing table to determine 
		if a route back to the source IP address can be found (this is a security 
		mechanism put in place with rp_filter).
	- VM-1's routing table has two entries on how to find the source address 
		(One entry for eth0 and one entry for eth1).
	- It sees that the source address (10.10.0.12) can be found on eth0 and
		stops looking through the routing table since it has found a working
		route.
	- Since rp_filter=1, eth1 cannot reply to the ARP request because the route 
		back to the source address is on a different interface. We know that eth1
		does have a route back to the source address, its just that when it looks
		through the routing table for a reverse path, the first entry with the correct
		route is using eth0, thus eth1 never replies to the ARP request.
	- Since the ARP request is being broadcasted, eth0 also receives the request,
		and since arp_ignore=0, eth0 can even reply to the request.
	- As I mentioned above, IPs are assigned on a per node basis, not on a per 
		interface basis, so when eth0 receives the ARP request for 10.10.0.6,
		it can see that 10.10.0.6 belongs to this node and therefore sends an ARP
		reply. The problem with this is that eth0 does not reply with the MAC address 
		of eth1 (00:00:00:00:00:06), it replies with its MAC address (00:00:00:00:00:05).
	- When the ARP reply from eth0 reaches the bridge, it is dropped because the bridge
		is aware that the MAC address 00:00:00:00:00:05 does not belong to the IP 10.10.0.6,
		thus the ARP reply never makes it back to VM-2.

Proposed Solutions:
-------------------

1) Set rp_filter=2:
	- The rp_filter field can be modified on VM-1 with 'sysctl -w net.ipv4.conf.all.rp_filter=2'.
	- I mentioned above that eth1 could not reply to the ARP request because the route
		it found back to the source IP was using eth0. By setting rp_filter=2, eth1 is
		allowed to send the reply back to VM-2.
	- Now if you ping 10.10.0.6 from VM-2, the ping request is sent to eth1, and the reply 
		is from eth0.

2) Create a new routing table for each vNIC:
	- Modify /etc/iproute2/rt_tables and create a table for each vNIC (For this example, I've 
		named the tables t0 and t1).
	- Execute the following commands (based on the example above):
		- For eth0:
			ip route add default via 10.10.0.1 dev eth0 table t0
			ip route add 10.10.0.0/24 dev eth0 src 10.10.0.5 table t0
			ip rule add from 10.10.0.5 table t0
		- For eth1:
			ip route add default via 10.10.0.1 dev eth1 table t1
			ip route add 10.10.0.0/24 dev eth1 src 10.10.0.6 table t1
			ip rule add from 10.10.0.6 table t1
	- Note that these routes are only temporary and will be deleted from their new tables if the 
		interface is turned down. If you seek persistant routes then you will have to create the
		appropriate network-scripts.

3) Create a new namespace for each additional vNIC:
	- For eth1:
		ip netns add new-space0
		ip netns set dev eth1 netns new-space0
		ip netns exec new-space0 ip link set dev eth1 up
		ip netns exec new-space0 ip addr add 10.10.0.6/24 brd + dev eth1

I am sure more solutions are available, but these are the ones I tested. The way the VM connectivity
tests are done today will have to be modified, but once again, this is just how Linux operates.