This issue isn't actually a bug; this is how Linux behaves.
For this to make more sense, please consider the following information:
rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
Each incoming packet is tested against the FIB and if the interface
is not the best reverse path the packet check will fail.
By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not reachable via any interface
the packet check will fail.
Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
or other complicated routing, then loose mode is recommended.
The max value from conf/{all,interface}/rp_filter is used
when doing source validation on the {interface}.
Default value is 0. Note that some distributions enable it
in startup scripts.
arp_ignore - INTEGER
Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured
on any interface
1 - reply only if the target IP address is local address
configured on the incoming interface
2 - reply only if the target IP address is local address
configured on the incoming interface and both with the
sender's IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host,
only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses
The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}
IPs are assigned on a per node basis, not on a per interface basis.
By default, each vNIC is configured with rp_filter=1, meaning that if an ARP
request is received on an interface, it must be able to find a route back to
the source IP from the interface it was received on, otherwise no ARP response
will be sent back. In addition, each interface is configured with arp_ignore=0,
meaning that any interface can respond to the ARP request as long as the node
holds that IP.
This is what happens when 10.10.0.12 (VM-2) pings 10.10.0.6 (VM-1):
- An ARP request is broadcasted by 10.10.0.12 asking for the MAC address
of 10.10.0.6.
- eth1 (10.10.0.6) sees the request and checks the routing table to determine
if a route back to the source IP address can be found (this is a security
mechanism put in place with rp_filter).
- VM-1's routing table has two entries on how to find the source address
(One entry for eth0 and one entry for eth1).
- It sees that the source address (10.10.0.12) can be found on eth0 and
stops looking through the routing table since it has found a working
route.
- Since rp_filter=1, eth1 cannot reply to the ARP request because the route
back to the source address is on a different interface. We know that eth1
does have a route back to the source address, its just that when it looks
through the routing table for a reverse path, the first entry with the correct
route is using eth0, thus eth1 never replies to the ARP request.
- Since the ARP request is being broadcasted, eth0 also receives the request,
and since arp_ignore=0, eth0 can even reply to the request.
- As I mentioned above, IPs are assigned on a per node basis, not on a per
interface basis, so when eth0 receives the ARP request for 10.10.0.6,
it can see that 10.10.0.6 belongs to this node and therefore sends an ARP
reply. The problem with this is that eth0 does not reply with the MAC address
of eth1 (00:00:00:00:00:06), it replies with its MAC address (00:00:00:00:00:05).
- When the ARP reply from eth0 reaches the bridge, it is dropped because the bridge
is aware that the MAC address 00:00:00:00:00:05 does not belong to the IP 10.10.0.6,
thus the ARP reply never makes it back to VM-2.
Proposed Solutions:
-------------------
1) Set rp_filter=2:
- The rp_filter field can be modified on VM-1 with 'sysctl -w net.ipv4.conf.all.rp_filter=2'.
- I mentioned above that eth1 could not reply to the ARP request because the route
it found back to the source IP was using eth0. By setting rp_filter=2, eth1 is
allowed to send the reply back to VM-2.
- Now if you ping 10.10.0.6 from VM-2, the ping request is sent to eth1, and the reply
is from eth0.
2) Create a new routing table for each vNIC:
- Modify /etc/iproute2/rt_tables and create a table for each vNIC (For this example, I've
named the tables t0 and t1).
- Execute the following commands (based on the example above):
- For eth0:
ip route add default via 10.10.0.1 dev eth0 table t0
ip route add 10.10.0.0/24 dev eth0 src 10.10.0.5 table t0
ip rule add from 10.10.0.5 table t0
- For eth1:
ip route add default via 10.10.0.1 dev eth1 table t1
ip route add 10.10.0.0/24 dev eth1 src 10.10.0.6 table t1
ip rule add from 10.10.0.6 table t1
- Note that these routes are only temporary and will be deleted from their new tables if the
interface is turned down. If you seek persistant routes then you will have to create the
appropriate network-scripts.
3) Create a new namespace for each additional vNIC:
- For eth1:
ip netns add new-space0
ip netns set dev eth1 netns new-space0
ip netns exec new-space0 ip link set dev eth1 up
ip netns exec new-space0 ip addr add 10.10.0.6/24 brd + dev eth1
I am sure more solutions are available, but these are the ones I tested. The way the VM connectivity
tests are done today will have to be modified, but once again, this is just how Linux operates.
This issue isn't actually a bug; this is how Linux behaves.
For this to make more sense, please consider the following information:
rp_filter - INTEGER
0 - No source validation.
1 - Strict mode as defined in RFC3704 Strict Reverse Path
Each incoming packet is tested against the FIB and if the interface
is not the best reverse path the packet check will fail.
By default failed packets are discarded.
2 - Loose mode as defined in RFC3704 Loose Reverse Path
Each incoming packet's source address is also tested against the FIB
and if the source address is not reachable via any interface
the packet check will fail.
Current recommended practice in RFC3704 is to enable strict mode
to prevent IP spoofing from DDos attacks. If using asymmetric routing
or other complicated routing, then loose mode is recommended.
The max value from conf/{all, interface} /rp_filter is used
when doing source validation on the {interface}.
Default value is 0. Note that some distributions enable it
in startup scripts.
arp_ignore - INTEGER
Define different modes for sending replies in response to
received ARP requests that resolve local target IP addresses:
0 - (default): reply for any local target IP address, configured
on any interface
1 - reply only if the target IP address is local address
configured on the incoming interface
2 - reply only if the target IP address is local address
configured on the incoming interface and both with the
sender's IP address are part from same subnet on this interface
3 - do not reply for local addresses configured with scope host,
only resolutions for global and link addresses are replied
4-7 - reserved
8 - do not reply for all local addresses
The max value from conf/{all, interface} /arp_ignore is used
when ARP request is received on the {interface}
Now consider this system:
----------- 10.10.0.5 ---- ----------- ------- ------- ---| | | | ------- ------| VM-2 | ------- ------- ---| | | |
| eth0|--
| | 00:00:00:00:00:05 | | 10.10.0.12 | |
| VM-1 | | br |------
| | 10.10.0.6 | | 00:00:00:00:00:12 | |
| eth1|--
----------- 00:00:00:00:00:06 ---- -----------
IPs are assigned on a per node basis, not on a per interface basis.
By default, each vNIC is configured with rp_filter=1, meaning that if an ARP
request is received on an interface, it must be able to find a route back to
the source IP from the interface it was received on, otherwise no ARP response
will be sent back. In addition, each interface is configured with arp_ignore=0,
meaning that any interface can respond to the ARP request as long as the node
holds that IP.
This is what happens when 10.10.0.12 (VM-2) pings 10.10.0.6 (VM-1): 00:00:00: 06), it replies with its MAC address (00:00: 00:00:00: 05).
- An ARP request is broadcasted by 10.10.0.12 asking for the MAC address
of 10.10.0.6.
- eth1 (10.10.0.6) sees the request and checks the routing table to determine
if a route back to the source IP address can be found (this is a security
mechanism put in place with rp_filter).
- VM-1's routing table has two entries on how to find the source address
(One entry for eth0 and one entry for eth1).
- It sees that the source address (10.10.0.12) can be found on eth0 and
stops looking through the routing table since it has found a working
route.
- Since rp_filter=1, eth1 cannot reply to the ARP request because the route
back to the source address is on a different interface. We know that eth1
does have a route back to the source address, its just that when it looks
through the routing table for a reverse path, the first entry with the correct
route is using eth0, thus eth1 never replies to the ARP request.
- Since the ARP request is being broadcasted, eth0 also receives the request,
and since arp_ignore=0, eth0 can even reply to the request.
- As I mentioned above, IPs are assigned on a per node basis, not on a per
interface basis, so when eth0 receives the ARP request for 10.10.0.6,
it can see that 10.10.0.6 belongs to this node and therefore sends an ARP
reply. The problem with this is that eth0 does not reply with the MAC address
of eth1 (00:00:
- When the ARP reply from eth0 reaches the bridge, it is dropped because the bridge
is aware that the MAC address 00:00:00:00:00:05 does not belong to the IP 10.10.0.6,
thus the ARP reply never makes it back to VM-2.
Proposed Solutions:
-------------------
1) Set rp_filter=2: conf.all. rp_filter= 2'.
- The rp_filter field can be modified on VM-1 with 'sysctl -w net.ipv4.
- I mentioned above that eth1 could not reply to the ARP request because the route
it found back to the source IP was using eth0. By setting rp_filter=2, eth1 is
allowed to send the reply back to VM-2.
- Now if you ping 10.10.0.6 from VM-2, the ping request is sent to eth1, and the reply
is from eth0.
2) Create a new routing table for each vNIC: rt_tables and create a table for each vNIC (For this example, I've
- Modify /etc/iproute2/
named the tables t0 and t1).
- Execute the following commands (based on the example above):
- For eth0:
ip route add default via 10.10.0.1 dev eth0 table t0
ip route add 10.10.0.0/24 dev eth0 src 10.10.0.5 table t0
ip rule add from 10.10.0.5 table t0
- For eth1:
ip route add default via 10.10.0.1 dev eth1 table t1
ip route add 10.10.0.0/24 dev eth1 src 10.10.0.6 table t1
ip rule add from 10.10.0.6 table t1
- Note that these routes are only temporary and will be deleted from their new tables if the
interface is turned down. If you seek persistant routes then you will have to create the
appropriate network-scripts.
3) Create a new namespace for each additional vNIC:
- For eth1:
ip netns add new-space0
ip netns set dev eth1 netns new-space0
ip netns exec new-space0 ip link set dev eth1 up
ip netns exec new-space0 ip addr add 10.10.0.6/24 brd + dev eth1
I am sure more solutions are available, but these are the ones I tested. The way the VM connectivity
tests are done today will have to be modified, but once again, this is just how Linux operates.