Bug #1011134 “hairpin mode on vnet bridge ports causes false pos...” : Bugs : OpenStack Compute (nova)

Revision history for this message

lucas kauffman (lucas-kauffman) wrote on 2012-06-10:

#1

I can confirm this problem.

Revision history for this message

Manu Sporny (msporny) wrote on 2012-06-21:

#2

Confirmed here as well.

There can be issues with IPv6 failing to initialize in a VM that is running on a host machine which is bridging the network traffic to the host's physical ethernet port. The issue appears because of two reasons:

1. IPv6 has a duplicate address detection feature, where if it sees packets from the same IPv6 Link-Local (MAC-based) address as itself, it assumes that there is another box on the same network with the same MAC address.
2. With network bridging hairpin'ing turned on, all packets are reflected back to the VMs, so any IPv6 traffic is duplicated and sent back to the VM... which means that the IPv6 duplicate address detection code activates and the IPv6 subsystem skips initialization.

The bug has to do with two separate systems stomping on each other:

1. The bridge is misconfigured. Either it is in promiscuous mode, or the virtual network interface has hairpin'ing turned on. Hairpin'ing reflects all traffic, including IPv6 Neighbor Solicitation messages, back to the sender.
2. The Linux kernel IPv6 code, upon seeing the reflected IPv6 Neighbor Solicitation message, assumes the address is in use (when it really isn't) and doesn't bring the link up all the way as a result.

Detecting the issue
---------------------------

On the VM, run the following command to see if you have an IPv6 Duplicate Address Detection issue:

dmesg | grep duplicate

If you see a line that matches something like the following, you have the IPv6DAD issue:

eth0: IPv6 duplicate address detected!

You can also issue the following command to see if you have any issues with your IPv6 links:

ip addr | grep tentative

If you see a line that matches something like the following, you most likely have the IPv6DAD issue:

inet6 fe80::a816:aeff:be53:d00d/64 scope link tentative

Fixing the Issue
---------------------

There are two potential fixes for the issue:

1. The bridge interface is in promiscuous mode.
2. Hairpin'ing is turned on for the virtual network device and is thus reflecting IPv6 Neighbor Advertisement messages back to the sender.

To solve #1, turn promiscuous mode for the bridge device off by doing the following:

ifconfig br100 -promisc

To solve #2, you have to turn off hairpin'ing mode for the virtual network interface that is associated with the VM that is not able to setup a valid IPv6 link. So, on the OpenStack Compute Node that is hosting the VM, assuming that the bridge device is named 'br100' and the virtual network interface is named 'vnet0', you would perform the following command to turn off hairpin'ing:

echo 0 > /sys/class/net/br100/brif/vnet0/hairpin_mode

If using OpenStack, to make the change more permanent, you can comment out the hairpin code in /usr/share/pyshared/nova/virt/libvirt/connection.py, starting at line 906. Specifically, comment out the "self._enable_hairpin(instance)" line.

Links
-------

* See Section '''14.2.2. Neighbor discovery''': http://tldp.org/HOWTO/html_single/Linux+IPv6-HOWTO/#EXAMPLES-TCPDUMP
* Hairpin'ing feature: http://www.networkworld.com/news/tech/2010/101223techupdate-vepa.html

Confirmed here as well.

There can be issues with IPv6 failing to initialize in a VM that is running on a host machine which is bridging the network traffic to the host's physical ethernet port. The issue appears because of two reasons:

1. IPv6 has a duplicate address detection feature, where if it sees packets from the same IPv6 Link-Local (MAC-based) address as itself, it assumes that there is another box on the same network with the same MAC address. 
2. With network bridging hairpin'ing turned on, all packets are reflected back to the VMs, so any IPv6 traffic is duplicated and sent back to the VM... which means that the IPv6 duplicate address detection code activates and the IPv6 subsystem skips initialization.

The bug has to do with two separate systems stomping on each other:

1. The bridge is misconfigured. Either it is in promiscuous mode, or the virtual network interface has hairpin'ing turned on. Hairpin'ing reflects all traffic, including IPv6 Neighbor Solicitation messages, back to the sender.
2. The Linux kernel IPv6 code, upon seeing the reflected IPv6 Neighbor Solicitation message, assumes the address is in use (when it really isn't) and doesn't bring the link up all the way as a result.

Detecting the issue
---------------------------

On the VM, run the following command to see if you have an IPv6 Duplicate Address Detection issue:

dmesg | grep duplicate

If you see a line that matches something like the following, you have the IPv6DAD issue:

eth0: IPv6 duplicate address detected!

You can also issue the following command to see if you have any issues with your IPv6 links:

ip addr | grep tentative

If you see a line that matches something like the following, you most likely have the IPv6DAD issue:

inet6 fe80::a816:aeff:be53:d00d/64 scope link tentative

Fixing the Issue
---------------------

There are two potential fixes for the issue:

1. The bridge interface is in promiscuous mode.
2. Hairpin'ing is turned on for the virtual network device and is thus reflecting IPv6 Neighbor Advertisement messages back to the sender.

To solve #1, turn promiscuous mode for the bridge device off by doing the following:

ifconfig br100 -promisc

To solve #2, you have to turn off hairpin'ing mode for the virtual network interface that is associated with the VM that is not able to setup a valid IPv6 link. So, on the OpenStack Compute Node that is hosting the VM, assuming that the bridge device is named 'br100' and the virtual network interface is named 'vnet0', you would perform the following command to turn off hairpin'ing:

echo 0 > /sys/class/net/br100/brif/vnet0/hairpin_mode

If using OpenStack, to make the change more permanent, you can comment out the hairpin code in /usr/share/pyshared/nova/virt/libvirt/connection.py, starting at line 906. Specifically, comment out the "self._enable_hairpin(instance)" line.

Links
-------

* See Section '''14.2.2. Neighbor discovery''': http://tldp.org/HOWTO/html_single/Linux+IPv6-HOWTO/#EXAMPLES-TCPDUMP
* Hairpin'ing feature: http://www.networkworld.com/news/tech/2010/101223techupdate-vepa.html

Revision history for this message

Evan Callicoat (diopter) wrote on 2012-06-30:

#3

I am the author of the hairpin_mode change and I recently had this bug brought to my attention, much to my chagrin!

I believe what is happening here is that when ICMPv6 sends certain messages (like Neighbor Solicitations in Duplicate Address Detection) it uses a multicast destination MAC address (33:33:xx:xx:xx:xx) in the ethernet frame sent to the host bridge. When the bridge receives the frame, given that it doesn't engage in IGMP snooping, it treats multicast MAC addresses just like the broadcast MAC address, and forwards the frame to all ports. With hairpin_mode enabled on the port the frame entered the bridge on, it will get copied back out that same port, resulting in the behavior seen above.

I believe the simplest approach to solving the problem without potentially breaking or altering any other behaviors is to add an nwfilter to libvirt which identifies this particular scenario and filters it, like this:

</filter>

I haven't tested this yet (so hopefully my syntax is correct there) but the idea is very simple: there's no normal scenario in which we should receive a v6 multicast frame originating from our own interface, so we can accurately identify this as a reflection to be dropped by a bridge filter.

I should have a chance to test this at some point soon and if it works, see about where to submit new nwfilters, but in the meantime it'd be great if anyone affected by this bug could try dropping in a no-mac-reflection.xml containing the filter listed above and see if the problem is resolved!

Revision history for this message

Evan Callicoat (diopter) wrote on 2012-06-30:

#4

Whoops, forgot to mention that nwfilter files go in /etc/libvirt/nwfilter, in case that wasn't well-understood

Changed in nova:
status:	New → Confirmed

Thierry Carrez (ttx) on 2012-08-24

Changed in nova:
importance:	Undecided → Medium

Takashi Sogabe (sogabe) on 2012-10-03

Changed in nova:
assignee:	nobody → Takashi Sogabe (sogabe)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-10-03: Fix proposed to nova (master)

#5

Fix proposed to branch: master
Review: https://review.openstack.org/14017

Changed in nova:
status:	Confirmed → In Progress

Vish Ishaya (vishvananda) on 2012-10-10

tags:

added: folsom-rc-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2012-10-10: Fix merged to nova (master)

#6

Reviewed: https://review.openstack.org/14017
Committed: http://github.com/openstack/nova/commit/0436cbdb882b532f0d01c41108508c6d4da3544e
Submitter: Jenkins
Branch: master

commit 0436cbdb882b532f0d01c41108508c6d4da3544e
Author: Takashi Sogabe <email address hidden>
Date: Wed Oct 3 17:19:20 2012 +0900

handle IPv6 race condition due to hairpin mode

bug 1011134

When using IPv6 an instance sees its own neighbour advertisement,
because of the reflective property of the hairpin mode.

    Because of this the trigger-happy duplicate address detection in
    the instance's kernel deconfigures the IPv6 address on the interface,
    resulting in no IPv6 connectivity.

Approach of this commit is to to add an nwfilter to libvirt which
identifies this particular scenario and filters it.

Change-Id: I28f9b49cee4b2ab6ff591fae4feee623955f845f

Changed in nova:
status:	In Progress → Fix Committed

Chuck Short (zulcss) on 2012-10-24

tags:

removed: folsom-rc-potential

Thierry Carrez (ttx) on 2012-11-21

Changed in nova:
milestone:	none → grizzly-1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-04-04

Changed in nova:
milestone:	grizzly-1 → 2013.1

Revision history for this message

Bernhard M. Wiedemann (ubuntubmw) wrote on 2013-06-01:

#7

We hit this bug in our Essex-based SUSE Cloud 1.0 deployment which does not use IPv6, so I think the patch is incomplete.
After reading http://wikibon.org/wiki/v/Edge_Virtual_Bridging I am rather sure that hairpin_mode (meant to be implemented in managed switches and working on Ethernet layer) is not the right approach to solve bug 933640 which is an IP-layer problem.
Btw: Did anyone manage to ping a floating IP from the VM it is assigned to? did not work for me.

https://bugzilla.novell.com/show_bug.cgi?id=821879#c23
has a summary of what I debugged during the last week

Revision history for this message

Evan Callicoat (diopter) wrote on 2013-06-01:

#8

Download full text (5.0 KiB)

The bugzilla you've linked to is down for maintenance, so I'm going to go at this blind to what you know or have done. Bear with me!

I thought the same thing when I first hit bug 933640 in a customer's deployment, where they absolutely refused to run split-DNS. They had a cluster of cattle (not pets), where any service could be running on any instance, and services only knew about each other through their global DNS hostnames, which were mapped to their floats, further distinguished by ports.

I specialize in Linux networking and spent about three days going over the issue on a whiteboard, working with Vish and other folks, and hairpinning was the simplest and most elegant solution I came up with at that time, and I'll tell you my reasoning.

First of all, I agree with your preliminary reading. Hairpin mode in the Linux kernel was implemented as part of a larger implementation of features to allow Linux to be a Virtual Ethernet Port Aggregator (VEPA), which is related to VEB as you mentioned. Hairpinning is absolutely an L2 functionality, and talking to your own float is indeed a L3 problem. However, getting out to your float and back in to a service that's actually listening on the same private IP you're sourcing from, without having to rewrite the client or service, is both an L2 and L3 issue.

The L3 portion is common; we need to DNAT on the way to a float (ingress initiated), and SNAT on the way from it (egress initiated), so the client thinks it's talking to the original service, and the translated service thinks the translator is the original client. For talking to our own float, we actually need to do both, but from the "back" (VM rather than public) side of the host: DNAT towards our float, which translates to our (private) IP as the new destination, then SNAT on the way back to ourself, so it looks like the traffic actually came *from* our float.

However, this gives us an L2 issue now. Namely that with native Linux bridges and its bridging engine's netfilter interaction (which you can see here from one of the netfilter devs: http://inai.de/images/nf-packet-flow.png), the bridge won't let the same frame egress the same port it ingressed without hairpin_mode enabled. So, unless we jump to a separate router beyond the compute host, and make *it* hairpin (same exact issue; usually this is discouraged even when straight routing, google "split horizon" and "reverse path filtering"), this is where the traffic needs to go.

The iptables rules in nova-network in Diablo/Essex to DNAT/SNAT floats didn't have -s restrictions, and may or may not have -i/-o restrictions depending on nova.conf flags. This turned out to be fortuitous, because it meant that I could rely on the same rules for the usual two float NAT patterns I mentioned earlier, yet hit them from the back side, without changing any iptables rules.

This left only one more minor issue, which is that the SNAT wasn't being hit on the way back to the VM, because of the iptables rule designed to -j ACCEPT fixed -> fixed VM traffic, short-circuiting before the float SNAT rules. So I had to make one minor change there; I added -m conntrack ! --ctstate DNAT (example from folsom, since esse...

The bugzilla you've linked to is down for maintenance, so I'm going to go at this blind to what you know or have done. Bear with me!

I thought the same thing when I first hit bug 933640 in a customer's deployment, where they absolutely refused to run split-DNS. They had a cluster of cattle (not pets), where any service could be running on any instance, and services only knew about each other through their global DNS hostnames, which were mapped to their floats, further distinguished by ports.

I specialize in Linux networking and spent about three days going over the issue on a whiteboard, working with Vish and other folks, and hairpinning was the simplest and most elegant solution I came up with at that time, and I'll tell you my reasoning.

First of all, I agree with your preliminary reading. Hairpin mode in the Linux kernel was implemented as part of a larger implementation of features to allow Linux to be a Virtual Ethernet Port Aggregator (VEPA), which is related to VEB as you mentioned. Hairpinning is absolutely an L2 functionality, and talking to your own float is indeed a L3 problem. However, getting out to your float and back in to a service that's actually listening on the same private IP you're sourcing from, without having to rewrite the client or service, is both an L2 and L3 issue.

The L3 portion is common; we need to DNAT on the way to a float (ingress initiated), and SNAT on the way from it (egress initiated), so the client thinks it's talking to the original service, and the translated service thinks the translator is the original client. For talking to our own float, we actually need to do both, but from the "back" (VM rather than public) side of the host: DNAT towards our float, which translates to our (private) IP as the new destination, then SNAT on the way back to ourself, so it looks like the traffic actually came *from* our float.

However, this gives us an L2 issue now. Namely that with native Linux bridges and its bridging engine's netfilter interaction (which you can see here from one of the netfilter devs: http://inai.de/images/nf-packet-flow.png), the bridge won't let the same frame egress the same port it ingressed without hairpin_mode enabled. So, unless we jump to a separate router beyond the compute host, and make *it* hairpin (same exact issue; usually this is discouraged even when straight routing, google "split horizon" and "reverse path filtering"), this is where the traffic needs to go.

The iptables rules in nova-network in Diablo/Essex to DNAT/SNAT floats didn't have -s restrictions, and may or may not have -i/-o restrictions depending on nova.conf flags. This turned out to be fortuitous, because it meant that I could rely on the same rules for the usual two float NAT patterns I mentioned earlier, yet hit them from the back side, without changing any iptables rules.

This left only one more minor issue, which is that the SNAT wasn't being hit on the way back to the VM, because of the iptables rule designed to -j ACCEPT fixed -> fixed VM traffic, short-circuiting before the float SNAT rules. So I had to make one minor change there; I added -m conntrack ! --ctstate DNAT (example from folsom, since essex isn't still in the upstream git repo: https://github.com/openstack/nova/blob/stable/folsom/nova/network/linux_net.py#L566).

With this minor match criteria, it would skip float SNATs for fixed -> fixed traffic unless we've already DNAT'd, which should only happen when we're hairpinning, and... voila. [VM (fixed src, float dst)] -> [Host DNAT (fixed src, fixed dst)] -> [Host SNAT (float src, fixed dst)] -> (hairpin lets it back in) -> [VM (float src, fixed dst)], so all clients talk from fixed -> float, and all services listen on fixed for traffic from floats. Of course, if you're familiar with how netfilter does NAT, you'll also know that all response traffic to an initially NAT'd flow will be automatically reversed by conntrack without explicit rules, so the existing DNAT/SNAT paths for floats do all the needful things!

----------

Now, with all of this said... I can't seem to find the source code for SuSe Cloud 1.0 to verify my patches are actually in there, or that the appropriate nwfilter bits around IPv6 will work as expected, but it's been proven to work for *many* other people, with no downsides. A few things I do know could interfere with this, however, is if your VM bridge isn't hosting your fixed gateway, or your floats are on a different interface, or your bridge has promiscuous mode enabled. Vish and I worked through these things one at a time, and the customer I was solving for as an example was in-line with our (Rackspace's) opinionated approach to using nova-network and floats. In the process from Diablo -> Essex -> Folsom, we disabled bridge promiscuity and enabled hairpins on every port, and never ran into any notable issues.

I'd love to find out more specifics on your environment and/or see the linux_net.py/nwfilter file from your installation, as well as hear any ideas you may have on better or different approaches!

-Evan

Revision history for this message

Bernhard M. Wiedemann (ubuntubmw) wrote on 2013-09-17:

#9

I made a patch for this bug https://review.openstack.org/45389

OpenStack Compute (nova)

hairpin mode on vnet bridge ports causes false positives on IPv6 duplicate address detection

Bug Description

Other bug subscribers

Remote bug watches