hairpin mode on vnet bridge ports causes false positives on IPv6 duplicate address detection

Bug #1011134 reported by Filipe Spencer Lopes dos Santos on 2012-06-10
42
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Takashi Sogabe

Bug Description

Because of bug 933640 (https://bugs.launchpad.net/nova/+bug/933640) hairpin mode is now enabled by default on all bridge ports. When using IPv6 an instance sees its own neighbour advertisement, because of the reflective property of the hairpin mode. Because of this the trigger-happy duplicate address detection in the instance's kernel deconfigures the IPv6 address on the interface, resulting in no IPv6 connectivity.

router advertisement recieved -> configures IPv6 on interface-> sends neighbour advertisement -> receives own neighbour advertisement -> removes IPv6 address from interface

from instance' hum' syslog: Jun 10 10:19:44 hum kernel: [ 150.028370] eth0: IPv6 duplicate address 2001:6f8:1477:1111:f816:3eff:fe20:48ad detected!

after disabling hairpin mode on the compute node of the vnet interface used the error disapeared and IPv6 connectivity was enabled.

lucas kauffman (lucas-kauffman) wrote :

I can confirm this problem.

Manu Sporny (msporny) wrote :

Confirmed here as well.

There can be issues with IPv6 failing to initialize in a VM that is running on a host machine which is bridging the network traffic to the host's physical ethernet port. The issue appears because of two reasons:

1. IPv6 has a duplicate address detection feature, where if it sees packets from the same IPv6 Link-Local (MAC-based) address as itself, it assumes that there is another box on the same network with the same MAC address.
2. With network bridging hairpin'ing turned on, all packets are reflected back to the VMs, so any IPv6 traffic is duplicated and sent back to the VM... which means that the IPv6 duplicate address detection code activates and the IPv6 subsystem skips initialization.

The bug has to do with two separate systems stomping on each other:

1. The bridge is misconfigured. Either it is in promiscuous mode, or the virtual network interface has hairpin'ing turned on. Hairpin'ing reflects all traffic, including IPv6 Neighbor Solicitation messages, back to the sender.
2. The Linux kernel IPv6 code, upon seeing the reflected IPv6 Neighbor Solicitation message, assumes the address is in use (when it really isn't) and doesn't bring the link up all the way as a result.

Detecting the issue
---------------------------

On the VM, run the following command to see if you have an IPv6 Duplicate Address Detection issue:

dmesg | grep duplicate

If you see a line that matches something like the following, you have the IPv6DAD issue:

eth0: IPv6 duplicate address detected!

You can also issue the following command to see if you have any issues with your IPv6 links:

ip addr | grep tentative

If you see a line that matches something like the following, you most likely have the IPv6DAD issue:

inet6 fe80::a816:aeff:be53:d00d/64 scope link tentative

Fixing the Issue
---------------------

There are two potential fixes for the issue:

1. The bridge interface is in promiscuous mode.
2. Hairpin'ing is turned on for the virtual network device and is thus reflecting IPv6 Neighbor Advertisement messages back to the sender.

To solve #1, turn promiscuous mode for the bridge device off by doing the following:

ifconfig br100 -promisc

To solve #2, you have to turn off hairpin'ing mode for the virtual network interface that is associated with the VM that is not able to setup a valid IPv6 link. So, on the OpenStack Compute Node that is hosting the VM, assuming that the bridge device is named 'br100' and the virtual network interface is named 'vnet0', you would perform the following command to turn off hairpin'ing:

echo 0 > /sys/class/net/br100/brif/vnet0/hairpin_mode

If using OpenStack, to make the change more permanent, you can comment out the hairpin code in /usr/share/pyshared/nova/virt/libvirt/connection.py, starting at line 906. Specifically, comment out the "self._enable_hairpin(instance)" line.

Links
-------

* See Section '''14.2.2. Neighbor discovery''': http://tldp.org/HOWTO/html_single/Linux+IPv6-HOWTO/#EXAMPLES-TCPDUMP
* Hairpin'ing feature: http://www.networkworld.com/news/tech/2010/101223techupdate-vepa.html

Evan Callicoat (diopter) wrote :

I am the author of the hairpin_mode change and I recently had this bug brought to my attention, much to my chagrin!

I believe what is happening here is that when ICMPv6 sends certain messages (like Neighbor Solicitations in Duplicate Address Detection) it uses a multicast destination MAC address (33:33:xx:xx:xx:xx) in the ethernet frame sent to the host bridge. When the bridge receives the frame, given that it doesn't engage in IGMP snooping, it treats multicast MAC addresses just like the broadcast MAC address, and forwards the frame to all ports. With hairpin_mode enabled on the port the frame entered the bridge on, it will get copied back out that same port, resulting in the behavior seen above.

I believe the simplest approach to solving the problem without potentially breaking or altering any other behaviors is to add an nwfilter to libvirt which identifies this particular scenario and filters it, like this:

<filter name='no-mac-reflection' chain='ipv6'>
    <!-- drop if destination mac is v6 mcast mac addr and we sent it. -->
    <rule action='drop' direction='in'>
        <mac dstmacaddr='33:33:00:00:00:00' dstmacmask='ff:ff:00:00:00:00' srcmacaddr='$MAC'/>
    </rule>

    <!-- not doing anything with sending side ... -->
</filter>

I haven't tested this yet (so hopefully my syntax is correct there) but the idea is very simple: there's no normal scenario in which we should receive a v6 multicast frame originating from our own interface, so we can accurately identify this as a reflection to be dropped by a bridge filter.

I should have a chance to test this at some point soon and if it works, see about where to submit new nwfilters, but in the meantime it'd be great if anyone affected by this bug could try dropping in a no-mac-reflection.xml containing the filter listed above and see if the problem is resolved!

Evan Callicoat (diopter) wrote :

Whoops, forgot to mention that nwfilter files go in /etc/libvirt/nwfilter, in case that wasn't well-understood

Changed in nova:
status: New → Confirmed
Thierry Carrez (ttx) on 2012-08-24
Changed in nova:
importance: Undecided → Medium
Takashi Sogabe (sogabe) on 2012-10-03
Changed in nova:
assignee: nobody → Takashi Sogabe (sogabe)

Fix proposed to branch: master
Review: https://review.openstack.org/14017

Changed in nova:
status: Confirmed → In Progress
tags: added: folsom-rc-potential

Reviewed: https://review.openstack.org/14017
Committed: http://github.com/openstack/nova/commit/0436cbdb882b532f0d01c41108508c6d4da3544e
Submitter: Jenkins
Branch: master

commit 0436cbdb882b532f0d01c41108508c6d4da3544e
Author: Takashi Sogabe <email address hidden>
Date: Wed Oct 3 17:19:20 2012 +0900

    handle IPv6 race condition due to hairpin mode

    bug 1011134

    When using IPv6 an instance sees its own neighbour advertisement,
    because of the reflective property of the hairpin mode.

    Because of this the trigger-happy duplicate address detection in
    the instance's kernel deconfigures the IPv6 address on the interface,
    resulting in no IPv6 connectivity.

    Approach of this commit is to to add an nwfilter to libvirt which
    identifies this particular scenario and filters it.

    Change-Id: I28f9b49cee4b2ab6ff591fae4feee623955f845f

Changed in nova:
status: In Progress → Fix Committed
Chuck Short (zulcss) on 2012-10-24
tags: removed: folsom-rc-potential
Thierry Carrez (ttx) on 2012-11-21
Changed in nova:
milestone: none → grizzly-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-04-04
Changed in nova:
milestone: grizzly-1 → 2013.1

We hit this bug in our Essex-based SUSE Cloud 1.0 deployment which does not use IPv6, so I think the patch is incomplete.
After reading http://wikibon.org/wiki/v/Edge_Virtual_Bridging I am rather sure that hairpin_mode (meant to be implemented in managed switches and working on Ethernet layer) is not the right approach to solve bug 933640 which is an IP-layer problem.
Btw: Did anyone manage to ping a floating IP from the VM it is assigned to? did not work for me.

https://bugzilla.novell.com/show_bug.cgi?id=821879#c23
has a summary of what I debugged during the last week

Evan Callicoat (diopter) wrote :
Download full text (5.0 KiB)

The bugzilla you've linked to is down for maintenance, so I'm going to go at this blind to what you know or have done. Bear with me!

I thought the same thing when I first hit bug 933640 in a customer's deployment, where they absolutely refused to run split-DNS. They had a cluster of cattle (not pets), where any service could be running on any instance, and services only knew about each other through their global DNS hostnames, which were mapped to their floats, further distinguished by ports.

I specialize in Linux networking and spent about three days going over the issue on a whiteboard, working with Vish and other folks, and hairpinning was the simplest and most elegant solution I came up with at that time, and I'll tell you my reasoning.

First of all, I agree with your preliminary reading. Hairpin mode in the Linux kernel was implemented as part of a larger implementation of features to allow Linux to be a Virtual Ethernet Port Aggregator (VEPA), which is related to VEB as you mentioned. Hairpinning is absolutely an L2 functionality, and talking to your own float is indeed a L3 problem. However, getting out to your float and back in to a service that's actually listening on the same private IP you're sourcing from, without having to rewrite the client or service, is both an L2 and L3 issue.

The L3 portion is common; we need to DNAT on the way to a float (ingress initiated), and SNAT on the way from it (egress initiated), so the client thinks it's talking to the original service, and the translated service thinks the translator is the original client. For talking to our own float, we actually need to do both, but from the "back" (VM rather than public) side of the host: DNAT towards our float, which translates to our (private) IP as the new destination, then SNAT on the way back to ourself, so it looks like the traffic actually came *from* our float.

However, this gives us an L2 issue now. Namely that with native Linux bridges and its bridging engine's netfilter interaction (which you can see here from one of the netfilter devs: http://inai.de/images/nf-packet-flow.png), the bridge won't let the same frame egress the same port it ingressed without hairpin_mode enabled. So, unless we jump to a separate router beyond the compute host, and make *it* hairpin (same exact issue; usually this is discouraged even when straight routing, google "split horizon" and "reverse path filtering"), this is where the traffic needs to go.

The iptables rules in nova-network in Diablo/Essex to DNAT/SNAT floats didn't have -s restrictions, and may or may not have -i/-o restrictions depending on nova.conf flags. This turned out to be fortuitous, because it meant that I could rely on the same rules for the usual two float NAT patterns I mentioned earlier, yet hit them from the back side, without changing any iptables rules.

This left only one more minor issue, which is that the SNAT wasn't being hit on the way back to the VM, because of the iptables rule designed to -j ACCEPT fixed -> fixed VM traffic, short-circuiting before the float SNAT rules. So I had to make one minor change there; I added -m conntrack ! --ctstate DNAT (example from folsom, since esse...

Read more...

I made a patch for this bug https://review.openstack.org/45389

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.