Comment 2 for bug 1596473

Revision history for this message
wondra (wondra) wrote : Re: Packet loss with DVR and IPv6

Hi!
I still haven't upgraded to Mitaka, but I have some more insight into this. It also affects IPv4.
Story:
A customer complained about connectivity issues. Pings to his instance had about 2% packet loss. I have spied into the forwarding table of OpenVSwitch on a compute node with DVR:
watch --differences=permanent -n0.1 "ovs-appctl fdb/show br-int | grep fa:16:3e:a5:d8:e7"
..where the MAC belongs to the .1 address of the distributed router. The one that exists on every compute and network node.

From time to time, the port number jumped there and back again. This coincided with the lost pings.

I thought that enabling l2population could solve the issue, but alas, that only populates the br-tun bridge, not br-int. (!)

Then I ran tcpdump in the router namespace and on the instance's iptables bridge
ip netns exec qrouter-ba8c8b17-5649-474b-ac81-4960c2358611 tcpdump -i qr-2f1aa754-89 -ln ether host fa:16:3e:a5:d8:e7
tcpdump -i qbre6b1046f-7c -ln ether host fa:16:3e:a5:d8:e7

a) The ping requests showed on both, the reply was missing only in the router namespace.
b) Around the time of the lost ping, I saw a connection attempt to another IP address (TCP Syn), even on the instance's bridge. It was flooded from another compute node, flipping the switching table of OpenVSwitch and causing packets from all nodes in the cloud to go to the node that did the broadcast for a short time.

Steps to reproduce:
1. On a compute node cmp01, run an instance and start pinging its floating IP from the outside (not a requirement, but the traffic needs to pass through DVR).
2. On a compute node cmp02, run an instance and then stop it (shutoff state).
Ping the floating IP of the shutoff instance.
3. Observe the flooded packets, flipping switching table and packet loss.

My original bug report and this one are closely related. Both are caused by duplicate MAC addresses of the router in the DVR model. l2population does not save day as the conflict happens on br-int, not br-tun.
Is this a design flaw of DVR?