OpenStack Compute (nova)

ARP table removed for br0 on host when node terminated

Bug #908194 reported by Steven Dake on 2011-12-23

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Invalid	Medium	Unassigned

Bug Description

openstack-nova-2011.3-13.fc16.noarch

Test case:
Fedora 16 Guest
Fedora 16 Host
Start two test guests
ping the bridge gateway from each host

terminate one of the virtual machines with euca-terminate-instances

The remaining VM takes 30-60 seconds before network connectivity is restored. Immediately after termination on the host:
[root@beast net]# cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.1.1 0x1 0x2 00:1d:7e:c1:2a:81 * wlan0

It appears the act of qemu being terminated clears the arp cache.

After 30-60 seconds elapsed:
[root@beast net]# cat /proc/net/arp
IP address HW type Flags HW address Mask Device
192.168.1.1 0x1 0x2 00:1d:7e:c1:2a:81 * wlan0
10.0.0.3 0x1 0x2 02:16:3e:62:ee:4b * br0

The vnets:
vnet0 Link encap:Ethernet HWaddr FE:16:3E:3F:89:E3
vnet1 Link encap:Ethernet HWaddr FE:16:3E:62:EE:4B

tcpdump of br0 shows:

at same time as kill of vm:
09:02:49.498818 ARP, Request who-has 10.0.0.3 tell reserved, length 28^M
09:02:49.499612 ARP, Reply 10.0.0.3 is-at 02:16:3e:62:ee:4b (oui
Unknown), length 28

Note "02:..." is the guests MAC address

Some time later (30-60 seconds) the network enables again when this arp is detected:
09:03:34.499680 ARP, Request who-has reserved tell 10.0.0.3, length 28^M
09:03:34.499722 ARP, Reply reserved is-at fe:16:3e:62:ee:4b (oui
Unknown), length 28^M

Note "fe:...." is the guests VNET address

Then the networking is active again.

start two

Revision history for this message

Steven Dake (sdake) wrote on 2011-12-23:

Note I found the root cause of this issue to be that the MAC address of br0 bounces around the vnetXX devices. If the active vnet device has the same MAC address as br0, it will flush the arp table and result in loss of connectivity for 30-60 seconds.

This does raise the problem of what to do when the eth0/em1 interface is downed. I proposed a solution here:

http://lists.fedoraproject.org/pipermail/cloud/2011-December/001116.html

Revision history for this message

Brian Waldon (bcwaldon) wrote on 2012-01-27:

Justin pointed out that the bug he filed may be related to this: https://bugs.launchpad.net/nova/+bug/921838. What do you think, Steven?

Changed in nova:
status:	New → Incomplete

Revision history for this message

Steven Dake (sdake) wrote on 2012-01-27:

Brian,

Ya his bug is the same but his workaround is invalid. What is happening is the mac address for the bridge is being set to 00:00:00:00:00:00 when the first vnet is removed from the brige. This causes the bridge to determine a new vnet which triggers a network outage of 30-60 seconsds for the bridge.

This problem at one point also existed in libvirt. The workaround they used was to insert a vnic into the bridge on startup of the system that was a "dummy". See:

http://lists.fedoraproject.org/pipermail/cloud/2012-January/001125.html

Regards
-steve

Thierry Carrez (ttx) on 2012-01-30

Changed in nova:
status:	Incomplete → New

Russell Bryant (russellb) on 2012-02-16

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Dan Smith (danms) wrote on 2012-08-09:

Is this bug still valid? I'm unable to reproduce it on current master...

Revision history for this message

Steven Dake (sdake) wrote on 2012-08-09:

Dan,

Bug remains.

A simple reproducer:
Start one VM on a fresh system (no vnet shown with ifconfig)

Start a second VM on that same system
ssh into the second VM and run ping www.news.com

Terminate the first vm

The ping delays for 35 seconds while the . My output from second vm is:
42 time=32.9 ms
64 bytes from phx1-rb-gtm3-tron-xw-lb.cnet.com (64.30.224.82): icmp_req=16 ttl=242 time=32.8 ms
64 bytes from phx1-rb-gtm3-tron-xw-lb.cnet.com (64.30.224.82): icmp_req=17 ttl=242 time=32.8 ms
64 bytes from phx1-rb-gtm3-tron-xw-lb.cnet.com (64.30.224.82): icmp_req=53 ttl=242 time=35.5 ms
64 bytes from phx1-rb-gtm3-tron-xw-lb.cnet.com (64.30.224.82): icmp_req=54 ttl=242 time=33.6 ms

notice the delay happens from icmp_req 17 to icmp-req 53

Revision history for this message

Dan Smith (danms) wrote on 2012-08-10:

Yeah, that's what I did before asking if it's still a problem :) I don't see the outage.

Are you running a bridge with no physical nic attached and NAT'ing out of your guests? I'm not, but perhaps that's the difference. If so, I'll try to reproduce it that way.

Revision history for this message

Dan Smith (danms) wrote on 2013-03-15:

Marking as incomplete without more information on how to reproduce (since I can't)

Changed in nova:
status:	Confirmed → Incomplete

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2013-03-18:

This bug lacks the necessary information to effectively reproduce and fix it, therefore it has been closed. Feel free to reopen the bug by providing the requested information and set the bug status back to ''New''.

Changed in nova:
status:	Incomplete → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.