neutron

[DVR] Recovery from openvswitch restart fails when veth are used for bridges interconnection

Bug #1877977 reported by Slawek Kaplonski on 2020-05-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Confirmed	Medium	Unassigned

Bug Description

In case of DVR routers, when use_veth_interconnection is set to True, when openvswitch service is restared, recovery from that isn't done correctly and FIPs aren't reachable until neutron-ovs-agent is restarted.

All works fine when patch ports are used for interconnection.

Tags:

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2020-05-18:

Download full text (3.8 KiB)

I managed to reproduce this and also noticed that the reproduction is indeterministic. Sometimes connectivity recovers after ovs restart, other times it does not. Both cases are quite frequent, so they can be easily caught.

For the record this is the exact reproduction:

# the default is to not use veth pairs, check that we don't have them at start
$ sudo ip l | egrep phy-br-ex
[nothing]

# change the config to use veth interconnections
$ vim /etc/neutron/plugins/ml2/ml2_conf.ini
[ovs]
use_veth_interconnection = True

$ sudo systemctl restart devstack@neutron-agent

# now we have veth interconnections
$ sudo ip l | egrep phy-br-ex
37: phy-br-ex@int-br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000
38: int-br-ex@phy-br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc noqueue master ovs-system state UP mode DEFAULT group default qlen 1000
$ sudo ethtool -S phy-br-ex
NIC statistics:
peer_ifindex: 38

# boot vm with floating ip
$ openstack server create vm0 --flavor cirros256 --image cirros-0.4.0-x86_64-disk --nic net-id=private --wait
$ openstack floating ip create --port "$( openstack port list --device-id "$( openstack server show vm0 -f value -c id )" -f value -c id | head -1 )" public -f value -c floating_ip_address
172.24.4.211

# start ping and keep it running, while...
$ ping 172.24.4.211

# ... we restart ovs
$ sudo systemctl restart openvswitch-switch

In some cases ping recovers in a few seconds. In other cases it never recovers.

flow diff for br-int (.0 is the working state before ovs restart, .1 is when ping did not recover):

# diff -u <( cat dump-flows.br-int.0 | cut -d ' ' -f4,8- | sort ) <( cat dump-flows.br-int.1 | cut -d ' ' -f4,8- | sort )
--- /dev/fd/63 2020-05-18 13:25:50.235895198 +0000
+++ /dev/fd/62 2020-05-18 13:25:50.239895241 +0000
@@ -4,8 +4,12 @@
table=0, priority=10,icmp6,in_port=18,icmp_type=136 actions=resubmit(,24)
table=0, priority=2,in_port=23 actions=drop
table=0, priority=2,in_port=24 actions=drop
-table=0, priority=3,in_port=23,vlan_tci=0x0000/0x1fff actions=mod_vlan_vid:2,resubmit(,60)
-table=0, priority=3,in_port=24,dl_vlan=100 actions=mod_vlan_vid:3,resubmit(,60)
+table=0, priority=2,in_port=41 actions=drop
+table=0, priority=2,in_port=42 actions=drop
+table=0, priority=2,in_port=43 actions=drop
+table=0, priority=2,in_port=ANY actions=drop
+table=0, priority=3,in_port=43,dl_vlan=100 actions=mod_vlan_vid:3,resubmit(,60)
+table=0, priority=3,in_port=ANY,vlan_tci=0x0000/0x1fff actions=mod_vlan_vid:2,resubmit(,60)
table=0, priority=5,in_port=23,dl_dst=fa:16:3f:ca:bf:17 actions=resubmit(,4)
table=0, priority=5,in_port=24,dl_dst=fa:16:3f:ca:bf:17 actions=resubmit(,4)
table=0, priority=5,in_port=3,dl_dst=fa:16:3f:ca:bf:17 actions=resubmit(,3)

flow diff for br-ex:

table=0, priority=0 actions=NORMAL
table=0, priority=1 actions=resubmit(,3)
+table=0, priority=2,in_port=14 acti...

For the record this is the exact reproduction:

# the default is to not use veth pairs, check that we don't have them at start
$ sudo ip l | egrep phy-br-ex
[nothing]

# change the config to use veth interconnections
$ vim /etc/neutron/plugins/ml2/ml2_conf.ini
[ovs]
use_veth_interconnection = True

$ sudo systemctl restart devstack@neutron-agent

# start ping and keep it running, while...
$ ping 172.24.4.211

# ... we restart ovs
$ sudo systemctl restart openvswitch-switch

In some cases ping recovers in a few seconds. In other cases it never recovers.

flow diff for br-int (.0 is the working state before ovs restart, .1 is when ping did not recover):

# diff -u <( cat dump-flows.br-int.0 | cut -d ' ' -f4,8- | sort ) <( cat dump-flows.br-int.1 | cut -d ' ' -f4,8- | sort )
--- /dev/fd/63  2020-05-18 13:25:50.235895198 +0000
+++ /dev/fd/62  2020-05-18 13:25:50.239895241 +0000
@@ -4,8 +4,12 @@
 table=0, priority=10,icmp6,in_port=18,icmp_type=136 actions=resubmit(,24)
 table=0, priority=2,in_port=23 actions=drop
 table=0, priority=2,in_port=24 actions=drop
-table=0, priority=3,in_port=23,vlan_tci=0x0000/0x1fff actions=mod_vlan_vid:2,resubmit(,60)
-table=0, priority=3,in_port=24,dl_vlan=100 actions=mod_vlan_vid:3,resubmit(,60)
+table=0, priority=2,in_port=41 actions=drop
+table=0, priority=2,in_port=42 actions=drop
+table=0, priority=2,in_port=43 actions=drop
+table=0, priority=2,in_port=ANY actions=drop
+table=0, priority=3,in_port=43,dl_vlan=100 actions=mod_vlan_vid:3,resubmit(,60)
+table=0, priority=3,in_port=ANY,vlan_tci=0x0000/0x1fff actions=mod_vlan_vid:2,resubmit(,60)
 table=0, priority=5,in_port=23,dl_dst=fa:16:3f:ca:bf:17 actions=resubmit(,4)
 table=0, priority=5,in_port=24,dl_dst=fa:16:3f:ca:bf:17 actions=resubmit(,4)
 table=0, priority=5,in_port=3,dl_dst=fa:16:3f:ca:bf:17 actions=resubmit(,3)

flow diff for br-ex:

# diff -u <( cat dump-flows.br-ex.0 | cut -d ' ' -f4,8- | sort ) <( cat dump-flows.br-ex.1 | cut -d ' ' -f4,8- | sort )
--- /dev/fd/63  2020-05-18 13:27:07.036710753 +0000
+++ /dev/fd/62  2020-05-18 13:27:07.036710753 +0000
@@ -1,8 +1,10 @@
 
 table=0, priority=0 actions=NORMAL
 table=0, priority=1 actions=resubmit(,3)
+table=0, priority=2,in_port=14 actions=drop
+table=0, priority=2,in_port=15 actions=drop
 table=0, priority=2,in_port=5 actions=resubmit(,1)
-table=0, priority=4,in_port=5,dl_vlan=2 actions=strip_vlan,NORMAL
+table=0, priority=4,in_port=15,dl_vlan=2 actions=strip_vlan,NORMAL
 table=1, priority=0 actions=resubmit(,2)
 table=2, priority=2,in_port=5 actions=drop
 table=3, priority=1 actions=NORMAL

In ovs-agent log there's no unexpected error message:

máj 18 13:14:35 devstack1 neutron-openvswitch-agent[8674]: ERROR neutron.agent.common.async_process [-] Error received from [ovsdb-client monitor tcp:127.0.0.1:6640 Interface name,ofport,external_ids --format=json]: ovsdb-client: tcp:127.0.0.1:6640: Open_vSwitch database was removed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.