The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Opinion
|
Undecided
|
Unassigned |
Bug Description
We ran into a problem in our openstack cluster, when traffic does not go through the virtual network on the node on which the neutron-
We had an update from one version of the Openstack to another and by chance we had a inconsistency of the DB and neutron-server: any port select from the DB returned an error.
For a while neutron-
But after updating the database, we got a broken virtual network on the node where the neutron-
It seems to me that I have found a problem place in the logic of neutron-ovs-agent.
To demonstrate, better to emulate the RPC request fail from neutron-ovs-agent to neutron-server.
Here are the steps to reproduce on devstack setup from the master branch.
Two nodes: node0 is controller, node1 is compute.
0) Prepare a vxlan based network and a VM.
[root@node0 ~]# openstack network create net1
[root@node0 ~]# openstack subnet create sub1 --network net1 --subnet-range 192.168.1.0/24
[root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny --image cirros-
Just after creating the VM, there is a message in the devstack@q-agt logs:
Nov 16 09:53:35 node1 neutron-
So, local vlan which is used on node1 for the network is `1`
A ping from the node0 to the VM on node1 success works:
[root@node0 ~]# ip netns exec qdhcp-710bcfcd-
PING 192.168.1.211 (192.168.1.211) 56(84) bytes of data.
64 bytes from 192.168.1.211: icmp_seq=1 ttl=64 time=1.86 ms
64 bytes from 192.168.1.211: icmp_seq=2 ttl=64 time=0.891 ms
1) Now, please don't misunderstand me, I don't want to be read that I'm patching the code and then clearly something won't work,
I just want to emulate a problem that's hard enough to reproduce in a normal way but it can.
So, emulate a problem that method get_resource_by_id returns an error just after neutron-ovs-agent restart (RPC based method actually):
[root@node1 neutron]# git diff
diff --git a/neutron/
index 9a133afb07.
--- a/neutron/
+++ b/neutron/
@@ -327,6 +327,11 @@ class CacheBackedPlug
def get_device_
+ import time
+ if not hasattr(self, '_stime'):
+ self._stime = time.time()
+ if self._stime + 5 > time.time():
+ raise Exception('Emulate RPC error in get_resource_by_id call')
port_obj = self.remote_
if not port_obj:
Restart neutron-
[root@node1 ~]# systemctl restart devstack@q-agt
[root@node0 ~]# ip netns exec qdhcp-710bcfcd-
PING 192.168.1.234 (192.168.1.234) 56(84) bytes of data.
--- 192.168.1.234 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1058ms
[root@node0 ~]#
Ping doesn't work.
Just after the neutron-ovs-agent restart and when the RPC starts working correctly, there are logs:
Nov 16 09:55:13 node1 neutron-
Nov 16 09:55:13 node1 neutron-
Nov 16 09:55:13 node1 neutron-
So, `Assigning 2 as local vlan` followed by `Initializing port ... that was already initialized.`
2) Using a pyrasite the eventlet backdoor was setup and I see that in the internal structure inside the OVSFirewallDriver a `vlan_tag` of the port is still `1` instead of `2`:
>>> import gc
>>> from neutron.
>>> for ob in gc.get_objects():
... if isinstance(ob, OVSFirewallDriver):
... break
...
>>> ob.sg_port_
1
>>>
So, the OVSFirewallDriver still thinks that the port has a local vlan 1, although at the ovs_neutron_agent level the local vlan 2 was assigned.
tags: | added: ovs |
Changed in neutron: | |
status: | New → Opinion |
Hello Anton:
The issue you are describing is not related to the OVS agent itself. When the OVS agent is restarted, it requests the network, port and SG information from the server. Once this information is retrieved, the OVS agent recreates the OF rule set in the local switch.
If due to an error external to the OVS agent, the RPC communication fails, the OF rule state is undetermined.
The VLAN tags used by the OVS agent are locally assigned. Each time the OVS agent is started, it creates a local map between the network segmentation ID (external VLAN tags or tunnel segmentation IDs) and the local VLAN tags. This map is built locally and the map between external and internal tags is not always the same. That means when you restart the OVS agent, the port VLAN tags can change, as you experienced.
Regards.