The virtual network is broken on the node after neutron-openvswitch-agent is restarted if RPC requests return an error for a while.

Bug #1996788 reported by Anton Kurbatov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Opinion
Undecided
Unassigned

Bug Description

We ran into a problem in our openstack cluster, when traffic does not go through the virtual network on the node on which the neutron-openvswitch-agent was restarted.
We had an update from one version of the Openstack to another and by chance we had a inconsistency of the DB and neutron-server: any port select from the DB returned an error.
For a while neutron-openvswitch-agent (just after restart) couldn't get any information via RCP in its rpc_loop iterations due to DB/neutron-server inconsistency.
But after updating the database, we got a broken virtual network on the node where the neutron-openvswitch-agent was restarted.

It seems to me that I have found a problem place in the logic of neutron-ovs-agent.
To demonstrate, better to emulate the RPC request fail from neutron-ovs-agent to neutron-server.

Here are the steps to reproduce on devstack setup from the master branch.
Two nodes: node0 is controller, node1 is compute.

0) Prepare a vxlan based network and a VM.
[root@node0 ~]# openstack network create net1
[root@node0 ~]# openstack subnet create sub1 --network net1 --subnet-range 192.168.1.0/24
[root@node0 ~]# openstack server create vm1 --network net1 --flavor m1.tiny --image cirros-0.5.2-x86_64-disk --host node1

Just after creating the VM, there is a message in the devstack@q-agt logs:

Nov 16 09:53:35 node1 neutron-openvswitch-agent[374810]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-77753b72-cb23-4dae-b68a-7048b63faf8b None None] Assigning 1 as local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466

So, local vlan which is used on node1 for the network is `1`
A ping from the node0 to the VM on node1 success works:

[root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping 192.168.1.211
PING 192.168.1.211 (192.168.1.211) 56(84) bytes of data.
64 bytes from 192.168.1.211: icmp_seq=1 ttl=64 time=1.86 ms
64 bytes from 192.168.1.211: icmp_seq=2 ttl=64 time=0.891 ms

1) Now, please don't misunderstand me, I don't want to be read that I'm patching the code and then clearly something won't work,
I just want to emulate a problem that's hard enough to reproduce in a normal way but it can.
So, emulate a problem that method get_resource_by_id returns an error just after neutron-ovs-agent restart (RPC based method actually):

[root@node1 neutron]# git diff
diff --git a/neutron/agent/rpc.py b/neutron/agent/rpc.py
index 9a133afb07..299eb25981 100644
--- a/neutron/agent/rpc.py
+++ b/neutron/agent/rpc.py
@@ -327,6 +327,11 @@ class CacheBackedPluginApi(PluginApi):

     def get_device_details(self, context, device, agent_id, host=None,
                            agent_restarted=False):
+ import time
+ if not hasattr(self, '_stime'):
+ self._stime = time.time()
+ if self._stime + 5 > time.time():
+ raise Exception('Emulate RPC error in get_resource_by_id call')
         port_obj = self.remote_resource_cache.get_resource_by_id(
             resources.PORT, device, agent_restarted)
         if not port_obj:

Restart neutron-openvswitch-agent agent and try to ping after 1-2 mins:

[root@node1 ~]# systemctl restart devstack@q-agt

[root@node0 ~]# ip netns exec qdhcp-710bcfcd-44d9-445d-a895-8ec522f64016 ping -c 2 192.168.1.234
PING 192.168.1.234 (192.168.1.234) 56(84) bytes of data.

--- 192.168.1.234 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1058ms

[root@node0 ~]#

Ping doesn't work.
Just after the neutron-ovs-agent restart and when the RPC starts working correctly, there are logs:

Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Assigning 2 as local vlan for net-id=710bcfcd-44d9-445d-a895-8ec522f64016, seg-id=466
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.agent.securitygroups_rpc [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Preparing filters for devices {'40d82f69-274f-4de5-84d9-6290159f288b'}
Nov 16 09:55:13 node1 neutron-openvswitch-agent[375032]: INFO neutron.agent.linux.openvswitch_firewall.firewall [None req-135ae96d-905e-485f-8c1f-b0a70616b4c7 None None] Initializing port 40d82f69-274f-4de5-84d9-6290159f288b that was already initialized.

So, `Assigning 2 as local vlan` followed by `Initializing port ... that was already initialized.`

2) Using a pyrasite the eventlet backdoor was setup and I see that in the internal structure inside the OVSFirewallDriver a `vlan_tag` of the port is still `1` instead of `2`:

>>> import gc
>>> from neutron.agent.linux.openvswitch_firewall.firewall import OVSFirewallDriver
>>> for ob in gc.get_objects():
... if isinstance(ob, OVSFirewallDriver):
... break
...
>>> ob.sg_port_map.ports['40d82f69-274f-4de5-84d9-6290159f288b'].vlan_tag
1
>>>

So, the OVSFirewallDriver still thinks that the port has a local vlan 1, although at the ovs_neutron_agent level the local vlan 2 was assigned.

Tags: ovs
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Anton:

The issue you are describing is not related to the OVS agent itself. When the OVS agent is restarted, it requests the network, port and SG information from the server. Once this information is retrieved, the OVS agent recreates the OF rule set in the local switch.

If due to an error external to the OVS agent, the RPC communication fails, the OF rule state is undetermined.

The VLAN tags used by the OVS agent are locally assigned. Each time the OVS agent is started, it creates a local map between the network segmentation ID (external VLAN tags or tunnel segmentation IDs) and the local VLAN tags. This map is built locally and the map between external and internal tags is not always the same. That means when you restart the OVS agent, the port VLAN tags can change, as you experienced.

Regards.

Revision history for this message
Anton Kurbatov (akurbatov) wrote :

Hello Rodolfo,
Yes, I got it.
It might be worth considering keeping _local_vlan_hins as a fix for this bug if some of the ports can't be handled.
For example, do not call _dispose_local_vlan_hints if there are failed_devices:
https://opendev.org/openstack/neutron/src/commit/bf44e70db6219e7f3a45bd61b7dd14a31ae33bb0/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2822

Or somehow at the OVSFirewallDriver level try to understand if the local vlan has changed or not.

tags: added: ovs
Changed in neutron:
status: New → Opinion
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.