ovs restart can lead to critical ovs flows missing

Bug #1758868 reported by Junien F
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned
neutron (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hi,

Running mitaka on xenial (neutron 2:8.4.0-0ubuntu6). We have l2pop and no l3ha. Using ovs with GRE tunnels.

The cloud has around 30 compute nodes (mostly arm64). Last week, ovs got restarted during a package upgrade :

2018-03-21 17:17:25 upgrade openvswitch-common:arm64 2.5.2-0ubuntu0.16.04.3 2.5.4-0ubuntu0.16.04.1

This led to instances on 2 arm64 compute nodes lose networking completely. Upon closer inspection, I realized that a flow was missing in br-tun table 3 : https://pastebin.ubuntu.com/p/VXRJJX8J3k/

I believe this is due to a race in ovs_neutron_agent.py. These flows in table 3 are set up in provision_local_vlan() : https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L675

which is called by port_bound() :
https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L789-L791

which is called by treat_vif_port() : https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1405-L1410

which is called by treat_devices_added_or_updated() : https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1517-L1525

which is called by process_network_ports() : https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1618-L1623

which is called by the big rpc_loop() : https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2023-L2029

So how does the agent knows when to create these table 3 flows ? Well, in rpc_loop(), it checks for OVS restarts (https://github.com/openstack/neutron/blob/mitaka-eol/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1947-L1948), and if OVS did restart, it does some basic ovs setup (default flows, etc), and (very important for later), it restarts the OVS polling manager.

Later (still in rpc_loop()), it sets "ovs_restarted" to True, and process the ports as usual. The expected behaviour here is that since the polling manager got restarted, any port up will be marked as "added" and processed as such, in port_bound() (see call stack above). If this function is called on a port when ovs_restarted is True, then provision_local_vlan() will get called and will able the table 3 flows.

This is all working great under the assumption that the polling manager (which is an async process) will raise the "I got new port !" event before the rpc_loop() checks it (in process_port_events(), called by process_port_info()). However, if for example the node is under load, this may not always be the case.

What happens then is that the rpc_loop in which OVS is detected as restarted doesn't see any change on the ports, and so does nothing. The next run of the rpc_loop will process the "I got new port !" events, but that loop will not be running with ovs_restarted set to True, so the ports won't be brought up properly - more specifically, the table 3 flows in br-tun will be missing. This is shown in the debug logs : https://pastebin.ubuntu.com/p/M8yYn3YnQ6/ - you can see the loop in which "OVS is restarted" is detected (loop iteration 320773) doesn't process any port ("iteration:320773 completed. Processed ports statistics: {'regular': {'updated': 0, 'added': 0, 'removed': 0}}.), but the next iteration does process 3 "added" ports. You can see that the "output received" is logged in the first loop, 49ms after "starting polling" is logged, which is presumably the problem. On all the non-failing nodes, the output is received before "starting polling".

I believe the proper thing to do is to set "sync" to True (in rpc_loop()) if an ovs restart is detected, forcing process_port_info() to not use async events and scan the ports itself using scan_ports().

Thanks

Revision history for this message
Junien F (axino) wrote :

This problem can probably be easily replicated if you replace /usr/bin/ovsdb-client (used by the polling manager) by a shell script that sleeps 0.5s and then exec the actual ovsdb-client.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

Hi Junien, I see that you have stated that you are using neutron version 2:8.4.0-0ubuntu6 which for the sake of this comment is fine but just to close the loop:

The current version of neutron in xenial-updates i.e. 2:8.4.0-0ubuntu7.1 introduced a regression [1][2] that produces almost exactly the same behaviour that you see. That neutron package has subsequently been reverted and currently waiting for a release to xenial-updates so in any case please be sure not to upgrade neutron until the version (currently in xenial-proposed) gets released. All UCA have already been fixed.

I'm gonna look closer at what you have observed to see if I can repro and see whats going on.

[1] https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1752838 <- introduced regression
[2] https://bugs.launchpad.net/ubuntu/+source/neutron/+bug/1758411 <- revert submitted

Revision history for this message
Seyeong Kim (seyeongkim) wrote :

neutron-ovs-cleanup was run after upgrading pkg?

and you can check if that service is fine with below

systemctl list-units --failed

if it is failed state, ovs-cleanup can run while upgrading and blow all ports

This seems different issue but checking it for sure

Revision history for this message
Junien F (axino) wrote :

I'm fairly confident this is not the ovs-cleanup bug. "grep neutron-ovs-cleanup syslog" doesn't have anything, I'm using neutron-plugin-openvswitch-agent 2:8.4.0-0ubuntu6, and neutron-ovs-cleanup.service is "Active: active (exited) since Mon 2018-02-19 05:09:12 UTC; 1 months 4 days ago" so it didn't run in March

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Based on neutron-ovs-cleanup not being restarted since early February, I don't think this is related to the recent neutron regression.

I would like to test with the following patch. After a brief look at the code, it doesn't appear to fix the issue, but perhaps I missed something so I think it's worth testing with:
https://bugs.launchpad.net/neutron/+bug/1646526

Revision history for this message
Corey Bryant (corey.bryant) wrote :
Revision history for this message
Corey Bryant (corey.bryant) wrote :

I've added upstream neutron to the bug. Keeping in mind this is mitaka and unsupported by upstream, but perhaps someone from upstream knows whether this is fixed in a mitaka+ release or not.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Need to triage in the Master branch to see if we still see such issue. Also need to check if there was a bug fix that went in after Mitaka for this issue.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

I believe that part of all of this issue is resolved by https://bugs.launchpad.net/neutron/+bug/1584647 which is currently being backported to Xenial/Mitaka

Revision history for this message
James Page (james-page) wrote :

On the assumption that bug 1584647 resolved this issue marking as a dupe - please comment if this is not the case or the issue remains.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.