ovs restart can lead to critical ovs flows missing
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
New
|
Undecided
|
Unassigned | ||
neutron (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Hi,
Running mitaka on xenial (neutron 2:8.4.0-0ubuntu6). We have l2pop and no l3ha. Using ovs with GRE tunnels.
The cloud has around 30 compute nodes (mostly arm64). Last week, ovs got restarted during a package upgrade :
2018-03-21 17:17:25 upgrade openvswitch-
This led to instances on 2 arm64 compute nodes lose networking completely. Upon closer inspection, I realized that a flow was missing in br-tun table 3 : https:/
I believe this is due to a race in ovs_neutron_
which is called by port_bound() :
https:/
which is called by treat_vif_port() : https:/
which is called by treat_devices_
which is called by process_
which is called by the big rpc_loop() : https:/
So how does the agent knows when to create these table 3 flows ? Well, in rpc_loop(), it checks for OVS restarts (https:/
Later (still in rpc_loop()), it sets "ovs_restarted" to True, and process the ports as usual. The expected behaviour here is that since the polling manager got restarted, any port up will be marked as "added" and processed as such, in port_bound() (see call stack above). If this function is called on a port when ovs_restarted is True, then provision_
This is all working great under the assumption that the polling manager (which is an async process) will raise the "I got new port !" event before the rpc_loop() checks it (in process_
What happens then is that the rpc_loop in which OVS is detected as restarted doesn't see any change on the ports, and so does nothing. The next run of the rpc_loop will process the "I got new port !" events, but that loop will not be running with ovs_restarted set to True, so the ports won't be brought up properly - more specifically, the table 3 flows in br-tun will be missing. This is shown in the debug logs : https:/
I believe the proper thing to do is to set "sync" to True (in rpc_loop()) if an ovs restart is detected, forcing process_port_info() to not use async events and scan the ports itself using scan_ports().
Thanks
This problem can probably be easily replicated if you replace /usr/bin/ ovsdb-client (used by the polling manager) by a shell script that sleeps 0.5s and then exec the actual ovsdb-client.