Missing secure fail mode on physical bridges
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Fix Released
|
Critical
|
Hynek Mlnarik |
Bug Description
Restarting current ovs neutron agent under a heavy load with Ryu (ofctl=native) leads intermittently to disruption of traffic as manifested by occasional failures in not-yet-commited fullstack test [1]. More specifically, the reason seems to be too slow restart of Ryu controller in combination with OVS vswitchd timeouts. The disruption of the traffic occurs always after the following log entry is recorded in ovs-vswitchd.log:
fail_
This issue is manifested regardless of network type (VLAN, flat) and vsctl interface (cli, native). It has not occured with ofctl=cli though.
The issue occurs for physical switches as they are in default fail mode, meaning that once controller connection is lost, ovs takes over the management of the flows and clears them. This conclusion is based on flows dumped from situation just before the traffic was blocked and after that event when flows were cleared. Before the event:
====== br-eth23164fdb4 =======
Fri Jul 29 10:44:04 UTC 2016
OFPST_FLOW reply (OF1.3) (xid=0x2):
cookie=
cookie=
cookie=
After the disruption:
====== br-eth23164fdb4 =======
Fri Jul 29 10:44:05 UTC 2016
OFPST_FLOW reply (OF1.3) (xid=0x2):
The same bug apperars in this condition (courtesy of Jakub Libosvar): setup a phys bridge, block the traffic for the respective bridge controller via iptables until OVS timeout occurs, and then check openflow rules of the affected bridge.
tags: | added: ovs |
I'd add that it's not only in restarts but a machine under heavy load can miss probes from ovsdb, that leads to controller disconnect. This also drops flows, so compute node with VLAN or flat network type remain with broken data plane until ovs agent is restarted.