Default NORMAL flows on OVS bridges at boot has potential to cause network storm

Bug #1324703 reported by James Denton
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Kyle Mestery
Icehouse
Fix Released
Undecided
Unassigned

Bug Description

There have been many times where we have restarted an environment only to find that it becomes laggy or inaccessible within minutes of coming back up. Unfortunately, the response of the systems through DRAC did not lend themselves to packet captures. The usual resolution is to shut the physical switchports, reboot the system, and turn the ports back up when the system is online. Switchport statistics, however, show a high number of packets in/out of the interfaces.

From what I can tell, when the openvswitch service is started at boot, every bridge is populated with a NORMAL learning flow. This includes the provider, integration and tunnel bridges. If/when there is a delay in the openvswitch plugin agent populating the flows, there is a potential for traffic coming in one port to be forwarded out all other bridges/ports. If there are multiple servers in this state, a traffic storm is generated on the network due to a bridging loop.

This issue could also be seen after a restart/upgrade of openvswitch prior to Kyle's patch for the following bug:

"neutron-openvswitch-agent does not recreate flows after ovsdb-server restarts"
https://bugs.launchpad.net/tripleo/+bug/1290486

In the above case, we could see physical vlan traffic on the provider bridge being forwarded through the overlay networks, and vice-versa. Rather that default to NORMAL flows for each bridge upon the starting of openvswitch, would it be possible to default to DROP flows? That way, if there are any sort of timing issues between openvswitch starting and the flows actually getting written, traffic will not be allowed between bridges.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Is issue still seen after the fix you've mentioned?

tags: added: ovs
Changed in neutron:
status: New → Incomplete
Revision history for this message
Kyle Mestery (mestery) wrote :

Changing from "NORMAL" to "drop" at startup would likely solve this, as it would cause all traffic to be dropped until proper flows are programmed. There is some logic to this approach, I'd like to hear feedback from others as well.

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
James Denton (james-denton) wrote :

Kyle - It looks like changing the default flow from NORMAL to DROP would have a negative impact if users have configured an IP address on the bridge interface. In cases where users are using a single interface for host management and a provider bridge, they would not have connectivity to the box without the NORMAL flow in place.

Revision history for this message
James Denton (james-denton) wrote :

To add to my above comment, maybe just changing the default action of the integration bridge from NORMAL to DROP would be enough to mitigate the circumstances of this bug from occurring.

Kyle Mestery (mestery)
Changed in neutron:
assignee: nobody → Kyle Mestery (mestery)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/96782

Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
James Denton (james-denton) wrote :

Tested the patch and it works as advertised. However, prior to implementing the SIGTERM and SIGINT wipe of the flows in the latest patch, traffic would have been switched based on the flows that existed prior to the agent being shutdown. It's when openvswitch-switch is restarted, or some condition in which openvswitch-switch is started and the agent isn't running or operating properly, that concerns me.

In those conditions, the bridges remain patched together and openvswitch itself implements its default NORMAL flows on each bridge. If the agent, for whatever reason, isn't running or operating properly, those NORMAL flows will never get overwritten. I'm not sure, based on this behavior, that this is something that can be mitigated in the ovs agent. My understanding is that bridges default to learning (NORMAL) mode, and when they're cross-connected like this without STP enabled it sets up the storm conditions.

Revision history for this message
Kyle Mestery (mestery) wrote :

James, given your comment in #6 above, I'm now thinking that perhaps my patch to handle reprogramming flows at restart will cover the case where you were having issues. I'm keen to possibly abandon the patch and this bug. The real fix would be to have the OVS startup scripts start with a DROP rule instead of a NORMAL rule, and this could be done using a deployer tool such as Puppet, Chef, Ansible, or Salt. Thoughts?

Revision history for this message
James Denton (james-denton) wrote :

Kyle,

Thanks for your help and quick response to this issue. I agree that you can abandon the patch, and will keep you posted as to how we end up resolving it.

Revision history for this message
James Denton (james-denton) wrote :

I am looking to add the following to the start section of the openvswitch-switch script on our hosts:

ovs-ofctl del-flows br-int
ovs-ofctl add-flow br-int actions=drop

This will ensure that if for any reason the OVS agent fails to start, the NORMAL flows on each bridge don't cause a bridging loop to occur.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/98857

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/96782
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0412c8a3c43e77c1bb04b0c3ba4b2781fc52e7f8
Submitter: Jenkins
Branch: master

commit 0412c8a3c43e77c1bb04b0c3ba4b2781fc52e7f8
Author: Kyle Mestery <email address hidden>
Date: Fri May 30 08:48:37 2014 +0000

    Default to setting secure mode on the integration bridge

    Set the fail-mode on the integration bridge to be secure. This means if the agent
    is stopped or crashes, and OVS is also restarted, OVS will not program a default
    NORMAL action. As soon as the agent is restarted, it will correctly program the
    integration bridge.

    Change-Id: Icf7e3e14ee747c8ce92c14c95a0a1bbf35986252
    Closes-Bug: #1324703

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/icehouse)

Reviewed: https://review.openstack.org/98857
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5e0ea72a4263a7604cb7ff44eca40a5bd0cffc8c
Submitter: Jenkins
Branch: stable/icehouse

commit 5e0ea72a4263a7604cb7ff44eca40a5bd0cffc8c
Author: Kyle Mestery <email address hidden>
Date: Fri May 30 08:48:37 2014 +0000

    Default to setting secure mode on the integration bridge

    Set the fail-mode on the integration bridge to be secure. This means if the agent
    is stopped or crashes, and OVS is also restarted, OVS will not program a default
    NORMAL action. As soon as the agent is restarted, it will correctly program the
    integration bridge.

    Change-Id: Icf7e3e14ee747c8ce92c14c95a0a1bbf35986252
    Closes-Bug: #1324703
    (cherry picked from commit 0412c8a3c43e77c1bb04b0c3ba4b2781fc52e7f8)

tags: added: in-stable-icehouse
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → juno-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.