Restarting neutron openvswitch while having broadcast/multicast traffic going into br-tun makes a broadcast storm over the tunnel network

Bug #1421232 reported by Miguel Angel Ajo
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Miguel Angel Ajo
Juno
Fix Released
Critical
Miguel Angel Ajo

Bug Description

As a result from the following bug (br-tun being reset across agent restarts) https://bugs.launchpad.net/neutron/+bug/1383674

If in addition, we have a broadcast or multicast packet jumping into
br-tun from br-int, openvswitch will bring down the network creating
a broadcast storm.

It's necessary to have at least 3 nodes connected via tunnels:

The packets will go:

NodeA -> NodeB -> NodeC -> NodeA

Or more amplified if we had more nodes.

This would be avoided if we re-created br-tun in fail-mode "secure" at least, because that doesn't introduce the "NORMAL" default switching rule
on the switch at creation (origin of this problem.)

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

Fabio Di Nitto , Jiri Benc (openvswitch) found the issue, nailed down the problem,
and got how to reproduce it:

Stop neutron-openvswitch-agent (maintenance), restart openvswitch...., and be a bit unlucky...

Changed in neutron:
status: New → Confirmed
assignee: nobody → Miguel Angel Ajo (mangelajo)
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/155328

Changed in neutron:
status: Confirmed → In Progress
Kyle Mestery (mestery)
Changed in neutron:
importance: Undecided → High
milestone: none → kilo-3
Revision history for this message
Kevin Fox (kevpn) wrote :

Seen this with Juno. Took almost 2 days to figure out. Ajo finally figured it out. The fix seems to be working for me. Please mark this as back port potential.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/156177

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Moving to Critical. The bug makes the whole network down. It's new for Juno+ where we started to rebuild br-tun.

Changed in neutron:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/155328
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=740ddc5043e39f3babb57896652d11c223f1f385
Submitter: Jenkins
Branch: master

commit 740ddc5043e39f3babb57896652d11c223f1f385
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Feb 12 14:32:58 2015 +0000

    Setup br-tun in secure fail mode to avoid broadcast storms

    When not creating br-tun in secure fail mode, there are chances to
    get a broadcast storm from br-tun.

    For example, this occurs when at least three nodes have the br-tun
    OpenFlow rules reset in and a broadcast/multicast packet enters br-tun.

    This can happen if:
      * openvswitch is restarted, until the agent reloads the Openflow rules.
      * during neutron-openvswitch-agent restart, br-tun is reset, and there
    is a few seconds timeframe where tunnel endpoints are plugged and OF
    rules are reset.

    Secure fail mode doesn't forward traffic by default if no rule is hit.

    Change-Id: Iba5ded14179156decb16dcd4b898c026660f9653
    Closes-bug: #1421232

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
Kevin Fox (kevpn) wrote :

This one killed our cloud several times. RDO Juno. I applied the patch and it seems to have resolved the issue. Please mark it for Juno.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/156177
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a52f19f9bdfc2eb1357e27b07bdf534289de9c7f
Submitter: Jenkins
Branch: stable/juno

commit a52f19f9bdfc2eb1357e27b07bdf534289de9c7f
Author: Miguel Angel Ajo <email address hidden>
Date: Thu Feb 12 14:32:58 2015 +0000

    Setup br-tun in secure fail mode to avoid broadcast storms

    When not creating br-tun in secure fail mode, there are chances to
    get a broadcast storm from br-tun.

    For example, this occurs when at least three nodes have the br-tun
    OpenFlow rules reset in and a broadcast/multicast packet enters br-tun.

    This can happen if:
      * openvswitch is restarted, until the agent reloads the Openflow rules.
      * during neutron-openvswitch-agent restart, br-tun is reset, and there
    is a few seconds timeframe where tunnel endpoints are plugged and OF
    rules are reset.

    Secure fail mode doesn't forward traffic by default if no rule is hit.

    For stable/juno: create bridge operation is implemented with the secure_mode
    parameter, since we don't have ovsdb transactions in juno, and we need
    to creation+fail_mode secure to be atomic to avoid race conditions.

    Closes-bug: #1421232

    Conflicts:
     neutron/agent/linux/ovs_lib.py
     neutron/tests/functional/agent/test_ovs_lib.py

    Change-Id: Iba5ded14179156decb16dcd4b898c026660f9653

tags: added: in-stable-juno
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-3 → 2015.1.0
Revision history for this message
Florian Ermisch (0xf10e) wrote :

I see this problem in my Icehouse test-setups, too.

Had a controller and two compute nodes on KVM all using 100% of one CPU core.
`top` inside the VMs showed up to 98% software interrupts.
Running `tcpdump` on their internal/mgmt network revealed a multicast storm that disappeared after running `ovs-vsctl set-fail-mode br-tun secure` on all nodes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.