arp storm when neutron-openvswitch-agent is configured with of_interface=native

Bug #1586462 reported by Inessa Vasilevskaya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Invalid
Critical
Ilya Chukhnakov

Bug Description

Note that of_interface=native is not the default value.

Seen on mos9.0, 3 controllers/1 compute.

Steps to reproduce:
1. deploy mos, change of_interface to 'native' in /etc/neutron/plugins/ml2/openvswitch_agent.ini
2. restart neutron-openvswitch-agent on all controllers and compute.
3. run OSTF "Check network connectivity from instance via floating IP"

In a couple of minutes LA on controllers will be 75+, ssh connection to controller node will be dropped.

ovs-ofctl br-tun on compute node shows that a huge amount of packets are received and dropped.
(rules with max packets).

node-1:br-tun
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.230s, table=0, n_packets=115, n_bytes=10273, priority=1,in_port=1 actions=goto_table:2
 cookie=0x9e3a99e5bc9ec2d4, duration=3591.527s, table=0, n_packets=4724868, n_bytes=198449650, priority=1,in_port=3 actions=goto_table:4
 cookie=0x9e3a99e5bc9ec2d4, duration=3591.096s, table=0, n_packets=5749482, n_bytes=241478272, priority=1,in_port=2 actions=goto_table:4
 cookie=0x9e3a99e5bc9ec2d4, duration=3590.737s, table=0, n_packets=6036999, n_bytes=253554350, priority=1,in_port=4 actions=goto_table:4
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.225s, table=0, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.217s, table=2, n_packets=99, n_bytes=8507, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=goto_table:20
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.209s, table=2, n_packets=16, n_bytes=1766, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=goto_table:22
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.204s, table=3, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x9e3a99e5bc9ec2d4, duration=282.190s, table=4, n_packets=71, n_bytes=8512, priority=1,tun_id=0x2 actions=push_vlan:0x8100,set_field:4097->vlan_vid,goto_table:10
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.198s, table=4, n_packets=16511278, n_bytes=693473760, priority=0 actions=drop
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.194s, table=6, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.188s, table=10, n_packets=71, n_bytes=8512, priority=1 actions=learn(table=20,hard_timeout=300,priority=1,cookie=0x9e3a99e5bc9ec2d4,OXM_OF_VLAN_VID[],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->OXM_OF_VLAN_VID[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:1
 cookie=0x9e3a99e5bc9ec2d4, duration=254.439s, table=20, n_packets=0, n_bytes=0, hard_timeout=300, priority=1,vlan_tci=0x0001/0x0fff,dl_dst=fa:16:3e:4b:94:d7 actions=load:0->OXM_OF_VLAN_VID[],load:0x2->NXM_NX_TUN_ID[],output:3
 cookie=0x9e3a99e5bc9ec2d4, duration=254.438s, table=20, n_packets=0, n_bytes=0, hard_timeout=300, priority=1,vlan_tci=0x0001/0x0fff,dl_dst=fa:16:3e:5d:5f:7d actions=load:0->OXM_OF_VLAN_VID[],load:0x2->NXM_NX_TUN_ID[],output:4
 cookie=0x9e3a99e5bc9ec2d4, duration=254.263s, table=20, n_packets=99, n_bytes=8507, hard_timeout=300, priority=1,vlan_tci=0x0001/0x0fff,dl_dst=fa:16:3e:9b:3c:e8 actions=load:0->OXM_OF_VLAN_VID[],load:0x2->NXM_NX_TUN_ID[],output:3
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.179s, table=20, n_packets=0, n_bytes=0, priority=0 actions=goto_table:22
 cookie=0x9e3a99e5bc9ec2d4, duration=282.207s, table=22, n_packets=11, n_bytes=1376, priority=1,dl_vlan=1 actions=pop_vlan,set_field:0x2->tun_id,output:3,output:2,output:4
 cookie=0x9e3a99e5bc9ec2d4, duration=3595.173s, table=22, n_packets=5, n_bytes=390, priority=0 actions=drop
=============

Tags: area-neutron
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

http://paste.openstack.org/show/505492/ - native after VM boot
http://paste.openstack.org/show/505494/ - default configuration after VM boot
http://paste.openstack.org/show/505509/ - native right before becoming living dead

I believe the reproduction can be minimized to a single VM successful boot (storm begins right after transition to ACTIVE state).

nova boot VM --flavor m1.tiny --image "TestVM" --nic net-name=admin_internal_net

summary: - rpc storm when neutron-openvswitch-agent is configured with
+ arp storm when neutron-openvswitch-agent is configured with
of_interface=native
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Close examination of native configuration flows showed that somehow br-tun on node-2 was configured to drop all traffic.

node-2:br-tun
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0xaf3b7046d9c0b1f1, duration=3365.416s, table=0, n_packets=3167, n_bytes=165002, priority=1,in_port=1 actions=goto_table:2
 cookie=0xaf3b7046d9c0b1f1, duration=3340.526s, table=0, n_packets=2, n_bytes=84, priority=1,in_port=3 actions=goto_table:4
 cookie=0xaf3b7046d9c0b1f1, duration=3339.894s, table=0, n_packets=10, n_bytes=1306, priority=1,in_port=2 actions=goto_table:4
 cookie=0xaf3b7046d9c0b1f1, duration=3338.011s, table=0, n_packets=4, n_bytes=224, priority=1,in_port=4 actions=goto_table:4
 cookie=0xaf3b7046d9c0b1f1, duration=3365.340s, table=0, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0xaf3b7046d9c0b1f1, duration=3365.234s, table=2, n_packets=0, n_bytes=0, priority=0,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=goto_table:20
 cookie=0xaf3b7046d9c0b1f1, duration=3365.161s, table=2, n_packets=3167, n_bytes=165002, priority=0,dl_dst=01:00:00:00:00:00/01:00:00:00:00:00 actions=goto_table:22
 cookie=0xaf3b7046d9c0b1f1, duration=3365.093s, table=3, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0xaf3b7046d9c0b1f1, duration=3365.020s, table=4, n_packets=13, n_bytes=1460, priority=0 actions=drop
 cookie=0xaf3b7046d9c0b1f1, duration=3364.930s, table=6, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0xaf3b7046d9c0b1f1, duration=3364.847s, table=10, n_packets=3, n_bytes=154, priority=1 actions=learn(table=20,hard_timeout=300,priority=1,cookie=0xaf3b7046d9c0b1f1,OXM_OF_VLAN_VID[],NXM_OF_ETH_DST[]=NXM_OF_ETH_SRC[],load:0->OXM_OF_VLAN_VID[],load:NXM_NX_TUN_ID[]->NXM_NX_TUN_ID[],output:OXM_OF_IN_PORT[]),output:1
 cookie=0xaf3b7046d9c0b1f1, duration=3364.745s, table=20, n_packets=0, n_bytes=0, priority=0 actions=goto_table:22
 cookie=0xaf3b7046d9c0b1f1, duration=3364.651s, table=22, n_packets=3165, n_bytes=164862, priority=0 actions=drop

Any pipeline that a packet matches (0->2->22 | 0->2->20->22 | 0->4) will inevitably end in a drop.

Some more logs from other reproduction runs (ideally with ovs-ofctl show command result along with ovs-ofctl dump-flows) will be of great help.

Warning: I DO NOT RECOMMEND to reproduce this thing on a lab with default external network because of possible network outage.

Revision history for this message
Alexander Ignatov (aignatov) wrote :

Please note this bug is applicable only to MOS 10.0. although it was reproduced with MOS 9.0. of_interface=native is turned off by default in 9.0.

Changed in mos:
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Inessa Vasilevskaya (ivasilevskaya)
milestone: none → 10.0
no longer affects: mos/10.0.x
tags: added: area-neutron
Changed in mos:
assignee: Inessa Vasilevskaya (ivasilevskaya) → Ilya Chukhnakov (ichukhnakov)
Revision history for this message
Ilya Chukhnakov (ichukhnakov) wrote :

Turning on native of_interface leads to some action=normal rules to be added due to in-band control (only visible with ovs-appctl bridge/dump-flows, see In-Band Control section in [1] for more details). Some of those rules allow packets from public network to pass into br-tun where it leads to the storm. It is not yet clear if this is a configuration issue or a bug in the OVS itself.

Also note, that it seems to be safe to turn in-band control off as the controller is configured to 127.0.0.1, which corresponds to out-of-band control configuration. Turning in-band control off with disable-in-band after the bridge has been already configured with the controller settings still leaves the action=normal rules visible, but they are non-functional (see [2]). So the proper solution would be to set disable-in-band in the same command that sets the controllers.

[1] http://openvswitch.org/support/dist-docs-2.5/DESIGN.md.html
[2] http://openvswitch.org/pipermail/git/2011-August/001676.html

Revision history for this message
Ilya Chukhnakov (ichukhnakov) wrote :
Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Alexander Ignatov (aignatov) wrote :

This bug is fixed in Newton upstream so closing it as Invalid.

Changed in mos:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.