ovs bridge flow table is dropped by unkown cause

Bug #1697243 reported by MarginHu
34
This bug affects 7 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Rodolfo Alonso

Bug Description

Hi,

My openstack has a provider network with ovs bridge is "provision", it has been running fine but found it is network breakdown after several hours,I found it's flow table is empty.

Is there a way to trace a bridge's flow table changement?

[root@cloud-sz-master-b12-01 neutron]# ovs-ofctl dump-flows provision
NXST_FLOW reply (xid=0x4):

[root@cloud-sz-master-b12-02 nova]# ovs-ofctl dump-flows provision
NXST_FLOW reply (xid=0x4):
[root@cloud-sz-master-b12-02 nova]#
[root@cloud-sz-master-b12-02 nova]#
[root@cloud-sz-master-b12-02 nova]# ip r
...
10.53.33.0/24 dev proTvision proto kernel scope link src 10.53.33.11
10.53.128.0/24 dev docker0 proto kernel scope link src 10.53.128.1
169.254.0.0/16 dev br-ex scope link metric 1055
169.254.0.0/16 dev provision scope link metric 1056
...

[root@cloud-sz-master-b12-02 nova]# ovs-ofctl show provision
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000248a075541e8
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STAS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(bond0): addr:24:8a:07:55:41:e8
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 2(phy-provision): addr:76:b5:88:cc:a6:74
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(provision): addr:24:8a:07:55:41:e8
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0

[root@cloud-sz-master-b12-02 nova]# ifconfig bond0
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
        inet6 fe80::268a:7ff:fe55:41e8 prefixlen 64 scopeid 0x20<link>
        ether 24:8a:07:55:41:e8 txqueuelen 1000 (Ethernet)
        RX packets 93588032 bytes 39646246456 (36.9 GiB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 8655257217 bytes 27148795388 (25.2 GiB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

[root@cloud-sz-master-b12-02 nova]# cat /proc/net/bonding/bond0
Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 0
Up Delay (ms): 0
Down Delay (ms): 0

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 24:8a:07:55:41:e8
Active Aggregator Info:
        Aggregator ID: 19
        Number of ports: 2
        Actor Key: 13
        Partner Key: 11073
        Partner Mac Address: 38:bc:01:c2:26:a1

Slave Interface: enp4s0f0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:8a:07:55:41:e8
Slave queue ID: 0
Aggregator ID: 19
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 24:8a:07:55:41:e8
    port key: 13
    port priority: 255
    port number: 1
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: 38:bc:01:c2:26:a1
    oper key: 11073
    port priority: 32768
    port number: 43
    port state: 61

Slave Interface: enp5s0f0
MII Status: up
Speed: 10000 Mbps
Duplex: full
Link Failure Count: 0
Permanent HW addr: 24:8a:07:55:44:64
Slave queue ID: 0
Aggregator ID: 19
Actor Churn State: none
Partner Churn State: none
Actor Churned Count: 0
Partner Churned Count: 0
details actor lacp pdu:
    system priority: 65535
    system mac address: 24:8a:07:55:41:e8
    port key: 13
    port priority: 255
    port number: 2
    port state: 61
details partner lacp pdu:
    system priority: 32768
    system mac address: 38:bc:01:c2:26:a1
    oper key: 11073
    port priority: 32768
    port number: 91
    port state: 61

Revision history for this message
MarginHu (margin2017) wrote :

There are other 2 servers with same network configuration, but the issue hasn't appeared on those servers.

Revision history for this message
MarginHu (margin2017) wrote :

the flow table is so important that it will influence network connection,why the source code doesn't print flow rule's change info like add ,delete, modify ?

Revision history for this message
Trevor McCasland (twm2016) wrote :

Can you state what the preconditions are for this bug and what the expected outcomes are? That will make it easier to understand what the problem is.

Right now the bug report stands more like a question, that is more appropriate for the mailing list.

tags: added: ovs
Changed in neutron:
status: New → Incomplete
Revision history for this message
MarginHu (margin2017) wrote :

I am not sure whether that is a bug or not . I think my description is enough, maybe you can ask me more info or logs.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

If you have debug enabled, the neutron agent will log when it deletes flows.

Revision history for this message
MarginHu (margin2017) wrote :

I found it's not as you said.

[DEFAULT]
debug = True

Now I manully delete the flow rule as following , but I don't find any about info in neutron log.
[root@cloud-sz-master-b12-02 ~]# ovs-ofctl dump-flows provision
NXST_FLOW reply (xid=0x4):
 cookie=0x80d0c34387b440b5, duration=54165.293s, table=0, n_packets=856, n_bytes=63275, idle_age=161, priority=4,in_port=2,dl_vlan=2 actions=strip_vlan,NORMAL
[root@cloud-sz-master-b12-02 ~]# ovs-ofctl del-flows provision priority=4
ovs-ofctl: unknown keyword priority
[root@cloud-sz-master-b12-02 ~]# ovs-ofctl del-flows provision in_port=2

Revision history for this message
Kevin Benton (kevinbenton) wrote :

If something else deletes a flow, neutron isn't going to log it because it doesn't know about it. It will only log when it deletes flows.

So if you don't see a log entry, there must be something else is probably deleting flows on the system. I suggest increasing the OVS logging levels and watching for the timestamp when all of the flows are removed from that bridge. Then check to see if neutron or nova did anything at that time.

Revision history for this message
MarginHu (margin2017) wrote :
Download full text (4.3 KiB)

I found the flow rules ware dropped from the log, but I don't know why.

Fri Jun 23 06:20:56 CST 2017

ovs-ofctl show provision
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000248a07554190
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(bond0): addr:24:8a:07:55:41:90
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 2(phy-provision): addr:02:66:68:17:37:71
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(provision): addr:24:8a:07:55:41:90
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
ovs-ofctl dump-flows provision
NXST_FLOW reply (xid=0x4):
 cookie=0x838540493f7eb89d, duration=116.540s, table=0, n_packets=1846, n_bytes=171564, idle_age=0, priority=0 actions=NORMAL

Fri Jun 23 06:21:57 CST 2017

ovs-ofctl show provision
OFPT_FEATURES_REPLY (xid=0x2): dpid:0000248a07554190
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS QUEUE_STATS ARP_MATCH_IP
actions: output enqueue set_vlan_vid set_vlan_pcp strip_vlan mod_dl_src mod_dl_dst mod_nw_src mod_nw_dst mod_nw_tos mod_tp_src mod_tp_dst
 1(bond0): addr:24:8a:07:55:41:90
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 2(phy-provision): addr:02:66:68:17:37:71
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 LOCAL(provision): addr:24:8a:07:55:41:90
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
OFPT_GET_CONFIG_REPLY (xid=0x4): frags=normal miss_send_len=0
ovs-ofctl dump-flows provision
NXST_FLOW reply (xid=0x4):

Please notice the following log:

18905:2017-06-23 06:21:07.514 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-381c7dde-6c3d-4e3b-af32-ae20c75f59d9 - - - - -] Starting to process devices in:{'current': set([u'9d3250d4-c789-4bdd-b325-31145863050e', u'0d474f55-dfa7-4cbf-91e6-123b2757b2f9']), 'removed': set([]), 'added': set([u'9d3250d4-c789-4bdd-b325-31145863050e', u'0d474f55-dfa7-4cbf-91e6-123b2757b2f9'])} rpc_loop /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2036

19066 2017-06-23 06:21:10.491 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [r eq-381c7dde-6c3d-4e3b-af32-ae20c75f59d9 - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,x id=0xc25b34bf,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group= 4294967295,out_port=4294967295,table_id=255,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_cou nt=218570,cookie=9477051674213136541L,duration_nsec=874000000,duration_sec=129,flags=0,hard_timeout=0,idl e_timeout=0,instructions=[OFPInstructionActions(actions=[OFPActionOutput(len=16,max_len=0,port=4294967290 ,type=0)],len=24,type=4)],length=80,match=OFPMatch(oxm_fields={}),packet_count=2215,priority=0,table_id=0 )],flags=0,type=1)] _send_msg /usr/lib/python2.7/si...

Read more...

Revision history for this message
MarginHu (margin2017) wrote :
Revision history for this message
Kevin Benton (kevinbenton) wrote :

There is definitely something wrong here. The statement:

"Deleting flow with cookie 0x838540493f7eb89d" is coming from "Cleaning stale provider flows" even though that cookie belongs to the provision bridge. I'm concerned the cookiebridge is leaking things between instances.

Revision history for this message
MarginHu (margin2017) wrote :

but what's wrong? what events may lead to this exception ?

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

I've had only a quick look for now.

First thing, some information that is currently missing from the bug, but that would help would be:
- version used
- full logs (the one provided seems to give logs long after)

Looking at the logs, it seems that 0x838540493f7eb89d is a cookie of a current bridge (not from a previous run), because "9477051674213136541" (the int for 0x838540493f7eb89d) is present in multiple lines such as:

2017-06-23 06:19:00.617 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-9a0971d8-f9fc-49c7-b640-013efa97dfd4 - - - - -] ofctl request version=0x4,msg_type=0xe,msg_len=0x50,xid=0xe427dd6f,OFPFlowMod(buffer_id=4294967295,command=0,cookie=9477051674213136541L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionActions(actions=[OFPActionOutput(len=16,max_len=0,port=4294967290,type=0)],len=24,type=4)],match=OFPMatch(oxm_fields={}),out_group=0,out_port=0,priority=0,table_id=0) result None _send_msg /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:93

The behavior for stale flow deletion is "delete things for all cookies except cookies that we know about", so believe that the error may lie in failing to identify that 0x838540493f7eb89d is a cookie we know about (ie having the cookie of our bridge in .reserved_cookies of the master bridge). In fact I suspect that if we derive a cookie bridge from another cookie bridge itself derived from the master, reserving is possibly not propagating to the master bridge as it should.

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

Attaching here the ml2_conf.ini that MarginHu posted to openstack-dev.

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

(TLDR: this comments list possible causes that, at least on a first approximation, I would rule out as cause for this specific issue -- but issues that the code may have nonetheless)

After looking at the config: since no l2 agent extension is loaded, I don't see how OVSCookieBridge could be involved (even though I suspect the code may have an issue lurking there as said in #12).

Another possibility I initially thought about is if we had multiple physical networks mapped to the same OVS bridge, in that case multiple OVSBridges instances would be created for the same OVS bridge, but with each unaware of the cookies of the others. But (a) the ml2 config does *not* have a given OVS bridge present multiple times in bridge_mappings, this can't be what is happening here, and (b) the bridge_mapping parsing code seems to prevent that (use of helpers.parse_mappings(bridge_mappings) with unique_values left to True).

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

(MarginHu, please disregard my comment above on missing info on version and truncated logs, I realize that the logs had, in the middle, a restart of the agent)

Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

I currently don't have an explanation why the _default_cookie of the bridge for "provision" would be absent from ._reserved_cookies, but on the other hand it seems to me that is a prerequisite for cleanup_flows ending up calling delete_flows on this cookie.

Revision history for this message
MarginHu (margin2017) wrote :

openstack-neutron-10.0.1-1.el7.noarch
python2-neutronclient-6.1.0-1.el7.noarch
openstack-neutron-common-10.0.1-1.el7.noarch
python-neutron-lbaas-10.0.0-1.el7.noarch
openstack-neutron-ml2-10.0.1-1.el7.noarch
openstack-neutron-openvswitch-10.0.1-1.el7.noarch
python-neutron-lib-1.1.0-1.el7.noarch
python-neutron-10.0.1-1.el7.noarch
openstack-neutron-lbaas-10.0.0-1.el7.noarch
openvswitch-2.6.1-4.1.git20161206.el7.x86_64
openstack-neutron-openvswitch-10.0.1-1.el7.noarch
python-openvswitch-2.6.1-4.1.git20161206.el7.noarch

Revision history for this message
MarginHu (margin2017) wrote :
Revision history for this message
MarginHu (margin2017) wrote :

the issue is 100% reproducible when rebooting server.
my service boot sequence after rebooting server is as following:

1.neutron_openvswitch_agent
2.neutron_dhcp_agent
3.neutron_l3_agent
4.neutron_server

the services are run in a script after server has been booted, so I can control sequence and check flow table easily.

I found two facts:

1)flow tables of all ovs bridge were normal after step1,2.
2)only after step3 the flow table of "provider" became empty.

It seems neutronp_l3_agent service make flow table emty.

Revision history for this message
MarginHu (margin2017) wrote :

If I don't start "neutron_l3_agent" service , the issue is disappeared.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Can you provide the config for the l3 agent as well?

Changed in neutron:
status: Incomplete → Confirmed
importance: Undecided → High
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Looking at the code, I don't even see how starting the neutron_l3_agent could trigger the flow cleanup code. Can you also provide a set of logs from both the OVS agent and the L3 agent so we can see what the L3 agent did at the time the flow cleanup was triggered?

Revision history for this message
Kevin Benton (kevinbenton) wrote :

Oh, I see the reason L3 agent is triggering it. After a fresh restart, there are no interfaces on the OVS integration bridge so the flow cleanup logic isn't triggered. Once the L3 agent starts up, it creates an interface which triggers the cleanup.

Revision history for this message
Kevin Benton (kevinbenton) wrote :

I'm not able to reproduce this on my Ocata setup.

Is it possible for you to modify the neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py file to include some additional logging statements to collect to help debug this?

The change you need to apply is here: https://review.openstack.org/#/c/477052/

Revision history for this message
MarginHu (margin2017) wrote :
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Yeah, the L3 agent is fine. Whatever first port goes into OVS would have triggered it. Can you apply that patch above (or just manually add the debug line yourself) and run it?

Revision history for this message
MarginHu (margin2017) wrote :

Sorry now I haven't chance to try your code because the environment hasn't been existed.

I guess the cause may be from the following usage of ovs bridge:

1.bond0 is built from eth0 and eth1, vlan17 is a vlan interface based on bond0.
2.bond0 is added into ovs bridge "provision"
3.vlan17 is added into ovs bridge "provider"

In this scene, flow table often be dropped on "provision" or "provider".

After I use only one ovs bridge, no any issue.

then I change my solution about network and remove "provider" bridge for escaping the above scene of "provision" and "provider" are existed on same times in one server.

Revision history for this message
Burkhard Linke (blinke) wrote :
Download full text (9.9 KiB)

We are affected by the same problem. Flows on physical bridges are deleted upon restart of the neutron-openvswitch-agent on _compute_ host.

OS: Ubuntu Xenial 16.04
Kernel: 4.4.0-83-generic
Openstack Distribution: Fuel Community Edition 10
neutron-openvswitch-agent: 2:9.2.0-1~u16.04+mos15
openvswitch: 2.6.1-0~u1604+mos1

Flow dump with a working setup and one VM:

root@dl580-r4-1:~# ovs-ofctl dump-flows br-biodb
NXST_FLOW reply (xid=0x4):
 cookie=0xa870e454201864c5, duration=31.624s, table=0, n_packets=34, n_bytes=2992, idle_age=1, priority=4,in_port=1,dl_vlan=3 actions=strip_vlan,NORMAL
 cookie=0xa870e454201864c5, duration=62679.598s, table=0, n_packets=491, n_bytes=69574, idle_age=1, priority=2,in_port=1 actions=drop
 cookie=0xa870e454201864c5, duration=62679.717s, table=0, n_packets=634, n_bytes=51232, idle_age=1, priority=0 actions=NORMAL

Bridge setup:
    Bridge br-biodb
        Controller "tcp:127.0.0.1:6633"
            is_connected: true
        fail_mode: secure
        Port br-biodb
            Interface br-biodb
                type: internal
        Port phy-br-biodb
            Interface phy-br-biodb
                type: patch
                options: {peer=int-br-biodb}
        Port "bond0.603"
            Interface "bond0.603"

bond0.603 is a vlan tagged LACP bond of the ethernet interfaces. Network associated to bridge uses flat network type.
VM running on that host is able to ping externel router / baremetal machines outside of cloud setup.

Restart of host (error not always reproducible by agent restart only), side note:
neutron-openvswitch-agent does not start properly, stating:

2017-07-13 08:15:05.699 2659 ERROR neutron.plugins.ml2.drivers.openvswitch.agent
.ovs_neutron_agent [-] Tunneling can't be enabled with invalid local_ip '192.168.11.83'. IP couldn't be found on this host's interfaces.

The IP address is associated to the br-mesh ovs bridge and present, so this is probably a startup race. Stating the agent manually afterwards works

Current flows before agent starts:
root@dl580-r4-1:~# ovs-ofctl dump-flows br-biodb
NXST_FLOW reply (xid=0x4):
root@dl580-r4-1:~#

Flows after agent start:
root@dl580-r4-1:~# ovs-ofctl dump-flows br-biodb
NXST_FLOW reply (xid=0x4):
root@dl580-r4-1:~#

ovs-ofcft snoop output:
OFPT_FEATURES_REQUEST (OF1.3) (xid=0x2a8f299c):
OFPT_FEATURES_REPLY (OF1.3) (xid=0x2a8f299c): dpid:00005cb901e425b0
n_tables:254, n_buffers:256
capabilities: FLOW_STATS TABLE_STATS PORT_STATS GROUP_STATS QUEUE_STATS
OFPST_PORT_DESC request (OF1.3) (xid=0x2a8f299d): port=ANY
OFPST_PORT_DESC reply (OF1.3) (xid=0x2a8f299d):
 1(phy-br-biodb): addr:c2:cc:e6:c1:7c:bf
     config: 0
     state: 0
     speed: 0 Mbps now, 0 Mbps max
 2(bond0.603): addr:5c:b9:01:e4:25:b0
     config: 0
     state: 0
     current: 10GB-FD
     speed: 10000 Mbps now, 0 Mbps max
 LOCAL(br-biodb): addr:5c:b9:01:e4:25:b0
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max
OFPT_ECHO_REQUEST (OF1.3) (xid=0x0): 0 bytes of payload
OFPT_ECHO_REPLY (OF1.3) (xid=0x0): 0 bytes of payload
....

Seconds restart of agent:
OFPT_ECHO_REQUEST (OF1.3) (xid=0x0): 0 bytes of payload
OFPT_ECHO_REPLY (...

Revision history for this message
MarginHu (margin2017) wrote :

similar issue come back, this time bond1(mode=4 lacp) is built from 2 nics, then create 2 vlan interfaces (name is vlan1162, vlan1163) based on bond1.

vlan1162 is added into bridge "br-ex", vlan1163 is added into bridge "br-ex2".

I can simply reproduce the issue by restart neutron-openvswitch-agent service and found flow table on br-ex is empty.

I applied your patch and found the following log:

2017-07-15 21:13:33.128 6 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-570e6b8c-ddf4-454b-8924-4e2edea1904b - - - - -] Deleting flow with cookie 0xbdc758dfcf1b2f4d
2017-07-15 21:13:33.129 6 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-570e6b8c-ddf4-454b-8924-4e2edea1904b - - - - -] ofctl request version=0x4,msg_type=0xe,msg_len=0x38,xid=0xaebd71d9,OFPFlowMod(buffer_id=4294967295,command=3,cookie=13674996511809417037L,cookie_mask=18446744073709551615L,flags=0,hard_timeout=0,idle_timeout=0,instructions=[],match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,priority=0,table_id=255) result None _send_msg /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:93
2017-07-15 21:13:33.130 6 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-570e6b8c-ddf4-454b-8924-4e2edea1904b - - - - -] Cleaning stale br-ex flows

2017-07-15 21:13:33.133 6 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-570e6b8c-ddf4-454b-8924-4e2edea1904b - - - - -] Reserved cookies for br-ex: set([13674996511809417037L]) cleanup_flows /usr/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:141

Revision history for this message
MarginHu (margin2017) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/485054

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/485054
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bc51380ded25eb679209c379a07a1ac176af30f3
Submitter: Jenkins
Branch: master

commit bc51380ded25eb679209c379a07a1ac176af30f3
Author: Kevin Benton <email address hidden>
Date: Fri Jun 23 18:57:02 2017 -0700

    Log reserved cookies in cleanup_flows method

    This will help us debug why flows are unexpectedly being
    cleaned up if the related bug ever resurfaces.

    Related-Bug: #1697243
    Change-Id: I517b16c550037f41a5f4915b98963c2232daa78c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/486211

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/477052
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=282d4af115cfbad5afc614bafb4939249b2ebc20
Submitter: Jenkins
Branch: stable/ocata

commit 282d4af115cfbad5afc614bafb4939249b2ebc20
Author: Kevin Benton <email address hidden>
Date: Fri Jun 23 18:57:02 2017 -0700

    Log reserved cookies in cleanup_flows method

    This will help us debug why flows are unexpectedly being
    cleaned up if the related bug ever resurfaces.

    Related-Bug: #1697243
    Change-Id: I517b16c550037f41a5f4915b98963c2232daa78c
    (cherry picked from commit bc51380ded25eb679209c379a07a1ac176af30f3)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/486211
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cdc216bcac7b440783ea9f0831a473db00a12e4a
Submitter: Jenkins
Branch: stable/newton

commit cdc216bcac7b440783ea9f0831a473db00a12e4a
Author: Kevin Benton <email address hidden>
Date: Fri Jun 23 18:57:02 2017 -0700

    Log reserved cookies in cleanup_flows method

    This will help us debug why flows are unexpectedly being
    cleaned up if the related bug ever resurfaces.

    Related-Bug: #1697243
    Change-Id: I517b16c550037f41a5f4915b98963c2232daa78c
    (cherry picked from commit bc51380ded25eb679209c379a07a1ac176af30f3)

tags: added: in-stable-newton
Revision history for this message
Thomas Morin (tmmorin-orange) wrote :

I've had a look at Margin's last trace.

(only the lines related to cookies and stale flow deletion, edited for readability)

2017-07-15 21:13:27.052 6 INFO neutron.common.config [-] /usr/bin/neutron-openvswitch-agent version 10.0.1
2017-07-15 21:13:33.118 6 DEBUG ....openflow.native.ofswitch [ - - - - -] Reserved cookies for br-int: set([10385564020546830277L])
2017-07-15 21:13:33.120 6 DEBUG ....openflow.native.ofswitch [ - - - - -] Reserved cookies for provision: set([10900502818113047970L])
2017-07-15 21:13:33.124 6 DEBUG ....openflow.native.ofswitch [ - - - - -] Reserved cookies for br-ex2: set([9707595023152995281L])

2017-07-15 21:13:33.125 6 WARNING ....openflow.native.ofswitch [ - - - - -] Deleting flow with cookie 0x0
2017-07-15 21:13:33.128 6 WARNING ....openflow.native.ofswitch [ - - - - -] Deleting flow with cookie 0xbdc758dfcf1b2f4d
                                                                                                      ^^^^^^^^^^^^^^^^^^

2017-07-15 21:13:33.133 6 DEBUG ....openflow.native.ofswitch [ - - - - -] Reserved cookies for br-ex: set([13674996511809417037L])
                                                                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^

So 0xbdc758dfcf1b2f4d is one of the reserved cookies of br-ex, but was also found when listing flows on br-ex2. A flow was created on a bridge (be-ex2) with a cookie that is a registered cookie on another bridge (br-ex). This, of course, is not supposed to happen.

I haven't yet looked further what could be the root cause.

tags: added: neutron-proactive-backport-potential
tags: removed: in-stable-newton in-stable-ocata neutron-proactive-backport-potential
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

We are facing the same issue on Pike (using Kolla).

bond0 (mode=LACP)
bond0.2710 on bond0

bond0 in br-provider
bond0.2710 in br-ex

If I remove the br-ex bridge from OpenvSwitch, I don't the "Deleting flow with cookie 0x474eaa664ecc52f4" message anymore and my provider network is working.

We tried to create a Linux bridge (brvlanprovider) with bond0 inside, added brvlanprovider as an interface of br-provider OpenvSwitch bridge. We got the same issue.

We don't have this issue on Mitaka.

Revision history for this message
Arjun Baindur (abaindur) wrote :

Hello,

This problem occurs because OVS bridge inherits it's datapath-id from the physical NIC's MACs. Looks like you have a vlan interface, which would also have same MAC address as bond0. Therefore br-provider and br-ext will have same datapath ID. You will only hit this issue when using native OVS controller, as legacy (ovs-vsctl and ovs-ofctl CLIs) identify bridges by name, which is unique.

I have a fix for this which i can send for review upstream.

Changed in neutron:
assignee: nobody → Arjun Baindur (xagent-9)
Revision history for this message
Arjun Baindur (abaindur) wrote :

FYI the Fix is more a workaround, as it just reads the datapathID of the user configured bridges during setup, and manually changes them if it detects any duplicates. This will happen anytime you have any vlan interface on an OVS bridge, and it's NIC or another vlan interface on a 2nd OVS bridge.

The default dpid is something set by Openvswitch. Or the install/setup Docs could instruct to specifically set the datapath ID of a bridge when creating it, using ovs-vsctl. As far as I know it has no valid use besides to identify a bridge, so assign it whatever

ovs-vsctl set bridge <mybr> other-config:datapath-id=<datapathid>

You can change this yourself using above command.

You can view/verify current datapath-id via

[root@centos7-neutron-template ~]# ovs-vsctl get Bridge br-vlan datapath-id
"00006ea5a4b38a4a"

(please note that other-config is missing in get but needed in set)

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/587244

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

vlan interfaces created by the ip(8) command may not work well with ovs.
Does using ovs-vsctl add-port instead make the problem better?

http://docs.openvswitch.org/en/latest/faq/vlan/

Changed in neutron:
assignee: Arjun Baindur (abaindur) → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/633260

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/633261

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/587244
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=379a9faf6206039903555ce7e3fc4221e5f06a7a
Submitter: Zuul
Branch: master

commit 379a9faf6206039903555ce7e3fc4221e5f06a7a
Author: Arjun Baindur <email address hidden>
Date: Mon Jul 30 15:31:50 2018 -0700

    Change duplicate OVS bridge datapath-ids

    The native OVS/ofctl controllers talk to the bridges using a
    datapath-id, instead of the bridge name. The datapath ID is
    auto-generated based on the MAC address of the bridge's NIC.
    In the case where bridges are on VLAN interfaces, they would
    have the same MACs, therefore the same datapath-id, causing
    flows for one physical bridge to be programmed on each other.

    The datapath-id is a 64-bit field, with lower 48 bits being
    the MAC. We set the upper 12 unused bits to identify each
    unique physical bridge

    This could also be fixed manually using ovs-vsctl set, but
    it might be beneficial to automate this in the code.

    ovs-vsctl set bridge <mybr> other-config:datapath-id=<datapathid>

    You can change this yourself using above command.

    You can view/verify current datapath-id via

    ovs-vsctl get Bridge br-vlan datapath-id
    "00006ea5a4b38a4a"

    (please note that other-config is needed in the set, but not get)

    Closes-Bug: #1697243
    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I575ddf0a66e2cfe745af3874728809cf54e37745

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/633260
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=777dc929dd7aab7be7a6e49843ff68c6935021ab
Submitter: Zuul
Branch: stable/rocky

commit 777dc929dd7aab7be7a6e49843ff68c6935021ab
Author: Arjun Baindur <email address hidden>
Date: Mon Jul 30 15:31:50 2018 -0700

    Change duplicate OVS bridge datapath-ids

    The native OVS/ofctl controllers talk to the bridges using a
    datapath-id, instead of the bridge name. The datapath ID is
    auto-generated based on the MAC address of the bridge's NIC.
    In the case where bridges are on VLAN interfaces, they would
    have the same MACs, therefore the same datapath-id, causing
    flows for one physical bridge to be programmed on each other.

    The datapath-id is a 64-bit field, with lower 48 bits being
    the MAC. We set the upper 12 unused bits to identify each
    unique physical bridge

    This could also be fixed manually using ovs-vsctl set, but
    it might be beneficial to automate this in the code.

    ovs-vsctl set bridge <mybr> other-config:datapath-id=<datapathid>

    You can change this yourself using above command.

    You can view/verify current datapath-id via

    ovs-vsctl get Bridge br-vlan datapath-id
    "00006ea5a4b38a4a"

    (please note that other-config is needed in the set, but not get)

    Closes-Bug: #1697243
    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I575ddf0a66e2cfe745af3874728809cf54e37745
    (cherry picked from commit 379a9faf6206039903555ce7e3fc4221e5f06a7a)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/633261
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c7031e2cd303fca0c418e040e89d3428fce5dffe
Submitter: Zuul
Branch: stable/queens

commit c7031e2cd303fca0c418e040e89d3428fce5dffe
Author: Arjun Baindur <email address hidden>
Date: Mon Jul 30 15:31:50 2018 -0700

    Change duplicate OVS bridge datapath-ids

    The native OVS/ofctl controllers talk to the bridges using a
    datapath-id, instead of the bridge name. The datapath ID is
    auto-generated based on the MAC address of the bridge's NIC.
    In the case where bridges are on VLAN interfaces, they would
    have the same MACs, therefore the same datapath-id, causing
    flows for one physical bridge to be programmed on each other.

    The datapath-id is a 64-bit field, with lower 48 bits being
    the MAC. We set the upper 12 unused bits to identify each
    unique physical bridge

    This could also be fixed manually using ovs-vsctl set, but
    it might be beneficial to automate this in the code.

    ovs-vsctl set bridge <mybr> other-config:datapath-id=<datapathid>

    You can change this yourself using above command.

    You can view/verify current datapath-id via

    ovs-vsctl get Bridge br-vlan datapath-id
    "00006ea5a4b38a4a"

    (please note that other-config is needed in the set, but not get)

    Closes-Bug: #1697243
    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I575ddf0a66e2cfe745af3874728809cf54e37745
    (cherry picked from commit 379a9faf6206039903555ce7e3fc4221e5f06a7a)
    (cherry picked from commit c02b1148db6c5183a9de0f032aec90e0bd5d8b9e)

tags: added: in-stable-queens
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b2

This issue was fixed in the openstack/neutron 14.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/648981

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649192

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/648981
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1c72d30d2d9f89ebd3d61520d1a088116b71e280
Submitter: Zuul
Branch: stable/pike

commit 1c72d30d2d9f89ebd3d61520d1a088116b71e280
Author: Arjun Baindur <email address hidden>
Date: Mon Jul 30 15:31:50 2018 -0700

    Change duplicate OVS bridge datapath-ids

    The native OVS/ofctl controllers talk to the bridges using a
    datapath-id, instead of the bridge name. The datapath ID is
    auto-generated based on the MAC address of the bridge's NIC.
    In the case where bridges are on VLAN interfaces, they would
    have the same MACs, therefore the same datapath-id, causing
    flows for one physical bridge to be programmed on each other.

    The datapath-id is a 64-bit field, with lower 48 bits being
    the MAC. We set the upper 12 unused bits to identify each
    unique physical bridge

    This could also be fixed manually using ovs-vsctl set, but
    it might be beneficial to automate this in the code.

    ovs-vsctl set bridge <mybr> other-config:datapath-id=<datapathid>

    You can change this yourself using above command.

    You can view/verify current datapath-id via

    ovs-vsctl get Bridge br-vlan datapath-id
    "00006ea5a4b38a4a"

    (please note that other-config is needed in the set, but not get)

    Closes-Bug: #1697243
    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I575ddf0a66e2cfe745af3874728809cf54e37745
    (cherry picked from commit 379a9faf6206039903555ce7e3fc4221e5f06a7a)
    (cherry picked from commit c02b1148db6c5183a9de0f032aec90e0bd5d8b9e)
    (cherry picked from commit c7031e2cd303fca0c418e040e89d3428fce5dffe)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/649192
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e68cc4d849e8732c81a737e432ec72623faf7216
Submitter: Zuul
Branch: stable/ocata

commit e68cc4d849e8732c81a737e432ec72623faf7216
Author: Arjun Baindur <email address hidden>
Date: Mon Jul 30 15:31:50 2018 -0700

    Change duplicate OVS bridge datapath-ids

    The native OVS/ofctl controllers talk to the bridges using a
    datapath-id, instead of the bridge name. The datapath ID is
    auto-generated based on the MAC address of the bridge's NIC.
    In the case where bridges are on VLAN interfaces, they would
    have the same MACs, therefore the same datapath-id, causing
    flows for one physical bridge to be programmed on each other.

    The datapath-id is a 64-bit field, with lower 48 bits being
    the MAC. We set the upper 12 unused bits to identify each
    unique physical bridge

    This could also be fixed manually using ovs-vsctl set, but
    it might be beneficial to automate this in the code.

    ovs-vsctl set bridge <mybr> other-config:datapath-id=<datapathid>

    You can change this yourself using above command.

    You can view/verify current datapath-id via

    ovs-vsctl get Bridge br-vlan datapath-id
    "00006ea5a4b38a4a"

    (please note that other-config is needed in the set, but not get)

    Closes-Bug: #1697243
    Co-Authored-By: Rodolfo Alonso Hernandez <email address hidden>

    Change-Id: I575ddf0a66e2cfe745af3874728809cf54e37745
    (cherry picked from commit 379a9faf6206039903555ce7e3fc4221e5f06a7a)
    (cherry picked from commit c02b1148db6c5183a9de0f032aec90e0bd5d8b9e)
    (cherry picked from commit c7031e2cd303fca0c418e040e89d3428fce5dffe)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.7

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.6

This issue was fixed in the openstack/neutron 12.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.3

This issue was fixed in the openstack/neutron 13.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.