Inconsistent flows with DVR l2pop VxLAN on br-tun

Bug #1794991 reported by Gaëtan Trellu
84
This bug affects 16 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
LIU Yulong

Bug Description

We are using Neutron (Pike) configured as DVR with l2pop and ARP responder and VxLAN. Since few weeks we are experiencing unexpected behaviors:

- [1] Some instances are not able to get DHCP address
- [2] Instances are not able to ping other instances on different compute

This is totally random, sometime it will work as expected and sometime we will have the behaviors describe above.

After checking the flows between network and compute nodes we have been able to discover that for behavior [1] it is due to missing flows on the compute nodes pointing to the DHCP agent on the network one.

About behavior [2] it is related to missing flows too, some compute nodes have missing output to other compute nodes (vxlan-xxxxxx) which prevent an instance on compute 1 to communicate to an instance on compute 2.

When we add the missing flows for [1] and [2] we are able to fix the issues but if we restart neutron-openvswitch-agent the flows are missing again.

For [1] sometime just disable/enable the port on the network nodes related to each DHCP solve the problem and sometime not.

For [2] the only way we found to fix the flows without adding them manually is to remove all instances of a network from the compute and create a new instance from this network which will sends a notification message to all computing and network nodes but again when neutron-openvswitch-agent restart the flows vanish again.

We cherry-picked these commits but nothing changed:
  - https://review.openstack.org/#/c/600151/
  - https://review.openstack.org/#/c/573785/

Information about our deployment:
  - OS: Ubuntu 16.04.5
  - Deployer: Kolla
  - Docker: 18.06
  - OpenStack: Pike/Rocky

Any ideas ?

description: updated
description: updated
description: updated
description: updated
summary: - Inconsistent flows with DVR l2pop on br-tun
+ Inconsistent flows with DVR l2pop VxLAN on br-tun
tags: added: l3-dvr-backlog
tags: added: l3-ipam-dhcp
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Example of an inconsistent flow across compute nodes having an instance on the same network:

cookie=0xf305729cf4d16a1e, table=22, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a830845",output:
"vxlan-0a830841",output:"vxlan-0a83083e"

A correct flow should be:

cookie=0xf305729cf4d16a1e, table=22, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a830845",output:"vxlan-0a830841",output:"vxlan-0a83083e",output:"vxlan-0a830843",output:"vxlan-0a830844"

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

We tried with the native and ovs_ofctl interfaces, same behavior.
From where Neutron get the flows to add/update/remove, from the ports table ?

Revision history for this message
Manuel Rodriguez (manuel.rodriguez) wrote :

Just adding more information to the issue number[1] Gaëtan described in the initial comment:

These are the flows listed from a vxlan network segmentation id 97, or 61 in hexadecimal from one of our compute nodes:

# dex -u0 openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x61
 cookie=0xde6f920d0d405dbc, duration=500977.143s, table=4, n_packets=427, n_bytes=41999, priority=1,tun_id=0x61 actions=mod_vlan_vid:8,resubmit(,9)

As you can see there is only one flow in the table 4, there was already two instances of that network running in the compute node, one created a few days ago, and another a few minutes ago, the later never got a DHCP IP because there were no flows to reach the DHCP namespaces running on three different controllers. To fix this we have found two workarounds:

1. Add the flows manually
2. Modify the network so it will trigger a flow addition across computes running instances from that network.

We tried to the second option by adding a second subnet to the network, and we could see log events about update_fdb_entries for the DHCP agents, now flows looks like this, even after deleting the second subnet,

# dex -u0 openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x61
 cookie=0xde6f920d0d405dbc, duration=501247.247s, table=4, n_packets=436, n_bytes=44143, priority=1,tun_id=0x61 actions=mod_vlan_vid:8,resubmit(,9)
 cookie=0xde6f920d0d405dbc, duration=15.405s, table=20, n_packets=1, n_bytes=42, priority=2,dl_vlan=8,dl_dst=fa:16:3e:4c:67:28 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083e"
 cookie=0xde6f920d0d405dbc, duration=14.697s, table=20, n_packets=2, n_bytes=84, priority=2,dl_vlan=8,dl_dst=fa:16:3e:5b:63:e3 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d"
 cookie=0xde6f920d0d405dbc, duration=14.711s, table=22, n_packets=9, n_bytes=1818, priority=1,dl_vlan=8 actions=strip_vlan,load:0x61->NXM_NX_TUN_ID[],output:"vxlan-0a83083d",output:"vxlan-0a83083e"

From the flows above we identify two things:
- They only take care of the communication to the DHCP agents, in other words none of the outputs of the vxlan tunnels from other computes running instances in that network are added in table 22.
- We are still missing a flow from one of the DHCP agents.

If we restart the instance that didn't get a DHCP lease, now it gets an IP. However once we restart the neutron openvswitch agent in the compute, we lose the flows from table 20 and table 22. Why we don't understand is why it always returns to the same unhealthy state with only one flow for table 4.

description: updated
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Based upon your description you mentioned that when you remove all the instances belonging to a network and recreate the instance, then the l2pop creates all the required entries and updates the compute host.
This means it does it for the 1st active port that pops up on the compute host.

Did you see any errors or Traces in the openvswitch agent that is fishy and have not configured the flows?
Was there any RPC timeout>

How many instances where running on that particular compute or on that network.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Hey Swaminathan, thanks for answering !

We have 8 instances on this network spread like this:
  - 1 on compute002
  - 1 on compute003
  - 3 on compute005
  - 3 on compute007

The only RPC timeout we got are these one (they are not related to any port of our network/instances with issue):

2018-10-10 17:43:09.282 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Setting status for 90f27db6-9db2-4954-93de-07876d9e6edb to UP _bind_devices /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:889
2018-10-10 17:43:09.548 7 ERROR neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Configuration for devices [u'90f27db6-9db2-4954-93de-07876d9e6edb'] failed!: MessagingTimeout: Timed out waiting for a reply to message ID be464cf0bdee41f0a88577379d2ade86
2018-10-10 17:43:09.559 7 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Configuration for devices up [u'90f27db6-9db2-4954-93de-07876d9e6edb'] and devices down [] completed.
2018-10-10 17:43:09.559 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-20947bce-dfd3-464c-a516-787caacb7405 - - - - -] Agent rpc_loop - iteration:509186 - ports processed. Elapsed:7.430 rpc_loop /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2117

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :
Download full text (4.3 KiB)

Output from Manuel script to check the flows per project:

│ Project: 24288331d192493eb08dbfeeea6eb864
│ │
├── Network: dcabfe60-576d-4820-9ad2-4ac0aeb39090 [default-network] - vxlan id dec: 97 [hex: 0x61]
│ │ │
│   │ │
│   ├── Flows for compute compute002
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── Missing flow for controller001 dhcp agent output vxlan-0a83083c in table 22
│   │   ├── There is a flow for controller002 dhcp agent output vxlan-0a83083d in table 22
│   │   ├── There is a flow for controller003 dhcp agent output vxlan-0a83083e in table 22
│   │   ├── There is a flow for compute003 vxlan output vxlan-0a830841 in table 22
│   │   ├── Missing flow for compute005 vxlan output vxlan-0a830843 in table 22
│   │   ├── Missing flow for compute007 vxlan output vxlan-0a830845 in table 22
│   │ │
│   ├── Flows for compute compute003
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── There is a flow for controller001 dhcp agent output vxlan-0a83083c in table 22
│   │   ├── There is a flow for controller002 dhcp agent output vxlan-0a83083d in table 22
│   │   ├── There is a flow for controller003 dhcp agent output vxlan-0a83083e in table 22
│   │   ├── There is a flow for compute002 vxlan output vxlan-0a830840 in table 22
│   │   ├── There is a flow for compute005 vxlan output vxlan-0a830843 in table 22
│   │   ├── There is a flow for compute007 vxlan output vxlan-0a830845 in table 22
│   │ │
│   ├── Flows for compute compute005
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── Missing flow for controller001 dhcp agent output vxlan-0a83083c in table 22
│   │   ├── There is a flow for controller002 dhcp agent output vxlan-0a83083d in table 22
│   │   ├── There is a flow for controller003 dhcp agent output vxlan-0a83083e in table 22
│   │   ├── Missing flow for compute002 vxlan output vxlan-0a830840 in table 22
│   │   ├── There is a flow for compute003 vxlan output vxlan-0a830841 in table 22
│   │   ├── There is a flow for compute007 vxlan output vxlan-0a830845 in table 22
│   │ │
│   ├── Flows for compute compute007
│   │   ├─ Table 20
│   │   ├── There is a flow for controller001 dhcp agent fa:16:3e:4f:3e:db in table 20
│   │   ├── There is a flow for controller002 dhcp agent fa:16:3e:5b:63:e3 in table 20
│   │   ├── There is a flow for controller003 dhcp agent fa:16:3e:4c:67:28 in table 20
│   │   ├─ Table 22
│   │   ├── There is a flow for contro...

Read more...

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Did you check the ovs-vswitchd logs, did you find anything odd in there, were vxlan tunnel interfaces were created and deleted or any duplicate vxlan tunnel interfaces.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

We have lot of WARN messages in ovs-vswitchd.log like this:

2018-10-11T00:36:57.802Z|47693|tunnel(handler467)|WARN|Dropped 23 log messages in last 59 seconds (most recently, 4 seconds ago) due to excessive rate

2018-10-11T00:36:57.802Z|47694|tunnel(handler467)|WARN|receive tunnel port not found (arp,tun_id=0x5e,tun_src=10.131.8.72,tun_dst=10.131.8.69,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_flags=key,in_port=8,vlan_tci=0x0000,dl_src=fa:16:3e:92:43:23,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=192.168.3.18,arp_tpa=192.168.3.18,arp_op=1,arp_sha=fa:16:3e:92:43:23,arp_tha=ff:ff:ff:ff:ff:ff)

2018-10-11T00:37:57.860Z|47697|tunnel(handler467)|WARN|Dropped 24 log messages in last 60 seconds (most recently, 4 seconds ago) due to excessive rate

2018-10-11T00:37:57.860Z|47698|tunnel(handler467)|WARN|receive tunnel port not found (arp,tun_id=0x5e,tun_src=10.131.8.72,tun_dst=10.131.8.69,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_flags=key,in_port=8,vlan_tci=0x0000,dl_src=fa:16:3e:92:43:23,dl_dst=ff:ff:ff:ff:ff:ff,arp_spa=192.168.3.18,arp_tpa=192.168.3.18,arp_op=1,arp_sha=fa:16:3e:92:43:23,arp_tha=ff:ff:ff:ff:ff:ff)

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I suspect that there are tunnel ports that are originally created and kind of deleted and new ones are created. The flow rules are having the original tunnel port data but the not the new ones that have been created by the l2-agent.
Do you see anything similar to this. Where there are two tunnel interfaces created back to back.
Or check in your ovs-vswitchd log if you have any entry in there for the tunnel port but still complains that the tunnel port not found, since it was deleted.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Is there an easy way to reproduce this problem with Pike devstack.
I would be interested in fixing this bug, but need a way to reproduce this problem.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

When we spawn an instance on the "broken" network on a compute that failed we can see this in the ovs-vswitchd.log:

2018-10-16T20:40:49.616Z|94255|bridge|INFO|bridge br-int: added interface qvof5e19b3e-f8 on port 1887
2018-10-16T20:41:05.950Z|94256|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 9 flow_mods in the 6 s starting 10 s ago (8 adds, 1 deletes)
2018-10-16T20:41:08.942Z|00119|tunnel(handler378)|WARN|Dropped 45 log messages in last 61 seconds (most recently, 5 seconds ago) due to excessive rate
2018-10-16T20:41:08.942Z|00120|tunnel(handler378)|WARN|receive tunnel port not found (igmp,tun_id=0xad,tun_src=10.131.8.83,tun_dst=10.131.8.64,tun_ipv6_src=::,tun_ipv6_dst=::,tun_gbp_id=0,tun_gbp_flags=0,tun_tos=0,tun_ttl=64,tun_flags=key,in_port=7,vlan_tci=0x0000,dl_src=fa:16:3e:bc:0b:94,dl_dst=01:00:5e:00:00:16,nw_src=192.168.3.16,nw_dst=224.0.0.22,nw_tos=0,nw_ecn=0,nw_ttl=1,igmp_type=0,igmp_code=0)
2018-10-16T20:41:08.942Z|00167|ofproto_dpif_upcall(handler364)|INFO|Dropped 22 log messages in last 61 seconds (most recently, 6 seconds ago) due to excessive rate
2018-10-16T20:41:08.942Z|00168|ofproto_dpif_upcall(handler364)|INFO|received packet on unassociated datapath port 7

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Is it possible that if a flow fails in Open vSwitch the whole process stop and then restart, fails, restart, fails, etc... ?

Is there a way to fix the flows like drop the ovsdb and restart the node ?

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Swaminathan Vasudevan, we don't really know how to reproduce it this is why we are kind of desperate. This issue appears on multiple deployments, all with Pike.

It basically appears when we restart neutron-openvswitch-agent container.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

So you are saying that if we have at least probably a couple of compute nodes, and probably 2 or 3 VMs on each compute and if we constantly restart the neutron-openvswitch-agent, we can reproduce it on Pike.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

This is how it happens for us.
Other weird fact, it seems to be related to old network only, newest network don't have this issue.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :
Download full text (32.5 KiB)

It just happened without any restart, few instances on the same network across different compute nodes lost Internet connectivity.

Logs about the MAC address of one of the instance:

2018-10-16 20:48:52.245 7 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-e043990b-2cff-43bf-a0e6-65b24421ec4b - - - - -] ofctl request version=0x4,msg_type=0x12,msg_len=0x38,xid=0xb26581cc,OFPFlowStatsRequest(cookie=0,cookie_mask=0,flags=0,match=OFPMatch(oxm_fields={}),out_group=4294967295,out_port=4294967295,table_id=25,type=1) result [OFPFlowStatsReply(body=[OFPFlowStats(byte_count=63937013,cookie=1865276097587823870,duration_nsec=225000000,duration_sec=9052,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:cc:af:31', 'in_port': 67}),packet_count=122641,priority=2,table_id=25), OFPFlowStats(byte_count=0,cookie=1865276097587823870,duration_nsec=144000000,duration_sec=9052,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:d7:bd:b7', 'in_port': 201}),packet_count=0,priority=2,table_id=25), OFPFlowStats(byte_count=0,cookie=1865276097587823870,duration_nsec=116000000,duration_sec=9052,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:2b:3e:10', 'in_port': 68}),packet_count=0,priority=2,table_id=25), OFPFlowStats(byte_count=870,cookie=1865276097587823870,duration_nsec=89000000,duration_sec=9052,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:2c:6e:6b', 'in_port': 252}),packet_count=15,priority=2,table_id=25), OFPFlowStats(byte_count=236338958,cookie=1865276097587823870,duration_nsec=32000000,duration_sec=9052,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:a8:60:0b', 'in_port': 307}),pac
ket_count=338352,priority=2,table_id=25), OFPFlowStats(byte_count=368190,cookie=1865276097587823870,duration_nsec=3000000,duration_sec=9052,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:73:cd:9d', 'in_port': 112}),packet_count=1462,priority=2,table_id=25), OFPFlowStats(byte_count=145838491,cookie=1865276097587823870,duration_nsec=269000000,duration_sec=8876,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:77:81:ec', 'in_port': 57}),packet_count=67025,priority=2,table_id=25), OFPFlowStats(byte_count=581037,cookie=1865276097587823870,duration_nsec=223000000,duration_sec=8876,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionGotoTable(len=8,table_id=60,type=1)],length=80,match=OFPMatch(oxm_fields={'eth_src': 'fa:16:3e:14:e2:2f', 'in_port': 89}),packet_count=9460,pr...

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Yes I think the problem is when a first VM on that network is created for a node, the l2pop rules have to be populated properly, given the criteria, that the tunnel interfaces are created properly and the vlan manager has configured the vlan mapping, then if everything is in place, the l2pop rules will be populated properly.
If anyone of these are missing then it may error out and not configure the l2pop rules for the network.

Any new networks that are created, may not have this problem, probably if the tunnel interfaces and the vlan mapping are configured right.

The most important thing that we need to figure out is if there is a race between the tunnel interface create/vlan mapping and the l2pop flow add.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

We tried to restart neutron-server and some of the missing flows have been added back and after few minutes they have been removed...

On the compute we can see the fdb_add and fdb_remove.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :
Download full text (9.5 KiB)

I saw an error in neutron-server.log:

2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers [req-61134bd9-1f45-4aad-9148-d64dba96b1d0 - - - - -] Mechanism driver 'l2population' failed in update_port_postcommit: AttributeError: 'Context' object has no attribute 'begin'
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers Traceback (most recent call last):
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/managers.py", line 426, in _call_on_drivers
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers getattr(driver.obj, method_name)(context)
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 183, in update_port_postcommit
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers self.update_port_up(context)
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 295, in update_port_up
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers if is_first_port or self.agent_restarted(context):
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/mech_driver.py", line 246, in agent_restarted
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers agent = l2pop_db.get_agent_by_host(port_context, agent_host)
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/l2pop/db.py", line 60, in get_agent_by_host
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers with session.begin(subtransactions=True):
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers AttributeError: 'Context' object has no attribute 'begin'
2018-10-18 16:23:39.258 30 ERROR neutron.plugins.ml2.managers
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager [req-61134bd9-1f45-4aad-9148-d64dba96b1d0 - - - - -] Error during notification for neutron.plugins.ml2.plugin.Ml2Plugin._port_provisioned--9223372036854067064 port, provisioning_complete: MechanismDriverError
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager Traceback (most recent call last):
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron_lib/callbacks/manager.py", line 197, in _notify_loop
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager callback(resource, event, trigger, **kwargs)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager File "/var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/plugin.py", line 260, in _port_provisioned
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager self.update_port_status(context, port_id, const.PORT_STATUS_ACTIVE)
2018-10-18 16:23:39.294 30 ERROR neutron_lib.callbacks.manager File "/var/lib/kolla/venv/loc...

Read more...

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

One other suggestion would be to test this in your environment. (Since I am not able to reproduce locally may be it would make sense to test it in your environment)

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/rpc.py#L327

Can you comment this check in the 'rpc.py'(#L327, #L328 and #L329) and restart the service.
You may see an ERROR log in the openvswith log that is being added for DVR routers, but it should not affect the functionality.

Revision history for this message
Brian Haley (brian-haley) wrote :

Hi - just wondering if you've been able to test with Swami's suggestion of adding some logging to see which code path is taken so we can further triage? Thanks.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Hey Guys,

Sorry, it's hard for us to test this in our production environment.
Keep you posted about it !

Revision history for this message
Alexandre Perreault (alexperreault) wrote :
Download full text (18.3 KiB)

Hi,

I have been trying to find the trigger for this problem for awhile and I finally was able to find something in one of our openstack environements.

I had setup a vxlan tenant network with DHCP enabled, a router and three instances. Two instances are on the same compute, the other is on a separate compute.
The OVS flows were good after the creation of these resources and it stayed that way for awhile but yesterday I finally saw some differences in the flows which are problematic.
On November 21st, early afternoon, everything was good. On November 22nd, the three controllers had missing flows or bad flows. Furthermore, the issues were not the same on all three controllers.

Below are the correct flows for each controller and then the problematic flows from the day after for comparison.

Controller001 Nov21 (good)
root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x7449e7306e86ff8f, duration=423786.590s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
 cookie=0x7449e7306e86ff8f, duration=423757.339s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x7449e7306e86ff8f, duration=423754.230s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x7449e7306e86ff8f, duration=423107.131s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"
 cookie=0x7449e7306e86ff8f, duration=423061.839s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x7449e7306e86ff8f, duration=438618.539s, table=22, n_packets=46, n_bytes=4140, priority=1,dl_vlan=25 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903",output:"vxlan-0aa04902",output:"vxlan-0aa04922",output:"vxlan-0aa0491b"

Controller001 Nov22 (bad)
root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c
 cookie=0x9d709590904de6cd, duration=77469.794s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922"
 cookie=0x9d709590904de6cd, duration=77280.104s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x9d709590904de6cd, duration=77279.096s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b"
 cookie=0x9d709590904de6cd, duration=76996.452s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903"
 cookie=0x9d709590904de6cd, duration=76846.245s, ...

Revision history for this message
Brian Haley (brian-haley) wrote :

Alexandre - thanks for all the information, we've just been swamped with other work, but we will take a look as soon as we have time.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Glad to here that @Brian !

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Hi Alexandre,
From your logs, it seems this 'unicat-to-tun' rule is missing between the good and the bad.

n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902"

Is this port with MAC fa:16:3e:bf:ae:93 still alive and is seen in other nodes.
Is this a VM port or a non VM port.

Was this rule only missing in the Controller Node and not on other Compute Nodes.

If it is only on the controller, was the controller configured for HA?

If it is a VM port did this VM ever migrate or at least this port went down and came up. ( I mean if the VM was shutdown and then restarted or so).
When this event happened, did the neutron server or the neutron-agent went down and came up.

Revision history for this message
Alexandre Perreault (alexperreault) wrote :

Hi Swaminathan,

Did you download the full text? The website does not show the full comment. I am asking because I'm not sure in what part of my comment you are referring to when you are talking about the missing rule regarding MAC fa:16:3e:bf:ae:93. I dont see it missing.

Most of my text was referring to MAC fa:16:3e:ab:ab:ae which is missing a table 20 flow in controller003. I also discuss issues with table 22 on all controllers and that the trigger seemed to have been a restart of neutron services.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Hi Alexandre,
Yes I might have missed the full text, I just noticed the one that shows up on the activity log.

I read the fulltext now.

So the problem that you mentioned in your log, is that when the Neutron-server and the openvswitch agent is restarted, that would have triggered this problem based on your statement.

"The next question is what triggered the flows to be refreshed/recreated at 22:24?

After some investigation I found that neutron-server and neutron-openvswitch services were restarted around that time."

The unicast-to-tun flow rules are created, when the Actual VM pops up on the compute, when there is a new MAC, that MAC entry will be populated on all the nodes other than the host node that is hosting the VM.
If the VM is already created and the rule is already in place, then by agent-restart we can bringback those roles. Also by enabling and disabling those ports, would trigger a 'Port-up'.

The table 22 rule is the default action or the normal action, where the traffic from the dVR routers should be routed. As and when the vxlan tunnels are added they should be populated in the list of output ports based on the Flood to tun.

This is first populated on the node, when the first port on a particular network shows up. It collects all the vxlan tunnel details for that particular network and adds it to the nodes.

So there is possibility of mismatch happening on a particular node, when the vxlan tunnels are deleted and agent restarted without the neutron-server.

I have always seen things in sync when the neutron-server restarted first and the agent restarted later.

Having understood your problem, the only way that the fdb_add_tun will not be called for add_fdb_flow is when the agent_ports does not exist.

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L560

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

In your case we should identify, why on agent-restart it did not pick up the 'unicast-to-tun' rules for the MAC that is missing. (dl_dst=fa:16:3e:ab:ab:ae)

On restart
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/rpc.py#L335
L335 to L350 has the logic to re-write the flow rules, if the port is in ACTIVE state.
But if the port is not in ACTIVE state, it might not rewrite those rules, but will remove it.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

This is where the tunnel ports are getting added to the local agent list.

https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1544

During the FLOOD_ENTRY, the number of tunnels supported is being retrieved from this constant.
So somehow either the local cached list is wipped out by some other function or else the agent misses it somehow.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

this is where the ports (vxlan ports) are being written to the flooding-table flow rule.
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_tun.py#L173

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Thanks @Swaminathan for your help on this !!!

Revision history for this message
Alexandre Perreault (alexperreault) wrote :

Hi Swaminathan,

Thanks for the replies.

Ok, so it is always better to restart neutron-server before the openvswitch agent. We should never restart the agent alone?

If I understand correctly, the cause for the unicast to tun issue is different then the flood to tun issue?

I want to mention that nothing was erased or moved during the test case. I had created three instances on a new network and copied the flows. Then every day I connected to the environement and checked if my flows stayed the same. Then one day they had changed. That's when I posted my long comment. To my knowledge the ports was always active. It is strange that the flow to MAC fa:16:3e:ab:ab:ae was recreated in Controller001 and not controller002 and controller003. That's another reason why the port was active.

As for flood to tun, we see a lot of inconsitencies in our different environments. You mention that maybe the local cached list is wiped out. Is there a way to check?

I am open to any suggestions or any tests. It's been awhile since I wrote that long comment (november 23rd). When I left it, it was in a really bad state (end of my long comment). There was no table 22 flow and no table 20 flows to each instance... I verified again this week and the flows have slowly fixed themselves but still have some problems. Controller002 missing table 20 flow to controller003. Table 22 missing all tunnels except the one to controller001. Controller003 table 22 missing all tunnels except to controller001. Controller001 is good. The two computes which have instances are missing the table 20 flow to controller003. I couldn't tell you what actions were taken on the cloud since november 23rd though the actual tenant network and instances have not been modified in any way. I can keep testing on this network or create something new (network and instances) and run tests on them.

Revision history for this message
Alexandre Perreault (alexperreault) wrote :

Hi Swaminathan,

Just another update. Pretty much all the flows on all servers are back. The only issue is on controller002 which is missing tunnels in table 22 to the two computes server. All other controllers and computes have correct flows for my test network.
It's unclear why it is happenning. Since this cloud infrastructure is used by many I cant follow all the activity. What I can say is no one is trying to fix my flows on purpose.

On a different production openstack infrastructure, we had a neutron maintenance yesterday. We had to change ml2_conf.ini to increase the vxlan id range. Neutron services were restarted after making the change. I also had a test network on this infrastructure with dhcp, a router and a few instances. I checked my flows for my network and saw abnormalities.
Table 22 disappeared from 3 of my physical hosts which either have instances or dhcp namespaces on them. On my other host, table 22 exists for my network but it has tunnels to physical hosts that do not have my network. Furthermore it is missing tunnels to the other hosts that do have my network. I still have not be able to find the cause for this issue.

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

Thanks for the information. I will check again in the code path to see what might cause the table 22 to misbehave.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Guys do you have any update about this bug ?
We still facing the issue.

Revision history for this message
Alexandre Perreault (alexperreault) wrote :
Download full text (3.3 KiB)

Here is an example of what I described in my previous comment. Our biggest problem is with table 22 as it causes outages for our clients.

Here is an example of flows for a vxlan network on a host. As you can see there is no table 22 at all.

docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x2b4
 cookie=0xfdd60859f2658796, duration=2037831.278s, table=4, n_packets=245, n_bytes=52188, priority=1,tun_id=0x2b4 actions=mod_vlan_vid:181,resubmit(,9)
 cookie=0xfdd60859f2658796, duration=304285.121s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:6d:55:e1 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xfdd60859f2658796, duration=299805.493s, table=20, n_packets=34, n_bytes=6528, priority=2,dl_vlan=181,dl_dst=fa:16:3e:99:cc:4c actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490c"
 cookie=0xfdd60859f2658796, duration=293855.521s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:f2:2f:7d actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xfdd60859f2658796, duration=293234.777s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=181,dl_dst=fa:16:3e:38:08:be actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490e"

Then on another host for the SAME vxlan network we have the opposite. There is a table 22 but all the tunnels are incorrect (They dont match the table 20 flows/tunnels).

docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x2b4
 cookie=0xe4add62aa34905c9, duration=2040773.745s, table=4, n_packets=0, n_bytes=0, priority=1,tun_id=0x2b4 actions=mod_vlan_vid:669,resubmit(,9)
 cookie=0xe4add62aa34905c9, duration=318238.768s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:af:9d:f4 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490d"
 cookie=0xe4add62aa34905c9, duration=318227.433s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:4c:c0:93 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490d"
 cookie=0xe4add62aa34905c9, duration=304524.808s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:6d:55:e1 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xe4add62aa34905c9, duration=300045.175s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:99:cc:4c actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490c"
 cookie=0xe4add62aa34905c9, duration=294095.211s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=669,dl_dst=fa:16:3e:f2:2f:7d actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490a"
 cookie=0xe4add62aa34905c9, duration=2040773.792s, table=22, n_packets=8, n_bytes=720, priority=1,dl_vlan=669 actions=strip_vlan,load:0x2b4->NXM_NX_TUN_ID[],output:"vxlan-0aa2490f",output:"vxlan-0aa24911",output:"vxlan-0aa24910",output:"vxlan-0aa2490b",output:"vxlan-0aa2490d"

These are outputs from today from the same environment I talked about on december 19th...

I can also provide similar outputs for other networks on different environments. In another environment, on one host, table 22 only...

Read more...

Revision history for this message
Bruca Lock (lockshi) wrote :

Hi Swaminathan,

   I did more analysis at the code level and here are my observations:

1 - The cleanup_stale_flows will wipe out all flows and somehow the table 22 and table 20 for the compute doesn't get recreated:
https://github.com/openstack/neutron/blob/886782c1777d8f104b5c57ffd4eff7680056602c/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2165

My question is should it systematically call cleanup_stale_flows on restart of nova_ovs? Note that there are no failed devices detected during process_network_ports.
I tried forcing the provisioning_needed on restart to ensure it does a local vlan provisioning. The side effect of this is table 22 is maintained with some extra flows,
but I still lose table 20 of the compute instance.

2 - I added logs to the flooding-table flow rule https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/br_tun.py#L173
as per your suggestion and noticed that table 22 for my specific tunnel didn't get populated during process_network_ports.

3 - Looking at _setup_tunnel_port https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1544
I didn't notice any errors related to this function in the code.

4 - For process_network_ports most of the processing goes into treat_devices_added_or_updated and could take quite a while in some cases we had to wait around 15 minutes.

Question about the cleanup_stale_flows, just noticed it just loops over the bridges and cleanup the flows. Didn't see any logic there where it checks for stale flow unless I'm wrong?
The problem probably lies in the fact that the flooding-table flow rule didn't get called for our specific vlan_id. Any insights of what could cause that?

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Could this be related to Docker (SIGTERM, SIGKILL) ?

Bruca Lock (lockshi)
Changed in neutron:
assignee: nobody → Bruca Lock (lockshi)
Revision history for this message
Bruca Lock (lockshi) wrote :

An interesting observation is when use neutron-ovs-cleanup tool on the controller, flows get repopulated quickly after reboot of neutron-ovs-agent. Somehow another problem occurs where I'm not able to do icmp on remote host. This could be due to the fact that I may reboot the compute.

So from a quick extrapolation this could either due to the amount of resources that need to be repopulated during reboot and somehow either timeout (which I didn't see in the logs) or dangling resources that corrupt the repopulation.

Anyone aware of this situation or knows whether neutron-ovs-cleanup is still required.

Revision history for this message
Bruca Lock (lockshi) wrote :

Finally found the reason why it's not populating the fdb entries. In case where there are lots of ports, the gateway port binding gets skipped during reboot from the neutron server as per following code which relies on a timer to check which is not optimal:
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/rpc.py#L332

I propose we come up with a more deterministic way of knowing whether the reboot completed either through RPC confirmation or the iteration counter.

For now a fix that works would be to increate the l2pop agent_boot_time from the default value of 180 to 1800 in my case in the neutron server ml2 config.

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

Swaminathan, what is the downside of having a too high value for agent_boot_time ?

description: updated
Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

We tried in different environments and so far we are not facing the issue anymore.
Still the pending questions from Bruca (#41) and I (#42). :)

Revision history for this message
Brian Haley (brian-haley) wrote :

I wonder how much of this will be covered in the fix(es) for https://bugs.launchpad.net/neutron/+bug/1813703 ? There is a lot of good info there as well and a patch, https://review.openstack.org/#/c/640797/

Revision history for this message
Gaëtan Trellu (goldyfruit) wrote :

@Brian, this is a great news !!!

Changed in neutron:
assignee: Bruca Lock (lockshi) → LIU Yulong (dragon889)
status: New → In Progress
Changed in neutron:
assignee: LIU Yulong (dragon889) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → LIU Yulong (dragon889)
Revision history for this message
Bruca Lock (lockshi) wrote :

Tested with a very large agent_boot_time and it has no side effects beside the flow being repopulated many times.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/645405

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/645406

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/645408

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/640797
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a5244d6d44d2b66de27dc77efa7830fa657260be
Submitter: Zuul
Branch: master

commit a5244d6d44d2b66de27dc77efa7830fa657260be
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0rc1

This issue was fixed in the openstack/neutron 14.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/649729

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/645408
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=62fe7852bbd70a24174853997096c52ee015e269
Submitter: Zuul
Branch: stable/pike

commit 62fe7852bbd70a24174853997096c52ee015e269
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Conflicts:
     neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/645405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=cc49ab550179bdc76d79f48be67886681cb43d4e
Submitter: Zuul
Branch: stable/rocky

commit cc49ab550179bdc76d79f48be67886681cb43d4e
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.7

This issue was fixed in the openstack/neutron 11.0.7 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.3

This issue was fixed in the openstack/neutron 13.0.3 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.6

This issue was fixed in the openstack/neutron 12.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.opendev.org/649729
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9583dc0549da2b4529a59b5862ba42aebc5ae15f
Submitter: Zuul
Branch: stable/ocata

commit 9583dc0549da2b4529a59b5862ba42aebc5ae15f
Author: LIU Yulong <email address hidden>
Date: Mon Mar 4 21:17:20 2019 +0800

    More accurate agent restart state transfer

    Ovs-agent can be very time-consuming in handling a large number
    of ports. At this point, the ovs-agent status report may have
    exceeded the set timeout value. Some flows updating operations
    will not be triggerred. This results in flows loss during agent
    restart, especially for hosts to hosts of vxlan tunnel flow.

    This fix will let the ovs-agent explicitly, in the first rpc loop,
    indicate that the status is restarted. Then l2pop will be required
    to update fdb entries.

    Conflicts:
     neutron/plugins/ml2/rpc.py

    Conflicts:
     neutron/plugins/ml2/drivers/l2pop/mech_driver.py

    Closes-Bug: #1813703
    Closes-Bug: #1813714
    Closes-Bug: #1813715
    Closes-Bug: #1794991
    Closes-Bug: #1799178

    Change-Id: I8edc2deb509216add1fb21e1893f1c17dda80961
    (cherry picked from commit a5244d6d44d2b66de27dc77efa7830fa657260be)
    (cherry picked from commit cc49ab550179bdc76d79f48be67886681cb43d4e)
    (cherry picked from commit 5ffca4966877454c605442e9e429aa83ea7d7348)

tags: added: in-stable-ocata
tags: added: neutron-proactive-backport-potential
Revision history for this message
Hua Zhang (zhhuabj) wrote :

I still can see the following problem reported in the bug description with neutron-server=2:14.0.2-0ubuntu1~cloud0 which actually has included all fixed patches in this bug.

About behavior [2] it is related to missing flows too, some compute nodes have missing output to other compute nodes (vxlan-xxxxxx) which prevent an instance on compute 1 to communicate to an instance on compute 2.

On one of our production system, we saw the following flow rule (sorry, we don't know how to reproduce this problem yet).

cookie=0x7e47f520bdf5215b, duration=55484.529s, table=22, n_packets=3441, n_bytes=218308, idle_age=794, priority=1,dl_vlan=10 actions=strip_vlan,load:0x40a->NXM_NX_TUN_ID[],output:47

port 47 is port vxlan-0af64525

# readlink -f /tmp/tmp.Z4pKS6ZnFC/vcd41026/ovs/bridges/br-tun/ports/47
/tmp/tmp.Z4pKS6ZnFC/vcd41026/ovs/ports/vxlan-0af64525

vxlan-0af64525 in a tunnel endpoint with remote_ip=10.246.69.37

root@vcd41026:~# ovs-vsctl show| grep 'Interface "vxlan-0af64525"' -A 2
Interface "vxlan-0af64525"
type: vxlan
options: {df_default="true", in_key=flow, local_ip="10.246.69.32", out_key=flow, remote_ip="10.246.69.37"}

where 10.246.69.37 is on vcd41025 but our test vm is running on vcd41021.

So I believe it should be:

cookie=0x7e47f520bdf5215b, duration=55484.529s, table=22, n_packets=3441, n_bytes=218308, idle_age=794, priority=1,dl_vlan=10 actions=strip_vlan,load:0x40a->NXM_NX_TUN_ID[],output:45,output:46,output:47,output:48,output:49,output:50,output:51,output:52

Alvaro Uria (aluria)
tags: added: canonical-bootstack
Revision history for this message
Jörn Volkhausen (jvolkhausen) wrote :

We see the Problem in our infrastructure like zhhuabi after updating to following versions:
Compute Nodes:
  neutron-common/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-l3-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-metadata-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed,automatic]
  neutron-openvswitch-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
Network Nodes:
  neutron-common/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed,automatic]
  neutron-dhcp-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-l3-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-lbaas-common/bionic,now 2:12.0.0-0ubuntu1 all [installed,automatic]
  neutron-lbaasv2-agent/bionic,now 2:12.0.0-0ubuntu1 all [installed]
  neutron-metadata-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed,automatic]
  neutron-metering-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]
  neutron-openvswitch-agent/bionic-updates,now 2:12.1.0-0ubuntu1 all [installed]

Here we have a infrastructure with three network and six compute Nodes and gre tunnels between them.

At the moment we are manipulating the flow entrys in br-tun table=22 with a script to output the packets to every compute/network nodes.

I think, if the patch is included in this packet versions, it would not completely solve the issues described above and seen here.

We will now check the agent_boot_time quick fix mentioned from Bruca Lock in post #41

tags: removed: neutron-proactive-backport-potential
Revision history for this message
Florian Engelmann (engelmann) wrote :

looks like "agent_boot_time" was removed in Train. Why?

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

Hi,

It was deprecated in Stein and we removed it in Train: https://review.opendev.org/#/c/641866/

And more info why it was deprecated is in https://review.opendev.org/#/c/640797/

Revision history for this message
Hua Zhang (zhhuabj) wrote :

See this problem again in another customer environment, one host lacks flow to 44, 46 and 48 as the comment #59 said

cookie=0x885c49e981aec5b2, duration=32295.028s, table=22, n_packets=225, n_bytes=14724, idle_age=1318, priority=1,dl_vlan=29 actions=strip_vlan,load:0x3f7->NXM_NX_TUN_ID[],output:36,output:37,output:39,output:43,output:45,output:47

$ ls /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/
1 36 37 39 43 44 45 46 47 48
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/1
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/patch-int
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/44
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/vxlan-0abf0615
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/46
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/vxlan-0abf060d
$ readlink -f /var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/bridges/br-tun/ports/48
/var/snap/ovs-stat/common/tmp/tmp.U6AafAs8Ip/lcpip-xxx002/ovs/ports/vxlan-0abf0613

After restarting ovs, the flow to 44, 46 and 48 came back

cookie=0x8e8b842be4291fe7, duration=5223.823s, table=22, n_packets=250, n_bytes=16496, priority=1,dl_vlan=29 actions=strip_vlan,load:0x3f7->NXM_NX_TUN_ID[],output:"vxlan-0abf0616",output:"vxlan-0abf0612",output:"vxlan-0abf060f",output:"vxlan-0abf0611",output:"vxlan-0abf0615",output:"vxlan-0abf0614",output:"vxlan-0abf060d",output:"vxlan-0abf060e",output:"vxlan-0abf0613",output:"vxlan-0abf0617",output:"vxlan-0abf0618",output:"vxlan-0abf0619"

Revision history for this message
Paul Goins (vultaire) wrote :

Could the occurrences in stein and newer be linked to https://bugs.launchpad.net/neutron/+bug/1853613?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Tested in OpenStack Train and Ussuri, I see some inconsistency in l2pop flows whenever openvswitch is restarted (with out restart of neutron-*-agents).

The workaround is to restart neutron-openvswitch-agent to repopulate the correct l2pop flows.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.