Hi, I have been trying to find the trigger for this problem for awhile and I finally was able to find something in one of our openstack environements. I had setup a vxlan tenant network with DHCP enabled, a router and three instances. Two instances are on the same compute, the other is on a separate compute. The OVS flows were good after the creation of these resources and it stayed that way for awhile but yesterday I finally saw some differences in the flows which are problematic. On November 21st, early afternoon, everything was good. On November 22nd, the three controllers had missing flows or bad flows. Furthermore, the issues were not the same on all three controllers. Below are the correct flows for each controller and then the problematic flows from the day after for comparison. Controller001 Nov21 (good) root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c cookie=0x7449e7306e86ff8f, duration=423786.590s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922" cookie=0x7449e7306e86ff8f, duration=423757.339s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x7449e7306e86ff8f, duration=423754.230s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x7449e7306e86ff8f, duration=423107.131s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902" cookie=0x7449e7306e86ff8f, duration=423061.839s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903" cookie=0x7449e7306e86ff8f, duration=438618.539s, table=22, n_packets=46, n_bytes=4140, priority=1,dl_vlan=25 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903",output:"vxlan-0aa04902",output:"vxlan-0aa04922",output:"vxlan-0aa0491b" Controller001 Nov22 (bad) root@controller001[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c cookie=0x9d709590904de6cd, duration=77469.794s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922" cookie=0x9d709590904de6cd, duration=77280.104s, table=20, n_packets=2, n_bytes=420, priority=2,dl_vlan=25,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x9d709590904de6cd, duration=77279.096s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x9d709590904de6cd, duration=76996.452s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903" cookie=0x9d709590904de6cd, duration=76846.245s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902" cookie=0x9d709590904de6cd, duration=77390.425s, table=22, n_packets=52, n_bytes=4680, priority=1,dl_vlan=25 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922",output:"vxlan-0aa0491b" Table 20 flows are still good. Table 22 is missing tunnels to other controllers -> output:"vxlan-0aa04903",output:"vxlan-0aa04902" are no longer there. Controller002 Nov21 (good) root@controller002[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c cookie=0x8a55ee0ee26404e, duration=423812.628s, table=20, n_packets=31, n_bytes=6308, priority=2,dl_vlan=156,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922" cookie=0x8a55ee0ee26404e, duration=423783.584s, table=20, n_packets=6, n_bytes=1240, priority=2,dl_vlan=156,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x8a55ee0ee26404e, duration=423780.439s, table=20, n_packets=133, n_bytes=17788, priority=2,dl_vlan=156,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x8a55ee0ee26404e, duration=423132.939s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902" cookie=0x8a55ee0ee26404e, duration=422890.009s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904" cookie=0x8a55ee0ee26404e, duration=438644.536s, table=22, n_packets=178, n_bytes=15980, priority=1,dl_vlan=156 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922",output:"vxlan-0aa04902",output:"vxlan-0aa04904",output:"vxlan-0aa0491b Controller002 Nov22 (bad) root@controller002[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c cookie=0x6235e529b85f14b0, duration=77347.983s, table=20, n_packets=6, n_bytes=1240, priority=2,dl_vlan=156,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x6235e529b85f14b0, duration=77346.975s, table=20, n_packets=133, n_bytes=17788, priority=2,dl_vlan=156,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x6235e529b85f14b0, duration=76914.122s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902" cookie=0x6235e529b85f14b0, duration=11775.668s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904" cookie=0x6235e529b85f14b0, duration=77457.952s, table=22, n_packets=182, n_bytes=16340, priority=1,dl_vlan=156 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" Table 20 is missing flow to MAC fa:16:3e:ab:ab:ae tunnel vxlan-0aa04922 (instance1). Tabke 22 is missing tunnels to other controllers and to a compute (instance1) -> output:"vxlan-0aa04922",output:"vxlan-0aa04902",output:"vxlan-0aa04904" are no longer there. Controller003 Nov21 (good) root@controller003[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c cookie=0x381c0764f5962ef4, duration=423836.393s, table=20, n_packets=2, n_bytes=400, priority=2,dl_vlan=164,dl_dst=fa:16:3e:ab:ab:ae actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922" cookie=0x381c0764f5962ef4, duration=423807.140s, table=20, n_packets=30, n_bytes=5944, priority=2,dl_vlan=164,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x381c0764f5962ef4, duration=423804.090s, table=20, n_packets=32, n_bytes=6364, priority=2,dl_vlan=164,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x381c0764f5962ef4, duration=423111.556s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903" cookie=0x381c0764f5962ef4, duration=422913.727s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904" cookie=0x381c0764f5962ef4, duration=438668.224s, table=22, n_packets=240, n_bytes=21600, priority=1,dl_vlan=164 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903",output:"vxlan-0aa0491b",output:"vxlan-0aa04922",output:"vxlan-0aa04904" Controller003 Nov22 (bad) root@controller003[SRV][PRD001][NWI]:~# docker exec -ti openvswitch_vswitchd ovs-ofctl dump-flows br-tun | grep 0x7c cookie=0x7b570e6d184cd6b6, duration=77979.624s, table=20, n_packets=35, n_bytes=6826, priority=2,dl_vlan=164,dl_dst=fa:16:3e:89:a2:e4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x7b570e6d184cd6b6, duration=77978.615s, table=20, n_packets=36, n_bytes=7204, priority=2,dl_vlan=164,dl_dst=fa:16:3e:d7:0f:a4 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa0491b" cookie=0x7b570e6d184cd6b6, duration=77696.228s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903" cookie=0x7b570e6d184cd6b6, duration=12407.309s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904" cookie=0x7b570e6d184cd6b6, duration=78065.179s, table=22, n_packets=244, n_bytes=21960, priority=1,dl_vlan=164 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04922",output:"vxlan-0aa0491f",output:"vxlan-0aa0491e",output:"vxlan-0aa04921",output:"vxlan-0aa04920",output:"vxlan-0aa0491b",output:"vxlan-0aa04918",output:"vxlan-0aa0491c" Table 20 is missing flow to MAC fa:16:3e:ab:ab:ae tunnel vxlan-0aa04922 (instance1). Table 22 is really starnge!!! It has tunnel to instance1 (vxlan-0aa04922). It is missing the tunnels to the other controllers (vxlan-0aa04903 and vxlan-0aa04904). It has extra tunnels!output:"vxlan-0aa0491f",output:"vxlan-0aa0491e",output:"vxlan-0aa04921",output:"vxlan-0aa04920",output:"vxlan-0aa04918",output:"vxlan-0aa0491c". These tunnels should not be there! What caused this? I looked into some of the logs. More precisely I wanted to compare why controller001 still had the flow to instance1 (fa:16:3e:ab:ab:ae) but not controller002 or controller003. Here are the logs on controller003 specifically related to fa:16:3e:ab:ab:ae. 2018-11-21 22:24:18.265 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66 2018-11-21 22:25:08.194 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66 These are the only logs I could find related to the MAC. It is add_fdb_entries. Now I checked the logs in controller001 2018-11-21 22:24:18.344 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66 2018-11-21 22:25:08.604 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method add_fdb_entries called with arguments (,) {u'fdb_entries': {u'31f29743-e5f6-4178-a7b1-8927a0387eb8': {u'segment_id': 124, u'ports': {u'10.160.73.34': [[u'00:00:00:00:00:00', u'0.0.0.0'], [u'fa:16:3e:ab:ab:ae', u'10.150.150.5']]}, u'network_type': u'vxlan'}}} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66 2018-11-21 22:25:08.605 8 DEBUG neutron.plugins.ml2.drivers.l2pop.rpc_manager.l2population_rpc [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent.OVSNeutronAgent method fdb_add_tun called with arguments (, , , {u'10.160.73.34': [PortInfo(mac_address=u'00:00:00:00:00:00', ip_address=u'0.0.0.0'), PortInfo(mac_address=u'fa:16:3e:ab:ab:ae', ip_address=u'10.150.150.5')]}, >) {} wrapper /var/lib/kolla/venv/local/lib/python2.7/site-packages/oslo_log/helpers.py:66 2018-11-21 22:25:09.262 8 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] ofctl request version=0x4,msg_type=0xe,msg_len=0xe8,xid=0x6681da75,OFPFlowMod(buffer_id=4294967295,command=0,cookie=11344731909475133133L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionActions(actions=[OFPActionSetField(arp_op=2), NXActionRegMove(dst_field='arp_tha',dst_ofs=0,experimenter=8992,len=24,n_bits=48,src_field='arp_sha',src_ofs=0,subtype=6,type=65535), NXActionRegMove(dst_field='arp_tpa',dst_ofs=0,experimenter=8992,len=24,n_bits=32,src_field='arp_spa',src_ofs=0,subtype=6,type=65535), OFPActionSetField(arp_sha='fa:16:3e:ab:ab:ae'), OFPActionSetField(arp_spa='10.150.150.5'), NXActionRegMove(dst_field='eth_dst',dst_ofs=0,experimenter=8992,len=24,n_bits=48,src_field='eth_src',src_ofs=0,subtype=6,type=65535), OFPActionSetField(eth_src='fa:16:3e:ab:ab:ae'), OFPActionOutput(len=16,max_len=0,port=4294967288,type=0)],len=160,type=4)],match=OFPMatch(oxm_fields={'arp_tpa': '10.150.150.5', 'eth_type': 2054, 'vlan_vid': 4121}),out_group=0,out_port=0,priority=1,table_id=21) result None _send_msg /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:114 2018-11-21 22:25:09.288 8 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-ea2a8701-df21-426a-bad3-25f78043e702 - - - - -] ofctl request version=0x4,msg_type=0xe,msg_len=0x78,xid=0x6681da8d,OFPFlowMod(buffer_id=4294967295,command=0,cookie=11344731909475133133L,cookie_mask=0,flags=0,hard_timeout=0,idle_timeout=0,instructions=[OFPInstructionActions(actions=[OFPActionPopVlan(len=8,type=18), OFPActionSetField(tunnel_id=124), OFPActionOutput(len=16,max_len=0,port=58,type=0)],len=48,type=4)],match=OFPMatch(oxm_fields={'eth_dst': 'fa:16:3e:ab:ab:ae', 'vlan_vid': 4121}),out_group=0,out_port=0,priority=2,table_id=20) result None _send_msg /var/lib/kolla/venv/local/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/openflow/native/ofswitch.py:114 As you can see, there are more logs related to the MAC. There are two add_fdb_entries which match with controller003. But then I have fdb_add_tun and two OFPFlowMod, one for table 21 (ARP responder) and the other is for table 20 (Unicast flow). This explains why controller001 still has the table 20 flow for fa:16:3e:ab:ab:ae. This also shows that the table 20 flow was never created on controller003 as there is no log for it. The next question is what triggered the flows to be refreshed/recreated at 22:24? After some investigation I found that neutron-server and neutron-openvswitch services were restarted around that time. From what I have investigated, we now have one of the possible triggers to our issue and the end result. The next question is why were the flows created badly or not at all? Furthermore, why are the issues not identical on all three controllers? Why such differences? I dont have the answer and I'm hoping someone from the community can help. Earlier in this ticket, it was suggested to comment lines 327, 328 and 329 from neutron/plugins/ml2/rpc.py. I manually made this change and restarted neutron-server and neutron-openvswitch on all three controllers. I got even worse results. Controller001 cookie=0x26b2b5c08704800d, duration=564.328s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903" cookie=0x26b2b5c08704800d, duration=437.006s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=25,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902" Controller002 cookie=0x8d36e0eb1592f790, duration=472.327s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904" cookie=0x8d36e0eb1592f790, duration=437.023s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=156,dl_dst=fa:16:3e:bf:ae:93 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04902" Controller003 cookie=0xd589182b73ccd982, duration=564.288s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:59:4e:da actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04903" cookie=0xd589182b73ccd982, duration=472.265s, table=20, n_packets=0, n_bytes=0, priority=2,dl_vlan=164,dl_dst=fa:16:3e:78:a4:c7 actions=strip_vlan,load:0x7c->NXM_NX_TUN_ID[],output:"vxlan-0aa04904" As you can see table 20 is missing flows to the three instances. Only the flows to the controllers are there. Even worse, table 22 is missing completely! I then uncommented the three lines and restarted neutron-server and neutron-openvswitch on all three controllers (rollback). Unfortunately nothing changed... It did not go back to its previous state. I have no table 22 and missing flows to all three instances. Following what I have posted does anyone have any ideas? What else can I do to try to figure this out? Are there any logs I should post?