On our dvr_snat host, the "install_dvr_to_src_mac" is installing the rule in br-int with the SNAT MAC instead instead of the DVR mac address (subnet's gateway aka network:router_interface_distributed). For example, the subnet's gateway is 172.16.0.1, with MAC fa:16:3e:42:a2:ec.
On most hosts, we see following rules in br-int:
[root@stan ~]# ovs-ofctl dump-flows br-int | grep fa:16:3e:42:a2:ec
cookie=0x77f69fee58f51737, duration=11872.801s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:22:eb:8b actions=mod_dl_src:fa:16:3e:42:a2:ec,resubmit(,60)
cookie=0x77f69fee58f51737, duration=11872.790s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:cd:71:e1 actions=mod_dl_src:fa:16:3e:42:a2:ec,resubmit(,60)
cookie=0x77f69fee58f51737, duration=11865.953s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:20:77:00 actions=mod_dl_src:fa:16:3e:42:a2:ec,resubmit(,60)
cookie=0x77f69fee58f51737, duration=11865.933s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:ab:2d:1a actions=mod_dl_src:fa:16:3e:42:a2:ec,resubmit(,60)
cookie=0x77f69fee58f51737, duration=11860.735s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:76:e9:ae actions=mod_dl_src:fa:16:3e:42:a2:ec,resubmit(,60)
cookie=0x77f69fee58f51737, duration=11859.335s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:cb:48:27 actions=mod_dl_src:fa:16:3e:42:a2:ec,resubmit(,60)
However, on our dvr_snat host, these rules are all missing for the dl_src MAC. Instead, they get added with the MAC of the network:router_centralized_snat instead:
root@krusty:~# ovs-ofctl dump-flows br-int | grep fa:16:3e:84:0b:42
cookie=0xbb5ebbfa2dfadb74, duration=5351.368s, table=2, n_packets=2976001, n_bytes=362273213, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:84:0b:42 actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5195.362s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:86:91:e2 actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5195.349s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:a2:04:d3 actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5195.336s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:82:ef:3b actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5195.325s, table=2, n_packets=24, n_bytes=2044, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:e4:d9:f3 actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5195.272s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:b9:a0:fe actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5194.118s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:1a:42:fa actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5194.098s, table=2, n_packets=56, n_bytes=4792, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:84:33:df actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5193.995s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:34:e1:92 actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5193.509s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:6d:3e:f3 actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5191.408s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:30:97:8f actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5188.895s, table=2, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:57:e5:ad actions=mod_dl_src:fa:16:3e:84:0b:42,resubmit(,60)
cookie=0xbb5ebbfa2dfadb74, duration=5351.361s, table=60, n_packets=0, n_bytes=0, idle_age=65534, priority=4,dl_vlan=795,dl_dst=fa:16:3e:84:0b:42 actions=strip_vlan,output:951
root@krusty:~#
I have traced this to the get_subnet_for_dvr call. In the subnet_info, the gateway_mac returned is incorrect. Initially upon restarting OVS agent, the dvr_local_map is empty. So OVS agent makes the get_subnet_for_dvr call to populate local subnet info map. On good hosts, it is querying with fixed_ip = subnet gateway (172.16.0.1). On the snat host, it is querying first with fixed_ip = 172.16.0.3.
Either this is incorrect, or even when querying with SNAT port, the gateway_mac in subnet should be DVR MAC, not snat MAC:
Good host:
root@barney:~# cat ovs.log | grep get_subnet_for_dvr | grep "172.16"
2018-07-24 19:42:24.454 15840 DEBUG neutron.api.rpc.handlers.dvr_rpc [req-5ece05d6-f2cd-46a4-b81e-7e579e61990b - - - - -] neutron.api.rpc.handlers.dvr_rpc.DVRServerRpcApi method get_subnet_for_dvr called with arguments (<neutron_lib.context.ContextBase object at 0x7f52f1983150>, '3707b250-b6f5-4701-9b17-01a8f288c17a') {'fixed_ips': [{'subnet_id': '3707b250-b6f5-4701-9b17-01a8f288c17a', 'ip_address': '172.16.0.1'}]} wrapper /opt/pf9/pf9-neutron/lib/python2.7/site-packages/oslo_log/helpers.py:66
2018-07-24 19:42:24.820 15840 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_dvr_neutron_agent [req-5ece05d6-f2cd-46a4-b81e-7e579e61990b - - - - -] get_subnet_for_dvr for subnet 3707b250-b6f5-4701-9b17-01a8f288c17a returned with {u'shared': True, u'service_types': [], u'description': None, u'enable_dhcp': True, u'tags': [], u'network_id': u'3f6ec232-7649-4639-b828-c3af9960481b', u'tenant_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'dns_nameservers': [u'10.1.10.19', u'8.8.8.8', u'8.8.4.4'], u'gateway_ip': u'172.16.0.1', u'ipv6_ra_mode': None, u'allocation_pools': [{u'start': u'172.16.0.2', u'end': u'172.16.255.254'}], u'host_routes': [], u'revision_number': 2, u'ipv6_address_mode': None, u'ip_version': 4, u'gateway_mac': u'fa:16:3e:42:a2:ec', u'cidr': u'172.16.0.0/16', u'project_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'id': u'3707b250-b6f5-4701-9b17-01a8f288c17a', u'subnetpool_id': None, u'name': u'172.16.0.0/16'} _bind_distributed_router_interface_port /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py:371
2018-07-24 19:42:25.686 15840 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_dvr_neutron_agent [req-5ece05d6-f2cd-46a4-b81e-7e579e61990b - - - - -] get_subnet_for_dvr for subnet 98d2750d-60ce-4b53-88ef-423b77d5f5f5 returned with {u'shared': True, u'service_types': [], u'description': None, u'enable_dhcp': True, u'tags': [], u'network_id': u'655c3eb4-b9f5-4e30-92de-2262d6e87c92', u'tenant_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'dns_nameservers': [], u'gateway_ip': u'10.100.0.1', u'ipv6_ra_mode': None, u'allocation_pools': [{u'start': u'10.100.0.2', u'end': u'10.100.255.254'}], u'host_routes': [{u'destination': u'0.0.0.0/0', u'nexthop': u'172.16.0.1'}], u'revision_number': 0, u'ipv6_address_mode': None, u'ip_version': 4, u'gateway_mac': u'fa:16:3e:13:61:98', u'cidr': u'10.100.0.0/16', u'project_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'id': u'98d2750d-60ce-4b53-88ef-423b77d5f5f5', u'subnetpool_id': None, u'name': u'dogfood-vxlan-8000-sub'} _bind_distributed_router_interface_port /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py:371
Bad Host:
root@krusty:~# cat ovs.log | grep get_subnet_for_dvr | grep "172.16"
2018-07-24 19:44:44.135 31138 DEBUG neutron.api.rpc.handlers.dvr_rpc [req-6d269f17-c49c-4f64-93f8-139639020c5d - - - - -] neutron.api.rpc.handlers.dvr_rpc.DVRServerRpcApi method get_subnet_for_dvr called with arguments (<neutron_lib.context.ContextBase object at 0x7f1c09d3b410>, '3707b250-b6f5-4701-9b17-01a8f288c17a') {'fixed_ips': [{'subnet_id': '3707b250-b6f5-4701-9b17-01a8f288c17a', 'ip_address': '172.16.0.3'}]} wrapper /opt/pf9/pf9-neutron/lib/python2.7/site-packages/oslo_log/helpers.py:66
2018-07-24 19:44:44.369 31138 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_dvr_neutron_agent [req-6d269f17-c49c-4f64-93f8-139639020c5d - - - - -] get_subnet_for_dvr for subnet 3707b250-b6f5-4701-9b17-01a8f288c17a returned with {u'shared': True, u'service_types': [], u'description': None, u'enable_dhcp': True, u'tags': [], u'network_id': u'3f6ec232-7649-4639-b828-c3af9960481b', u'tenant_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'dns_nameservers': [u'10.1.10.19', u'8.8.8.8', u'8.8.4.4'], u'gateway_ip': u'172.16.0.1', u'ipv6_ra_mode': None, u'allocation_pools': [{u'start': u'172.16.0.2', u'end': u'172.16.255.254'}], u'host_routes': [], u'revision_number': 2, u'ipv6_address_mode': None, u'ip_version': 4, u'gateway_mac': u'fa:16:3e:84:0b:42', u'cidr': u'172.16.0.0/16', u'project_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'id': u'3707b250-b6f5-4701-9b17-01a8f288c17a', u'subnetpool_id': None, u'name': u'172.16.0.0/16'} _bind_centralized_snat_port_on_dvr_subnet /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py:553
2018-07-24 19:44:51.786 31138 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_dvr_neutron_agent [req-6d269f17-c49c-4f64-93f8-139639020c5d - - - - -] get_subnet_for_dvr for subnet 98d2750d-60ce-4b53-88ef-423b77d5f5f5 returned with {u'shared': True, u'service_types': [], u'description': None, u'enable_dhcp': True, u'tags': [], u'network_id': u'655c3eb4-b9f5-4e30-92de-2262d6e87c92', u'tenant_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'dns_nameservers': [], u'gateway_ip': u'10.100.0.1', u'ipv6_ra_mode': None, u'allocation_pools': [{u'start': u'10.100.0.2', u'end': u'10.100.255.254'}], u'host_routes': [{u'destination': u'0.0.0.0/0', u'nexthop': u'172.16.0.1'}], u'revision_number': 0, u'ipv6_address_mode': None, u'ip_version': 4, u'gateway_mac': u'fa:16:3e:b1:bd:33', u'cidr': u'10.100.0.0/16', u'project_id': u'f175f441ebbb4c2b8fedf6469d6415fc', u'id': u'98d2750d-60ce-4b53-88ef-423b77d5f5f5', u'subnetpool_id': None, u'name': u'dogfood-vxlan-8000-sub'} _bind_centralized_snat_port_on_dvr_subnet /opt/pf9/pf9-neutron/lib/python2.7/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py:553
This causes a whole slew of problems - packets are sent into network infrastructure with src MAC of the local DVR mac, causing this MAC to flap on remote hosts' br-int between patch cable and qr interface. If we shut the snat host's interfaces or bring the host down, the dvr MAC stops flapping on br-int on other hosts, and network connectivity is restored.
The problem appears to be the internal_ gateway_ ports returned by self.plugin. get_ports( ) when called with a filter of the SNAT port. On our snat host, this is the very first port that is used when querying the subnet info.
The subnet is the same, as is the gateway, so why is the gateway_mac returned different depending on query port?
2018-07-25 05:26:25.643 1033 INFO neutron. db.dvr_ mac_db [req-d69336a0- 3db0-4e8a- b24f-b0446c5d97 35 - - - - -] ARJUN: get_subnet_for_dvr: filter = {'fixed_ips': {'subnet_id': [u'3707b250- b6f5-4701- 9b17-01a8f288c1 7a'], 'ip_address': [u'172.16.0.3']}} db.dvr_ mac_db [req-d69336a0- 3db0-4e8a- b24f-b0446c5d97 35 - - - - -] ARJUN: internal_ gateway_ ports = [{'status': u'ACTIVE', 'dns_name': '', 'binding:host_id': u'25f77b4e- 1f9d-45c3- 937e-49eed272fa af', 'description': None, 'allowed_ address_ pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'dns_assignment': [{'hostname': u'host-172-16-0-3', 'ip_address': u'172.16.0.3', 'fqdn': u'host- 172-16- 0-3.platform9. sys.'}] , 'device_owner': u'network: router_ centralized_ snat', 'revision_number': 242L, 'port_security_ enabled' : True, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'3707b250- b6f5-4701- 9b17-01a8f288c1 7a', 'ip_address': u'172.16.0.3'}], 'id': u'bc6797ed- 2aa8-4764- b42f-852438d21b d5', 'security_groups': [], 'device_id': u'37176403- cfb0-478d- b51c-971d89597c f5', 'name': u'', 'admin_state_up': True, 'network_id': u'3f6ec232- 7649-4639- b828-c3af996048 1b', 'tenant_id': u'', 'binding: vif_details' : {u'port_filter': True, u'ovs_hybrid_plug': True}, 'binding: vnic_type' : u'normal', 'binding:vif_type': u'ovs', 'mac_address': u'fa:16: 3e:84:0b: 42', 'project_id': u''}] db.dvr_ mac_db [req-d69336a0- 3db0-4e8a- b24f-b0446c5d97 35 - - - - -] ARJUN: using internal_ port['mac_ address' ] = {'status': u'ACTIVE', 'dns_name': '', 'binding:host_id': u'25f77b4e- 1f9d-45c3- 937e-49eed272fa af', 'description': None, 'allowed_ address_ pairs': [], 'tags': [], 'extra_dhcp_opts': [], 'dns_assignment': [{'hostname': u'host-172-16-0-3', 'ip_address': u'172.16.0.3', 'fqdn': u'host- 172-16- 0-3.platform9. sys.'}] , 'device_owner': u'network: router_ centralized_ snat', 'revision_number': 242L, 'port_security_ enabled' : True, 'binding:profile': {}, 'fixed_ips': [{'subnet_id': u'3707b250- b6f5-4701- 9b17-01a8f288c1 7a', 'ip_address': u'172.16.0.3'}], 'id': u'bc6797ed- 2aa8-4764- b42f-852438d21b d5', 'security_groups': [], 'device_id': u'37176403- cfb0-478d- b51c-971d89597c f5', 'name': u'', 'admin_state_up': True, 'network_id': u'3f6ec232- 7649-4639- b828-c3af996048 1b', 'tenant_id': u'', 'binding: vif_details' : {u'port_filter': True, u'ovs_hybrid_plug': True}, 'binding: vnic_type' : u'normal', 'binding:vif_type': u'ovs', 'mac_address': u'fa:16: 3e:84:0b: 42', 'project_id': u''}
2018-07-25 05:26:25.766 1033 INFO neutron.
2018-07-25 05:26:25.767 1033 INFO neutron.