At the end of deleting a GW port for a router, l3_dvr_db.py will look
for any more router gw ports on the external network. If there are
none, then it calls delete_floatingip_agent_gateway_port [1]. This
should fan out to all l3 agents on all compute nodes [2]. Each agent
should then delete the port [3].
In some cases, the fip namespace and the gateway port are not deleted.
I don't know where things are going wrong. This seems pretty
straight-forward. Do some agents miss the fanout? We know at least
some of them are getting the fanout. So, it is definitely being sent.
When I checked, the port had been deleted from the database. The fact
that a new one is created supports this because if one existed in the DB
already then it would be returned.
[1] https://github.com/openstack/neutron/blob/d3cd20151a67289f023875de682a6d3c4ccee645/neutron/db/l3_dvr_db.py#L179
[2] https://github.com/openstack/neutron/blob/d3cd20151a67289f023875de682a6d3c4ccee645/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L166
[3] https://github.com/openstack/neutron/blob/d3cd20151a67289f023875de682a6d3c4ccee645/neutron/agent/l3/dvr.py#L73
I'm marking this High because of what happens when there are multiple fg ports in the fip namespace. Because DVR uses proxy_arp on the fg port, having two of them with the same route to the external network makes the host essentially reply to any arp request on the subnet, receive the traffic, and then spit it right back out the other fg interface.
This happens because proxy_arp works by responding to any arp request for an IP address it thinks it can route to on another interface. With two fg interfaces with the same route, it thinks it can always route the packet to another interface, regardless of the IP address.
With one of these fip namespaces on the network, it manifests as a performance degradation because traffic passes through an extra host. With two or three, things get really ugly. These hosts can form a routing loop and packets go round and round until TTL expires. Yikes!