[dvr][fast-exit] router add/remove subnet operations are not idempotent

Bug #1761555 reported by Dmitrii Shcherbakov
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Medium
Unassigned

Bug Description

OpenStack Queens from UCA (xenial, GA kernel), 2 external subnets (one routed provider network), 2 tenant subnets, all subnets in the same address scope to trigger "fast exit" logic.

Tenant subnet cidr: 192.168.100.0/24
Other tenant subnet cidr: 192.168.200.0/24

Relevant agent configs:
http://paste.openstack.org/show/718514/

Commands and outputs:
http://paste.openstack.org/show/JFYmGJwF1pdtliQOfXgd/

Overall, a similar situation as with https://bugs.launchpad.net/neutron/+bug/1759956 but with one tenant subnet at first for which routes and rules do not get deleted at all.

Problem description:

* router add subnet tenantsubnet
* routes in fip namespace and rules in qrouter namespace get created and a distributed port gets created for DVR;
* router remove subnet tenantsubnet
* routes are still there, no new logged events in DVR l3 agent logs

If two networks are added then removing one of them triggers removal of routes and rules and new messages are logged in l3 agent log (the rules removed are affected by pad.lv/1759956).

A sequence of add subnet/remove subnet commands may result in errors logged in l3 agent logs: http://paste.openstack.org/show/718511/

Sometimes after re-adding a tenantsubnet in presence of othertenantsubnet a proper route is added for a few seconds but then removed:

# just do some operations
(openstack) router add subnet pubrouter tenantsubnet
(openstack) router add subnet pubrouter othertenantsubnet
(openstack) router add subnet pubrouter tenantsubnet
(openstack) router add subnet pubrouter tenantsubnet
(openstack) router remove subnet pubrouter tenantsubnet

# lots of errors, see http://paste.openstack.org/show/718511/

# try again without restarting agents
(openstack) router add subnet pubrouter tenantsubnet # ran client command

# ... got 192.168.100.0/24 here for a few seconds while l3 agent was doing something
10.232.16.0/21 dev fg-7f42af4f-ad proto kernel scope link src 10.232.17.5
169.254.106.114/31 dev fpr-3182a7c6-b proto kernel scope link src 169.254.106.115
192.168.100.0/24 via 169.254.106.114 dev fpr-3182a7c6-b
192.168.200.0/24 via 169.254.106.114 dev fpr-3182a7c6-b

# finished server and l3 agent finished processing "router add subnet pubrouter tenantsubnet"
# route got deleted
root@ipotane:~# ip netns exec fip-64ab1ec3-4927-4f09-87f9-804e7f4f8748 ip r
10.232.16.0/21 dev fg-7f42af4f-ad proto kernel scope link src 10.232.17.5
169.254.106.114/31 dev fpr-3182a7c6-b proto kernel scope link src 169.254.106.115
192.168.200.0/24 via 169.254.106.114 dev fpr-3182a7c6-b

There is something wrong with how tenant network add/remove notifications are sent it seems because on first removal of a tenant network nothing is logged in l3 agent logs but there is activity in neutron server logs.

tags: added: l3-dvr-backlog
Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

I think we should take a single case here to see what is going wrong.

Thanks for your efforts in doing multiple use case tests.

But let us split down the problem in a simple case and then go from there. So that it is readable and will be easy to follow.

What you are seeing is may be if we have more than one subnets added to the router with address-scopes, and try to remove, we are seeing failures when it tries to remove the routes from fip-namespace.

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Yes, you are right, let's solve the problem with notifications to l3 agents first.

So here is what happens:

1) a brand new distributed non-HA router gets created;
2) gets enabled and a gateway port is added to it;
3) a distributed port on one tenant network is added by subnet;
3) after 'router remove subnet <router> <subnet-id>' the distributed port is deleted from the database but no notifications are sent to l3 agents.

This is confirmed by tracing and l3 agent log and is 100% reproducible:

http://paste.openstack.org/show/718632/

--Return--
> /usr/lib/python2.7/dist-packages/neutron/db/l3_dvrscheduler_db.py(250)get_hosts_to_notify()->[]
-> return hosts
(Pdb)

Network agent list:
http://paste.openstack.org/show/718633/

From the first glance, I think the problem is that we first remove an interface by subnet and then try to query hosts by subnet:

> /usr/lib/python2.7/dist-packages/neutron/db/l3_db.py(1019)remove_router_interface()
-> port, subnets = self._remove_interface_by_subnet(

...

> /usr/lib/python2.7/dist-packages/neutron/db/l3_db.py(1945)remove_router_interface()
> /usr/lib/python2.7/dist-packages/neutron/db/l3_db.py(1924)notify_router_interface_action()
> /usr/lib/python2.7/dist-packages/neutron/db/l3_db.py(1893)notify_routers_updated()
-> def routers_updated(self, context, router_ids, operation=None, data=None,
    def _notification(self, context, method, router_ids, operation,
                      shuffle_agents, schedule_routers=True):
    def _agent_notification(self, context, method, router_ids, operation,
                            shuffle_agents):
    def get_hosts_to_notify(self, context, router_id):
    def _get_dvr_hosts_for_router(self, context, router_id):
    def _get_dvr_hosts_for_subnets(self, context, subnet_ids):

Revision history for this message
Swaminathan Vasudevan (swaminathan-vasudevan) wrote :

When you say that no 'notifications' are sent to the l3-agent when the subnet is removed, where there any active 'Service ports' ( VM ports) in the compute Node, when you tried to remove the Subnet.

Only when there is an active port in a compute node the notification will be sent to the respective host to update the router.
Can you confirm that.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

There were no instance ports there - only a distributed router port, centralized SNAT port and DHCP ports.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

What I noticed is that when there are two tenant network ports attached to a router (no VMs) and one tenant network gets removed, notifications are sent out correctly to L3 agents.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.