excessive number of dvrs where vm got a fixed ip on floating network

Bug #1840579 reported by norman shen on 2019-08-18
This bug affects 1 person
Affects Status Importance Assigned to Milestone
norman shen

Bug Description

we are running into an unexpected situation where number of dvr routers is increasing to nearly 2000 on a compute node on which some instances got a nic on floating ip network.

We are using Queens release,

neutron-common/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
neutron-l3-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
neutron-metadata-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
neutron-openvswitch-agent/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed]
python-neutron/xenial,now 2:12.0.5-5~u16.04+mcp155 all [installed,automatic]
python-neutron-fwaas/xenial,xenial,now 2:12.0.1-1.0~u16.04+mcp6 all [installed,automatic]
python-neutron-lib/xenial,xenial,now 1.13.0-1.0~u16.04+mcp9 all [installed,automatic]
python-neutronclient/xenial,xenial,now 1:6.7.0-1.0~u16.04+mcp17 all [installed,automatic]

Currently, my guess is that some applications mistakenly invokes rpc calls like this https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/api/rpc/agentnotifiers/l3_rpc_agent_api.py#L166 with dvr associated with a floating ip address on a host which has fixed ip address allocated from floating network (aka device_owner prefix with compute:). Then such router will be kept by this https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L427 function, because `get_subnet_ids_on_router` does not filter out router:gateway ports.

I think this is a bug because as long as we do not have ports with specific device owners we should not have a dvr router on it.

besides it is pretty easy to replay this bug.

First create a dvr router with an external gateway on floating network
Then create on virtual machine with fixed ip on floating network
Then call `routers_updated_on_host` manually, then this dvr will be created on the host where vm resides on, but actually it should be there.

norman shen (jshen28) on 2019-08-18
description: updated

Fix proposed to branch: master
Review: https://review.opendev.org/677092

Changed in neutron:
assignee: nobody → norman shen (jshen28)
status: New → In Progress
Changed in neutron:
importance: Undecided → Medium

DVR routers will be created on host, where the Service port is bound.
If the VM that you are creating is bound to host A. then the DVR routers will be created on that host.

So you are creating a VM in a floatingIP Network and not on the fixed IP network. So you got an IP for a VM within the same range as the floatingIP.

That is fine, irrespective of the floatingIP, you VM has an IP now and since that VM is bound to a host, the DVR router is supposed to be provisioned on that host.

I don't understand this part of your comment. 'Then call `routers_updated_on_host` manually, then this dvr will be created on the host where vm resides on, but actually it should be there.'

As far as your device_owner is 'compute:none', or 'dhcp' or 'lbaas' you should see a router pop up.

norman shen (jshen28) wrote :

I do not understand. floating ip network does not associate with any router directly, it is just used as external gateway. So i personally do not believe it's necessary to have a dvr router with it.

norman shen (jshen28) wrote :

for `routers_updated_on_host`, again please take look at https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L174 this method, this method will return every port on the router, even router:gateway port will be returned, and I do no think it is necessary to check router:gateway port.

norman shen (jshen28) wrote :

for example, let's define `floating` as an floating ip network
let's assume that dvr `router1` has an external gateway using `floating`
and all of the servers use this router running on compute01, let's create a server using `floating` on compute02

for instance,

openstack server create --nic net-id=floating --availability-zone :compute02

until now this compute02 does not has a dvr router called `router1` but if I manually call

neutron.api.rpc.agentnotifiers.l3_rpc_agent_api.L3AgentNotifyAPI.routers_updated_on_host(context, ['router1'], 'comptue02')
I can see `router1` created on this host and I believe this is not necessary.

The root cause is that method https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L174 will return gateway's subnet id, if one looks at https://github.com/openstack/neutron/blob/490471ebd3ac56d0cee164b9c1c1211687e49437/neutron/db/l3_dvrscheduler_db.py#L426 this code, you can see `get_subnet_ids_on_router` will be called if router_ids is not in result_set, and then all the routers using `floating` as gateway will be qualified to be created on compute02 but it does not make sense.

tags: added: l3-dvr-backlog

Hello Norman:

First of all and not related directly to this bug, let me say that we are maybe mixing concepts here, between FIPs and DVR. Take the FIP part out from this bug.

If you have a network attached to a DVR router and you create a VM with a port in this network, a DVR router will be created in this host. That's what DVR is going to do: to distribute the router load between the compute nodes, creating instances of this router in the servers with ports associated to the router networks. This will distribute the routing load between servers.

If you want to have a centralized routing architecture, do not use DVR.

IMO, this bug is not valid.


PS: https://assafmuller.com/category/dvr/

norman shen (jshen28) wrote :


I totally disagree with your point. Again I do believe Dvr is only necessary when router's got an interface on the router it should have nothing to do with what network its gateway uses. In this scenario, I use fip as my instance's fixed ip (and this instance does not associate with a floating ip) and fip subnet does not have interface attached to router, so it should not have a dvr.

even if I step back and admit your point is valid, can you please tell me why this router is necessary? this dvr router does not even have a qr-xx port on fip subnet...

I do hope you please look at the extra test cases I added thanks.

Slawek Kaplonski (slaweq) wrote :

IIUC what Norman is saying, I think he is right here.
So let me explain how I understand this issue:

1. There is networks called e.g. "public",
2. There is dvr router R1 which has "public" network set as external gateway and some "private" networks plugged to it. So VMs connected to "private" networks can have FIP from "public" associated,
3. Now on some compute node new VM is spawned and it is plugged directly to the "public" network - here IIUC dvr router R1 is created on compute node with new VM. But it don't need to be there as this vm is no connected to R1 at all.

@Norman, is my understanding of the issue correct?

norman shen (jshen28) wrote :

Thank you sir, that's exactly what I want to say ....

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers