DVR routers attached to shared networks aren't being unscheduled from a compute node after deleting the VMs using the shared net

Bug #1424096 reported by Stephen Ma on 2015-02-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Undecided
Oleg Bondarev
Nominated for Mitaka by Oleg Bondarev
Nominated for Trunk by Oleg Bondarev
Juno
Undecided
Unassigned
Kilo
Undecided
Unassigned

Bug Description

As the administrator, a DVR router is created and attached to a shared network. The administrator also created the shared network.

As a non-admin tenant, a VM is created with the port using the shared network. The only VM using the shared network is scheduled to a compute node. When the VM is deleted, it is expected the qrouter namespace of the DVR router is removed. But it is not. This doesn't happen with routers attached to networks that are not shared.

The environment consists of 1 controller node and 1 compute node.

Routers having the problem are created by the administrator attached to shared networks that are also owned by the admin:

As the administrator, do the following commands on a setup having 1 compute node and 1 controller node:

1. neutron net-create shared-net -- --shared True
   Shared net's uuid is f9ccf1f9-aea9-4f72-accc-8a03170fa242.

2. neutron subnet-create --name shared-subnet shared-net 10.0.0.0/16

3. neutron router-create shared-router
    Router's UUID is ab78428a-9653-4a7b-98ec-22e1f956f44f.

4. neutron router-interface-add shared-router shared-subnet
5. neutron router-gateway-set shared-router public

As a non-admin tenant (tenant-id: 95cd5d9c61cf45c7bdd4e9ee52659d13), boot a VM using the shared-net network:

1. neutron net-show shared-net
+-----------------+--------------------------------------+
| Field | Value |
+-----------------+--------------------------------------+
| admin_state_up | True |
| id | f9ccf1f9-aea9-4f72-accc-8a03170fa242 |
| name | shared-net |
| router:external | False |
| shared | True |
| status | ACTIVE |
| subnets | c4fd4279-81a7-40d6-a80b-01e8238c1c2d |
| tenant_id | 2a54d6758fab47f4a2508b06284b5104 |
+-----------------+--------------------------------------+

At this point, there are no VMs using the shared-net network running in the environment.

2. Boot a VM that uses the shared-net network: nova boot ... --nic net-id=f9ccf1f9-aea9-4f72-accc-8a03170fa242 ... vm_sharednet
3. Assign a floating IP to the VM "vm_sharednet"
4. Delete "vm_sharednet". On the compute node, the qrouter namespace of the shared router (qrouter-ab78428a-9653-4a7b-98ec-22e1f956f44f) is left behind

stack@DVR-CN2:~/DEVSTACK/manage$ ip netns
qrouter-ab78428a-9653-4a7b-98ec-22e1f956f44f
 ...

This is consistent with the output of "neutron l3-agent-list-hosting-router" command. It shows the router is still being hosted on the compute node.

$ neutron l3-agent-list-hosting-router ab78428a-9653-4a7b-98ec-22e1f956f44f
+--------------------------------------+----------------+----------------+-------+
| id | host | admin_state_up | alive |
+--------------------------------------+----------------+----------------+-------+
| 42f12eb0-51bc-4861-928a-48de51ba7ae1 | DVR-Controller | True | :-) |
| ff869dc5-d39c-464d-86f3-112b55ec1c08 | DVR-CN2 | True | :-) |
+--------------------------------------+----------------+----------------+-------+

Running the "neutron l3-agent-router-remove" command removes the qrouter namespace from the compute node:

$ neutron l3-agent-router-remove ff869dc5-d39c-464d-86f3-112b55ec1c08 ab78428a-9653-4a7b-98ec-22e1f956f44f
Removed router ab78428a-9653-4a7b-98ec-22e1f956f44f from L3 agent

stack@DVR-CN2:~/DEVSTACK/manage$ ip netns
stack@DVR-CN2:~/DEVSTACK/manage$

This is a workaround to get the qrouter namespace deleted from the compute node. The L3-agent scheduler should have removed the router from the compute node when the VM is deleted.

Stephen Ma (stephen-ma) on 2015-02-22
Changed in neutron:
assignee: nobody → Stephen Ma (stephen-ma)

Fix proposed to branch: master
Review: https://review.openstack.org/159296

Changed in neutron:
status: New → In Progress
tags: added: l3-dvr-backlog
Stephen Ma (stephen-ma) wrote :

Explanation of why this problem is happening.

In this case, the VM created by non-admin tenant. The VM is using a shared network created by the admin tenant. The subnet's interface is tied to an admin-created router. So the qr-port device is also owned by the admin. When a tenant creates a VM using the shared network, the tenant owns the VM's port. So in this case, a VM's port and the qr ports don't have the same tenant ids.

If the VM is created by the admin, the qrouter namespace on the compute node is removed when the VM is removed. However, when the VM is created by a non-admin user, the qrouter namespace stays on the compute node. This show that the neutron api server is running as the owner of the VM, not the admin, during the VM port deletion.
The decision to delete a namespace is made in dvr_deletens_if_no_port. It makes 5 database queries to make the decision. The first query is to get_dvr_routers_by_portid to retrieve the ids of routers affected by the VM port removal. To do this, it has to find the ports on the subnet whose owner is 'network:router_interface_distributed'. In this case, the owner of router-interface port is the admin. Because the context is only the VM owner, no such port is found, so the router list is empty. So no routers needs to be removed from any node. So this is the reason, the admin context is needed to return the true situation.

The admin context is also needed for the other queries made by dvr_deletens_if_no_port. To determine whether a namespace on a compute node needs to be deleted, it needs to find out whether there are other ports using the same network and subnet on the compute node. Because the network is shared, other tenants also may have VMs using the same network on the compute node. Without the admin context, it will only return the ports used by only the tenant. Since the tenant has already deleted the port, the namespace could be removed. For this reason, the following test failed, if the other database queries in the dvr_deletens_if_no_port doesn't have admin context as well:

On a cloud setup with only 1 compute node, given that dvr_deletens_if_no_port calls get_dvr_routers_by_portid using admin context, but the other queries are called without having admin context:

  0. Create the shared network subnet, and router as described in the description.

  1. As tenant 1, create a VM using the shared network. When the VM boots up assign a floating IP to the VM
  2. As tenant 2, repeat (1).
  3. As tenant 2, ping the VM using the floating IP assigned to tenant 2's VM using the FIP. Ping should work. Continue to ping.
  4. As tenant 1, delete the VM.
  5. Now the pings to tenant 2's VM fails.

The reason for the ping failure after step 4 is that the router namespace on the compute node was deleted as a result deleting tenant 1's VM for the reason described above.

Reviewed: https://review.openstack.org/159296
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=edbade486102a219810137d1c6b916e87475d477
Submitter: Jenkins
Branch: master

commit edbade486102a219810137d1c6b916e87475d477
Author: Stephen Ma <email address hidden>
Date: Tue Feb 24 23:31:33 2015 +0000

    Router is not unscheduled when the last port is deleted

    When checking for ports that are still in use on a DVR router,
    the L3 agent scheduler makes the assumption that a port's
    network must be owned by the same tenant. This isn't always
    true as the admin could have created a shared network that
    other tenants may use. The result of this assumption is that
    the router associated with the shared network may not be
    unscheduled from a VM host when the last VM (created by a
    non-admin tenant) using the shared network is deleted from
    the compute node.

    The owner of a VM may not own all the ports of a shared
    network. Other tenants may have VMs using the same shared
    network running on the same compute node. Also the VM owner
    may not own the router ports. In order to check whether a
    router can be unscheduled from a node has to be run with
    admin context so all the ports associated with router are
    returned from database queries.

    This patch fixes this problem by using the admin context to
    make the queries needed for the DVR scheduler to make the
    correct unschedule decision.

    Change-Id: I45477713d7ce16f2451fa6fbe04c610388b06867
    Closes-bug: #1424096

Changed in neutron:
status: In Progress → Fix Committed
Stephen Ma (stephen-ma) on 2015-04-20
tags: added: juno-backport-potential

Reviewed: https://review.openstack.org/176612
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=036133f932dd86f8d34d1e99c8ad2d25db44bd07
Submitter: Jenkins
Branch: stable/juno

commit 036133f932dd86f8d34d1e99c8ad2d25db44bd07
Author: Stephen Ma <email address hidden>
Date: Tue Feb 24 23:31:33 2015 +0000

    Router is not unscheduled when the last port is deleted

    When checking for ports that are still in use on a DVR router,
    the L3 agent scheduler makes the assumption that a port's
    network must be owned by the same tenant. This isn't always
    true as the admin could have created a shared network that
    other tenants may use. The result of this assumption is that
    the router associated with the shared network may not be
    unscheduled from a VM host when the last VM (created by a
    non-admin tenant) using the shared network is deleted from
    the compute node.

    The owner of a VM may not own all the ports of a shared
    network. Other tenants may have VMs using the same shared
    network running on the same compute node. Also the VM owner
    may not own the router ports. In order to check whether a
    router can be unscheduled from a node has to be run with
    admin context so all the ports associated with router are
    returned from database queries.

    This patch fixes this problem by using the admin context to
    make the queries needed for the DVR scheduler to make the
    correct unschedule decision.

    (cherry picked from commit edbade486102a219810137d1c6b916e87475d477)

    Conflicts:
        neutron/db/l3_dvrscheduler_db.py

    Closes-bug: #1424096
    Change-Id: I45477713d7ce16f2451fa6fbe04c610388b06867

tags: added: in-stable-juno

Reviewed: https://review.openstack.org/177825
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1813da49aded224e273e0a33a90dca902fa05b75
Submitter: Jenkins
Branch: stable/kilo

commit 1813da49aded224e273e0a33a90dca902fa05b75
Author: Stephen Ma <email address hidden>
Date: Tue Feb 24 23:31:33 2015 +0000

    Router is not unscheduled when the last port is deleted

    When checking for ports that are still in use on a DVR router,
    the L3 agent scheduler makes the assumption that a port's
    network must be owned by the same tenant. This isn't always
    true as the admin could have created a shared network that
    other tenants may use. The result of this assumption is that
    the router associated with the shared network may not be
    unscheduled from a VM host when the last VM (created by a
    non-admin tenant) using the shared network is deleted from
    the compute node.

    The owner of a VM may not own all the ports of a shared
    network. Other tenants may have VMs using the same shared
    network running on the same compute node. Also the VM owner
    may not own the router ports. In order to check whether a
    router can be unscheduled from a node has to be run with
    admin context so all the ports associated with router are
    returned from database queries.

    This patch fixes this problem by using the admin context to
    make the queries needed for the DVR scheduler to make the
    correct unschedule decision.

    Change-Id: I45477713d7ce16f2451fa6fbe04c610388b06867
    Closes-bug: #1424096
    (cherry picked from commit edbade486102a219810137d1c6b916e87475d477)

tags: added: in-stable-kilo
Thierry Carrez (ttx) on 2015-06-24
Changed in neutron:
milestone: none → liberty-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-10-15
Changed in neutron:
milestone: liberty-1 → 7.0.0
Oleg Bondarev (obondarev) wrote :

I faced the bug while reworking unit tests into functional tests: when performing steps described in the description I get:
 2015-12-15 17:41:23,484 ERROR [neutron.callbacks.manager] Error during notification for neutron.db.l3_dvrscheduler_db._notify_port_delete port, after_delete
    Traceback (most recent call last):
      File "neutron/callbacks/manager.py", line 141, in _notify_loop
        callback(resource, event, trigger, **kwargs)
      File "neutron/db/l3_dvrscheduler_db.py", line 485, in _notify_port_delete
        context, router['agent_id'], router['router_id'])
      File "neutron/db/l3_dvrscheduler_db.py", line 439, in remove_router_from_l3_agent
        router = self.get_router(context, router_id)
      File "neutron/db/l3_db.py", line 451, in get_router
        router = self._get_router(context, id)
      File "neutron/db/l3_db.py", line 137, in _get_router
        raise l3.RouterNotFound(router_id=router_id)
    RouterNotFound: Router 7d52836b-8fe5-4417-842f-3cbe0920c89c could not be found

and router is not removed from host which has no more dvr serviceable ports.

Looks like we also need admin context in order to remove admin router from a host when non-admin tenant removes last dvr serviceable port on a shared network.

Changed in neutron:
status: Fix Released → Confirmed

Fix proposed to branch: master
Review: https://review.openstack.org/257938

Changed in neutron:
assignee: Stephen Ma (stephen-ma) → Oleg Bondarev (obondarev)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/257938
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=96ba199d733944e5b8aa3664a04d9204fd66c878
Submitter: Jenkins
Branch: master

commit 96ba199d733944e5b8aa3664a04d9204fd66c878
Author: Oleg Bondarev <email address hidden>
Date: Tue Dec 15 17:58:51 2015 +0300

    Use admin context when removing DVR router on vm port deletion

    In case non-admin tenant removes last VM on a shared network (owned
    by admin) connected to a DVR router (also owned by admin) we need
    to remove the router from the host where there are no more dvr
    serviceable ports. Commit edbade486102a219810137d1c6b916e87475d477
    fixed logic that determines routers that should be removed from host.
    However in order to actually remove the router we also need admin
    context.

    This was not caught by unit tests and one reason for that is so called
    'mock everything' approach which is evil and generally useless.
    This patch replaces unit tests with functional tests that we able
    to catch the bug.

    Closes-Bug: #1424096
    Change-Id: Ia6cdf2294562c2a2727350c78eeab155097e0c33

Changed in neutron:
status: In Progress → Fix Released

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

tags: removed: juno-backport-potential

Reviewed: https://review.openstack.org/296851
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=69a384a9af4f0fe3112d98b2eb766a8417359e1c
Submitter: Jenkins
Branch: stable/liberty

commit 69a384a9af4f0fe3112d98b2eb766a8417359e1c
Author: Oleg Bondarev <email address hidden>
Date: Tue Dec 15 17:58:51 2015 +0300

    Use admin context when removing DVR router on vm port deletion

    In case non-admin tenant removes last VM on a shared network (owned
    by admin) connected to a DVR router (also owned by admin) we need
    to remove the router from the host where there are no more dvr
    serviceable ports. Commit edbade486102a219810137d1c6b916e87475d477
    fixed logic that determines routers that should be removed from host.
    However in order to actually remove the router we also need admin
    context.

    This was not caught by unit tests and one reason for that is so called
    'mock everything' approach which is evil and generally useless.
    This patch replaces unit tests with functional tests that we able
    to catch the bug.

    Closes-Bug: #1424096
    Change-Id: Ia6cdf2294562c2a2727350c78eeab155097e0c33
    (cherry picked from commit 96ba199d733944e5b8aa3664a04d9204fd66c878)

tags: added: in-stable-liberty

This issue was fixed in the openstack/neutron 7.1.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers