fg- device is not deleted after the deletion of the last VM on the compute node

Bug #1377156 reported by Stephen Ma
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Stephen Ma
Juno
Fix Released
Undecided
Unassigned

Bug Description

The external gateway port in the fip- namespace on a compute node is not removed after the user deleted the last VM running on the node.

How to reproduce the problem:

1. SETUP:
     * Use devstack to start up the controller node. In local.conf, Q_DVR_MODE=dvr_snat.
     * Use devstack to setup a compute node. In local.conf, Q_DVR_MODE=dvr.

At the start, there are no VMs hosted on the compute node. The fip namespace hasn't been created yet.

1. Create a network and subnet
2. Create a router and dd the subnet to the router
3. Tie the router to the external network
4. Boot up a VM using the network, and assign it a floatingip
5. Ping the floating IP (make sure you open up your SG)
6. Note the fg- device in the fip namespace on the compute node
7. Now delete the VM

Expected results:

- The VM is deleted.
- Neutron port-list shows the gateway port is also deleted.
- The FIP namespace is also cleared

Experienced results:

- The fg- device still remains in the fip namespace on the compute node and the fip namespace isn't removed.

For detailed command sequence, see:

http://paste.openstack.org/show/118174/

Stephen Ma (stephen-ma)
Changed in neutron:
assignee: nobody → Stephen Ma (stephen-ma)
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I did look at this very test case in the past, and I recall these related bug reports and fixes:

https://bugs.launchpad.net/neutron/+bug/1351066
https://bugs.launchpad.net/neutron/+bug/1367588

https://review.openstack.org/#/c/120917/
https://review.openstack.org/#/c/111421/

In my experience, namespaces were cleaned up correctly, but for that I needed router_delete_namespaces=True set for the L3 Agent.

Can you confirm that you have that too? That said, I believe that no more fixes should be filed against these issues if we don't have a functional test that demonstrated that the namespaces are indeed cleared. Clearly just unit coverage is not enough.

Changed in neutron:
status: New → Incomplete
description: updated
description: updated
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

This is my experience on a single host Devstack, pulling Juno RC1

http://paste.openstack.org/show/118175/

I'll see if I can triage on a multi-host deployment.

Revision history for this message
Stephen Ma (stephen-ma) wrote :

Hi Armando,

Yes the delete namespace option is turned on (router_delete_namespaces=True in l3_agent.ini). The L3 agent did remove the router namespace on the compute node.

The problem is reproduced with the latest neutron code.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/126176

Changed in neutron:
status: Incomplete → In Progress
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

you mean did *NOT* remove the router namespace?

Revision history for this message
Stephen Ma (stephen-ma) wrote :

Hi Armando,

The L3 agent DID remove the qrouter namespace. But it didn't remove the fip namespace. The fip agent external gateway port was removed in the database.

Revision history for this message
Stephen Ma (stephen-ma) wrote :
Download full text (3.2 KiB)

Please disregard my comment #7. I was trying to reproduce the error and I made the observation on the controller node instead of the compute node.

My comment #6 is correct. I reran the reproducer. The namespaces on my compute node after booting up a VM and created a Floating IP on the VM's port:

stack@DVR-CN2:~/DEVSTACK/devstack$ ip netns
fip-ebfe6325-0c4f-4fb1-a702-7681d579291d
qrouter-09babbf7-5760-4c39-be1a-ae66c69357e2
stack@DVR-CN2:~/DEVSTACK/devstack$

I can ssh into the VM.
After deleting the VM (step #7), the namespace on the compute node is. The L3 agent removed the qrouter namespace:

stack@DVR-CN2:~/DEVSTACK/devstack$ ip netns
fip-ebfe6325-0c4f-4fb1-a702-7681d579291d
stack@DVR-CN2:~/DEVSTACK/devstack$

The public network uuid is ebfe6325-0c4f-4fb1-a702-7681d579291d:
$ neutron net-list
+--------------------------------------+-----------+--------------------------------------------------+
| id | name | subnets |
+--------------------------------------+-----------+--------------------------------------------------+
| 9b19281f-3a70-45d0-adcf-d145b4c7584a | user-1net | fd7074ec-8fcc-4efb-9a63-e651b7031d88 10.1.2.0/24 |
| ebfe6325-0c4f-4fb1-a702-7681d579291d | public | 5050f076-241c-40c0-98f1-02e962f286fe |
+--------------------------------------+-----------+--------------------------------------------------+

The router uuid is 09babbf7-5760-4c39-be1a-ae66c69357e2:
$ neutron router-list
+--------------------------------------+--------------+-----------------------------------------------------------------------------+
| id | name | external_gateway_info |
+--------------------------------------+--------------+-----------------------------------------------------------------------------+
| 09babbf7-5760-4c39-be1a-ae66c69357e2 | user-1router | {"network_id": "ebfe6325-0c4f-4fb1-a702-7681d579291d", "enable_snat": true} |
+--------------------------------------+--------------+-----------------------------------------------------------------------------+

The state of the fip-ebfe6325-0c4f-4fb1-a702-7681d579291d after the VM is deleted:
stack@DVR-CN2:~/DEVSTACK/devstack$ sudo ip netns exec fip-ebfe6325-0c4f-4fb1-a702-7681d579291d ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
204: fg-c187c483-78: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:e1:a9:4c brd ff:ff:ff:ff:ff:ff
    inet 10.127.10.227/24 brd 10.127.10.255 scope global fg-c187c483-78
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fee1:a94c/64 scope link
       valid_lft forever preferred_lft forever
stack@DVR-CN2:~/DEVSTACK/devstack$

But running "neutron port-list" as the admin shows no port with mac address of 'fa:16:3e:e1:a9:4c':
$ ./os_admin n...

Read more...

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Stephen there's a 'hide' link right on the right side of the comment, can you hide the ones that are misleading please?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I did reproduce the issue, but I would like to understand why this happens only on L3 agents in DVR mode; the ones with DVR_SNAT mode looks fine.

Revision history for this message
Stephen Ma (stephen-ma) wrote :

I have hidden the Comment #7. I believe that comment is causing confusion.

Revision history for this message
Stephen Ma (stephen-ma) wrote :

Explanation on this problem is happening:

Here is the analysis of the problem:

The fip agent gw port and the fip namespace are deleted from the method floating_ip_removed_dist() in L3NATAgent. This is called from process_router_floating_ip_addresses, which in turn is called from process_routers.

The number of floating ips on a node's router is tracked using the counter dist_fip_count. Floating_ip_removed_dist deletes the fg- device when the count becomes 0. The number of floating ips managed by an agent is tracked by the counter agent_fip_count. Likewise, when agent_fip_count becomes 0, floating_ip_removed_dist deletes the fip- namespace.

I instrumented the floating_ip_removed_dist function to track the dist_fip_count and agent_fip_count.

At the start, I have 2 VMs running on the compute node (each VM has a floating ip so

dist_fip_count=2 and agent_fip_count=2). After deleting the first VM, I see in the debug message reporting dist_fip_count and agent_fip_count becomes 1. Then I deleted the second VM. I would expect to see a debug message saying dist_fip_count=0 and agent_fip_count=0. This message isn't found in the log. This says when last VM is deleted, the process_router function never called process_router_floating_ip_addresses.

This is what happened when the last VM is deleted:

When the VM is deleted, the delete_port in ml2 plugin is called. Delete_port in turn,

calls.

1. dvr_deletens_if_no_port being called. The router namespace on the compute node is
going to be removed. The return value is a router. So a router_removed notification
will be sent out later.
2. disassociate_floatingips. This will in turn calls
delete_floatingip_agent_gateway_port(). This is going to remove the fip agent gw port from the database.
3. send notification to del vm's arp entry.

4. Sends out a routers_updated notification.
5. Because (1) returned a router, a router_removed_from_agent notification is sent.

When the L3-agent processes the routers_updated notification, the fip agent gateway port has already been removed. So ri.router['gw_port'] is None but ri.ex_gw_port is still the old value. Under this condition, process_router_floating_ip_addresses is not called. So the fg- device and the fip namespace are not deleted.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/126176
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e3b949c3bc08808e3df15215bc30d6610f3a4bd3
Submitter: Jenkins
Branch: master

commit e3b949c3bc08808e3df15215bc30d6610f3a4bd3
Author: Stephen Ma <email address hidden>
Date: Sun Oct 5 04:59:40 2014 +0000

    Delete FIP namespace when last VM is deleted

    On a compute node when the last VM with a floating IP association
    is deleted, the L3 agent did not delete the fip namespace. However
    the api server has already deleted the fip agent external gateway
    port from the database.

    This problem is happening on DVRs because the deletion of a VM port,
    in addition to a floating IP disassociation, may also result in the
    removal of the external gateway port binding AND the removal of the
    fip agent external gateway port.

    When the L3 agent is handling a routers_updated notification, it is
    not processing floating ip address updates when the router has both
    a floating ip disassociated and a external gateway port deleted.
    This patch corrects this problem.

    Closes-bug: #1377156
    Change-Id: I86bdef7c9d988cb9d87c88adde55548d459f29a5

Changed in neutron:
status: In Progress → Fix Committed
Stephen Ma (stephen-ma)
tags: added: juno-backport-potential
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → kilo-1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/142971

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/juno)

Reviewed: https://review.openstack.org/142971
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b776d1923fa387f15d3248fcfc49a69c0603a17d
Submitter: Jenkins
Branch: stable/juno

commit b776d1923fa387f15d3248fcfc49a69c0603a17d
Author: Stephen Ma <email address hidden>
Date: Sun Oct 5 04:59:40 2014 +0000

    Delete FIP namespace when last VM is deleted

    On a compute node when the last VM with a floating IP association
    is deleted, the L3 agent did not delete the fip namespace. However
    the api server has already deleted the fip agent external gateway
    port from the database.

    This problem is happening on DVRs because the deletion of a VM port,
    in addition to a floating IP disassociation, may also result in the
    removal of the external gateway port binding AND the removal of the
    fip agent external gateway port.

    When the L3 agent is handling a routers_updated notification, it is
    not processing floating ip address updates when the router has both
    a floating ip disassociated and a external gateway port deleted.
    This patch corrects this problem.

    cherry-picked from e3b949c3bc08808e3df15215bc30d6610f3a4bd3
    Closes-bug: #1377156
    Change-Id: I86bdef7c9d988cb9d87c88adde55548d459f29a5

tags: added: in-stable-juno
Revision history for this message
Itzik Brown (itzikb1) wrote :

Followed the instructions but the FIP namespace isn't deleted.

Version
======
python-neutron-2014.2.2-1.el7ost.noarch
openstack-neutron-2014.2.2-1.el7ost.noarch
openstack-neutron-openvswitch-2014.2.2-1.el7ost.noarch
python-neutronclient-2.3.9-1.el7ost.noarch

Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-1 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.