dhcp net namespace is not deleted well

Bug #1052535 reported by yong sheng gong on 2012-09-18
218
This bug affects 45 people
Affects Status Importance Assigned to Milestone
neutron
Low
Edgar Magana
Havana
Undecided
Edgar Magana

Bug Description

reproduce steps:
1. create a network
2. create a subnet
3. delete the network
4. check the dhcp-xxx namespace, we will see the dhcp-name space for this network is not deleted.

yong sheng gong (gongysh) wrote :

I am not sure if it should be targeted to RC2 since its importance.

Changed in quantum:
importance: Undecided → Low
dan wendlandt (danwent) wrote :

definitely something we should clean-up, but probably not a blocker.

Gary Kotton (garyk) wrote :

Is this not a classic case where one should run the netns_cleanup utility. From my understanidng it is meant to be running in the background.
This is maybe something that is worth bringing up with the people who are working on packaging.
Thanks
Gary

Gary Kotton (garyk) wrote :

That is, if sudo python bin/quantum-netns-cleanup --config-file /etc/quantum/quantum.conf --config-file /etc/quantum/dhcp_agent.ini is run int he background then the namespace is removed.

Changed in quantum:
status: New → Confirmed
dan wendlandt (danwent) wrote :

adding mark to comment. I hadn't realized that the clean-up tool was intended to be run in the background. If so, then we should definitely clarify this to packagers.

Janis Gengeris (janisg) wrote :

The cleanup tool is not working very well. It explodes with this message, when trying to remove namespace.

Command: ['sudo', '/usr/bin/quantum-rootwrap', '/etc/quantum/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-a3e53be2-7e6c-434b-b2f5-84678dd42aa0', 'ip', '-o', 'link', 'list']
Exit code: 1
Stdout: ''
Stderr: 'seting the network namespace failed: Invalid argument\n'

The original output of the command is the following:

# ip netns exec qrouter-3442d231-2e00-4d26-823e-1feb5d02a798 ip -o link list
54: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN \ link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Trying to delete namespace manually also fails:
# ip netns delete qrouter-3442d231-2e00-4d26-823e-1feb5d02a798
Cannot remove /var/run/netns/qrouter-3442d231-2e00-4d26-823e-1feb5d02a798: Device or resource busy

Janis Gengeris (janisg) wrote :

If adding --force, works better, but still unpredictable results.

Mark McClain (markmcclain) wrote :

The router namespace is bit trickier to destroy. We have to figure which resource is still active when ip netns delete runs. Once ip netns delete fails the netns db is left in a bad state. We to figure out a reproducible way to generate this bug manually via ip netns, so that we can submit a bug against iproute (it's db is the one that is causing the problems).

tags: added: l3-ipam-dhcp
Phani Achanta (phani-achanta) wrote :

The netns db seems to require that all resources allocated in the namespace be cleared before deleting the namespace.

Currently , the code in quantum has a problem in that when a network delete is issued it does not ensure a recursive release of subnet resources- it just cleans the subnet quantum db entries

I have attached a patch with this which solves the cleanup issue with dhcp namespace.
The following works with the patch.
1. create a network
2. create a subnet
3. delete the network
4. quantum-netns-cleanup
5. check the dhcp-xxx namespace is deleted

I am expecting that trying the same patch with a router namespace also will work.

Eugene Nikanorov (enikanorov) wrote :

This link provides further explanation of the issue: https://bugzilla.redhat.com/show_bug.cgi?id=872689

Once you've unsuccessfully tried to delete a namespace with ip netns delete,
it clears read permission from corresponding entry of /var/run/netns/ causing "seting the network namespace failed: Invalid argument" when namespace is used in further operations

Phani Achanta (phani-achanta) wrote :

The patch I provided does not address the base IP netns.
It addresses 2 issues:
1. openstack ensuring it is cleaning up the created resources it has created: it ensures cleanup of subnets upon deletion of networks.
2. it results in subnet deletion notifications being sent out for agents like dhcp agent which will make them cleanup their resources (for dhcpagent its dnsmasq) more promptly instead of waiting for a manual cleanup

Fix proposed to branch: master
Review: https://review.openstack.org/23828

Changed in quantum:
assignee: nobody → Phani Achanta (phani-achanta)
status: Confirmed → In Progress
dan wendlandt (danwent) on 2013-03-07
Changed in quantum:
milestone: none → grizzly-rc1
Changed in quantum:
assignee: Phani Achanta (phani-achanta) → nobody
status: In Progress → Confirmed

As the reported affirmed not being able to repro anymore, I am setting to 'incomplete' for the time being, aiming at setting it to invalid if nothing comes up until RC-2 is released.

Changed in quantum:
status: Confirmed → Incomplete
milestone: grizzly-rc1 → none

Hi there,
I'm totally able to reproduce this behaviour.
I'd like this bug to be reopened.

By the wayn this bug affects not only the qdhcp but also the qrouter

Mark McClain (markmcclain) wrote :

How are you reproducing this bug?

Kevin Bringard (kbringard) wrote :
Download full text (4.0 KiB)

I'm seeing something similar. Running: 1:2013.1+git201305151531~precise-0ubuntu1 from the Ubuntu grizzly testing repo:

I created a network, subnet and router. Linked them all up and then set the router's upstream gateway to be my external network. I then remove it all (clear the gateway, remove the interface, delete the router, subnet and network) and attempt to run quantum-netns-cleanup. The first time I run it I get the following:

quantum-netns-cleanup
2013-05-17 20:17:23 ERROR [quantum.agent.netns_cleanup_util] Error unable to destroy namespace: qrouter-96ddb034-d577-43e7-89b9-9c39e713c8c9
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/netns_cleanup_util.py", line 141, in destroy_namespace
    ip.garbage_collect_namespace()
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 123, in garbage_collect_namespace
    self.netns.delete(self.namespace)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 402, in delete
    self._as_root('delete', name, use_root_namespace=True)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 167, in _as_root
    kwargs.get('use_root_namespace', False))
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 47, in _as_root
    namespace)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 58, in _execute
    root_helper=root_helper)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
    raise RuntimeError(m)
RuntimeError:
Command: ['sudo', 'ip', 'netns', 'delete', 'qrouter-96ddb034-d577-43e7-89b9-9c39e713c8c9']
Exit code: 1
Stdout: ''
Stderr: 'Cannot remove /var/run/netns/qrouter-96ddb034-d577-43e7-89b9-9c39e713c8c9: Device or resource busy\n'
2013-05-17 20:17:23 ERROR [quantum.agent.netns_cleanup_util] Error unable to destroy namespace: qdhcp-a793c185-16f9-483c-b250-09fcac07eb2d
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/netns_cleanup_util.py", line 141, in destroy_namespace
    ip.garbage_collect_namespace()
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 123, in garbage_collect_namespace
    self.netns.delete(self.namespace)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 402, in delete
    self._as_root('delete', name, use_root_namespace=True)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 167, in _as_root
    kwargs.get('use_root_namespace', False))
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 47, in _as_root
    namespace)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 58, in _execute
    root_helper=root_helper)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
    raise RuntimeError(m)
RuntimeError:
Command: ['sudo', 'ip', 'netns', 'delete', 'qdhcp-a793c185-16f9-483c-b250-09fcac07eb2d']
Exit code: 1
Stdout: ''
Stderr: 'Cannot remove /var/run/netns/qdhcp-a793c185-16f9-483c-b250-09fcac07eb2d: Device or resource busy\n'

I then attempted to run a c...

Read more...

spcla1 (spcla1) wrote :

I am experiencing the same problem with the router namespace after deleting the network, router and gateway. The l3-agent.log file has the following error message:

2013-05-20 17:05:50 ERROR [quantum.agent.l3_agent] Failed deleting namespace 'qrouter-c04d91b9-ae63-4699-a889-54772fd4d771'
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 189, in _destroy_router_namespaces
    self._destroy_router_namespace(ns)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 195, in _destroy_router_namespace
    for d in ns_ip.get_devices(exclude_loopback=True):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 73, in get_devices
    self.root_helper, self.namespace)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 58, in _execute
    root_helper=root_helper)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
    raise RuntimeError(m)
RuntimeError:
Command: ['sudo', 'quantum-rootwrap', '/etc/quantum/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-c04d91b9-ae63-4699-a889-54772fd4d771', 'ip', '-o', 'link', 'list']
Exit code: 1
Stdout: ''
Stderr: 'seting the network namespace failed: Invalid argument\n'
2013-05-20 17:05:54 ERROR [quantum.agent.l3_agent] Failed synchronizing routers
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 639, in _sync_routers_task
    self._process_routers(routers, all_routers=True)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 619, in _process_routers
    self._router_added(r['id'], r)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 232, in _router_added
    self._create_router_namespace(ri)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/l3_agent.py", line 210, in _create_router_namespace
    ip_wrapper.netns.execute(['sysctl', '-w', 'net.ipv4.ip_forward=1'])
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/ip_lib.py", line 407, in execute
    check_exit_code=check_exit_code)
  File "/usr/lib/python2.7/dist-packages/quantum/agent/linux/utils.py", line 61, in execute
    raise RuntimeError(m)
RuntimeError:

daniels (danxcai) wrote :

in my environement, when quantum runs for a while , this bug would appear, even i didn't delete this network.

# ip netns exec qrouter-93a70ed7-ac17-41d3-9d92-4d8bfb0ee56a ip a
seting the network namespace failed: Invalid argument

futher more, i cann't delete it

# service quantum-l3-agent stop
quantum-l3-agent stop/waiting

# ip netns del qrouter-93a70ed7-ac17-41d3-9d92-4d8bfb0ee56a
Cannot remove /var/run/netns/qrouter-93a70ed7-ac17-41d3-9d92-4d8bfb0ee56a: Device or resource busy

daniels (danxcai) wrote :

got it ,
it can be deleted when i kill all the metadata service .

Hi,

Debian is affected by this bug as well. I'm having it on my test server, and didn't find a solution. The issue is even preventing from scheduling network correctly for VMs, it seems (so I get no connectivity at all).

I am hitting the same issue, can the bug be reopened?

Hi,

After applying the proposed patch, it worked. I'd like this bug to be reopened, and the patch committed please. I'm adding it as a debian specific patch.

Vangelis Tasoulas (cyberang3l) wrote :

Same problem here. Similar behavior to the one described on comment 17.

Also same problem here. How to workaround it when the situation already exists?

Marco Colombo (colo90) wrote :

Same problem here.
Confirm, if i kill all metadata service, i can delete namespace.

I tried all of these suggestions. qdhcp namespace is getting deleted after quantum rootwrap line is added in dhcp_agent.ini file.
But still unable to delete qrouter namespace. I tried adding the same line in l3_agent.ini. no use.

Did anyone find proper solution to delete the unused namespaces?

Carl Baldwin (carl-baldwin) wrote :

I have found cases where the metadata service that was running in the namespace is no longer running. I've looked through /proc to find any other processes holding the netns file descriptor open.

Even after confirming that there are no processes holding a file descriptor on the namespace I still cannot delete the namespace. What other resources could be preventing the deletion of these namespaces?

Is it necessary to stop all metadata services in all namespaces? If so, this is not practical.

Marco Colombo (colo90) wrote :

Hi Carl,
it's possible that there are some dnsmasq process.
If i kill all dnsmasq and stop all quantum-ns-metadata-proxy process, i can delete namespace well.

Phil Hopkins (phil-hopkins-a) wrote :

I found that in a situation after killing all dnsmasq and metadata processes I still could not delete a qrouter namespace. After restarting the dbus process I was able to delete it. I did not think to look to see just what the dbus service had mounted in that namespace but resarting it released the mount so I could delee the namespace.

Julian Sternberg (jules-i) wrote :

Having exactly the same situation.

I've realized that script runs just fine if there is no "network:dhcp" Port up.
So it looks to me like this process (dnsmasq) is locking up the whole cleanup task.

if you cleanly setup networking withouth dhcp controller (network:dhcp),
quantum-netns-cleanup just runs through.

Changed in neutron:
assignee: nobody → Sean McCully (sean-mccully)
Antonio S (ellohir) wrote :

How come is this bug on low priority? It triggers everytime a network is created/deleted and it blocks completetly the communication with the instances.

I've tried everything on the above comments with no success. I haven't tried the patch because I can't find where is that file located (I'm using Grizzly on Ubuntu 12.04, it's not on /opt or /etc/quantum).

lee jian (leejian0612) wrote :

Also the same problem,and solve it on comment 29.
if you want delete the namespace mannully,you may follow the steps below:
1.ps -ef|grep quantum-ns-metadata-proxy
2.kill all the processes you found
3.ip netns delete qdhcp-xxxxxx

Carl Baldwin (carl-baldwin) wrote :

Killing all processes that are run under "ip netns exec" is not acceptable. Imagine I've got many namespaces and I want to delete a small percentage of them. It is very disruptive to kill all dnsmasq and metadata processes regardless of the namespace in which they're running.

I found a problem with the iproute utility described here: http://permalink.gmane.org/gmane.linux.network/240875

The solution was committed here: https://git.kernel.org/cgit/linux/kernel/git/shemminger/iproute2.git/commit/?id=58a3e8270fe72f8ed92687d3a3132c2a708582dd

I applied this patch to my Ubuntu Precise version of iproute. It applied cleanly. I rebooted once to clear out processes that were started under the old version of "ip netns exec." Since then, I have been able to delete any namespace once the processes running under *that* namespace have been cleared out. I no longer need to stop *all* processes running under "ip netns exec" regardless of namespace.

This was the answer for me on Ubuntu Precise where I happen to be seeing the problem. I imagine if your version of iproute does not include the above patch then you will have a similar experience.

Changed in neutron:
assignee: Sean McCully (sean-mccully) → nobody
Andrey Korolyov (xdeller) wrote :

Since issue is a quite trivial and should not be fixed by replacing base iproute package, I suggest to add small snippet of code doing kill -s KILL for every process in the 'ip netns pids qdhcp|qrouter' just before namespace removal. This should fix current issues on distros with old iproute as RH/Wheezy/Precise.

Eugene Nikanorov (enikanorov) wrote :

Andrey, this idea was tested and appeared to not help to fix the issue.

Carl Baldwin (carl-baldwin) wrote :

Andrey, killing all of the process running under all namespaces is *not* trivial. There could be hundreds of namespaces each running dnsmasq, metadata proxy, and potentially more. You are welcome to apply something like comment #33 manually but I would not advise that this be automated in the agents or the cleanup script.

Andrey Korolyov (xdeller) wrote :

Ok, so we fixed it by using make-ip-netns-delete-more-likely-to-succeed.patch
from https://launchpad.net/ubuntu/precise/+source/iproute/20111117-1ubuntu2.1 because my idea of killing processes will not apply very well to the veth pairs which are affected too and which are less easier to identify for a proper namespace than regular processes when relatively old userspace tools are in use. This one applies to RDO iproute2 package as well and fixes this problem completely, so I may think we can finally close this issue.

Changed in neutron:
status: Incomplete → Fix Committed
Changed in neutron:
status: Fix Committed → Incomplete

Reviewed: https://review.openstack.org/56114
Committed: http://github.com/openstack/neutron/commit/7336f3bd27d138b3d11d601f977a1e3df2a44b3e
Submitter: Jenkins
Branch: master

commit 7336f3bd27d138b3d11d601f977a1e3df2a44b3e
Author: Carl Baldwin <email address hidden>
Date: Tue Nov 12 19:31:45 2013 +0000

    Optionally delete namespaces when they are no longer needed

    Adds a configuration option to tell the network agents to delete
    namespaces when they are no longer in use. The option defaults to
    False so that the agent will not attempt to delete namespaces in
    environments where this is not safe.

    This has been working well in deployments where iproute2 has been
    patched with commit 58a3e8270fe72f8ed92687d3a3132c2a708582dd or it is
    new enough to include it without being patched.

    Change-Id: Ice5242c6f0446d16aaaa7ee353d674310297ef72
    Closes-Bug: #1250596
    Related-Bug: #1052535

Andrey Korolyov (xdeller) wrote :

What`s the status of the problem? Proposed solution above does only metadata processes, but not lbaas issues for example.

Reviewed: https://review.openstack.org/84570
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=8ab6fd6d7e521a3692f57542e5c5c5d513d57ccc
Submitter: Jenkins
Branch: master

commit 8ab6fd6d7e521a3692f57542e5c5c5d513d57ccc
Author: Carl Baldwin <email address hidden>
Date: Tue Apr 1 22:16:59 2014 +0000

    Clean out namespaces even if we don't delete namespaces

    Even when we don't enable namespace deletion, we still want to run the
    code that cleans out the namespaces so that the devices get unplugged,
    etc. Otherwise, routers deleted while the agent is down will continue
    to operate as if they were never deleted.

    The trade-off to consider here is that if there are many stale
    namespaces this will slow down the restart of the L3 agent. The best
    option is to get namespace deletion working correctly. However, where
    that has not been worked out yet, this patch provides the cleaning
    service for deleted routers.

    Change-Id: Ic7b4608a23c4d9530f521d5faff3f8526200b92e
    Closes-Bug: #1301042
    Related-Bug: #1052535

Reviewed: https://review.openstack.org/84419
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=91657c1612b4cc037c74b77f4b3548f843c10fcd
Submitter: Jenkins
Branch: stable/havana

commit 91657c1612b4cc037c74b77f4b3548f843c10fcd
Author: Carl Baldwin <email address hidden>
Date: Tue Nov 12 19:31:45 2013 +0000

    Optionally delete namespaces when they are no longer needed

    Adds a configuration option to tell the network agents to delete
    namespaces when they are no longer in use. The option defaults to
    False so that the agent will not attempt to delete namespaces in
    environments where this is not safe.

    This has been working well in deployments where iproute2 has been
    patched with commit 58a3e8270fe72f8ed92687d3a3132c2a708582dd or it is
    new enough to include it without being patched.

    Change-Id: Ice5242c6f0446d16aaaa7ee353d674310297ef72
    Closes-Bug: #1250596
    Related-Bug: #1052535
    (cherry picked from commit 7336f3bd27d138b3d11d601f977a1e3df2a44b3e)
    Related-Bug: #1175695

tags: added: in-stable-havana
Changed in neutron:
assignee: nobody → Sudhakar Gariganti (sudhakar-gariganti)
status: Incomplete → In Progress
Carl Baldwin (carl-baldwin) wrote :

I wonder if we should close this bug now. I didn't realize that it had been left open. Namespaces can be deleted when they're no longer needed by setting options in the l3_agent.ini and dhcp_agent.ini files. See this commit for details:

https://review.openstack.org/84419

The reason that this is a configuration option and the default is False is the problem with the iproute package that prevents clean deletion of namespaces. Attempting to clean the namespace with a broken version of iproute can make really make a mess of your network node. This was discussed in the comments above.

Thanks for the clarification Carl. My bad, I have missed the above patch. I have tried configuring the dhcp_delete_namespaces to True and the namespaces are getting removed properly.
I will abandon the new patch I submitted.

Agree with you that we can close this defect.

Change abandoned by Carl Baldwin (<email address hidden>) on branch: master
Review: https://review.openstack.org/105018
Reason: The author said would abandon the patch but did not. Just getting it off our radar.

Changed in neutron:
assignee: Sudhakar Gariganti (sudhakar-gariganti) → nobody
Changed in neutron:
status: In Progress → Incomplete
status: Incomplete → New
Sam Betts (sambetts) wrote :

This bug has been marked as back to New when the comments above seem to imply that the bug is closed, can someone confirm what the status of this issue is?

Matt Lesko (mattlesko-nih) wrote :

I believe I have also encountered this bug recently with the lbaas environment, just as comment #41 suggests.

Changed in neutron:
assignee: nobody → Padmakanth (padmakanth-chandrapati)
Changed in neutron:
assignee: Padmakanth (padmakanth-chandrapati) → nobody
Tom Fifield (fifieldt) wrote :

This bug just featured in the ops meetup at the paris summit.

tags: added: ops
Changed in neutron:
status: New → Confirmed
Edgar Magana (emagana) wrote :

I will give it a try to this bug

Changed in neutron:
status: Confirmed → Triaged
assignee: nobody → Edgar Magana (emagana)
Edgar Magana (emagana) wrote :

OK, after testing this functionality. I found that indeed once you set-up in dhcp_agent.ini is configure to delete namespace all works nicely. However, based on Operators feedback that should be the default behavior and therefore I will change it in my next commit.

Edgar Magana (emagana) wrote :

I am closing this bug because it is not really a bug. I will address the confusion with the operators by means of a note in the admin guide:
https://bugs.launchpad.net/openstack-manuals/+bug/1402739

Changed in neutron:
status: Triaged → Invalid

Reviewed: https://review.openstack.org/176471
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=723162501a5e2e5f202af9d95a1b946e3d43cf96
Submitter: Jenkins
Branch: master

commit 723162501a5e2e5f202af9d95a1b946e3d43cf96
Author: Eugene Nikanorov <email address hidden>
Date: Wed Apr 22 19:45:57 2015 +0400

    Finally let L3 and DHCP agents cleanup namespaces by default

    There has been a problem with iproute package that resulted in errors
    when deleting the namespaces, so deleting was turned off by default.
    According to tests with iproute version 3.12.0 there is no such issue
    so the option could be safely turned on by default.

    DocImpact
    Related-Bug: #1052535
    Related-Bug: #1402739

    Change-Id: I4c831f98fb2462382ef0f9216e265555186b965a

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.