neutron-netns-cleanup doesn't clean up all L3 agent spawned processes

Bug #1403455 reported by Assaf Muller on 2014-12-17
48
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
High
Daniel Alvarez

Bug Description

neutron-netns-cleanup cleans all DHCP resources by invoking a command on the DHCP driver. However, in the L3 agent case, it merely tries to remove the router namespaces. Any child processes that we've added over the years (Specifically metadata proxy, radvd, keepalived at this point) are not cleaned up. I think that netns-cleanup should remove any routers on the system the same way we remove DHCP resources, by invoking a method on a L3 agent instance.

Miguel Angel Ajo (mangelajo) wrote :

Correct, we need proper cleanup for the L3 agent resources.

Also for advanced services, but now that they are out of tree we may determine a way to loop them in in the cleanup (probably on a separate bug)

Changed in neutron:
assignee: nobody → Miguel Angel Ajo (mangelajo)
Assaf Muller (amuller) wrote :

One way would be for each service to provide its own cleanup script in its own repo. If we want to split up the DHCP and L3 clean ups anyway that makes sense.

Miguel Angel Ajo (mangelajo) wrote :

Hmm, nice suggestion, it makes a lot of sense actually.

For backwards compatibility we could make the main netns cleanup call them all, or otherwise let the user call
each cmdline tool separately.

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Miguel Angel Ajo (mangelajo) wrote :

I'm taking on this soon.

Miguel Angel Ajo (mangelajo) wrote :

Ok, a possible design for this, considering separate repos, and avoiding the need to directly modify netns for every new service to be handled, with a fallback plan:

1) Use stevedore HookManager to let out-of-tree repos register netns prefixes declaration, and netns cleaners,
    so every piece of code (in-tree or out-of-tree) declare which netns prefixes they use, and provide a netns cleanup
    hook to be called.
2) Before cleaning a namespace blindly in the end, identify any network service in the namespace (via netstat), kill those processes, so they aren't orphaned, and then, kill the namespace.

An ERROR log would be thrown for any unknown process left over, so we could easily identify that via tempest runs (via a final cleanup that fails when something unexpected is left-over).

Changed in neutron:
milestone: none → mitaka-1
Changed in neutron:
milestone: mitaka-1 → mitaka-2
Miguel Angel Ajo (mangelajo) wrote :

Ok, I have to tackle this before it slips another cycle.

Do my comments in #5 make sense?, if that's the case I will proceed.

Assaf Muller (amuller) wrote :

In response to comment 5, generally speaking I'd prefer to move to a model where we delete a router via the L3 agent code which already knows how to clean up resources correctly.

Cedric Brandily (cbrandily) wrote :

I would be in favor of Assaf approach which uses l3-agent code to clean its own resources

Changed in neutron:
milestone: mitaka-2 → mitaka-3
Changed in neutron:
milestone: mitaka-3 → mitaka-rc1

If this doesn't have a viable fix by the next day or so, it can't go in RC1

Changed in neutron:
milestone: mitaka-rc1 → newton-1
David Medberry (med) wrote :

@armando-migliaccio any progress?

Assaf Muller (amuller) wrote :

Armando is not the assignee. Looking at Miguel's on going, the answer is no.

Changed in neutron:
assignee: Miguel Angel Ajo (mangelajo) → nobody
Miguel Angel Ajo (mangelajo) wrote :

Thanks @assaf, let's let others pick it up.

After much thinking (and quite little doing) I believe the option "2" I proposed is a rather reasonable one:

2) Before cleaning a namespace blindly in the end, identify any network service in the namespace (via netstat), kill those processes, so they aren't orphaned, and then, kill the namespace.

Any process should be safely killed that way, and if it's not, we can complicate our lifes and code with "1":
1) Use stevedore HookManager to let out-of-tree repos register netns prefixes declaration, and netns cleaners,
    so every piece of code (in-tree or out-of-tree) declare which netns prefixes they use, and provide a netns cleanup
    hook to be called.

or something of that sort.

Changed in neutron:
milestone: newton-1 → newton-2
Changed in neutron:
milestone: newton-2 → newton-3

This is an automated cleanup. This bug report has been closed because it
is older than 18 months and there is no open code change to fix this.
After this time it is unlikely that the circumstances which lead to
the observed issue can be reproduced.

If you can reproduce the bug, please:
* reopen the bug report (set to status "New")
* AND add the detailed steps to reproduce the issue (if applicable)
* AND leave a comment "CONFIRMED FOR: <RELEASE_NAME>"
  Only still supported release names are valid (INCUBATOR-JUNO, LIBERTY, MITAKA, NEWTON).
  Valid example: CONFIRMED FOR: INCUBATOR-JUNO

Changed in neutron:
importance: Medium → Undecided
status: Confirmed → Expired
Assaf Muller (amuller) on 2016-08-17
Changed in neutron:
status: Expired → Confirmed

Ack: any volunteer?

Changed in neutron:
status: Confirmed → Incomplete
milestone: newton-3 → none
status: Incomplete → Triaged

Without a volunteer, this is going to be cleaned up again.

tags: added: low-hanging-fruit
tags: added: l3-dvr-backlog
removed: low-hanging-fruit

Let's bump the priority to see if we can get some attention.

Changed in neutron:
importance: Undecided → High
tags: added: needs-attention

@Armax i'll give it a try, from above discussion what i can think of solution is to
have a method in netns-cleanup.py which will invoke a cleanup script provided by l3-agent ?

Changed in neutron:
assignee: nobody → Manjeet Singh Bhatia (manjeet-s-bhatia)

I'd suggest you reach out to amuller on irc, to see if he can help you set yourself on the right track. Thanks for volunteering! Kudos to you!

Assaf Muller (amuller) on 2016-10-07
Changed in neutron:
status: Triaged → In Progress
Changed in neutron:
assignee: Manjeet Singh Bhatia (manjeet-s-bhatia) → Daniel Alvarez (dalvarezs)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/383936
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
assignee: Daniel Alvarez (dalvarezs) → Dilip Renkila (dilip-renkila278)
assignee: Dilip Renkila (dilip-renkila278) → nobody
Changed in neutron:
assignee: nobody → Daniel Alvarez (dalvarezs)

Reviewed: https://review.openstack.org/402140
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1d38f30555384257445f0d119a307e74e88d7fbf
Submitter: Jenkins
Branch: master

commit 1d38f30555384257445f0d119a307e74e88d7fbf
Author: Daniel Alvarez <email address hidden>
Date: Thu Nov 24 16:32:50 2016 +0000

    Kill processes when cleaning up namespaces

    This patch will kill processes that are listening on any port/UNIX
    socket within the namespace to be cleaned up. To kill them it will
    issue a SIGTERM to them (or to their parents if they were forked) and,
    if they don't die after a few seconds, a SIGKILL to them and all their
    children.

    This is intended for those cases when there's no specific cleanup and
    serves as a fallback method.

    Change-Id: I4195f633ef4a1788496d1293846f19eef89416aa
    Partial-Bug: #1403455

Reviewed: https://review.openstack.org/417957
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=340e776492ee9ec00261871996662686221f9f82
Submitter: Jenkins
Branch: stable/newton

commit 340e776492ee9ec00261871996662686221f9f82
Author: Daniel Alvarez <email address hidden>
Date: Thu Nov 24 16:32:50 2016 +0000

    Kill processes when cleaning up namespaces

    This patch will kill processes that are listening on any port/UNIX
    socket within the namespace to be cleaned up. To kill them it will
    issue a SIGTERM to them (or to their parents if they were forked) and,
    if they don't die after a few seconds, a SIGKILL to them and all their
    children.

    This is intended for those cases when there's no specific cleanup and
    serves as a fallback method.

    Change-Id: I4195f633ef4a1788496d1293846f19eef89416aa
    Partial-Bug: #1403455
    (cherry picked from commit 1d38f30555384257445f0d119a307e74e88d7fbf)

tags: added: in-stable-newton
Changed in neutron:
status: In Progress → Fix Committed
tags: added: neutron-proactive-backport-potential
tags: removed: neutron-proactive-backport-potential
tags: removed: needs-attention
Changed in neutron:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers