[RFE] Implement "stop agent" hook script support in Neutron

Bug #1825943 reported by Cédric Jeanneret
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Wishlist
Slawek Kaplonski

Bug Description

Dear all,

Following a quick thread on the openstack-discuss ML[1], I would like to request a new feature:
Hook script support for Neutron agents.

The Why:
Neutron doesn't know about containers, and just kills processes whenever it doesn't need them, keeping dangling containers.
This can lead to some issues, among them:
- disk space usage
- "failed" containers (exit-code 137 instead of a clean 0)
- issues when validating container state in CI[2]
- possible issue starting a new container due to name already used (not yet seen, but might occur)

The What:
Currently, launching an agent already supports a so-called "wrapper" (script). The current use in TripleO is to start a new container with the right options.
Adding such a feature for the "stop" part would allow to do more than just kill the process:
- clean up the configuration (in bare-metal use)
- clean up temporary files (in bare-metal use)
- clean up the container(s) instead of keeping them in failed state

The How:
Apparently, it might be possible to take advantage of the "get_stop_command" parameter for the "external_process.ProcessMonitor"[3].

Thank you!

Cheers,

C.

[1] http://lists.openstack.org/pipermail/openstack-discuss/2019-April/005258.html
[2] https://github.com/openstack/tripleo-quickstart-extras/blob/master/roles/validate-services/tasks/containers.yaml#L18-L36
[3] https://github.com/openstack/neutron/blob/master/neutron/agent/linux/external_process.py#L98

affects: tripleo → neutron
Changed in neutron:
milestone: train-1 → none
status: Triaged → New
tags: added: rfe
Changed in neutron:
importance: Medium → Wishlist
Revision history for this message
Miguel Lavalle (minsel) wrote :

Hi Cedric,

Thanks for your submission. Of course we always consider proposals to improve Neutron. Having said that, I have a couple of questions about your proposal:

1) Shouldn't this cleanup be the responsibility of the containerization framework? Have you contemplated that possibility? Why wouldn't the containerization framework be capable of handling the cleanup? Why do we have to resort to Neutron?

2) You also mention "hooks" in your proposal. What would be the mechanism to register such hooks? An API?

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hello Miguel,

Thank you for your interest. Here are the answers:

- Relying on the containerization framework:
Not really possible, since the "framework" doesn't exist: it's plain docker or podman calls. The only way to get those containers cleaned would be to run some heuristic job (like a garbage collector).

Running such a job is a bad idea, since it might delete containers that are really crashed, preventing an easy way to get them back (in TripleO, that would more or less require a new deploy run).
We might of course keep track of the containers in a file, "managed" by the current agent wrapper scripts, but that would be ugly, and might lead to issues if the file get corrupted for some reasons.

Since Neutron is already capable of creating and starting a container (via a wrapper script), it should also be able to stop it cleanly, and remove it, instead of keeping dangling things in place. That would also prevent misreading in case of debug.
For instance, we detected those dangling containers in OpenStack CI, because a new validation was added. This validation is checking for failed containers or services and, depending on when we run this validation, it pointed the Neutron metadata agent container (it was done after Tempest runs).

- Hook registration:
I'm talking about "hooks" in order to be wider than just "delete containers". Neutron already has a mechanism for the agent start allowing to use a wrapper script[1] - my idea/request is to make the same thing in order to stop the agent.
Being able to use a hook, or to sticking with the current nomenclature, a "wrapper script" would allow to do all the needed things, and even more.

Since the current "start" wrappers aren't added via an API, I suppose the "stop" wrappers/hooks don't need this feature. I didn't check how Neutron currently manages the start part, but in my mind, it's something like:
"does a script exist for this action? if so, execute it - if not, just kill the pid".

Does it answer your questions? Feel free to ping me if you need more details :).

Cheers,

C.

[1] those script are deployed during the installation. An example for such a file is here:
https://github.com/openstack/puppet-tripleo/blob/master/templates/neutron/haproxy.epp
It's created/managed by this class:
https://github.com/openstack/puppet-tripleo/blob/master/manifests/profile/base/neutron/wrappers/haproxy.pp

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I agree with Cedric here. We are currently simply send "kill" (-9 in some cases) to the process, even if it is in fact container.
What we can IMHO do is provide e.g. config option (or options, one per type of process managed by neutron, like e.g. one for dnsmasq, one for keepalived and so on) that will allow to set some "kill script". If it would be set, it would be used instead of kill command to stop such processes.

I think that maybe this can somehow help also to solve problem with our too wide rootwrap kill filters for python processes, as we can then add similar "killer scripts" for e.g. "python neutron-keepalived-state-change-monitor" process.

Revision history for this message
Miguel Lavalle (minsel) wrote :

@Cedric,

What do you think of Slawek's proposal? Does this address your use case? Can you add details to Slawek's proposal

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hello Miguel,

well... that's more or less what I propose. Call it "kill script", "hook", as long as we get a way to stop and clean up containers instead of keeping them dangling with a bad exit status, I'm OK :).

Cheers,

C.

Revision history for this message
Brent Eagles (beagles) wrote :

I agree with Slaweq's proposal of having separate config options for each type of process. Certain processes continue to run in the same container as the respective agent and others processes might require special service specific cleanups.

Miguel Lavalle (minsel)
tags: added: rfe-triaged
removed: rfe
Revision history for this message
Miguel Lavalle (minsel) wrote :

The neutron drivers team discussed and approved this RFE

tags: added: rfe-approved
removed: rfe-triaged
Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/661760

Changed in neutron:
status: Confirmed → In Progress
Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Slawek Kaplonski (slaweq)
Changed in neutron:
assignee: Slawek Kaplonski (slaweq) → Brian Haley (brian-haley)
Changed in neutron:
assignee: Brian Haley (brian-haley) → Slawek Kaplonski (slaweq)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/661760
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=93015527f02cb54b0b2bc07c8c9c239c63878264
Submitter: Zuul
Branch: master

commit 93015527f02cb54b0b2bc07c8c9c239c63878264
Author: Slawek Kaplonski <email address hidden>
Date: Mon May 27 13:17:28 2019 +0200

    Add kill hooks for external processes

    This patch adds possibility to configure kill hooks used to kill
    external processes, like dnsmasq or keepalived.

    Change-Id: I29dfbedfb7167982323dcff1c4554ee780cc48db
    Closes-Bug: #1825943

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/663289

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/stein)

Reviewed: https://review.opendev.org/663289
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=702d4b321447d94598e1962cb67a7da968fa5fba
Submitter: Zuul
Branch: stable/stein

commit 702d4b321447d94598e1962cb67a7da968fa5fba
Author: Slawek Kaplonski <email address hidden>
Date: Mon May 27 13:17:28 2019 +0200

    Add kill hooks for external processes

    This patch adds possibility to configure kill hooks used to kill
    external processes, like dnsmasq or keepalived.

    Conflicts:
        neutron/tests/functional/agent/l3/framework.py

    Change-Id: I29dfbedfb7167982323dcff1c4554ee780cc48db
    Closes-Bug: #1825943
    (cherry picked from commit 93015527f02cb54b0b2bc07c8c9c239c63878264)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.2

This issue was fixed in the openstack/neutron 14.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 15.0.0.0b1

This issue was fixed in the openstack/neutron 15.0.0.0b1 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.