neutron

Add an option for graceful l3 agent shutdown

Bug #1851609 reported by Oleg Bondarev on 2019-11-07

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	Medium	Brian Haley	neutron ussuri-2

Bug Description

If KillMode in systemd config of a neutron l3 agent service is set to 'process' - it will not kill child processes on main service stop - this is useful when we don't want data-plane downtime on agent stop/restart due to keepalived exit.

However in some cases graceful cleanup on l3 agent shutdown is needed - like with containerised control plane, when kubernetes kills l3-agent pod, it automatically kills its children (keepalived processes) in non-graceful way, so that keepalived does not clear VIPs. This leads to a situation when same VIP is present on different nodes and hence to long downtime.

The proposal is to add a new l3 agent config so that it handles stop (SIGTERM) by deleting all routers. For HA routers it results in graceful keepalived shutdown.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-07: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/693323

Changed in neutron:
status:	New → In Progress

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2019-11-07:

Hi Oleg:

Can you bring this topic to next L3 meeting [1]? Sounds interesting.

Regards.

[1] http://eavesdrop.openstack.org/#Neutron_L3_Sub-team_Meeting

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-11-10:

Hi Oleg,

I agree with Rodolfo that You should first talk about this with L3 subteam and than, after initial triaging by them, we can also discuss this on drivers meeting.
I added "rfe" tag to track this as RFE also.

tags:

added: rfe

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2019-11-11:

Hi, I added this bug to the agenda, thanks.

Slawek Kaplonski (slaweq) on 2019-11-14

tags:

added: rfe-triaged
removed: rfe

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-12-04:

I have a question based on the initial description.

This seems to revolve around keepalived getting killed when the l3-agent pod is killed. So it seems to be in a different pod? Is there some way to not kill that pod when the l3-agent one is killed?

For example, in our containerized deployment we spawn keepalived in a sidecar container, so that when the l3-agent container is destroyed, keepalived continues to run. This leads to no dataplane downtime.

Revision history for this message

Oleg Bondarev (obondarev) wrote on 2019-12-05:

@Brian, well it's l3 agent who is starting keepalived processes and in our deployment it's the same pod for l3 agent and keepalived. Can you please clarify how it's possible to configure neutron so that keepalived is running in a sidecar container?

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-12-05:

@Oleg - I wasn't the one who did the work so I can't explain the details, but basically when keepalived, or dibbler, is spawned a separate sidecar container is created for it, which will survive even if the parent container is destroyed. This way we can restart the l3-agent without affecting the dataplane.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-12-06:

@Oleg, I don't think that You can configure directly in Neutron what Brian described. We have done that in TripleO.

Revision history for this message

Brian Haley (brian-haley) wrote on 2019-12-06:

Right, I guess my point was that there is a way to do this without having a new config option perhaps.

Revision history for this message

Brent Eagles (beagles) wrote on 2019-12-06:

#10

I see the original bug report as stating two different issues:
1. that keepalived, etc are tied to the l3 agent's lifetime and
2. it is desirable to have a mode of shutdown that gracefully "drops" the agent effectively removing the router(s) (e.g. making sure the VIP is no longer assigned on that host etc)

For 1.), TripleO generates scripts at deployment time that are constructed so that they can create a container on the host. The exact mechanism differs with depending on whether docker or podman. They are mounted into /usr/local/bin as whatever is relevant (e.g. keepalived, dnsmasq, etc) and because /usr/local/bin is before /usr/bin in the PATH, neutron invokes them instead. In addition to the scripts, this requires mounting some shared directories for things like the pid and state files etc, but seems to work okay. We used to just have the sidecars destroyed via their pids, but that didn't work well for non docker deployments so a "kill script" mechanism was implemented and used. The kill script takes care of stopping and removing the container.

2.) does involve more knowledge of what is required for a graceful shutdown so seems like something neutron should handle.

Revision history for this message

Slawek Kaplonski (slaweq) wrote on 2019-12-13:

#11

On last drivers meeting we agreed to accept this RFE. So feel free to continue work on implementation of this feature :)

tags:

added: rfe-approved
removed: rfe-triaged

Slawek Kaplonski (slaweq) on 2019-12-13

Changed in neutron:
milestone:	none → ussuri-2

OpenStack Infra (hudson-openstack) on 2019-12-13

Changed in neutron:
assignee:	Oleg Bondarev (obondarev) → Brian Haley (brian-haley)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-12-18: Fix merged to neutron (master)

#12

Reviewed: https://review.opendev.org/693323
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=566351761318aa1f33650ba4d78b55cc6a4f8f7b
Submitter: Zuul
Branch: master

commit 566351761318aa1f33650ba4d78b55cc6a4f8f7b
Author: Oleg Bondarev <email address hidden>
Date: Wed Nov 6 11:43:57 2019 +0400

Support L3 agent cleanup on shutdown

Add an option to delete all routers on agent shutdown.

Closes-Bug: #1851609
Change-Id: I7a4056680d8453b2ef2dcc853437a0ec4b3e8044

Changed in neutron:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-21: Fix included in openstack/neutron 16.0.0.0b1

#13

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-30: Fix proposed to neutron (stable/zed)

#14

Fix proposed to branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/872114

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-01-30: Change abandoned on neutron (stable/zed)

#15

Change abandoned by "Mark Goddard <email address hidden>" on branch: stable/zed
Review: https://review.opendev.org/c/openstack/neutron/+/872114

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.