Make keepalived healthcheck more configurable

Bug #1892200 reported by Edward Hope-Morley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Wishlist
Unassigned

Bug Description

Since the Newton release, users of HA routers have had a keepalived healthcheck that fails if it doesn't get a response to a single ping or if the expected tenant network address is not configured in the local namespace being watched. While this works for most cases where an environment is stable it appears to produce a lot of instability as soon as an environment gets loaded or a node fails and transitions/failovers occur. An example of this appears to be where transitions of the MASTER to a new node take a little longer than they should. For example we have seen in the field that under heavy load a node can, for a very short period of time, have the external network address that keepalived is tracking be configured on two interfaces/hosts at once and while neutron is still doing its garp updates it is possible that a ping from the new master router can fail to get a response for 50% of requests since the switch may still send the reply to either the new master or the old one.

In order to avoid transient problems like this from causing further instability we would like to be able to make the healthcheck a little more tolerant of transient issues. Currently the healthcheck script is generated by Neutron for each router and its contents are not configurable. It would be great to be able to change e.g. the number of pings that it will do before declaring a failure.

description: updated
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I think that this is pretty good idea for some use cases.
But as keepalived have got tons of different options can You maybe write exactly which of them You want to include and make configurable through neutron config files?

Or maybe we should do it differently and e.g. propose some template of the config file and fill this template with variables, like interface name, IP addresses, etc. for specific router.

That way user may be able to configure whatever keepalived options he would need by preparing this template file to the l3 agent. What do You think about it?

Changed in neutron:
importance: Undecided → Wishlist
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@slaweq actually using a template of the config file rather than generating all in code is absolutely something I think we should do because it would make it easier to make changes like this in the future. In terms if which config to change, right now since the ping test is hardcoded in [1] I was thinking of starting my making e.g. number of pings configurable. But now that you mention it perhaps a better approach would be to move the existing code to use a template, the path to which could be configurable so that operators can make their own changes/additions but still have it rendered with the necessary settings (gateway ip etc) by Neutron.

https://github.com/openstack/neutron/blob/114ac0ae89cfad124a906430ba70c14d0678391a/neutron/agent/linux/keepalived.py#L506

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I would like to discuss it on next Neutron drivers meeting which will be on Friday: http://eavesdrop.openstack.org/#Neutron_drivers_Meeting - so it would be great if You could join there if there would be any additional questions. But RFE should be discussed even if You will not be able to attend this meeting.

tags: added: l3-ha rfe-triaged
Revision history for this message
Slawek Kaplonski (slaweq) wrote :

We discussed that proposal on our last drivers meeting and we decided to approve this rfe if it will be done with some template (like e.g. jinja2).
Default template should provide exactly same config of keepalived like we have now.

tags: added: rfe-approved
removed: rfe-triaged
Changed in neutron:
milestone: none → next
Dan Radez (dradez)
Changed in neutron:
assignee: nobody → Dan Radez (dradez)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759886

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/760372

Revision history for this message
Dan Radez (dradez) wrote :

There are two patches proposed now. One makes the changes to the config generation converting it to Jinja2. The second adds the specific parameters @hopem has requested.

I'm of the opinion we should file new bugs to address more options. Let me know if you think otherwise.

Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Dan Radez (dradez) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/759886
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.