support disabling keepalived healthcheck

Bug #1890900 reported by Edward Hope-Morley on 2020-08-08
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack neutron-gateway charm
High
Edward Hope-Morley
OpenStack neutron-openvswitch charm
High
Edward Hope-Morley

Bug Description

We have observed that neutron l3ha can get into a situation where continuously failing healthchecks can themselves cause other nodes to fail their heathchecks of the keepalived doesnt perform IP address cleanup and arp refresh fast enough. As a result we need to be able to temporarily disable healthchecks.

Tags: sts Edit Tag help
Edward Hope-Morley (hopem) wrote :

Another thing that would be good to adjust is the number of pings that are tried before a failure is declared. Currently the healthcheck looks like:

#!/bin/bash -eu
ip a | grep 192.168.100.22 || exit 0
ping -c 1 -w 1 10.5.150.1 1>/dev/null || exit 1

But the number of pings is not configurable:

https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py#L558

Changed in charm-neutron-gateway:
importance: Undecided → High
Changed in charm-neutron-openvswitch:
importance: Undecided → High
Changed in charm-neutron-gateway:
assignee: nobody → Edward Hope-Morley (hopem)
Changed in charm-neutron-openvswitch:
assignee: nobody → Edward Hope-Morley (hopem)

Fix proposed to branch: master
Review: https://review.opendev.org/745440

Changed in charm-neutron-openvswitch:
status: New → In Progress
Changed in charm-neutron-gateway:
status: New → In Progress
Felipe Reyes (freyes) on 2020-08-10
tags: added: sts
Felipe Reyes (freyes) wrote :

> #!/bin/bash -eu
> ip a | grep 192.168.100.22 || exit 0
> ping -c 1 -w 1 10.5.150.1 1>/dev/null || exit 1

This would require to increasea the deadline (-w 1) too, since ping will exit when the number of pings is completed or the deadline time reaches, whatever happens first.

Since the current command assumes that a single ping should last less than a second, I think we shouold be able to safely assume that 1 seconds should be the maximum to wait, hence we should use the same argument for ping counts and deadline.

On Mon, 2020-08-10 at 02:06 +0000, Felipe Reyes wrote:
> > #!/bin/bash -eu
> > ip a | grep 192.168.100.22 || exit 0
> > ping -c 1 -w 1 10.5.150.1 1>/dev/null || exit 1
>
> This would require to increasea the deadline (-w 1) too, since ping
> will
> exit when the number of pings is completed or the deadline time
> reaches,
> whatever happens first.
>
> Since the current command assumes that a single ping should last less
> than a second, I think we shouold be able to safely assume that 1
> seconds should be the maximum to wait, hence we should use the same
> argument for ping counts and deadline.
>

I believe would be better to expose the vrrp_script config options of
"rise" and "fall"

From the manpage:

           # required number of successes for OK transition
           rise <INTEGER>

           # required number of successes for KO transition
           fall <INTEGER>

This would allow to not so stable networks to have more chances of
holding the VIP even when some pings have failed.

Currently the rise and fall config options are set to 2. So two
consecutive pings that fail and the transition happens.

https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/linux/keepalived.py#L521

Edward Hope-Morley (hopem) wrote :

fwiw i have a raised a bug at https://bugs.launchpad.net/neutron/+bug/1892200 to discuss what changes we can apply in neutron to improve the test itself.

Changed in charm-neutron-gateway:
milestone: none → 20.10
Changed in charm-neutron-openvswitch:
milestone: none → 20.10

Reviewed: https://review.opendev.org/745440
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=5d83c2c702d9bd720286f8da4329aa36417228a8
Submitter: Zuul
Branch: master

commit 5d83c2c702d9bd720286f8da4329aa36417228a8
Author: Edward Hope-Morley <email address hidden>
Date: Sat Aug 8 15:37:50 2020 +0100

    Add keepalived-healthcheck-interval config option

    Defaults to 30s (i.e. enabled) but also allows disabling
    healthchecks by setting to 0.

    Change-Id: I5bb7d362f0d957237e24f79f1f82583661bed470
    Closes-Bug: #1890900

Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
Changed in charm-neutron-gateway:
status: In Progress → Fix Committed

Reviewed: https://review.opendev.org/745441
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-gateway/commit/?id=8d71c414811a78714e79bb0743975458aa5cd2a2
Submitter: Zuul
Branch: master

commit 8d71c414811a78714e79bb0743975458aa5cd2a2
Author: Edward Hope-Morley <email address hidden>
Date: Sat Aug 8 15:39:31 2020 +0100

    Add keepalived-healthcheck-interval config option

    Defaults to 30s (i.e. enabled) but also allows disabling
    healthchecks by setting to 0.

    Change-Id: I49603c22d8085aabd6085058e4d4eb9c74e84a20
    Closes-Bug: #1890900

Changed in charm-neutron-gateway:
status: Fix Committed → Fix Released
Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers