support disabling keepalived healthcheck

Bug #1890900 reported by Edward Hope-Morley
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Neutron Gateway Charm
Fix Released
High
Edward Hope-Morley
OpenStack Neutron Open vSwitch Charm
Fix Released
High
Edward Hope-Morley

Bug Description

We have observed that neutron l3ha can get into a situation where continuously failing healthchecks can themselves cause other nodes to fail their heathchecks of the keepalived doesnt perform IP address cleanup and arp refresh fast enough. As a result we need to be able to temporarily disable healthchecks.

Tags: sts
Revision history for this message
Edward Hope-Morley (hopem) wrote :

Another thing that would be good to adjust is the number of pings that are tried before a failure is declared. Currently the healthcheck looks like:

#!/bin/bash -eu
ip a | grep 192.168.100.22 || exit 0
ping -c 1 -w 1 10.5.150.1 1>/dev/null || exit 1

But the number of pings is not configurable:

https://github.com/openstack/neutron/blob/master/neutron/agent/linux/keepalived.py#L558

Changed in charm-neutron-gateway:
importance: Undecided → High
Changed in charm-neutron-openvswitch:
importance: Undecided → High
Changed in charm-neutron-gateway:
assignee: nobody → Edward Hope-Morley (hopem)
Changed in charm-neutron-openvswitch:
assignee: nobody → Edward Hope-Morley (hopem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-openvswitch (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745440

Changed in charm-neutron-openvswitch:
status: New → In Progress
Changed in charm-neutron-gateway:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-gateway (master)

Fix proposed to branch: master
Review: https://review.opendev.org/745441

Felipe Reyes (freyes)
tags: added: sts
Revision history for this message
Felipe Reyes (freyes) wrote :

> #!/bin/bash -eu
> ip a | grep 192.168.100.22 || exit 0
> ping -c 1 -w 1 10.5.150.1 1>/dev/null || exit 1

This would require to increasea the deadline (-w 1) too, since ping will exit when the number of pings is completed or the deadline time reaches, whatever happens first.

Since the current command assumes that a single ping should last less than a second, I think we shouold be able to safely assume that 1 seconds should be the maximum to wait, hence we should use the same argument for ping counts and deadline.

Revision history for this message
Felipe Reyes (freyes) wrote : Re: [Bug 1890900] Re: support disabling keepalived healthcheck

On Mon, 2020-08-10 at 02:06 +0000, Felipe Reyes wrote:
> > #!/bin/bash -eu
> > ip a | grep 192.168.100.22 || exit 0
> > ping -c 1 -w 1 10.5.150.1 1>/dev/null || exit 1
>
> This would require to increasea the deadline (-w 1) too, since ping
> will
> exit when the number of pings is completed or the deadline time
> reaches,
> whatever happens first.
>
> Since the current command assumes that a single ping should last less
> than a second, I think we shouold be able to safely assume that 1
> seconds should be the maximum to wait, hence we should use the same
> argument for ping counts and deadline.
>

I believe would be better to expose the vrrp_script config options of
"rise" and "fall"

From the manpage:

           # required number of successes for OK transition
           rise <INTEGER>

           # required number of successes for KO transition
           fall <INTEGER>

This would allow to not so stable networks to have more chances of
holding the VIP even when some pings have failed.

Currently the rise and fall config options are set to 2. So two
consecutive pings that fail and the transition happens.

https://opendev.org/openstack/neutron/src/branch/master/neutron/agent/linux/keepalived.py#L521

Revision history for this message
Edward Hope-Morley (hopem) wrote :

fwiw i have a raised a bug at https://bugs.launchpad.net/neutron/+bug/1892200 to discuss what changes we can apply in neutron to improve the test itself.

Changed in charm-neutron-gateway:
milestone: none → 20.10
Changed in charm-neutron-openvswitch:
milestone: none → 20.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-openvswitch (master)

Reviewed: https://review.opendev.org/745440
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-openvswitch/commit/?id=5d83c2c702d9bd720286f8da4329aa36417228a8
Submitter: Zuul
Branch: master

commit 5d83c2c702d9bd720286f8da4329aa36417228a8
Author: Edward Hope-Morley <email address hidden>
Date: Sat Aug 8 15:37:50 2020 +0100

    Add keepalived-healthcheck-interval config option

    Defaults to 30s (i.e. enabled) but also allows disabling
    healthchecks by setting to 0.

    Change-Id: I5bb7d362f0d957237e24f79f1f82583661bed470
    Closes-Bug: #1890900

Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
Changed in charm-neutron-gateway:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-gateway (master)

Reviewed: https://review.opendev.org/745441
Committed: https://git.openstack.org/cgit/openstack/charm-neutron-gateway/commit/?id=8d71c414811a78714e79bb0743975458aa5cd2a2
Submitter: Zuul
Branch: master

commit 8d71c414811a78714e79bb0743975458aa5cd2a2
Author: Edward Hope-Morley <email address hidden>
Date: Sat Aug 8 15:39:31 2020 +0100

    Add keepalived-healthcheck-interval config option

    Defaults to 30s (i.e. enabled) but also allows disabling
    healthchecks by setting to 0.

    Change-Id: I49603c22d8085aabd6085058e4d4eb9c74e84a20
    Closes-Bug: #1890900

Changed in charm-neutron-gateway:
status: Fix Committed → Fix Released
Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.