Neutron VRRP healthchecks should not be enabled by default

Bug #1921010 reported by Billy Olsen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Gateway Charm
Fix Released
High
Billy Olsen
OpenStack Neutron Open vSwitch Charm
Fix Released
High
Billy Olsen

Bug Description

The neutron-openvswitch and neutron-gateway charms configure neutron to use additional vrrp checks in order to provide additional feedback for checking connectivity status. The charms configure the VRRP check to run every 30 seconds by default.

While in theory, this is a good thing, as it provides more input into determining whether the router needs to be migrated to another node, in practice it can lead to problems.

The neutron vrrp check is implemented as a bash script and looks as follows:

#!/bin/bash -eu
ip a | grep <fip_address> || exit 0
ping -c 1 -w 1 <route.nexthop> 1>/dev/null || exit 1

Essentially, it will exit quietly if the FIP is not on the local node. If it is on the local node, it will attempt to send 1 ICMP ping command to the next hop, with a timeout period of 1 second. The use of ICMP ping and requiring one second is not reasonable in some environments for a number of reasons:

1. ICMP pings are low priority and may be delayed for other packets
2. If the next hop router does not have the FIP in its ARP cache, this requires the ARP cache to be updated for the address which takes some time in larger networks
3. Some networks disable ICMP/ping for security reasons
4. It is not uncommon that the first ping message is dropped in certain networks

When this ping fails, it will trigger a VRRP failover transition to another node which is disruptive to network traffic. The provided VRRP check does allow for some additional sanity checks when the external network connectivity to a node is lost but the internal network is still available.

Starting in the 19.07 charm release, the vrrp checks were enabled whenever dvr/snat was enabled and hardcoded to run every 30 seconds. A subsequent change was introduced in the 20.10 release of the charms which allowed the vrrp check interval to be configurable, however the default was left at 30 seconds - which was the previously provided charm default.

The problem here is that the charm makes this opt-out by default rather than opt-in. Thus, users who have deployed on a version of charms prior to 19.07, use dvr+snat and upgrade to the latest versions are susceptible to a behavior change unintentionally. Users who do not immediately run into this problem are likely to run into this problem when they sufficiently scale their workload.

Changed in charm-neutron-gateway:
status: New → Triaged
importance: Undecided → High
importance: High → Critical
Changed in charm-neutron-openvswitch:
status: New → Triaged
importance: Undecided → Critical
Changed in charm-neutron-gateway:
status: Triaged → In Progress
Changed in charm-neutron-openvswitch:
status: Triaged → In Progress
Changed in charm-neutron-gateway:
assignee: nobody → Billy Olsen (billy-olsen)
Changed in charm-neutron-openvswitch:
assignee: nobody → Billy Olsen (billy-olsen)
Revision history for this message
Edward Hope-Morley (hopem) wrote :

One small correction to the above description is that it actually requires two failures to trigger a failover but after a single failure it then requires two successive successes in order for a failure to be prevented. This is as per the static configuration from Neutron [1]. It is also worth noting that there is a plan upstream to make the keepalived config configurable so that e.g. the charms could define the healthcheck test to be whatever it deems appropriate [2][3]

[1] https://github.com/openstack/neutron/blob/3cbe340846cb00e542afbad238207186cc22a858/neutron/agent/linux/keepalived.py#L547
[2] https://bugs.launchpad.net/neutron/+bug/1892200
[3] https://review.opendev.org/c/openstack/neutron/+/759886/

Changed in charm-neutron-gateway:
status: In Progress → Fix Committed
Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
Changed in charm-neutron-gateway:
milestone: none → 21.10
Changed in charm-neutron-openvswitch:
milestone: none → 21.10
Changed in charm-neutron-gateway:
importance: Critical → High
Changed in charm-neutron-openvswitch:
importance: Critical → High
Changed in charm-neutron-gateway:
status: Fix Committed → Fix Released
Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.