[SRU] L3 HA: multiple agents are active at the same time

Bug #1744062 reported by Corey Bryant on 2018-01-18
38
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
High
Unassigned
Mitaka
High
Unassigned
Ocata
High
Unassigned
Pike
High
Unassigned
Queens
High
Unassigned
neutron
Undecided
Unassigned
keepalived (Ubuntu)
High
Unassigned
Xenial
High
Unassigned
Bionic
High
Unassigned
neutron (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

[Impact]

This is the same issue reported in https://bugs.launchpad.net/neutron/+bug/1731595, however that is marked as 'Fix Released' and the issue is still occurring and I can't change back to 'New' so it seems best to just open a new bug.

It seems as if this bug surfaces due to load issues. While the fix provided by Venkata in https://bugs.launchpad.net/neutron/+bug/1731595 (https://review.openstack.org/#/c/522641/) should help clean things up at the time of l3 agent restart, issues seem to come back later down the line in some circumstances. xavpaice mentioned he saw multiple routers active at the same time when they had 464 routers configured on 3 neutron gateway hosts using L3HA, and each router was scheduled to all 3 hosts. However, jhebden mentions that things seem stable at the 400 L3HA router mark, and it's worth noting this is the same deployment that xavpaice was referring to.

keepalived has a patch upstream in 1.4.0 that provides a fix for removing left-over addresses if keepalived aborts. That patch will be cherry-picked to Ubuntu keepalived packages.

[Test Case]
The following SRU process will be followed:
https://wiki.ubuntu.com/OpenStackUpdates

In order to avoid regression of existing consumers, the OpenStack team will run their continuous integration test against the packages that are in -proposed. A successful run of all available tests will be required before the proposed packages can be let into -updates.

The OpenStack team will be in charge of attaching the output summary of the executed tests. The OpenStack team members will not mark ‘verification-done’ until this has happened.

[Regression Potential]
The regression potential is lowered as the fix is cherry-picked without change from upstream. In order to mitigate the regression potential, the results of the aforementioned tests are attached to this bug.

[Discussion]

description: updated
description: updated
no longer affects: neutron
summary: - L3 HA: multiple agents are active at the same time
+ -
Changed in neutron (Ubuntu):
status: New → Incomplete
summary: - -
+ L3 HA: multiple agents are active at the same time
description: updated
Changed in neutron (Ubuntu):
status: Incomplete → Triaged
importance: Undecided → High
Changed in neutron (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Changed in neutron (Ubuntu Artful):
importance: Undecided → High
status: New → Triaged
description: updated
Alvaro Uría (aluria) on 2018-01-18
tags: added: canonical-bootstack
John George (jog) wrote :

This bug falls under the Canonical Cloud Engineering service-level agreement (SLA) process, as a field critical bug.

Hua Zhang (zhhuabj) wrote :
Download full text (3.5 KiB)

I have some thoughts in my mind for this problem as below:

1, First of all, we need to figure out why it will appear multiple ACTIVE master HA nodes in theory ?

Assume the master is dead (at this time, it's status in DB is still ACTIVE), then slave will be selected to new master. After the old master has recovered, the L444 this.enable_keepalived() [4] will be invoked to spawn keepalived instance, so multiple ACTIVE master HA nodes occur. (Related patch - https://review.openstack.org/#/c/357458/)

So the key to solving this problem is to reset the status of all HA ports into DOWN at a certain code path, so the patch https://review.openstack.org/#/c/470905/ is used to address this point. But this patch sets the status=DOWN at this code path 'fetch_and_sync_all_routers -> get_router_ids' which will lead to a bigger problem when the load is large.

2, Why setting status=DOWN in the code path 'fetch_and_sync_all_routers -> get_router_ids' will lead to a bigger problem when the load is large ?

If l3-agent is not active via heartbeat check, l3-agent will be set status=AGENT_REVIVED [1], then l3-agent will be triggered to do a full sync (self.fullsync=True) [2] so that the code logic 'periodic_sync_routers_task -> fetch_and_sync_all_routers' will be called again and again [3].

All these operations will aggravate the load for l2-agent, l2-agent, DB and MQ etc. Conversely, large load also will aggravate AGENT_REVIVED case.

So it's a vicious circle, the patch https://review.openstack.org/#/c/522792/ is used to address this point. It uses the code path '__init__ -> get_service_plugin_list -> _update_ha_network_port_status' instead of the code path 'periodic_sync_routers_task -> fetch_and_sync_all_routers'.

3, We have known, the small heartbeat value can cause AGENT_REVIVED then aggravate the load, the high load can cause other problems, like some phenomenons Xav mentioned before, I pasted them as below as well:

- We later found that openvswitch had run out of filehandles, see LP: #1737866
- Resolving that allowed ovs to create a ton more filehandles.

This is just an example, there may be other circumstances. All those let us mistake the fix doesn't fix the problem.

The high load can also cause other similar problem, for another example:

a, can cause the process neutron-keepalived-state-change to exit due to term singal [5] (https://paste.ubuntu.com/26450042/), neutron-keepalived-state-change is used to monitor vrrp's VIP change then update the ha_router's status to neutron-server [6]. so that l3-agent will not be able to update the status for ha ports, thus we can see multiple ACTIVE case or multiple STANDBY case or others.

b, can cause the RPC message sent from here [6] can not be handled well.

So for this problem, my concrete opinion is:

a, bump up heartbeat option (agent_down_time)

b, we need this patch: https://review.openstack.org/#/c/522641/

c, Ensure that other components (like MQ, DB etc) have no performance problems

[1] https://github.com/openstack/neutron/blob/stable/ocata/neutron/db/agents_db.py#L354
[2] https://github.com/openstack/neutron/blob/stable/ocata/neutron/agent/l3/agent.py#L736
[3] https://github.com/openstack/ne...

Read more...

Ryan Beisner (1chb1n) wrote :

FYI: Unsubscribing field SLA based on re-triage with Kiko R.

Ryan Beisner (1chb1n) wrote :

I do think this issue is still of high importance to OpenStack's overall scale and resilience story.

James Troup (elmo) wrote :

FYI: Resubscribing field SLA. It was not raised as critical by Kiko R.; I raised it and it's still an active ongoing problem on a customer site. Please do not unsubscribe again without discussion with the correct people.

James Troup (elmo) wrote :

Downgrading to Field High - I think the Critical part is tracked in LP #1749425

James Page (james-page) wrote :

Just as another thing to consider - the deployment where this is happening also experienced bug 1749425 which resulted in packet loss; the networks between network/gateway units is also made via OVS, so if OVS was dropping packets due to the large number of missing tap devices, its possible this was also impacting connectivity between keepalived instances for HA routers, resulting in active/active nasty-ness.

Chris Gregan (cgregan) wrote :

We need an assigned engineer to meet the requirements of Field High SLA. Please assign

Ryan Beisner (1chb1n) wrote :

@cgregan I think this is really a situation where the desired scale/density is at odds with the fundamental design of neutron HA routers. It's not something to address in the charms or in the packaging. As such, I'd consider this a feature request, against upstream Neutron, and I don't have an assignee for that.

George (lmihaiescu) wrote :

We experience the same issue although we have a smaller environment with only around 40 Neutron router running HA (two agents per router) over three physical controllers running Ocata on Ubuntu 16.04

Joris S'heeren (jsheeren) wrote :

Our environment is experiencing the same behavior.

Ocata on 16.04 - around 320 routers, some have 2 agents per router, others have 3 agents per router.

LIU Yulong (dragon889) wrote :

VRRP heart beat lose may cause the multiple active router behavior. The underlying connectivity is key point for this. A monitoring for such behavior is also necessary.

Corey Bryant (corey.bryant) wrote :

As reported by Xav in https://bugs.launchpad.net/ubuntu/+bug/1731595:

"Comment for the folks that are noticing this as 'fix released' but still affected - see https://github.com/acassen/keepalived/commit/e90a633c34fbe6ebbb891aa98bf29ce579b8b45c for the rest of this fix, we need keepalived to be at least 1.4.0 in order to have this commit."

I just checked and the patch Xav referenced can be backported fairly cleanly to at least keepalived 1:1.2.19-1 (xenial/mitaka) and above.

no longer affects: keepalived (Ubuntu Artful)
Changed in keepalived (Ubuntu):
importance: Undecided → High
status: New → Triaged
Changed in keepalived (Ubuntu Xenial):
importance: Undecided → High
status: New → Triaged
Changed in keepalived (Ubuntu Bionic):
importance: Undecided → High
status: New → Triaged
no longer affects: cloud-archive/newton
no longer affects: neutron (Ubuntu Artful)
Corey Bryant (corey.bryant) wrote :

It appears the following commits are required to fix this for keepalived:

commit e90a633c34fbe6ebbb891aa98bf29ce579b8b45c
Author: Quentin Armitage <email address hidden>
Date: Fri Dec 15 21:14:24 2017 +0000

    Fix removing left-over addresses if keepalived aborts

    Issue #718 reported that if keepalived terminates abnormally when
    it has vrrp instances in master state, it doesn't remove the
    left-over VIPs and eVIPs when it restarts. This is despite
    commit f4c10426c saying that it resolved this problem.

    It turns out that commit f4c10426c did resolve the problem for VIPs
    or eVIPs, although it did resolve the issue for iptables and ipset
    configuration.

    This commit now really resolves the problem, and residual VIPs and
    eVIPs are removed at startup.

    Signed-off-by: Quentin Armitage <email address hidden>

commit f4c10426ca0a7c3392422c22079f1b71e7d4ebe9
Author: Quentin Armitage <email address hidden>
Date: Sun Mar 6 09:53:27 2016 +0000

    Remove ip addresses left over from previous failure

    If keepalived terminates unexpectedly, for any instances for which
    it was master, it leaves ip addresses configured on the interfaces.
    When keepalived restarts, if it starts in backup mode, the addresses
    must be removed. In addition, any iptables/ipsets entries added for
    !accept_mode must also be removed, in order to avoid multiple entries
    being created in iptables.

    This commit removes any addresses and iptables/ipsets configuration
    for any interfaces that exist when iptables starts up. If keepalived
    shut down cleanly, that will only be for non-vmac interfaces, but if
    it terminated unexpectedly, it can also be for any left-over vmacs.

    Signed-off-by: Quentin Armitage <email address hidden>

f4c10426ca0a7c3392422c22079f1b71e7d4ebe9 is already included in:
* keepalived 1:1.3.9-1build1 (bionic/queens, cosmic/rocky)
* keepalived 1:1.3.2-1build1 (artful/pike)
* keepalived 1:1.3.2-1 (zesty/ocata) [1]

[1] zesty is EOL - https://launchpad.net/ubuntu/+source/keepalived/1:1.3.2-1

f4c10426ca0a7c3392422c22079f1b71e7d4ebe9 is not included in:
* keepalived 1:1.2.19-1ubuntu0.2 (xenial/mitaka)

The backport of f4c10426ca0a7c3392422c22079f1b71e7d4ebe9 to xenial does not look trivial. I'd prefer to backport keepalived 1:1.3.2-* to the pike/ocata cloud archives.

summary: - L3 HA: multiple agents are active at the same time
+ [SRU] L3 HA: multiple agents are active at the same time
Corey Bryant (corey.bryant) wrote :

Moving back to New for neutron for the time being since we think this may be fixed in keepalived.

description: updated
Changed in neutron (Ubuntu):
status: Triaged → New
importance: High → Undecided
Changed in neutron (Ubuntu Xenial):
importance: High → Undecided
status: Triaged → New
Changed in neutron (Ubuntu Bionic):
importance: High → Undecided
status: Triaged → New
Corey Bryant (corey.bryant) wrote :

I've uploaded new versions of keepalived to cosmic, bionic (awaiting SRU team review), pike-staging, and ocata-staging. I need to confirm with other cloud archive admins that this is ok to backport to pike/ocata cloud archives prior to promoting to pike-proposed/ocata-proposed. In the mean time if you'd like to test the ocata fix (where this was initially reported) you can install from the staging PPA:

sudo add-apt-repository ppa:ubuntu-cloud-archive/ocata-staging
sudo apt update

And from pike:

sudo add-apt-repository ppa:ubuntu-cloud-archive/pike-staging
sudo apt update

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package keepalived - 1:1.3.9-1ubuntu1

---------------
keepalived (1:1.3.9-1ubuntu1) cosmic; urgency=medium

  * d/p/fix-removing-left-over-addresses-if-keepalived-abort.patch:
    Cherry-picked from upstream to ensure left-over VIPs and eVIPs are
    properly removed on restart if keepalived terminates abonormally. This
    fix is from the upstream 1.4.0 release (LP: #1744062).

 -- Corey Bryant <email address hidden> Tue, 03 Jul 2018 10:26:45 -0400

Changed in keepalived (Ubuntu):
status: Triaged → Fix Released

Hello Corey, or anyone else affected,

Accepted keepalived into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/keepalived/1:1.3.9-1ubuntu0.18.04.1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in keepalived (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
Xav Paice (xavpaice) wrote :

Subscribed field-high. This is affecting production environments.

tags: added: sts-sru-needed
Corey Bryant (corey.bryant) wrote :

Hello Corey, or anyone else affected,

Accepted keepalived into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers