All HA routers become active on the same agent

Bug #1365429 reported by Assaf Muller
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Assaf Muller

Bug Description

How to reproduce:
On a setup with two L3 agents, create ten HA routers, the scheduler will place them on both agents, but the same agent will host the active instance of all ten routers. This defeats the idea of load sharing traffic across all L3 agents.

Solutions:
This can be solved in one of two ways:
1) Enable preemptive elections for HA routers. Keepalived enables a configuration value that causes VRRP pre-emptive elections. This way we can set a random VRRP priority for each router instance, and the elections process will guarantee a random distribution of active routers on the available agents. Preemptive elections have a major downside - If an agent that's hosting a master instance drops, the backup router will come in to play, but when the node is fixed the old master will re-assume its role. This second state transition is costly and redundant.
2) With non-preemptive elections the first router instance to come up will become the master. We can exploit this and send the notification from the server to the agents in a random order.

Tags: l3-ha
Assaf Muller (amuller)
tags: removed: vrrp
Revision history for this message
Kevin Benton (kevinbenton) wrote :

Why is 1 costly and redundant? A VRRP takeover isn't that bad. The other advantage is that the load will then be shared again across the agents where if you don't have preemption everything will be stuck on one router after the failure.

Revision history for this message
Assaf Muller (amuller) wrote :

A takeover is bad because it disrupts connectivity for a minimum of 8~ seconds. If you don't *have* to perform a failover, don't... You do bring up a good point though. If the failed node doesn't take back a few routers when it comes back online we'll also be at a less than ideal state when it comes to load sharing, which is what we're tying to solve. Definitely something to think about...

Assaf Muller (amuller)
Changed in neutron:
assignee: nobody → Assaf Muller (amuller)
Assaf Muller (amuller)
Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/121620

Changed in neutron:
importance: Undecided → High
Kyle Mestery (mestery)
Changed in neutron:
milestone: none → juno-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/121620
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0bd4472ef7bdb9d94f988669f34f7eaa53ca0a89
Submitter: Jenkins
Branch: master

commit 0bd4472ef7bdb9d94f988669f34f7eaa53ca0a89
Author: Assaf Muller <email address hidden>
Date: Mon Sep 15 18:11:17 2014 +0300

    HA routers master state now distributed amongst agents

    We're currently running with no pre-emption, meaning that
    the first router in a cluster to go up will be the master,
    regardless of priority. Since the order in which we sent
    notifications was constant, the same agent hosted the
    master instances of all HA routers, defeating the idea
    of load sharing.

    Closes-Bug: #1365429
    Change-Id: Ia6fe2bd0317c241bf7eb55915df7650dfdc68210

Changed in neutron:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-rc1 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.