Neutron HA should allow min_l3_agents_per_router to equal one

Bug #1555042 reported by Dr. Jens Harbott
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Wishlist
Dr. Jens Harbott

Bug Description

As an operator, when I am running a setup with two network nodes, the idea of running L3 HA is that an outage of one of the network nodes should have minimum customer impact. With the current code, existing setups will indeed have little to no impact, but customers will not be able to create new routers during the outage.

If neutron would allow to set min_l3_agents_per_router=1, new routers will be created even when just one agent is available, which certainly is not optimal, but at least will fulfill the customer request. Once the second network node recovers, the second router instance will be added and thus redundancy restored.

Changed in neutron:
assignee: nobody → Dr. Jens Rosenboom (j-rosenboom-j)
status: New → In Progress
tags: added: rfe
Changed in neutron:
importance: Undecided → Wishlist
tags: added: l3-ha
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

A patch for this is proposed in https://review.openstack.org/289925

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

I'd like Assaf's input, but I am personally against relaxing this constraint.

Changed in neutron:
status: In Progress → Won't Fix
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Let's fast track a decision about this, change is limited, even though the outcome may have its own repercussions.

tags: removed: rfe
Revision history for this message
Assaf Muller (amuller) wrote :

It's hard to say. I'm currently suggesting we remove the functionality for neutron to note a new / recently dead L3 agent and schedule another router replica to it (In the sake of simplicity and stability). If we remove that functionality, creating a router with only 1 replica makes no sense as it'll never be able to 'grow' to 2 or 3 replicas.

Revision history for this message
Assaf Muller (amuller) wrote :

Super WIP / early design phase: https://review.openstack.org/#/c/285480/

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

I agree that maybe allowing min_l3_agents_per_router=1 is not necessary, but I do think that there should be some solution for my use case:

- Run L3-HA on two network nodes / L3 agents
- Allow customers to create new routers even while one of the nodes is temporarily out of service

Would it make sense to rephrase the bug description with that or should I file a new bug?

If I understood our IRC conversion correctly, even with your patch in place, it will still be possible to run some script that will perform the task of growing from 1 to 2 or maybe from 2 to 3 replicas, similar to what we are using in the non-HA case in order to reschedule routers from failed L3 agents to others.

If that is correct, I think having my patch will still be sensible, as it will solve the goal of hiding outages from customers, which IMHO is the basic idea behind providing some "high availability" service.

Changed in neutron:
status: Won't Fix → In Progress
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

People asked for a design discussion, but this doesn't seem to be happening.

This patch is the only patch remaining for our Mitaka deployment that would require us to build our own packages, so once more I ask you from an operator perspective to give us an option to get the behaviour we need without having to perform local workarounds.

tags: added: ops
Revision history for this message
Jan Klare (j-klare) wrote :

Hey Carl,

as we discussed in the neutron feedback session in Austin last Thursday, this change is needed for us to deploy and run a small cluster with only two network nodes and still be able to serve full functionality, including the creation of routers and not just the functionality of the ones created before. It would be great if this could be merged. I will also try to attend the l3 meeting this week, so we can continue the discussion if necessary.

Cheers,
Jan

Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Here's my $0.02.

I think the min agents configuration is the number of agents required for a non-degraded router not necessarily the number of agents required for the router creation to succeed or not. I personally think that the API should *not* fail if this can't be met at the time of router creation.

Look at it this way. In one case, you can try to create a router milliseconds before an agent goes offline. As long as the creation completes, the API will return successfully. In another case, you can try to create a router milliseconds after an agent goes offline. In this case, the API will fail. This doesn't make sense to me.

IMO, successful return of the API is not a guarantee that the router is in a non-degraded state at the time of creation or any time after that. To me, the degraded nature of the router over time is orthogonal to the successful creation of the HA router. So, I think we should allow a fix to this bug that allows the creation of an HA router in the degraded state. All of the others routers are degraded, so why not allow creating another and let the operator worry about getting that other agent back online.

Revision history for this message
Assaf Muller (amuller) wrote :

I'd be in favor of removing the option entirely and removing the enforcement of a minimum.

One thing to note is that https://review.openstack.org/#/c/285480/ simplifies the auto_schedule routers code path and also removes the ability to schedule HA routers to new nodes as they pop in. This means that if an HA router is created with only one alive L3 agent, the router will be scheduled to only that agent, and when the second agent comes back up, the HA router will *not* be automatically scheduled to that node. That scenario currently *does* work as you would expect on master and the patch changes that behavior. When we are enforcing a minimum of 2 I didn't think removing the auto-scale-up-for-HA-routers functionality was a big deal, but if we remove the minimum, it probably is, so let's make sure the patch doesn't merge in its current state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/289925
Reason: This review is > 4 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/289925
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ab131ee0af32d9eea04f1403a9a134a070e80e02
Submitter: Jenkins
Branch: master

commit ab131ee0af32d9eea04f1403a9a134a070e80e02
Author: Jens Rosenboom <email address hidden>
Date: Tue Mar 8 14:31:34 2016 +0100

    Allow min_l3_agents_per_router to equal one

    As an operator, when I am running a setup with two network nodes, the
    idea of running L3 HA is that an outage of one of the network nodes
    should have minimum customer impact. With the current rules in place,
    existing setups will indeed have little to no impact, but customers will
    not be able to create new routers during the outage.

    With this chance in place, we can set min_l3_agents_per_router=1, so the
    customers will be affected even less. New routers will be created with
    just one instance, which certainly is not optimal, but at least will
    fulfill the customer request. Once the second network node recovers, the
    second router instance will be added and thus redundancy restored.

    Also change the help text to specify the effect of setting
    min_l3_agents_per_router more clearly.

    Closes-Bug: 1555042
    Change-Id: I8a5fc74a96c784d474aefe2d9b27eeb66521ca82

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/337985

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/339755

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Sindhu Devale (<email address hidden>) on branch: master
Review: https://review.openstack.org/339755
Reason: Looks like we have two patches for this: https://review.openstack.org/#/c/338169/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/338169
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e82494f9cd07706f086b036a15cf7bf1e141074a
Submitter: Jenkins
Branch: master

commit e82494f9cd07706f086b036a15cf7bf1e141074a
Author: Jens Rosenboom <email address hidden>
Date: Wed Jul 6 11:49:19 2016 +0200

    Deprecate option min_l3_agents_per_router

    As was discussed in [1], we should not only allow setting
    min_l3_agents_per_router to one [2], but deprecate this option
    completely.

    [1] https://bugs.launchpad.net/bugs/1555042
    [2] https://review.openstack.org/289925

    Related-Bug: 1555042
    Closes-Bug: 1599275
    Change-Id: I518e12edd4bfb7a036b278d5f108cf0fc3de0353

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/337985
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=412ba9f1219085200b66cc1ed373ac3304ed3282
Submitter: Jenkins
Branch: stable/mitaka

commit 412ba9f1219085200b66cc1ed373ac3304ed3282
Author: Jens Rosenboom <email address hidden>
Date: Tue Mar 8 14:31:34 2016 +0100

    Allow min_l3_agents_per_router to equal one

    As an operator, when I am running a setup with two network nodes, the
    idea of running L3 HA is that an outage of one of the network nodes
    should have minimum customer impact. With the current rules in place,
    existing setups will indeed have little to no impact, but customers will
    not be able to create new routers during the outage.

    With this chance in place, we can set min_l3_agents_per_router=1, so the
    customers will be affected even less. New routers will be created with
    just one instance, which certainly is not optimal, but at least will
    fulfill the customer request. Once the second network node recovers, the
    second router instance will be added and thus redundancy restored.

    Also change the help text to specify the effect of setting
    min_l3_agents_per_router more clearly.

    Conflicts:
     neutron/db/l3_hamode_db.py
     neutron/extensions/l3_ext_ha_mode.py

    Closes-Bug: 1555042
    Change-Id: I8a5fc74a96c784d474aefe2d9b27eeb66521ca82
    (cherry picked from commit ab131ee0af32d9eea04f1403a9a134a070e80e02)

tags: added: in-stable-mitaka
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b2

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 8.2.0

This issue was fixed in the openstack/neutron 8.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/385604

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-dynamic-routing (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/420189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-dynamic-routing (master)

Reviewed: https://review.openstack.org/420189
Committed: https://git.openstack.org/cgit/openstack/neutron-dynamic-routing/commit/?id=3aa73ef3365b04b3bc39137063cf8ff427547845
Submitter: Jenkins
Branch: master

commit 3aa73ef3365b04b3bc39137063cf8ff427547845
Author: Ihar Hrachyshka <email address hidden>
Date: Sun Jan 8 16:53:41 2017 +0000

    Don't override min_l3_agents_per_router in tests

    The option is about to be removed which will break the unit tests suite
    for the repo.

    Change-Id: I908c3717b5db970080c7880a6d2455cb382d7104
    See: I3a9195ff6fd18fad9f85cec03a632e7e52d954e7
    Related-Bug: #1555042

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/385604
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=dd5aca38f90e1b387837c555e110d825062bfb5a
Submitter: Jenkins
Branch: master

commit dd5aca38f90e1b387837c555e110d825062bfb5a
Author: Assaf Muller <email address hidden>
Date: Wed Oct 12 15:32:45 2016 -0400

    Remove deprecated min_l3_agents_per_router

    The option was deprecated [1] for removal in Newton
    and is being removed in Ocata.

    [1] Deprecated in patch with Gerrit Change-Id of:
        I8a5fc74a96c784d474aefe2d9b27eeb66521ca82

    DocImpact remove all references to the option.

    Change-Id: I3a9195ff6fd18fad9f85cec03a632e7e52d954e7
    Closes-Bug: #1555042

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.0.0b3

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.