Multiple l3 agents are scheduled to host one newly created router if multiple interfaces are added at the same time

Bug #1535557 reported by Lujin Luo
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
John Schwarz

Bug Description

I have three all-in-one controller nodes deployed by DevStack with the latest codes. Neutron servers on these controllers are set behind Pacemaker and HAProxy to realize active/active HA. MariaDB Galera cluster is used as my database backend.

In neutron.conf, I have made the following changes:
router_scheduler_driver = neutron.scheduler.l3_agent_scheduler.ChanceScheduler

When we add interfaces of multiple subnets to a newly created router, we might end up with more than one l3 agents hosting this router. This bug is not easy to reproduce. You may need to repeat the following steps several times.

How to reproduce:

Prerequisite
make the following changes in neutron.conf
[DEFAULT]
router_scheduler_driver = neutron.scheduler.l3_agent_scheduler.ChanceScheduler

Step 0: Confirm multiple l3 agents are running
$ neutron agent-list --agent_type='L3 agent'
my result is shown http://paste.openstack.org/show/483963/

Step 1: Create two networks
$ neutron net-create net-l3agent-test-1
$ neutron net-create net-l3agent-test-2

Step 2: Add one subnet to each of the two networks
$ neutron subnet-create --name subnet-l3agent-test-1 net-l3agent-test-1 192.168.11.0/24
$ neutron subnet-create --name subnet-l3agent-test-2 net-l3agent-test-2 192.168.12.0/24

Step 3: Create a router
$ neutron router-create router-l3agent-test

Step 4: Add the two subnets as the router's interfaces immediately after creating the router at the same time
On controller1:
$ neutron router-interface-add router-l3agent-test subnet-l3agent-test-1
On controller2:
$ neutron router-interface-add router-l3agent-test subnet-l3agent-test-2

Step 5: Check which l3 agent(s) is/are hosting the router
$ neutron l3-agent-list-hosting-router router-l3agent-test
my result is shown http://paste.openstack.org/show/483962/

If you end up with only one l3 agent, please proceed as follows
Step 6: Clear interfaces on the router
$ neutron router-interface-delete router-l3agent-test subnet-l3agent-test-1
$ neutron router-interface-delete router-l3agent-test subnet-l3agent-test-2

Step 7: Delete the router
$ neutron router-delete router-l3agent-test

Go back to Step 3-5

Tags: l3-ipam-dhcp
Lujin Luo (luo-lujin)
Changed in neutron:
assignee: nobody → Lujin Luo (luo-lujin)
tags: added: l3-ipam-dhcp
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
ZongKai LI (zongkai) wrote :

What's kind of router in the issue? Legacy? HA?

Revision history for this message
Lujin Luo (luo-lujin) wrote :

@Zongkai Li, legacy router

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/289190

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/364278

Changed in neutron:
assignee: Lujin Luo (luo-lujin) → John Schwarz (jschwarz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/364278
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b1ec8d523d4c45616dd71016f7e218b4b732c2ee
Submitter: Jenkins
Branch: master

commit b1ec8d523d4c45616dd71016f7e218b4b732c2ee
Author: John Schwarz <email address hidden>
Date: Fri Aug 19 15:17:21 2016 +0100

    Add binding_index to RouterL3AgentBinding

    The patch proposes adding a new binding_index to the
    RouterL3AgentBinding table, with an additional Unique Constraint that
    enforces a single <router_id, binding_id> per router. This goes a long
    way into fixing 2 issues:

    1. When scheduling a non-HA router, we only use binding_index=1. This
       means that only a single row containing that router_id can be
       committed into the database. This in fact prevents over-scheduling of
       non-HA routers. Note that for the HA router case, the binding_index
       is simply copied from the L3HARouterAgentPortBinding (since they are
       always created together they should always match).

    2. This sets the ground-work for a refactor of the l3 scheduler - by
       using this binding and db-based limitation, we can schedule a router
       to agents using the RouterL3AgentBinding, while postponing the
       creation of L3HARouterAgentPortBinding objects for the agents until
       they ask for it (using sync_routers). This will be a major
       improvement over todays "everything can create
       L3HARouterAgentPortBinding" way of things).

    Closes-Bug: #1535557
    Change-Id: I3447ea5bcb7c57365c6f50efe12a1671e86588b3

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Lujin Luo (<email address hidden>) on branch: master
Review: https://review.openstack.org/289190

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0rc1

This issue was fixed in the openstack/neutron 9.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.