At scale router scheduling takes a long time with DVR routers with multiple compute nodes hosting thousands of VMs

Bug #1513678 reported by Swaminathan Vasudevan on 2015-11-06
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
High
Swaminathan Vasudevan

Bug Description

At scale when we have 100s of compute Node and 1000s of VM in networks that are routed by Distributed Virtual Router, we are seeing a control plane performance issue.
It takes a while for all the routers to be schedule in the Nodes.

The _schedule_router calls _get_candidates, and it internally calls get_l3_agent_candidates. In the case of the DVR Routers, all the active agents are passed to the get_l3_agent_candidates which iterates through the agents and for each agent it tries to find out if there are any dvr_service ports available in the routed subnet.

This might be taking lot more time.

So we need to figure out the issue and reduce the time taken for the scheduling.

Fix proposed to branch: master
Review: https://review.openstack.org/242286

Changed in neutron:
assignee: nobody → Swaminathan Vasudevan (swaminathan-vasudevan)
status: New → In Progress
Oleg Bondarev (obondarev) wrote :

I think this falls in scope of https://github.com/openstack/neutron-specs/blob/master/specs/mitaka/improve-dvr-l3-agent-binding.rst, not sure we should have a separate bug for this.

Kyle Mestery (mestery) wrote :

Agree with Oleg, do we want to collapse the two here?

Changed in neutron:
importance: Undecided → High

I am ok, collapsing the two together and binding it to the blueprint.

Related fix proposed to branch: master
Review: https://review.openstack.org/250150

Reviewed: https://review.openstack.org/241843
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=062ad0a0a62ab49bd0c27e73fbe93d739f82f410
Submitter: Jenkins
Branch: master

commit 062ad0a0a62ab49bd0c27e73fbe93d739f82f410
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Nov 4 18:02:09 2015 -0800

    Change check_ports_exist_on_l3agent to pass the subnet_ids

    The get_subnet_ids_on_router is called for every
    available l3agent in check_ports_exist_on_l3agent.
    This introduces un-necessary call to the same
    function multiple times which is expensive since it
    calls get_ports internally.

    In large scale the time taken to schedule a VM
    on a given N-Node increases.

    By passing the subnet_ids to check_ports_exist_on_l3agent
    we would be only calling once get_subnet_ids_on_router in
    the get_l3_agent_candidates.

    This patch addresses the above problem by calling
    get_subnet_ids_on_router just once.

    Change-Id: I9d130f98e07bfe571eee32b827ff9af4010ff0fb
    Related-Bug: #1513678

Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Oleg Bondarev (obondarev)
Changed in neutron:
assignee: Oleg Bondarev (obondarev) → Swaminathan Vasudevan (swaminathan-vasudevan)
Changed in neutron:
assignee: Swaminathan Vasudevan (swaminathan-vasudevan) → Oleg Bondarev (obondarev)
Changed in neutron:
assignee: Oleg Bondarev (obondarev) → Carl Baldwin (carl-baldwin)
Changed in neutron:
assignee: Carl Baldwin (carl-baldwin) → Swaminathan Vasudevan (swaminathan-vasudevan)

Reviewed: https://review.openstack.org/242286
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=411e6ff1570f9508424eb985201943e881084d7a
Submitter: Jenkins
Branch: master

commit 411e6ff1570f9508424eb985201943e881084d7a
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Nov 5 17:00:49 2015 -0800

    Tune _get_candidates for faster scheduling in dvr

    Right now we have seen some performance issues when
    dvr routers are scheduled on multiple compute nodes
    with thousands of VMs on the routed subnets.

    The _get_candidates call get_l3_agent_candidates with
    a complete list of agents irrespective of the routers
    already hosted on the agents or not.

    So this fix will reduce the amount of iterations that
    get_l3_agent_candidates need to process for all the
    agents and would increase the control plane performance.

    Closes-Bug: #1513678
    Change-Id: I8f781d4cbc996ce13441303c9296e4f6ec822b94

Changed in neutron:
status: In Progress → Fix Released
tags: added: liberty-backport-potential

Reviewed: https://review.openstack.org/258403
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=5bbff07beb5e22ba365c3073043e0c1f03811c99
Submitter: Jenkins
Branch: stable/liberty

commit 5bbff07beb5e22ba365c3073043e0c1f03811c99
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Nov 5 17:00:49 2015 -0800

    Tune _get_candidates for faster scheduling in dvr

    Right now we have seen some performance issues when
    dvr routers are scheduled on multiple compute nodes
    with thousands of VMs on the routed subnets.

    The _get_candidates call get_l3_agent_candidates with
    a complete list of agents irrespective of the routers
    already hosted on the agents or not.

    So this fix will reduce the amount of iterations that
    get_l3_agent_candidates need to process for all the
    agents and would increase the control plane performance.

    Closes-Bug: #1513678

    Conflicts:

     neutron/scheduler/l3_agent_scheduler.py

    Change-Id: I8f781d4cbc996ce13441303c9296e4f6ec822b94
    (cherry picked from commit 411e6ff1570f9508424eb985201943e881084d7a)

tags: added: in-stable-liberty

Reviewed: https://review.openstack.org/250075
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c2483b73c2ca6586d7b169511be50f85230fd0f7
Submitter: Jenkins
Branch: master

commit c2483b73c2ca6586d7b169511be50f85230fd0f7
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Nov 25 15:15:17 2015 -0800

    Remove check on dhcp enabled subnets while scheduling dvr

    In check_ports_exist_on_l3agent we have an optimization fix
    that checks for the subnets associated with the router and if
    the subnets have dhcp enabled we go ahead and create the
    router if it is a dvr_snat agent.

    This was introduced in liberty since we saw some race condition
    in the gate with single node failures.
    It may not be completely right, since the dhcp agents can
    run on non dvr_snat nodes as well.

    Based on recommendation from the reviews, and a recent upstream
    patch that sends notification on port create, we would want to
    remove this and monitor the situation.

    This would reduce the load on check_ports_exist_on_l3agent for
    non dvr_snat nodes.

    Depends-On: I40b8684f6ec9ddd31753f7bbbdb364d1c0ec838a
    Related-Bug: #1513678

    Change-Id: I0f50dc1101b2013caf03a64a4f48e2d03ea87b26

Reviewed: https://review.openstack.org/250150
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=226c999de3e342bf7ce667e21f4ab685b7fd5622
Submitter: Jenkins
Branch: master

commit 226c999de3e342bf7ce667e21f4ab685b7fd5622
Author: Oleg Bondarev <email address hidden>
Date: Wed Dec 2 14:52:30 2015 +0300

    DVR: optimize check_ports_exist_on_l3_agent()

    Currently the function gets all ports on the subnet and iterates
    through them to find dvr serviceable ports on a particular host.
    This patch makes it a single DB query to see if any port exists
    matching criterias.

    Partial-Bug: #1513678
    Change-Id: Ie17885497aacb8fda4a2c4a05f19d08991038557
    Co-Authored-By: Oleg Bondarev <email address hidden>

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/265999
Reason: I see some issue with this one, I will resubmit another patch.

Change abandoned by Swaminathan Vasudevan (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/266026
Reason: there are still some issues preventing me to push all the files upstream on this patch. I will try to sort it out and push a new one.

This issue was fixed in the openstack/neutron 8.0.0.0b2 development milestone.

Reviewed: https://review.openstack.org/268066
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=96d4ab3de5ff09ae3fd7cdf8060f60ecb7ad6979
Submitter: Jenkins
Branch: stable/liberty

commit 96d4ab3de5ff09ae3fd7cdf8060f60ecb7ad6979
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Nov 25 15:15:17 2015 -0800

    Remove check on dhcp enabled subnets while scheduling dvr

    In check_ports_exist_on_l3agent we have an optimization fix
    that checks for the subnets associated with the router and if
    the subnets have dhcp enabled we go ahead and create the
    router if it is a dvr_snat agent.

    This was introduced in liberty since we saw some race condition
    in the gate with single node failures.
    It may not be completely right, since the dhcp agents can
    run on non dvr_snat nodes as well.

    Based on recommendation from the reviews, and a recent upstream
    patch that sends notification on port create, we would want to
    remove this and monitor the situation.

    This would reduce the load on check_ports_exist_on_l3agent for
    non dvr_snat nodes.

    Related-Bug: #1513678

    Conflicts:

     neutron/tests/functional/services/l3_router/test_l3_dvr_router_plugin.py
     neutron/tests/unit/scheduler/test_l3_agent_scheduler.py

    Change-Id: I0f50dc1101b2013caf03a64a4f48e2d03ea87b26
    (cherry picked from commit c2483b73c2ca6586d7b169511be50f85230fd0f7)

This issue was fixed in the openstack/neutron 7.0.2 release.

Reviewed: https://review.openstack.org/266026
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9246cffc7ca8f6731c5448e63f779b926f769e78
Submitter: Jenkins
Branch: stable/liberty

commit 9246cffc7ca8f6731c5448e63f779b926f769e78
Author: Swaminathan Vasudevan <email address hidden>
Date: Wed Nov 4 18:02:09 2015 -0800

    Change check_ports_exist_on_l3agent to pass the subnet_ids

    The get_subnet_ids_on_router is called for every
    available l3agent in check_ports_exist_on_l3agent.
    This introduces un-necessary call to the same
    function multiple times which is expensive since it
    calls get_ports internally.

    In large scale the time taken to schedule a VM
    on a given N-Node increases.

    By passing the subnet_ids to check_ports_exist_on_l3agent
    we would be only calling once get_subnet_ids_on_router in
    the get_l3_agent_candidates.

    This patch addresses the above problem by calling
    get_subnet_ids_on_router just once.

    Related-Bug: #1513678

    Conflicts:

     neutron/tests/unit/db/test_agentschedulers_db.py

    Change-Id: I9d130f98e07bfe571eee32b827ff9af4010ff0fb
    (cherry picked from commit 062ad0a0a62ab49bd0c27e73fbe93d739f82f410)

Reviewed: https://review.openstack.org/274037
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0bd401c8638f22977977ffc2ce267063eb83b195
Submitter: Jenkins
Branch: stable/liberty

commit 0bd401c8638f22977977ffc2ce267063eb83b195
Author: Oleg Bondarev <email address hidden>
Date: Wed Dec 2 14:52:30 2015 +0300

    DVR: optimize check_ports_exist_on_l3_agent()

    Currently the function gets all ports on the subnet and iterates
    through them to find dvr serviceable ports on a particular host.
    This patch makes it a single DB query to see if any port exists
    matching criterias.

    Partial-Bug: #1513678

    Conflicts:

     neutron/common/utils.py
     neutron/db/l3_agentschedulers_db.py

    Change-Id: Ie17885497aacb8fda4a2c4a05f19d08991038557
    Co-Authored-By: Oleg Bondarev <email address hidden>
    (cherry picked from commit 226c999de3e342bf7ce667e21f4ab685b7fd5622)

tags: removed: liberty-backport-potential
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers