L3 Agent's fullsync is raceful with creation of HA router

Bug #1550886 reported by John Schwarz
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
John Schwarz

Bug Description

When creating an HA router, after the server creates all the DB objects (including the HA network and ports if it's the first one), the server continues on the schedule the router to (some of) the available agents.

The race is achieved when an L3 agent router issues a sync_router request, which later down the line ends up in an auto_schedule_routers() call. If this happens before the above schedule (of the create_router()) is complete, the server will refuse to schedule the router to the other intended L3 agents, resulting is less agents being scheduled.

The only way to fix this is either restarting one of the L3 agents which didn't get scheduled, or recreating the router. Either is a bad option.

An example of the state:
$ neutron l3-agent-list-hosting-router router2
+--------------------------------------+-------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+-------------------------+----------------+-------+----------+
| d05da32b-34e7-4c7f-b0dd-938328a0c0ed | vpn-6-12 | True | :-) | active |
+--------------------------------------+-------------------------+----------------+-------+----------+
(only 1 of the agent got scheduled with the router, even though there are 3 suitable agents that normally get scheduled without the race.)

John Schwarz (jschwarz)
Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
Changed in neutron:
status: New → In Progress
John Schwarz (jschwarz)
description: updated
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
LIU Yulong (dragon889) wrote :

Maybe the following trace is related to this bug:
http://paste.openstack.org/show/488732/

Revision history for this message
Ann Taraday (akamyshnikova) wrote :
Revision history for this message
Assaf Muller (amuller) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/284400
Reason: This seems like a complicated patch and it looks like it's only going to get more complicated. I'm abandoning this in favour of https://review.openstack.org/#/c/285480/ which simplifies the code greatly and also solves the races at the same time.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/257059
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9c3c19f07ce52e139d431aec54341c38a183f0b7
Submitter: Jenkins
Branch: master

commit 9c3c19f07ce52e139d431aec54341c38a183f0b7
Author: Kevin Benton <email address hidden>
Date: Thu Feb 18 03:48:29 2016 -0800

    Add ALLOCATING state to routers

    This patch adds a new ALLOCATING status to routers
    to indicate that the routers are still being built on the
    Neutron server. Any routers in this state are excluded in
    router retrievals by the L3 agent since they are not yet
    ready to be wired up.

    This is necessary when a router is made up of several
    distinct Neutron resources that cannot all be put
    into a single transaction. This patch applies this new
    state to HA routers while their internal HA ports and
    networks are being created/deleted so the L3 HA agent
    will never retrieve a partially formed HA router. It's
    important to note that the ALLOCATING status carries over
    until after the scheduling is done, which ensures that
    routers that weren't fully scheduled will not be sent to
    the agents.

    An HA router is placed in this state only when it is being
    created or converted to/from the HA state since this is
    disruptive to the dataplane.

    This patch also reverts the changes introduced in
    Iadb5a69d4cbc2515fb112867c525676cadea002b since they will
    be handled by the ALLOCATING logic instead.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>
    Co-Authored-By: John Schwarz <email address hidden>

    APIImpact
    Closes-Bug: #1550886
    Related-bug: #1499647
    Change-Id: I22ff5a5a74527366da8f82982232d4e70e455570

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/305622

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/305774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by John Schwarz (<email address hidden>) on branch: master
Review: https://review.openstack.org/284400
Reason: Fixed by the ALLOCATING patch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/305622
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=36305c0c4f4ebf498020f5956e103832da75f8a9
Submitter: Jenkins
Branch: stable/mitaka

commit 36305c0c4f4ebf498020f5956e103832da75f8a9
Author: Kevin Benton <email address hidden>
Date: Thu Feb 18 03:48:29 2016 -0800

    Add ALLOCATING state to routers

    This patch adds a new ALLOCATING status to routers
    to indicate that the routers are still being built on the
    Neutron server. Any routers in this state are excluded in
    router retrievals by the L3 agent since they are not yet
    ready to be wired up.

    This is necessary when a router is made up of several
    distinct Neutron resources that cannot all be put
    into a single transaction. This patch applies this new
    state to HA routers while their internal HA ports and
    networks are being created/deleted so the L3 HA agent
    will never retrieve a partially formed HA router. It's
    important to note that the ALLOCATING status carries over
    until after the scheduling is done, which ensures that
    routers that weren't fully scheduled will not be sent to
    the agents.

    An HA router is placed in this state only when it is being
    created or converted to/from the HA state since this is
    disruptive to the dataplane.

    This patch also reverts the changes introduced in
    Iadb5a69d4cbc2515fb112867c525676cadea002b since they will
    be handled by the ALLOCATING logic instead.

    Co-Authored-By: Ann Kamyshnikova <email address hidden>
    Co-Authored-By: John Schwarz <email address hidden>

    APIImpact
    Closes-Bug: #1550886
    Related-bug: #1499647
    Change-Id: I22ff5a5a74527366da8f82982232d4e70e455570
    (cherry picked from commit 9c3c19f07ce52e139d431aec54341c38a183f0b7)

tags: added: in-stable-mitaka
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 8.1.0

This issue was fixed in the openstack/neutron 8.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/314250

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)
Download full text (36.9 KiB)

Reviewed: https://review.openstack.org/314250
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3bf73801df169de40d365e6240e045266392ca63
Submitter: Jenkins
Branch: master

commit a323769143001d67fd1b3b4ba294e59accd09e0e
Author: Ryan Moats <email address hidden>
Date: Tue Oct 20 15:51:37 2015 +0000

    Revert "Improve performance of ensure_namespace"

    This reverts commit 81823e86328e62850a89aef9f0b609bfc0a6dacd.

    Unneeded optimization: this commit only improves execution
    time on the order of milliseconds, which is less than 1% of
    the total router update execution time at the network node.

    This also

    Closes-bug: #1574881

    Change-Id: Icbcdf4725ba7d2e743bb6761c9799ae436bd953b

commit 7fcf0253246832300f13b0aa4cea397215700572
Author: OpenStack Proposal Bot <email address hidden>
Date: Thu Apr 21 07:05:16 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I9e930750dde85a9beb0b6f85eeea8a0962d3e020

commit 643b4431606421b09d05eb0ccde130adbf88df64
Author: OpenStack Proposal Bot <email address hidden>
Date: Tue Apr 19 06:52:48 2016 +0000

    Imported Translations from Zanata

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: I52d7460b3265b5460b9089e1cc58624640dc7230

commit 1ffea42ccdc14b7a6162c1895bd8f2aae48d5dae
Author: OpenStack Proposal Bot <email address hidden>
Date: Mon Apr 18 15:03:30 2016 +0000

    Updated from global requirements

    Change-Id: Icb27945b3f222af1d9ab2b62bf2169d82b6ae26c

commit b970ed5bdac60c0fa227f2fddaa9b842ba4f51a7
Author: Kevin Benton <email address hidden>
Date: Fri Apr 8 17:52:14 2016 -0700

    Clear DVR MAC on last agent deletion from host

    Once all agents are deleted from a host, the DVR MAC generated
    for that host should be deleted as well to prevent a buildup of
    pointless flows generated in the OVS agent for hosts that don't
    exist.

    Closes-Bug: #1568206
    Change-Id: I51e736aa0431980a595ecf810f148ca62d990d20
    (cherry picked from commit 92527c2de2afaf4862fddc101143e4d02858924d)

commit eee9e58ed258a48c69effef121f55fdaa5b68bd6
Author: Mike Bayer <email address hidden>
Date: Tue Feb 9 13:10:57 2016 -0500

    Add an option for WSGI pool size

    Neutron currently hardcodes the number of
    greenlets used to process requests in a process to 1000.
    As detailed in
    http://lists.openstack.org/pipermail/openstack-dev/2015-December/082717.html

    this can cause requests to wait within one process
    for available database connection while other processes
    remain available.

    By adding a wsgi_default_pool_size option functionally
    identical to that of Nova, we can lower the number of
    greenlets per process to be more in line with a typical
    max database connection pool size.

    DocImpact: a previously unused configuration value
               wsgi_default_pool_size is now used to a...

tags: added: neutron-proactive-backport-potential
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Revision history for this message
John Schwarz (jschwarz) wrote :

In regards to discussions on whether or not to include this in stable/liberty or not: the race occurs when creating and deleting HA routers, and is especially apparent when the router is the first one a tenant has created. In this case, the resulting race can produce a wide array of effects, such as not creating all the resources and failure to schedule the router to the minimum required HA agents (even though there are more available).

This can happen very easily when running rally's create_and_delete_routers sample task and have been reported to happen on a few large deployments.

This rather-complex patch makes sure an agent is not made aware of routers which are currently ALLOCATING. As a safeguard, during specific sensitive areas of the code (and specifically when scheduling a router), its' status is modified to ALLOCATING.

Revision history for this message
John Schwarz (jschwarz) wrote :

The fix for not scheduling to router to the minimum required HA agents is, btw, manually re-creating the agent or manually scheduling it to more agents.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/liberty)

Change abandoned by John Schwarz (<email address hidden>) on branch: stable/liberty
Review: https://review.openstack.org/305774

tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.