L3 Agent cannot process RPC messages until _sync_routers_task is finished

Bug #1289066 reported by Carl Baldwin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Carl Baldwin

Bug Description

When L3 agent starts or restarts, it almost immediately goes in to a _sync_routers_task run. This task is synchronized with _rpc_loop so that only one can happen at a time.

The problem with this is that -- at least at scale -- the _sync_routers_task can take a VERY LONG time to run. I've observed it take 1-2 hours! This is WAY too long to wait before I can do something with my router like add a floating ip.

The thing is, _sync_routers_task is important to do periodically but it is mostly just checking that things are still in the right state. It should never take precedence over responding to RPC messages. The RPC messages represent work that the system has just been asked to perform. It is silly to make it wait a long time for a maintenance task to complete.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/78819

Changed in neutron:
assignee: nobody → Carl Baldwin (carl-baldwin)
status: New → In Progress
Kyle Mestery (mestery)
Changed in neutron:
milestone: none → juno-2
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/78819
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fe2ca9a75878a445a54ecfe4a97c79b696abf503
Submitter: Jenkins
Branch: master

commit fe2ca9a75878a445a54ecfe4a97c79b696abf503
Author: Carl Baldwin <email address hidden>
Date: Thu Mar 6 23:57:11 2014 +0000

    L3 agent prefers RPC messages over full sync

    When the L3 agent starts up and runs the sync task it doesn't process
    any incoming RPC events until the sync task is complete.

    This change combines the work from _rpc_loop and _sync_routers_task in
    to a single loop called _process_routers_loop. This loop spawns
    threads that pull from a priority queue. The queue ensures that RPC
    messages are handled before _process_routers_loop. The latter is
    generally maintenance tasks triggered by the agent rather than user
    triggered tasks.

    Synchronization between RPC and sync routers loops is no longer
    necessary since they both feed in to a single queue. There were
    places where it was necessary to reorder some things to allow for the
    lack of synchronization. For example, it is necessary to list
    namespaces before fetching the full list of routers to ensure that it
    doesn't delete a new namespace that gets created after listing
    namespaces. The lack of the need for synchronization between loops is
    probably the main strength of this patch.

    With multiple worker threads, need to handle the case where an RPC
    message came in while a thread was working on a router. Another
    thread should not handle the same router that is already in progress.
    Adds a mechanism to signal to the working thread that an update came
    in for the router it is working on. The original thread will repeat
    processing the router when it is finished to get the update.
    Multiple rapid updates to the same router will be consolidated.
    Essentially, there is still synchronization of work for a given router
    but not between routers. Much better than before.

    blueprint l3-agent-responsiveness
    Closes-Bug: #1289066
    Change-Id: I39afe86c66f864d71adf865d7bd1c9db35511505

Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.