neutron operators using L3 agent might need to tune SYNC_ROUTERS_MAX_CHUNK_SIZE and SYNC_ROUTERS_MIN_CHUNK_SIZE

Bug #1692971 reported by Cristian Calin
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
Undecided
Cristian Calin

Bug Description

Summary
=======
Openstack operators deploying the L3 agent might need to tune the SYNC_ROUTERS_MIN/MAX_CHUNK_SIZE parameters to avoid flooding the neutron-server

High level description
======================
neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync when they start. The process is to fetch a list of associated routers from the neutron-server then issue a sync_routers RPC call with the difference delta of what they have online and what they need to synchronise.
The call time is linear dependent on the number of routers associated to that agent and might result in rpc timeout if the server is overloaded (like in the situation of a complete datacenter outage or a multi-step upgrade). The l3 agent will attempt to chunk out the call if oslo_messaging.MessagingTimeout is caught but by the time it eventually scales down the chunk size the server may be already swamped with calls and will take a considerable time to start onlining routers.

Pre-conditions
==============
We faced this issue in a production environment and managed to reproduce an approximate behaviour in a pre-production environment.

Details of the test environment:
* 4 instances of the neutron-server
 - 8 RPC workers
 - 8 API workers
 - 700 networks with 1 subnetwork each
 - 100 tenants
 - 9 external networks
 - 1 shared network with instances attached to it
* 6 neutron vpn agents (also tested with neutron-l3-agent)
 - L3 HA configured
 - no l2-population configured
 - 240 routers scheduled per agent
 - rpc_timeout = 600
* 3 nova-compute nodes
 - running 600 instances
 - 100 instances with 2 network interfaces
 - 50 instances attached to the shared network

Observations:
* sync_routers RPC call takes 7-10 minutes to get processed
* in production we observe messaging timeout and chunk scaling after 40 minutes
* in this environment we don't see RPC timeout but still the sync_routers call would exceed the rpc_timeout of 60 and would trigger neutron-server to consume 100% CPU for almost 40 minutes before eventually scaling down the chunk size and managing to fully online all the routers

Modifications:
We modified neutron/agent/l3/agent.py on the L3 agent nodes and set:
SYNC_ROUTERS_MAX_CHUNK_SIZE = 32
SYNC_ROUTERS_MIN_CHUNK_SIZE = 8
... this resulted in the neutron-l3-agent starting to create qrouter-* namespaces after 10 seconds from a clean restart.
Clean restart for this test is to kill all keepalived and neutron agent processes, delete ovs ports and delete all namespaces from the node. This effectively ensures a full clean resync.

Versions tested:
* stable/mitaka (head)
* 8.4.0 tag
* 8.3.0 tag
I checked the code and the logic is the same in master so I don't expect much improvement with newton or ocata.

I want to propose we make these hardcoded values operator parametrisable while keeping the current defaults. It would not change the behaviour of the code for anybody except for operators which need to adjust these values and not require us to keep private patches.
I have a working patch set I can submit upstream for this which should be back portable all the way back to mitaka.

tags: added: neutron-l3-agent
tags: added: neutron-von-agent
tags: added: rpc slow
tags: added: sync-routers
tags: added: neutron-vpn-agent slow-rpc
removed: neutron-von-agent rpc slow
Revision history for this message
Cristian Calin (cristi-calin) wrote :

Assigning to myself to push patch for review.

Changed in neutron:
assignee: nobody → Cristian Calin (cristi-calin)
Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/470295

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/470428

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/470429

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/mitaka)

Related fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/470430

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/470295
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a858ca7d8fdd625524859b723b9ce3fdb747b6b6
Submitter: Jenkins
Branch: master

commit a858ca7d8fdd625524859b723b9ce3fdb747b6b6
Author: Thomas Morin <email address hidden>
Date: Fri Jun 2 15:36:43 2017 +0200

    l3_ha_mode: call bulk _populate_mtu_and_subnets_for_ports

    Based on the observation that a call to sync_routers can be very
    slow (minutes) on some setup, and that profiling data show that
    a significant amount of time is spent in many individual calls
    of _process_sync_ha_data to _populate_mtu_and_subnets_for_ports for
    a single interface, this change refactors _process_sync_ha_data to
    call _populate_mtu_and_subnets_for_ports only once on a list of
    interfaces instead of <n> times.

    Said otherwise:
    - before: O(#routers) SQL queries
              (one per network of an HA interface of a router)
    - after : O(1) SQL queries
              (on the set of networks with an HA interface on a router)

    A basic test shows a drastic improvements, from minutes to around
    one second, in the processing of a sync_routers call with 256 routers.

    Change-Id: I3a00c8fbb245ab3b6d93bdaa97f3435570992791
    Related-Bug: 1692971

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/470429
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e6c1d6dc0f80a27eba29009cf7a6bf7be3bb9c75
Submitter: Jenkins
Branch: stable/newton

commit e6c1d6dc0f80a27eba29009cf7a6bf7be3bb9c75
Author: Thomas Morin <email address hidden>
Date: Fri Jun 2 15:36:43 2017 +0200

    l3_ha_mode: call bulk _populate_mtu_and_subnets_for_ports

    Based on the observation that a call to sync_routers can be very
    slow (minutes) on some setup, and that profiling data show that
    a significant amount of time is spent in many individual calls
    of _process_sync_ha_data to _populate_mtu_and_subnets_for_ports for
    a single interface, this change refactors _process_sync_ha_data to
    call _populate_mtu_and_subnets_for_ports only once on a list of
    interfaces instead of <n> times.

    Said otherwise:
    - before: O(#routers) SQL queries
              (one per network of an HA interface of a router)
    - after : O(1) SQL queries
              (on the set of networks with an HA interface on a router)

    A basic test shows a drastic improvements, from minutes to around
    one second, in the processing of a sync_routers call with 256 routers.

    Change-Id: I3a00c8fbb245ab3b6d93bdaa97f3435570992791
    Related-Bug: 1692971

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Cristian Calin (<email address hidden>) on branch: master
Review: https://review.openstack.org/468421
Reason: Discussion is going nowhere based on feelings

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/470428
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a08aa3bf2f567d4d0a5678d470cc3a3aec382d74
Submitter: Jenkins
Branch: stable/ocata

commit a08aa3bf2f567d4d0a5678d470cc3a3aec382d74
Author: Thomas Morin <email address hidden>
Date: Fri Jun 2 15:36:43 2017 +0200

    l3_ha_mode: call bulk _populate_mtu_and_subnets_for_ports

    Based on the observation that a call to sync_routers can be very
    slow (minutes) on some setup, and that profiling data show that
    a significant amount of time is spent in many individual calls
    of _process_sync_ha_data to _populate_mtu_and_subnets_for_ports for
    a single interface, this change refactors _process_sync_ha_data to
    call _populate_mtu_and_subnets_for_ports only once on a list of
    interfaces instead of <n> times.

    Said otherwise:
    - before: O(#routers) SQL queries
              (one per network of an HA interface of a router)
    - after : O(1) SQL queries
              (on the set of networks with an HA interface on a router)

    A basic test shows a drastic improvements, from minutes to around
    one second, in the processing of a sync_routers call with 256 routers.

    Change-Id: I3a00c8fbb245ab3b6d93bdaa97f3435570992791
    Related-Bug: 1692971

tags: added: in-stable-ocata
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/mitaka)

Change abandoned by Thomas Morin (<email address hidden>) on branch: stable/mitaka
Review: https://review.openstack.org/470430
Reason: mitaka is EOL since 2017-04-10

tags: removed: neutron-proactive-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.