neutron operators using L3 agent might need to tune SYNC_ROUTERS_MAX_CHUNK_SIZE and SYNC_ROUTERS_MIN_CHUNK_SIZE
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
In Progress
|
Undecided
|
Cristian Calin |
Bug Description
Summary
=======
Openstack operators deploying the L3 agent might need to tune the SYNC_ROUTERS_
High level description
=======
neutron l3 agent and its derivatives (neutron-vpn-agent) will do a full sync when they start. The process is to fetch a list of associated routers from the neutron-server then issue a sync_routers RPC call with the difference delta of what they have online and what they need to synchronise.
The call time is linear dependent on the number of routers associated to that agent and might result in rpc timeout if the server is overloaded (like in the situation of a complete datacenter outage or a multi-step upgrade). The l3 agent will attempt to chunk out the call if oslo_messaging.
Pre-conditions
==============
We faced this issue in a production environment and managed to reproduce an approximate behaviour in a pre-production environment.
Details of the test environment:
* 4 instances of the neutron-server
- 8 RPC workers
- 8 API workers
- 700 networks with 1 subnetwork each
- 100 tenants
- 9 external networks
- 1 shared network with instances attached to it
* 6 neutron vpn agents (also tested with neutron-l3-agent)
- L3 HA configured
- no l2-population configured
- 240 routers scheduled per agent
- rpc_timeout = 600
* 3 nova-compute nodes
- running 600 instances
- 100 instances with 2 network interfaces
- 50 instances attached to the shared network
Observations:
* sync_routers RPC call takes 7-10 minutes to get processed
* in production we observe messaging timeout and chunk scaling after 40 minutes
* in this environment we don't see RPC timeout but still the sync_routers call would exceed the rpc_timeout of 60 and would trigger neutron-server to consume 100% CPU for almost 40 minutes before eventually scaling down the chunk size and managing to fully online all the routers
Modifications:
We modified neutron/
SYNC_ROUTERS_
SYNC_ROUTERS_
... this resulted in the neutron-l3-agent starting to create qrouter-* namespaces after 10 seconds from a clean restart.
Clean restart for this test is to kill all keepalived and neutron agent processes, delete ovs ports and delete all namespaces from the node. This effectively ensures a full clean resync.
Versions tested:
* stable/mitaka (head)
* 8.4.0 tag
* 8.3.0 tag
I checked the code and the logic is the same in master so I don't expect much improvement with newton or ocata.
I want to propose we make these hardcoded values operator parametrisable while keeping the current defaults. It would not change the behaviour of the code for anybody except for operators which need to adjust these values and not require us to keep private patches.
I have a working patch set I can submit upstream for this which should be back portable all the way back to mitaka.
tags: | added: neutron-l3-agent |
tags: | added: neutron-von-agent |
tags: | added: rpc slow |
tags: | added: sync-routers |
tags: |
added: neutron-vpn-agent slow-rpc removed: neutron-von-agent rpc slow |
Changed in neutron: | |
status: | New → In Progress |
tags: | added: neutron-proactive-backport-potential |
tags: | removed: neutron-proactive-backport-potential |
Assigning to myself to push patch for review.