Docs needed for tunables at large scale

Bug #1858419 reported by Slawek Kaplonski on 2020-01-06
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Slawek Kaplonski

Bug Description

Various things come up in IRC every once in a while about configuration options that need to be tweaked at large scale (blizzard, cern, etc) which once you hit hundreds or thousands of compute nodes need to be changed to avoid killing the control plane.

Similar bug is reported for Nova here:

So lets use this bug to discuss and track here what Neutron options are also important to be tuned for large scale.

Tags: doc Edit Tag help
LIU Yulong (dragon889) wrote :

Yes, we run some deploy on large scale. And indeed there are many may config options tunning. But not neutron itself only, but also those centralized services (DB and MQ) for neutron-server and agents require some adjustment and optimization.

Besides, we did a lot of optimization and service subtraction for neutron itself, such as use distributed DHCP instead of dhcp-agent, use distributed-metadata-proxy instead of metadata agent, resource lazy load for L3, local cache for ovs-agent and direct DB access for ovs-agent etc.

So, this could be a really long story.

LIU Yulong (dragon889) wrote :

Let's recall some scale issue at first:
[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow
[L2][scale issue] RPC timeout during ovs-agent restart
[l3][scale issue] unrestricted hosting routers in network node increase service operating pressure
[scale issue] ovs-agent port processing time increases linearly and eventually timeouts
[L2][scale issue] local connection to ovs-vswitchd was drop or timeout
[L2][scale issue] ovs-agent failed to restart
[L2][scale issue] ovs-agent restart costs too long time
[L2][scale issue] ovs-agent has too many flows to do trouble shooting
[L2][scale issue] ovs-agent dump-flows takes a lots of time
[L2][scale issue] ovs-agent has multipe cookies flows (stale flows)

So for L3 agent, we have these settings:
use_helper_for_ns_read = False

For ovs-agent, in order to prevent some timeout to ovs-db, we have these settings:
of_connect_timeout = 600
of_request_timeout = 600
ovsdb_timeout = 600
ovs_probe_interval = 600

For network node without any compute resource, we disable securitygroup forever:
firewall_driver = noop
enable_security_group = False

More gernal config is to disable the 'debug' level log to all your neutron service.

And for those centralized services (DB and MQ), find out all config tailing with "_timeout" to tune until meet your requirement, here is greped list (all these are default value, please do not use it directly):
#check_timeout = 20000
#client_socket_timeout = 900
#default_notify_timeout = 30
#default_reply_timeout = 30
#default_sender_link_timeout = 600
#default_send_timeout = 30
#heartbeat_timeout_threshold = 60
#http_connect_timeout = <None>
#idle_timeout = 0
#kafka_consumer_timeout = 1.0
#kombu_missing_consumer_retry_timeout = 60
#memcache_pool_conn_get_timeout = 10
#memcache_pool_socket_timeout = 3
#memcache_pool_unused_timeout = 60
#pool_timeout = <None>
#producer_batch_timeout = 0.0
#rpc_ack_timeout_base = 15
#rpc_ack_timeout_multiplier = 2
#rpc_poll_timeout = 1
#rpc_response_timeout = 60
#socket_timeout = 0.25
#tcp_user_timeout = 0.25
#timeout = <None>
#wait_timeout = 2000

Hongbin Lu ( on 2020-01-12
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
assignee: nobody → Slawek Kaplonski (slaweq)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers