Docs needed for tunables at large scale
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| neutron |
Medium
|
Slawek Kaplonski |
Bug Description
Various things come up in IRC every once in a while about configuration options that need to be tweaked at large scale (blizzard, cern, etc) which once you hit hundreds or thousands of compute nodes need to be changed to avoid killing the control plane.
Similar bug is reported for Nova here: https:/
So lets use this bug to discuss and track here what Neutron options are also important to be tuned for large scale.
LIU Yulong (dragon889) wrote : | #1 |
LIU Yulong (dragon889) wrote : | #2 |
Let's recall some scale issue at first:
[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow
https:/
[L2][scale issue] RPC timeout during ovs-agent restart
https:/
[l3][scale issue] unrestricted hosting routers in network node increase service operating pressure
https:/
[scale issue] ovs-agent port processing time increases linearly and eventually timeouts
https:/
[L2][scale issue] local connection to ovs-vswitchd was drop or timeout
https:/
[L2][scale issue] ovs-agent failed to restart
https:/
[L2][scale issue] ovs-agent restart costs too long time
https:/
[L2][scale issue] ovs-agent has too many flows to do trouble shooting
https:/
[L2][scale issue] ovs-agent dump-flows takes a lots of time
https:/
[L2][scale issue] ovs-agent has multipe cookies flows (stale flows)
https:/
So for L3 agent, we have these settings:
[agent]
use_helper_
root_helper_daemon=
For ovs-agent, in order to prevent some timeout to ovs-db, we have these settings:
[ovs]
of_connect_timeout = 600
of_request_timeout = 600
ovsdb_timeout = 600
ovs_probe_interval = 600
For network node without any compute resource, we disable securitygroup forever:
[securitygroup]
firewall_driver = noop
enable_
More gernal config is to disable the 'debug' level log to all your neutron service.
And for those centralized services (DB and MQ), find out all config tailing with "_timeout" to tune until meet your requirement, here is greped list (all these are default value, please do not use it directly):
#check_timeout = 20000
#client_
#default_
#default_
#default_
#default_
#heartbeat_
#http_connect_
#idle_timeout = 0
#kafka_
#kombu_
#memcache_
#memcache_
#memcache_
#pool_timeout = <None>
#producer_
#rpc_ack_
#rpc_ack_
#rpc_poll_timeout = 1
#rpc_response_
#socket_timeout = 0.25
#tcp_user_timeout = 0.25
#timeout = <None>
#wait_timeout = 2000
Changed in neutron: | |
importance: | Undecided → Medium |
Changed in neutron: | |
assignee: | nobody → Slawek Kaplonski (slaweq) |
Yes, we run some deploy on large scale. And indeed there are many may config options tunning. But not neutron itself only, but also those centralized services (DB and MQ) for neutron-server and agents require some adjustment and optimization.
Besides, we did a lot of optimization and service subtraction for neutron itself, such as use distributed DHCP instead of dhcp-agent, use distributed- metadata- proxy instead of metadata agent, resource lazy load for L3, local cache for ovs-agent and direct DB access for ovs-agent etc.
So, this could be a really long story.