Bug #1858419 “Docs needed for tunables at large scale” : Bugs : neutron

Revision history for this message

LIU Yulong (dragon889) wrote on 2020-01-08:

#1

Yes, we run some deploy on large scale. And indeed there are many may config options tunning. But not neutron itself only, but also those centralized services (DB and MQ) for neutron-server and agents require some adjustment and optimization.

Besides, we did a lot of optimization and service subtraction for neutron itself, such as use distributed DHCP instead of dhcp-agent, use distributed-metadata-proxy instead of metadata agent, resource lazy load for L3, local cache for ovs-agent and direct DB access for ovs-agent etc.

So, this could be a really long story.

Revision history for this message

LIU Yulong (dragon889) wrote on 2020-01-10:

#2

Let's recall some scale issue at first:
[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow
https://bugs.launchpad.net/neutron/+bug/1825152
[L2][scale issue] RPC timeout during ovs-agent restart
https://bugs.launchpad.net/neutron/+bug/1813704
[l3][scale issue] unrestricted hosting routers in network node increase service operating pressure
https://bugs.launchpad.net/neutron/+bug/1828605
[scale issue] ovs-agent port processing time increases linearly and eventually timeouts
https://bugs.launchpad.net/neutron/+bug/1838431
[L2][scale issue] local connection to ovs-vswitchd was drop or timeout
https://bugs.launchpad.net/neutron/+bug/1813705
[L2][scale issue] ovs-agent failed to restart
https://bugs.launchpad.net/neutron/+bug/1813706
[L2][scale issue] ovs-agent restart costs too long time
https://bugs.launchpad.net/neutron/+bug/1813707
[L2][scale issue] ovs-agent has too many flows to do trouble shooting
https://bugs.launchpad.net/neutron/+bug/1813708
[L2][scale issue] ovs-agent dump-flows takes a lots of time
https://bugs.launchpad.net/neutron/+bug/1813709
[L2][scale issue] ovs-agent has multipe cookies flows (stale flows)
https://bugs.launchpad.net/neutron/+bug/1813712

So for L3 agent, we have these settings:
[agent]
use_helper_for_ns_read = False
root_helper_daemon=

For ovs-agent, in order to prevent some timeout to ovs-db, we have these settings:
[ovs]
of_connect_timeout = 600
of_request_timeout = 600
ovsdb_timeout = 600
ovs_probe_interval = 600

For network node without any compute resource, we disable securitygroup forever:
[securitygroup]
firewall_driver = noop
enable_security_group = False

More gernal config is to disable the 'debug' level log to all your neutron service.

And for those centralized services (DB and MQ), find out all config tailing with "_timeout" to tune until meet your requirement, here is greped list (all these are default value, please do not use it directly):
#check_timeout = 20000
#client_socket_timeout = 900
#default_notify_timeout = 30
#default_reply_timeout = 30
#default_sender_link_timeout = 600
#default_send_timeout = 30
#heartbeat_timeout_threshold = 60
#http_connect_timeout = <None>
#idle_timeout = 0
#kafka_consumer_timeout = 1.0
#kombu_missing_consumer_retry_timeout = 60
#memcache_pool_conn_get_timeout = 10
#memcache_pool_socket_timeout = 3
#memcache_pool_unused_timeout = 60
#pool_timeout = <None>
#producer_batch_timeout = 0.0
#rpc_ack_timeout_base = 15
#rpc_ack_timeout_multiplier = 2
#rpc_poll_timeout = 1
#rpc_response_timeout = 60
#socket_timeout = 0.25
#tcp_user_timeout = 0.25
#timeout = <None>
#wait_timeout = 2000

Let's recall some scale issue at first:
[scale issue] the root rootwrap deamon causes l3 agent router procssing very very slow
https://bugs.launchpad.net/neutron/+bug/1825152
[L2][scale issue] RPC timeout during ovs-agent restart
https://bugs.launchpad.net/neutron/+bug/1813704
[l3][scale issue] unrestricted hosting routers in network node increase service operating pressure
https://bugs.launchpad.net/neutron/+bug/1828605
[scale issue] ovs-agent port processing time increases linearly and eventually timeouts
https://bugs.launchpad.net/neutron/+bug/1838431
[L2][scale issue] local connection to ovs-vswitchd was drop or timeout
https://bugs.launchpad.net/neutron/+bug/1813705
[L2][scale issue] ovs-agent failed to restart
https://bugs.launchpad.net/neutron/+bug/1813706
[L2][scale issue] ovs-agent restart costs too long time
https://bugs.launchpad.net/neutron/+bug/1813707
[L2][scale issue] ovs-agent has too many flows to do trouble shooting
https://bugs.launchpad.net/neutron/+bug/1813708
[L2][scale issue] ovs-agent dump-flows takes a lots of time
https://bugs.launchpad.net/neutron/+bug/1813709
[L2][scale issue] ovs-agent has multipe cookies flows (stale flows) 
https://bugs.launchpad.net/neutron/+bug/1813712

So for L3 agent, we have these settings:
[agent]
use_helper_for_ns_read = False
root_helper_daemon=

For ovs-agent, in order to prevent some timeout to ovs-db, we have these settings:
[ovs]
of_connect_timeout = 600
of_request_timeout = 600
ovsdb_timeout = 600
ovs_probe_interval = 600

For network node without any compute resource, we disable securitygroup forever:
[securitygroup]
firewall_driver = noop
enable_security_group = False

More gernal config is to disable the 'debug' level log to all your neutron service.

And for those centralized services (DB and MQ), find out all config tailing with "_timeout" to tune until meet your requirement, here is greped list (all these are default value, please do not use it directly):
#check_timeout = 20000
#client_socket_timeout = 900
#default_notify_timeout = 30
#default_reply_timeout = 30
#default_sender_link_timeout = 600
#default_send_timeout = 30
#heartbeat_timeout_threshold = 60
#http_connect_timeout = <None>
#idle_timeout = 0
#kafka_consumer_timeout = 1.0
#kombu_missing_consumer_retry_timeout = 60
#memcache_pool_conn_get_timeout = 10
#memcache_pool_socket_timeout = 3
#memcache_pool_unused_timeout = 60
#pool_timeout = <None>
#producer_batch_timeout = 0.0
#rpc_ack_timeout_base = 15
#rpc_ack_timeout_multiplier = 2
#rpc_poll_timeout = 1
#rpc_response_timeout = 60
#socket_timeout = 0.25
#tcp_user_timeout = 0.25
#timeout = <None>
#wait_timeout = 2000

Hongbin Lu (hongbin.lu) on 2020-01-12

Changed in neutron:
importance:	Undecided → Medium

Slawek Kaplonski (slaweq) on 2020-01-14

Changed in neutron:
assignee:	nobody → Slawek Kaplonski (slaweq)

neutron

Docs needed for tunables at large scale

Bug Description

Other bug subscribers

Remote bug watches