In DHCP agent's sync_state, get_active_networks_info RPC times out, when there are large number of networks.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Expired
|
Medium
|
Unassigned |
Bug Description
In our scale tests, for the scenario of supporting large number of networks, we encountered frequent RPC timeouts for the get_active_
Once this timeout happens, it takes an indefinite amount of time for the DHCP agent to recover as it keeps doing alot of redundant work.
Assume I am at provisioning some 600th tenant network and fail to enable the DHCP for that network. So a resync is scheduled for this network alone.
Now in the sync_state method, we fire get_active_
1. fetch all networks from DB which are hosted on this agent and try to schedule
2. fetch subnets info for all networks ,
3. fetch ports info for all networks,
By the time the response comes, agent had already timed out the default 60sec timeout.
Though the step 1 makes sense for some cases, we don't need to get subnet and ports info for all the networks, when we actually want to resync only 1 network.
I think we need to resurrect the get_active_networks RPC and have filtering in get_active_
P.S: Increasing the rpc_timeout is definetly an option, but given the possible room of improvement in agent code, I do not want to call that shot already.
Changed in neutron: | |
status: | New → In Progress |
Below is the current sync_state method, for quick reference:
@utils. synchronized( 'dhcp-agent' )
LOG.info( _LI('Synchroniz ing state')) GreenPool( cfg.CONF. num_sync_ threads)
known_ network_ ids = set(self. cache.get_ network_ ids())
def sync_state(self, networks=None):
"""Sync the local DHCP state with Neutron. If no networks are passed,
or 'None' is one of the networks, sync all of the networks.
"""
only_nets = set([] if (not networks or None in networks) else networks)
pool = eventlet.
try:
active_ networks = self.plugin_ rpc.get_ active_ networks_ info() <--- Area of concern
active_ network_ ids = set(network.id for network in active_networks)
try:
self. disable_ dhcp_helper( deleted_ id)
except Exception as e:
self. schedule_ resync( e, deleted_id)
LOG. exception( _LE('Unable to sync network state on '
'deleted network %s'), deleted_id)
for deleted_id in known_network_ids - active_network_ids:
for network in active_networks:
network. id not in known_network_ids or # missing net
network. id in only_nets): # specific network to sync
pool. spawn(self. safe_configure_ dhcp_for_ network, network)
pool. waitall( )
LOG. info(_LI( 'Synchronizing state complete'))
if (not only_nets or # specifically resync all
except Exception as e:
self. schedule_ resync( e)
LOG. exception( _LE('Unable to sync network state.'))