In DHCP agent's sync_state, get_active_networks_info RPC times out, when there are large number of networks.

Bug #1490308 reported by Sudhakar Gariganti
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Expired
Medium
Unassigned

Bug Description

In our scale tests, for the scenario of supporting large number of networks, we encountered frequent RPC timeouts for the get_active_networks_info call in the sync_state method.

Once this timeout happens, it takes an indefinite amount of time for the DHCP agent to recover as it keeps doing alot of redundant work.

Assume I am at provisioning some 600th tenant network and fail to enable the DHCP for that network. So a resync is scheduled for this network alone.

Now in the sync_state method, we fire get_active_networks_info call, which doesn't have any 'filters'. Neutron server takes its own sweet time to return as it had to,

1. fetch all networks from DB which are hosted on this agent and try to schedule
2. fetch subnets info for all networks ,
3. fetch ports info for all networks,

By the time the response comes, agent had already timed out the default 60sec timeout.

Though the step 1 makes sense for some cases, we don't need to get subnet and ports info for all the networks, when we actually want to resync only 1 network.

I think we need to resurrect the get_active_networks RPC and have filtering in get_active_networks_info RPC.

P.S: Increasing the rpc_timeout is definetly an option, but given the possible room of improvement in agent code, I do not want to call that shot already.

Tags: l3-ipam-dhcp
Revision history for this message
Sudhakar Gariganti (sudhakar-gariganti) wrote :

Below is the current sync_state method, for quick reference:

@utils.synchronized('dhcp-agent')
    def sync_state(self, networks=None):
        """Sync the local DHCP state with Neutron. If no networks are passed,
        or 'None' is one of the networks, sync all of the networks.
        """
        only_nets = set([] if (not networks or None in networks) else networks)
        LOG.info(_LI('Synchronizing state'))
        pool = eventlet.GreenPool(cfg.CONF.num_sync_threads)
        known_network_ids = set(self.cache.get_network_ids())

        try:
            active_networks = self.plugin_rpc.get_active_networks_info() <--- Area of concern
            active_network_ids = set(network.id for network in active_networks)
            for deleted_id in known_network_ids - active_network_ids:
                try:
                    self.disable_dhcp_helper(deleted_id)
                except Exception as e:
                    self.schedule_resync(e, deleted_id)
                    LOG.exception(_LE('Unable to sync network state on '
                                      'deleted network %s'), deleted_id)

            for network in active_networks:
                if (not only_nets or # specifically resync all
                        network.id not in known_network_ids or # missing net
                        network.id in only_nets): # specific network to sync
                    pool.spawn(self.safe_configure_dhcp_for_network, network)
            pool.waitall()
            LOG.info(_LI('Synchronizing state complete'))

        except Exception as e:
            self.schedule_resync(e)
            LOG.exception(_LE('Unable to sync network state.'))

Changed in neutron:
assignee: nobody → Sudhakar Gariganti (sudhakar-gariganti)
summary: - In DHCP agent's sync_state, get_active_networks_info results in RPC
- timeout
+ In DHCP agent's sync_state, get_active_networks_info times out
summary: - In DHCP agent's sync_state, get_active_networks_info times out
+ In DHCP agent's sync_state, get_active_networks_info RPC times out, when
+ there are large number of networks.
Changed in neutron:
status: New → In Progress
Revision history for this message
Carl Baldwin (carl-baldwin) wrote :

Any progress on this? I don't see a patch proposed?

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Sudhakar Gariganti (sudhakar-gariganti) wrote :

Not sure why the notification got missed. I have posted a patch [ https://review.openstack.org/#/c/219573/ ] quite a while ago, but did not get time to revise it off late.
Will resume the work and try to post a update.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/219573
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

If you are still working on this please resume, or allow someone else to pick this up.

Changed in neutron:
status: In Progress → Incomplete
assignee: Sudhakar Gariganti (sudhakar-gariganti) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.