neutron

In DHCP agent's sync_state, get_active_networks_info RPC times out, when there are large number of networks.

Bug #1490308 reported by Sudhakar Gariganti on 2015-08-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Expired	Medium	Unassigned

Bug Description

In our scale tests, for the scenario of supporting large number of networks, we encountered frequent RPC timeouts for the get_active_networks_info call in the sync_state method.

Once this timeout happens, it takes an indefinite amount of time for the DHCP agent to recover as it keeps doing alot of redundant work.

Assume I am at provisioning some 600th tenant network and fail to enable the DHCP for that network. So a resync is scheduled for this network alone.

Now in the sync_state method, we fire get_active_networks_info call, which doesn't have any 'filters'. Neutron server takes its own sweet time to return as it had to,

1. fetch all networks from DB which are hosted on this agent and try to schedule
2. fetch subnets info for all networks ,
3. fetch ports info for all networks,

By the time the response comes, agent had already timed out the default 60sec timeout.

Though the step 1 makes sense for some cases, we don't need to get subnet and ports info for all the networks, when we actually want to resync only 1 network.

I think we need to resurrect the get_active_networks RPC and have filtering in get_active_networks_info RPC.

P.S: Increasing the rpc_timeout is definetly an option, but given the possible room of improvement in agent code, I do not want to call that shot already.

Tags:

Revision history for this message

Sudhakar Gariganti (sudhakar-gariganti) wrote on 2015-08-30:

Below is the current sync_state method, for quick reference:

@utils.synchronized('dhcp-agent')
    def sync_state(self, networks=None):
        """Sync the local DHCP state with Neutron. If no networks are passed,
        or 'None' is one of the networks, sync all of the networks.
        """
        only_nets = set([] if (not networks or None in networks) else networks)
        LOG.info(_LI('Synchronizing state'))
        pool = eventlet.GreenPool(cfg.CONF.num_sync_threads)
        known_network_ids = set(self.cache.get_network_ids())

        try:
            active_networks = self.plugin_rpc.get_active_networks_info() <--- Area of concern
            active_network_ids = set(network.id for network in active_networks)
            for deleted_id in known_network_ids - active_network_ids:
                try:
                    self.disable_dhcp_helper(deleted_id)
                except Exception as e:
                    self.schedule_resync(e, deleted_id)
                    LOG.exception(_LE('Unable to sync network state on '
                                      'deleted network %s'), deleted_id)

            for network in active_networks:
                if (not only_nets or # specifically resync all
                        network.id not in known_network_ids or # missing net
                        network.id in only_nets): # specific network to sync
                    pool.spawn(self.safe_configure_dhcp_for_network, network)
            pool.waitall()
            LOG.info(_LI('Synchronizing state complete'))

        except Exception as e:
            self.schedule_resync(e)
            LOG.exception(_LE('Unable to sync network state.'))

Changed in neutron:
assignee:	nobody → Sudhakar Gariganti (sudhakar-gariganti)
summary:	- In DHCP agent's sync_state, get_active_networks_info results in RPC - timeout + In DHCP agent's sync_state, get_active_networks_info times out
summary:	- In DHCP agent's sync_state, get_active_networks_info times out + In DHCP agent's sync_state, get_active_networks_info RPC times out, when + there are large number of networks.

Sudhakar Gariganti (sudhakar-gariganti) on 2015-08-31

Changed in neutron:
status:	New → In Progress

Revision history for this message

Carl Baldwin (carl-baldwin) wrote on 2015-12-03:

Any progress on this? I don't see a patch proposed?

Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

Sudhakar Gariganti (sudhakar-gariganti) wrote on 2015-12-03:

Not sure why the notification got missed. I have posted a patch [ https://review.openstack.org/#/c/219573/ ] quite a while ago, but did not get time to revise it off late.
Will resume the work and try to post a update.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-03-12: Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/219573
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

Armando Migliaccio (armando-migliaccio) wrote on 2016-03-12:

If you are still working on this please resume, or allow someone else to pick this up.

Changed in neutron:
status:	In Progress → Incomplete
assignee:	Sudhakar Gariganti (sudhakar-gariganti) → nobody

Revision history for this message

Launchpad Janitor (janitor) wrote on 2016-05-11:

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1525753

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.