Comment 15 for bug 1224001

Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

In some cases we have a situation where rpc_loop or _sync_routers_task block. From log observations this happens always will executing subprocess.communicate, and the root cause could be this: https://github.com/eventlet/eventlet/pull/24

Which is a bit strange since this popen.communicate is used also in common.processutils and not other block issue has been reported. Perhaps neutron.agent.linux.utils.execute should leverage openstack.common

Another thing which at the moment is hardly explained is why this would not affect the dhcp agent.

In other cases instead the following exception is raised (and probably it shouldn't):

2013-10-04 12:28:21.360 1259 ERROR neutron.agent.l3_agent [-] Failed synchronizing routers
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent Traceback (most recent call last):
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 730, in _rpc_loop
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent self._process_router_delete()
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 739, in _process_router_delete
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent self._router_removed(router_id)
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent File "/opt/stack/new/neutron/neutron/agent/l3_agent.py", line 313, in _router_removed
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent ri = self.router_info[router_id]
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent KeyError: u'b17f5fe6-8354-4af7-b271-a4ab0896dcb7'
2013-10-04 12:28:21.360 1259 TRACE neutron.agent.l3_agent

This triggers a full synchronization, which has the following effects:
- blocks rpc loops; so the update for the floating IP is delayed. With many routers (and tenant isolation jobs added many routers) this might mean that the floating IP is applied after the tempest timeout is elapsed.
- doing many execute operations increases the chance of the thread blocking

Current approach is to 'blindly' ignore the router_removed error to avoid it triggering the full router synchronization.
If it works, this should be regarded only as the first step of a more complex fix aimed at getting the gate going again.