Just a +1 to the HA property being changed requiring the router to be set down prior, and back up after to start the recreation of the router as HA.
We have seen various other side effects in Neutron/OVS environments and specifically the environment in question, such as -
* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces
All of which are tasks which are performed during bringup of HA routers. We have seen fewer of these issues on non-HA routers, and whether the router is HA or not, rescheduling the router or converting from HA to non-HA or vice versa will rebuild and as a result repair the router.
I should also point out that at the time of these issues, we have rarely observed high system load, but I do also agree that the number of routers and therefore the workload on both Neutron and OVS to orchestrate interface plugging and unplugging and namespace (and associated network stack plumbing) work is much higher than a typical environment. Having three servers doing this work rather than scaling horizontally seems like it might be exposing bottlenecks in either Neutron or OVS when it comes to the orchestration of these tasks.
I'm not sure if you are seeing the following traceback in the logs provided, but the below traceback has also been common when this issue crops up, and shows an example of a task performed during the bringup of a router (the IPTablesManager initialisation) falling over.
2018-02-14 05:04:32.101 1352665 DEBUG neutron.agent.linux.utils [-] Exit code: 0 execute /usr/lib/python2.7/dist-packages/neutron/agent/linux/utils.py:158
2018-02-14 05:04:32.103 1352665 DEBUG neutron.agent.linux.iptables_manager [-] IPTablesManager.apply completed with success. 0 iptables commands were issued _apply_synchronized /usr/lib/python2.7/dist-packages/neutron/agent/linux/iptables_manager.py:576
2018-02-14 05:04:32.103 1352665 DEBUG oslo_concurrency.lockutils [-] Releasing semaphore "iptables-qrouter-43801324-72ce-469f-a628-a5c645041e30" lock /usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py:228
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info [-] 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info Traceback (most recent call last):
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 253, in call
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info return func(*args, **kwargs)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 1115, in process
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self.process_external()
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 890, in process_external
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self._process_external_gateway(ex_gw_port)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 777, in _process_external_gateway
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self.external_gateway_updated(ex_gw_port, interface_name)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 403, in external_gateway_updated
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info self._remove_vip(old_gateway_cidr)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 202, in _remove_vip
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info instance.remove_vip_by_ip_address(ip_cidr)
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info AttributeError: 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.103 1352665 ERROR neutron.agent.l3.router_info
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent [-] Failed to process compatible router: 43801324-72ce-469f-a628-a5c645041e30
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 517, in _process_router_update
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self._process_router_if_compatible(router)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 454, in _process_router_if_compatible
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self._process_updated_router(router)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/agent.py", line 469, in _process_updated_router
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent ri.process()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 426, in process
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent super(HaRouter, self).process()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 256, in call
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self.logger(e)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self.force_reraise()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent six.reraise(self.type_, self.value, self.tb)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/common/utils.py", line 253, in call
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent return func(*args, **kwargs)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 1115, in process
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self.process_external()
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 890, in process_external
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self._process_external_gateway(ex_gw_port)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/router_info.py", line 777, in _process_external_gateway
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self.external_gateway_updated(ex_gw_port, interface_name)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 403, in external_gateway_updated
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent self._remove_vip(old_gateway_cidr)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent File "/usr/lib/python2.7/dist-packages/neutron/agent/l3/ha_router.py", line 202, in _remove_vip
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent instance.remove_vip_by_ip_address(ip_cidr)
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent AttributeError: 'NoneType' object has no attribute 'remove_vip_by_ip_address'
2018-02-14 05:04:32.104 1352665 ERROR neutron.agent.l3.agent
@james-page, @axino -
Just a +1 to the HA property being changed requiring the router to be set down prior, and back up after to start the recreation of the router as HA.
We have seen various other side effects in Neutron/OVS environments and specifically the environment in question, such as -
* Missing interfaces inside qrouter namespaces (OVS taps)
* Missing iptables rules
* Missing floating IP aliases on OVS interfaces inside the qrouter namespaces
All of which are tasks which are performed during bringup of HA routers. We have seen fewer of these issues on non-HA routers, and whether the router is HA or not, rescheduling the router or converting from HA to non-HA or vice versa will rebuild and as a result repair the router.
I should also point out that at the time of these issues, we have rarely observed high system load, but I do also agree that the number of routers and therefore the workload on both Neutron and OVS to orchestrate interface plugging and unplugging and namespace (and associated network stack plumbing) work is much higher than a typical environment. Having three servers doing this work rather than scaling horizontally seems like it might be exposing bottlenecks in either Neutron or OVS when it comes to the orchestration of these tasks.
I'm not sure if you are seeing the following traceback in the logs provided, but the below traceback has also been common when this issue crops up, and shows an example of a task performed during the bringup of a router (the IPTablesManager initialisation) falling over.
2018-02-14 05:04:32.101 1352665 DEBUG neutron. agent.linux. utils [-] Exit code: 0 execute /usr/lib/ python2. 7/dist- packages/ neutron/ agent/linux/ utils.py: 158 agent.linux. iptables_ manager [-] IPTablesManager .apply completed with success. 0 iptables commands were issued _apply_synchronized /usr/lib/ python2. 7/dist- packages/ neutron/ agent/linux/ iptables_ manager. py:576 y.lockutils [-] Releasing semaphore "iptables- qrouter- 43801324- 72ce-469f- a628-a5c645041e 30" lock /usr/lib/ python2. 7/dist- packages/ oslo_concurrenc y/lockutils. py:228 agent.l3. router_ info [-] 'NoneType' object has no attribute 'remove_ vip_by_ ip_address' agent.l3. router_ info Traceback (most recent call last): agent.l3. router_ info File "/usr/lib/ python2. 7/dist- packages/ neutron/ common/ utils.py" , line 253, in call agent.l3. router_ info return func(*args, **kwargs) agent.l3. router_ info File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ router_ info.py" , line 1115, in process agent.l3. router_ info self.process_ external( ) agent.l3. router_ info File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ router_ info.py" , line 890, in process_external agent.l3. router_ info self._process_ external_ gateway( ex_gw_port) agent.l3. router_ info File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ router_ info.py" , line 777, in _process_ external_ gateway agent.l3. router_ info self.external_ gateway_ updated( ex_gw_port, interface_name) agent.l3. router_ info File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ ha_router. py", line 403, in external_ gateway_ updated agent.l3. router_ info self._remove_ vip(old_ gateway_ cidr) agent.l3. router_ info File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ ha_router. py", line 202, in _remove_vip agent.l3. router_ info instance. remove_ vip_by_ ip_address( ip_cidr) agent.l3. router_ info AttributeError: 'NoneType' object has no attribute 'remove_ vip_by_ ip_address' agent.l3. router_ info agent.l3. agent [-] Failed to process compatible router: 43801324- 72ce-469f- a628-a5c645041e 30 agent.l3. agent Traceback (most recent call last): agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ agent.py" , line 517, in _process_ router_ update agent.l3. agent self._process_ router_ if_compatible( router) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ agent.py" , line 454, in _process_ router_ if_compatible agent.l3. agent self._process_ updated_ router( router) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ agent.py" , line 469, in _process_ updated_ router agent.l3. agent ri.process() agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ ha_router. py", line 426, in process agent.l3. agent super(HaRouter, self).process() agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ common/ utils.py" , line 256, in call agent.l3. agent self.logger(e) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ oslo_utils/ excutils. py", line 220, in __exit__ agent.l3. agent self.force_ reraise( ) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ oslo_utils/ excutils. py", line 196, in force_reraise agent.l3. agent six.reraise( self.type_ , self.value, self.tb) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ common/ utils.py" , line 253, in call agent.l3. agent return func(*args, **kwargs) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ router_ info.py" , line 1115, in process agent.l3. agent self.process_ external( ) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ router_ info.py" , line 890, in process_external agent.l3. agent self._process_ external_ gateway( ex_gw_port) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ router_ info.py" , line 777, in _process_ external_ gateway agent.l3. agent self.external_ gateway_ updated( ex_gw_port, interface_name) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ ha_router. py", line 403, in external_ gateway_ updated agent.l3. agent self._remove_ vip(old_ gateway_ cidr) agent.l3. agent File "/usr/lib/ python2. 7/dist- packages/ neutron/ agent/l3/ ha_router. py", line 202, in _remove_vip agent.l3. agent instance. remove_ vip_by_ ip_address( ip_cidr) agent.l3. agent AttributeError: 'NoneType' object has no attribute 'remove_ vip_by_ ip_address' agent.l3. agent
2018-02-14 05:04:32.103 1352665 DEBUG neutron.
2018-02-14 05:04:32.103 1352665 DEBUG oslo_concurrenc
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.103 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.
2018-02-14 05:04:32.104 1352665 ERROR neutron.