A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03:
2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds.
2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds.
2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client
2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds.
2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting state!
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 686, in _report_state
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent True)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent return method(context, 'report_state', **kwargs)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=self.retry)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent timeout=timeout, retry=retry)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=retry)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent result = self._waiter.wait(msg_id, timeout)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent message = self.waiters.get(msg_id, timeout=timeout)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 'to message ID %s' % msg_id)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent
2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 0.01 sec
2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91
2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to restore it to the default value.
2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal.
If you need any other information please let me know.
Version: Mitaka
While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state.
In this scenario, we have a router (7629f5d7- b205-4af5- 8e0e-a3c4d15e76 77) scheduled to (3) L3 agents:
+------ ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+--- ------- + ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+--- ------- + 51d0-4bbb- 843c-5430255d5c 64 | 726404- infra03- neutron- agents- container- a8bb0b1f | True | :-) | active | df47-4bfe- 917f-ca35c13820 9a | 726402- infra01- neutron- agents- container- fc937477 | True | :-) | standby | 1e8a-4a36- 8394-6448b8c606 fb | 726403- infra02- neutron- agents- container- 0338af5a | True | :-) | standby | ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+--- ------- +
| id | host | admin_state_up | alive | ha_state |
+------
| 4434f999-
| 710e7768-
| 7f0888ba-
+------
The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages:
2016-12-06 16:15:06.457 18450 INFO neutron. agent.linux. interface [-] Device qg-d48918fa-eb already exists ab57-4f56- 9838-0900dc304a f6 transitioned to master wsgi.server [-] <local> - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 ab57-4f56- 9838-0900dc304a f6 transitioned to backup wsgi.server [-] <local> - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 b205-4af5- 8e0e-a3c4d15e76 77 transitioned to master wsgi.server [-] <local> - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 279c-4114- 8674-209f7fd5ab 1a transitioned to master wsgi.server [-] <local> - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848
2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-
2016-12-07 15:16:51.811 18450 INFO eventlet.
2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-
2016-12-07 15:18:29.229 18450 INFO eventlet.
2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-
2016-12-07 15:21:49.537 18450 INFO eventlet.
2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-
2016-12-07 15:22:09.515 18450 INFO eventlet.
Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby':
+------ ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+--- ------- + ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+--- ------- + 51d0-4bbb- 843c-5430255d5c 64 | 726404- infra03- neutron- agents- container- a8bb0b1f | True | xxx | standby | df47-4bfe- 917f-ca35c13820 9a | 726402- infra01- neutron- agents- container- fc937477 | True | :-) | standby | 1e8a-4a36- 8394-6448b8c606 fb | 726403- infra02- neutron- agents- container- 0338af5a | True | :-) | standby | ------- ------- ------- ------- ----+-- ------- ------- ------- ------- ------- ------- ------+ ------- ------- --+---- ---+--- ------- +
| id | host | admin_state_up | alive | ha_state |
+------
| 4434f999-
| 710e7768-
| 7f0888ba-
+------
A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03:
2016-12-07 15:22:30.525 18450 ERROR oslo.messaging. _drivers. impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. _drivers. impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. _drivers. impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client _drivers. impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds. _drivers. impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client agent.l3. agent [-] Failed reporting state! agent.l3. agent Traceback (most recent call last): agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ neutron/ agent/l3/ agent.py" , line 686, in _report_state agent.l3. agent True) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ neutron/ agent/rpc. py", line 87, in report_state agent.l3. agent return method(context, 'report_state', **kwargs) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ oslo_messaging/ rpc/client. py", line 158, in call agent.l3. agent retry=self.retry) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ oslo_messaging/ transport. py", line 90, in _send agent.l3. agent timeout=timeout, retry=retry) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 470, in send agent.l3. agent retry=retry) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 459, in _send agent.l3. agent result = self._waiter. wait(msg_ id, timeout) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 342, in wait agent.l3. agent message = self.waiters. get(msg_ id, timeout=timeout) agent.l3. agent File "/openstack/ venvs/neutron- 13.3.9/ lib/python2. 7/site- packages/ oslo_messaging/ _drivers/ amqpdriver. py", line 244, in get agent.l3. agent 'to message ID %s' % msg_id) agent.l3. agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d7499 08652c050eeeaec 91 agent.l3. agent loopingcall [-] Function 'neutron. agent.l3. agent.L3NATAgen tWithStateRepor t._report_ state' run outlasted interval by 0.01 sec _drivers. amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d7499 08652c050eeeaec 91 ha_routers_ states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_ timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. ha_routers_ states calls to 120 seconds. Restart the agent to restore it to the default value. ha_routers_ states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_ timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2016-12-07 15:22:36.537 18450 ERROR oslo.messaging.
2016-12-07 15:22:37.553 18450 INFO oslo.messaging.
2016-12-07 15:22:51.210 18450 ERROR oslo.messaging.
2016-12-07 15:22:52.262 18450 INFO oslo.messaging.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.829 18450 WARNING oslo.service.
2016-12-07 15:23:01.086 18450 INFO oslo_messaging.
2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_
2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_
2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_
While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal.
If you need any other information please let me know.
Thanks,
James