Activity log for bug #1648242

Date Who What changed Old value New value Message
2016-12-07 22:35:02 James Denton bug added bug
2016-12-08 07:16:23 Ann Taraday neutron: status New Confirmed
2016-12-08 07:16:41 Ann Taraday tags l3-ha
2016-12-08 07:17:35 Ann Taraday neutron: importance Undecided Low
2016-12-16 12:34:07 OpenStack Infra neutron: status Confirmed In Progress
2016-12-16 12:34:07 OpenStack Infra neutron: assignee Ann Taraday (akamyshnikova)
2017-01-19 17:42:33 OpenStack Infra neutron: status In Progress Fix Released
2017-01-24 01:37:09 OpenStack Infra tags l3-ha in-stable-newton l3-ha
2017-01-26 02:15:06 OpenStack Infra tags in-stable-newton l3-ha in-stable-mitaka in-stable-newton l3-ha
2017-03-08 05:59:11 Dominique Poulain bug added subscriber Dominique Poulain
2017-03-31 09:02:14 Edward Hope-Morley bug task added cloud-archive
2017-03-31 09:03:44 Louis Bouchard bug task added neutron (Ubuntu)
2017-03-31 09:03:54 Louis Bouchard nominated for series Ubuntu Xenial
2017-03-31 09:03:54 Louis Bouchard bug task added neutron (Ubuntu Xenial)
2017-03-31 09:04:33 Edward Hope-Morley nominated for series cloud-archive/mitaka
2017-04-05 20:32:41 Edward Hope-Morley description Version: Mitaka While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state. In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-a3c4d15e7677) scheduled to (3) L3 agents: +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages: 2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists 2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master 2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup 2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master 2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master 2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848 Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby': +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | xxx | standby | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03: 2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. 2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. 2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client 2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds. 2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting state! 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most recent call last): 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 686, in _report_state 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent True) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent return method(context, 'report_state', **kwargs) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=self.retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent timeout=timeout, retry=retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent result = self._waiter.wait(msg_id, timeout) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent message = self.waiters.get(msg_id, timeout=timeout) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 'to message ID %s' % msg_id) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 0.01 sec 2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91 2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. 2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to restore it to the default value. 2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal. If you need any other information please let me know. Thanks, James [Impact] Mitigates risk of incorrect ha_state reported by l3-agent for HA routers in case where rmq connection is lost during update window. Fix is already in Ubuntu for O and N but upstream backport just missed the Mitaka PR hence this SRU. [Test Case] * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3-agents-per-router set to 3 * configure network, router, boot instance with floating ip and start pinging * check that status is 1 agent showing active and 2 showing standby * trigger some router failovers while rabbit server stopped e.g. - go to l3-agent hosting your router and do: ip netns exec qrouter-${router} ip link set dev <ha-iface> down check other units to see if ha iface has been failed over ip netns exec qrouter-${router} ip link set dev <ha-iface> up * ensure ping still running * eventually all agents will be xxx/standby * start rabbit server * wait for correct ha_state to be set (takes a few seconds) [Regression Potential] I do not envisage any regression from this patch. One potential side-effect is mildy increased rmq traffic but should be negligible. -------- Version: Mitaka While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state. In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-a3c4d15e7677) scheduled to (3) L3 agents: +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages: 2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists 2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master 2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464 2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup 2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110 2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master 2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920 2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master 2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848 Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby': +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | id | host | admin_state_up | alive | ha_state | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ | 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | xxx | standby | | 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby | | 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby | +--------------------------------------+--------------------------------------------------+----------------+-------+----------+ A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03: 2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. 2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds. 2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client 2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds. 2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting state! 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most recent call last): 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 686, in _report_state 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent True) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent return method(context, 'report_state', **kwargs) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=self.retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent timeout=timeout, retry=retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=retry) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent result = self._waiter.wait(msg_id, timeout) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent message = self.waiters.get(msg_id, timeout=timeout) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 'to message ID %s' % msg_id) 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91 2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 0.01 sec 2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91 2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. 2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to restore it to the default value. 2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough. While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal. If you need any other information please let me know. Thanks, James
2017-04-05 20:32:52 Edward Hope-Morley summary Failure to retry update_ha_routers_states [SRU] Failure to retry update_ha_routers_states
2017-04-05 20:33:11 Edward Hope-Morley tags in-stable-mitaka in-stable-newton l3-ha in-stable-mitaka in-stable-newton l3-ha sts sts-sru-needed
2017-04-05 20:33:34 Edward Hope-Morley attachment added lp1648242-xenial-mitaka.debdiff https://bugs.launchpad.net/ubuntu/xenial/+source/neutron/+bug/1648242/+attachment/4855750/+files/lp1648242-xenial-mitaka.debdiff
2017-04-06 00:28:17 Ubuntu Foundations Team Bug Bot tags in-stable-mitaka in-stable-newton l3-ha sts sts-sru-needed in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed
2017-04-06 00:28:25 Ubuntu Foundations Team Bug Bot bug added subscriber Ubuntu Sponsors Team
2017-04-10 09:50:46 Edward Hope-Morley neutron (Ubuntu Xenial): assignee Edward Hope-Morley (hopem)
2017-04-15 01:55:51 Mathew Hodson neutron (Ubuntu): importance Undecided Low
2017-04-15 01:55:55 Mathew Hodson neutron (Ubuntu Xenial): importance Undecided Low
2017-04-19 09:36:06 James Page neutron (Ubuntu Xenial): status New Triaged
2017-04-19 09:36:08 James Page neutron (Ubuntu): status New Fix Released
2017-04-19 09:36:13 James Page bug task added cloud-archive/mitaka
2017-04-19 09:36:21 James Page cloud-archive: status New Fix Released
2017-04-19 09:36:27 James Page cloud-archive/mitaka: importance Undecided Low
2017-04-19 09:36:30 James Page cloud-archive/mitaka: status New Triaged
2017-04-19 10:44:37 James Page removed subscriber Ubuntu Sponsors Team
2017-04-19 10:44:45 James Page bug added subscriber Ubuntu Stable Release Updates Team
2017-05-04 20:54:15 Brian Murray neutron (Ubuntu Xenial): status Triaged Fix Committed
2017-05-04 20:54:21 Brian Murray bug added subscriber SRU Verification
2017-05-04 20:54:29 Brian Murray tags in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed verification-needed
2017-05-10 17:45:23 Edward Hope-Morley tags in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed verification-needed in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed verification-done
2017-05-15 08:25:25 James Page cloud-archive/mitaka: status Triaged Fix Committed
2017-05-15 08:25:29 James Page tags in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed verification-done in-stable-mitaka in-stable-newton l3-ha patch sts sts-sru-needed verification-done verification-mitaka-needed
2017-05-15 17:34:56 Ɓukasz Zemczak removed subscriber Ubuntu Stable Release Updates Team
2017-05-15 17:45:02 Launchpad Janitor neutron (Ubuntu Xenial): status Fix Committed Fix Released
2017-05-17 08:11:32 James Page cloud-archive/mitaka: status Fix Committed Fix Released