[SRU] Failure to retry update_ha_routers_states
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ubuntu Cloud Archive |
Fix Released
|
Undecided
|
Unassigned | ||
Mitaka |
Fix Released
|
Low
|
Unassigned | ||
neutron |
Fix Released
|
Low
|
Ann Taraday | ||
neutron (Ubuntu) |
Fix Released
|
Low
|
Unassigned | ||
Xenial |
Fix Released
|
Low
|
Edward Hope-Morley |
Bug Description
[Impact]
Mitigates risk of incorrect ha_state reported by l3-agent for HA
routers in case where rmq connection is lost during update
window. Fix is already in Ubuntu for O and N but upstream
backport just missed the Mitaka PR hence this SRU.
[Test Case]
* deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3-agents-
* configure network, router, boot instance with floating ip and start pinging
* check that status is 1 agent showing active and 2 showing standby
* trigger some router failovers while rabbit server stopped e.g.
- go to l3-agent hosting your router and do:
ip netns exec qrouter-${router} ip link set dev <ha-iface> down
check other units to see if ha iface has been failed over
ip netns exec qrouter-${router} ip link set dev <ha-iface> up
* ensure ping still running
* eventually all agents will be xxx/standby
* start rabbit server
* wait for correct ha_state to be set (takes a few seconds)
[Regression Potential]
I do not envisage any regression from this patch. One potential side-effect is
mildy increased rmq traffic but should be negligible.
--------
Version: Mitaka
While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state.
In this scenario, we have a router (7629f5d7-
+------
| id | host | admin_state_up | alive | ha_state |
+------
| 4434f999-
| 710e7768-
| 7f0888ba-
+------
The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages:
2016-12-06 16:15:06.457 18450 INFO neutron.
2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-
2016-12-07 15:16:51.811 18450 INFO eventlet.
2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-
2016-12-07 15:18:29.229 18450 INFO eventlet.
2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-
2016-12-07 15:21:49.537 18450 INFO eventlet.
2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-
2016-12-07 15:22:09.515 18450 INFO eventlet.
Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby':
+------
| id | host | admin_state_up | alive | ha_state |
+------
| 4434f999-
| 710e7768-
| 7f0888ba-
+------
A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03:
2016-12-07 15:22:30.525 18450 ERROR oslo.messaging.
2016-12-07 15:22:36.537 18450 ERROR oslo.messaging.
2016-12-07 15:22:37.553 18450 INFO oslo.messaging.
2016-12-07 15:22:51.210 18450 ERROR oslo.messaging.
2016-12-07 15:22:52.262 18450 INFO oslo.messaging.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.827 18450 ERROR neutron.
2016-12-07 15:22:55.829 18450 WARNING oslo.service.
2016-12-07 15:23:01.086 18450 INFO oslo_messaging.
2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_
2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_
2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_
While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal.
If you need any other information please let me know.
Thanks,
James
Changed in neutron (Ubuntu Xenial): | |
assignee: | nobody → Edward Hope-Morley (hopem) |
Changed in neutron (Ubuntu): | |
importance: | Undecided → Low |
Changed in neutron (Ubuntu Xenial): | |
importance: | Undecided → Low |
Changed in neutron (Ubuntu Xenial): | |
status: | New → Triaged |
Changed in neutron (Ubuntu): | |
status: | New → Fix Released |
Changed in cloud-archive: | |
status: | New → Fix Released |
I've seen the same issue.