[SRU] Failure to retry update_ha_routers_states

Bug #1648242 reported by James Denton on 2016-12-07
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Undecided
Unassigned
Mitaka
Low
Unassigned
neutron
Low
Ann Taraday
neutron (Ubuntu)
Low
Unassigned
Xenial
Low
Edward Hope-Morley

Bug Description

[Impact]

  Mitigates risk of incorrect ha_state reported by l3-agent for HA
  routers in case where rmq connection is lost during update
  window. Fix is already in Ubuntu for O and N but upstream
  backport just missed the Mitaka PR hence this SRU.

[Test Case]

  * deploy Openstack Mitaka (Xenial) with l3-ha enabled and min/max l3-agents-per-router set to 3

  * configure network, router, boot instance with floating ip and start pinging

  * check that status is 1 agent showing active and 2 showing standby

  * trigger some router failovers while rabbit server stopped e.g.

    - go to l3-agent hosting your router and do:

      ip netns exec qrouter-${router} ip link set dev <ha-iface> down

      check other units to see if ha iface has been failed over

      ip netns exec qrouter-${router} ip link set dev <ha-iface> up

  * ensure ping still running

  * eventually all agents will be xxx/standby

  * start rabbit server

  * wait for correct ha_state to be set (takes a few seconds)

[Regression Potential]

 I do not envisage any regression from this patch. One potential side-effect is
 mildy increased rmq traffic but should be negligible.

--------

Version: Mitaka

While performing failover testing of L3 HA routers, we've discovered an issue with regards to the failure of an agent to report its state.

In this scenario, we have a router (7629f5d7-b205-4af5-8e0e-a3c4d15e7677) scheduled to (3) L3 agents:

+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
| 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | :-) | active |
| 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby |
| 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby |
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+

The infra03 node was shut down completely and abruptly. The router transitioned to master on infra02 as indicated in these log messages:

2016-12-06 16:15:06.457 18450 INFO neutron.agent.linux.interface [-] Device qg-d48918fa-eb already exists
2016-12-07 15:16:51.145 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to master
2016-12-07 15:16:51.811 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:16:51] "GET / HTTP/1.1" 200 115 0.666464
2016-12-07 15:18:29.167 18450 INFO neutron.agent.l3.ha [-] Router c8b5d5b7-ab57-4f56-9838-0900dc304af6 transitioned to backup
2016-12-07 15:18:29.229 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:18:29] "GET / HTTP/1.1" 200 115 0.062110
2016-12-07 15:21:48.870 18450 INFO neutron.agent.l3.ha [-] Router 7629f5d7-b205-4af5-8e0e-a3c4d15e7677 transitioned to master
2016-12-07 15:21:49.537 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:21:49] "GET / HTTP/1.1" 200 115 0.667920
2016-12-07 15:22:08.796 18450 INFO neutron.agent.l3.ha [-] Router 4676e7a5-279c-4114-8674-209f7fd5ab1a transitioned to master
2016-12-07 15:22:09.515 18450 INFO eventlet.wsgi.server [-] <local> - - [07/Dec/2016 15:22:09] "GET / HTTP/1.1" 200 115 0.719848

Traffic to/from VMs through the new master router functioned as expected. However, the ha_state remained 'standby':

+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
| id | host | admin_state_up | alive | ha_state |
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+
| 4434f999-51d0-4bbb-843c-5430255d5c64 | 726404-infra03-neutron-agents-container-a8bb0b1f | True | xxx | standby |
| 710e7768-df47-4bfe-917f-ca35c138209a | 726402-infra01-neutron-agents-container-fc937477 | True | :-) | standby |
| 7f0888ba-1e8a-4a36-8394-6448b8c606fb | 726403-infra02-neutron-agents-container-0338af5a | True | :-) | standby |
+--------------------------------------+--------------------------------------------------+----------------+-------+----------+

A traceback was observed in the logs related to a message timeout, probably due to the cut of AMQP on infra03:

2016-12-07 15:22:30.525 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds.
2016-12-07 15:22:36.537 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.155:5671 is unreachable: timed out. Trying again in 1 seconds.
2016-12-07 15:22:37.553 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.238.65:5671 via [amqp] client
2016-12-07 15:22:51.210 18450 ERROR oslo.messaging._drivers.impl_rabbit [-] AMQP server on 172.29.237.246:5671 is unreachable: Basic.cancel: (0) 1. Trying again in 1 seconds.
2016-12-07 15:22:52.262 18450 INFO oslo.messaging._drivers.impl_rabbit [-] Reconnected to AMQP server on 172.29.237.246:5671 via [amqp] client
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent [-] Failed reporting state!
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent Traceback (most recent call last):
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/l3/agent.py", line 686, in _report_state
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent True)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/neutron/agent/rpc.py", line 87, in report_state
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent return method(context, 'report_state', **kwargs)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 158, in call
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=self.retry)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/transport.py", line 90, in _send
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent timeout=timeout, retry=retry)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 470, in send
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent retry=retry)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 459, in _send
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent result = self._waiter.wait(msg_id, timeout)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 342, in wait
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent message = self.waiters.get(msg_id, timeout=timeout)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent File "/openstack/venvs/neutron-13.3.9/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 244, in get
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent 'to message ID %s' % msg_id)
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent MessagingTimeout: Timed out waiting for a reply to message ID 03e3909c25d749908652c050eeeaec91
2016-12-07 15:22:55.827 18450 ERROR neutron.agent.l3.agent
2016-12-07 15:22:55.829 18450 WARNING oslo.service.loopingcall [-] Function 'neutron.agent.l3.agent.L3NATAgentWithStateReport._report_state' run outlasted interval by 0.01 sec
2016-12-07 15:23:01.086 18450 INFO oslo_messaging._drivers.amqpdriver [-] No calling threads waiting for msg_id : 03e3909c25d749908652c050eeeaec91
2016-12-07 15:23:45.180 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 4 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.
2016-12-07 15:23:45.181 18450 WARNING neutron.common.rpc [-] Increasing timeout for update_ha_routers_states calls to 120 seconds. Restart the agent to restore it to the default value.
2016-12-07 15:23:51.941 18450 ERROR neutron.common.rpc [-] Timeout in RPC method update_ha_routers_states. Waiting for 35 seconds before next attempt. If the server is not down, consider increasing the rpc_response_timeout option as Neutron server(s) may be overloaded and unable to respond quickly enough.

While subsequent transitions for other routers acted and updated Neutron accordingly, this router's ha_state was never updated after this initial failure. We could force the update with a restart of the L3 agent, or by causing a transition from infra02 to another node, and back. Neither is ideal.

If you need any other information please let me know.

Thanks,
James

Ann Taraday (akamyshnikova) wrote :

I've seen the same issue.

Changed in neutron:
status: New → Confirmed
tags: added: l3-ha
Changed in neutron:
importance: Undecided → Low
Ann Taraday (akamyshnikova) wrote :

When I was looking at this issue I put LOG.debug("...") at line https://github.com/openstack/neutron/blob/master/neutron/agent/l3/keepalived_state_change.py#L106 and master has appeared as expected. I think this was coincidence and also I'm using just virtual env.

I'm not sure what will be the right solution here: check if we have all standby on server and notify agent to check whether one of them is the master?

Can you verify the version of keepalived being used? I believe the version was important for this feature.

James Denton (james-denton) wrote :

Jesse - Looks like Keepalived v1.2.7 (08/14,2013)

Ann Taraday (akamyshnikova) wrote :

I've saw this with v1.2.13. I don't think that version of Keepalived is related here as the issue is on the Neutron side.

Related patches could be https://review.openstack.org/#/c/300245/ and https://review.openstack.org/#/c/364803/

Ann Taraday (akamyshnikova) wrote :

Checked both patches - issue still appearing.

Does infra03 house your only rabbitmq service? vrrp is operating normally it seems underneath. If you look at your state_path/neutron/ha_conf/{router_id}/state you'll probably see vrrp operating without issue. Now the only way the l3 agent will update neutron api is through rabbitmq and if it's not available, no update will be made until the service returns.

Not sure what can be done here if rabbitmq dies as well.

Ann Taraday (akamyshnikova) wrote :

Yes, you are correct, if I check state_path/neutron/ha_conf/{router_id}/state it will show which agent is active for router. As I said in comment #2 probably the variant can be notify L3 agent to recheck states if for server all agents shown as standby, but I haven't checked something like this yet.

@Ann, so the main question I had didn't get answered as failure testing of infra01/02 is ok. The main thing we need to understand is if the rabbit node is getting killed during this testing on infra03. No matter what the solution you present, if there's no rpc, no states will get updated. The reason I ask this is infra01/02 failure testing shows that the states get updated just fine.

James Denton (james-denton) wrote :

@Brian - This is an OSA/RPC build with RabbitMQ clustering across 3 infra nodes. In this test we simply wanted to kill the server where the router was 'master' and observe the behavior. As you pointed out, VRRP under the hood was operating correctly - one of the two remaining nodes took over as master and the user did not sustain a long period of downtime. However, there was no clear indication from the API which infra node had taken over, and there was no way to update that state without either restarting the L3 agent(s) or causing a transition through manual means.

Looks like the state notification will choose only one rpc node and if that become unavailable it seems there's not enough time to detect this impact before we fail the state change notification:

https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L679

Might be best to detect a failed state, maybe determine if other nodes are possible, and if so wait for a reconnect to rabbit before trying again. Might have to play with this a bit, testing out some solutions.

Was able to duplicate this in devstack. But, oslo handles the reconnect to the secondary rabbit host correctly and reissues the call to the newly connected host. Checking stable/mitaka for the differences.

@James, what is the olso.messaging version?

@James, and also what version is kombu?

So this is a very narrow window that is being hit:

https://github.com/openstack/neutron/blob/master/neutron/common/rpc.py#L127

When the oslo.messaging driver fetches the connection, it has to hit a tiny window when it views the socket as not timed out:

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/amqpdriver.py#L419

The connection is validated and if it's viewed as a valid connection, the message is sent:

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L705

Ultimately, we get the traceback and it times out waiting for the reply, and it's handled by just killing the request:

https://github.com/openstack/neutron/blob/master/neutron/common/rpc.py#L128

Now, we could add an rpc_ha_retry_limit to attempt another send, basically another param that we act on. But this narrow window that we are hitting explains why sometimes you see it and sometimes you don't, just depends on how quick everything falls into place.

Some questions though:
If the host dies, how can the amqp client view the connection as valid and send anything if it can't establish a connection to the rabbit host? How is it able to accept a connection?

Is it possible ensure_publishing isn't handling a disconnect well? or kombu?

https://github.com/openstack/oslo.messaging/blob/master/oslo_messaging/_drivers/impl_rabbit.py#L1156

Fix proposed to branch: master
Review: https://review.openstack.org/411784

Changed in neutron:
assignee: nobody → Ann Taraday (akamyshnikova)
status: Confirmed → In Progress

In regards to the versions we run the following versions of kombu and olso:

kombu (3.0.34)
oslo.messaging (4.6.1)

Reviewed: https://review.openstack.org/411784
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=1927da1bc7c4e56162dd3704d58d3b922d4ebce9
Submitter: Jenkins
Branch: master

commit 1927da1bc7c4e56162dd3704d58d3b922d4ebce9
Author: AKamyshnikova <email address hidden>
Date: Fri Dec 16 16:00:59 2016 +0400

    Add check for ha state

    If all agents are shown as a standby it is possible changing state
    were lost due to problems with RabbitMQ. Current change adds check
    for ha state in fetch_and_sync_all_routers. If state is different -
     notify server that state should be changed.

    Also change _get_bindings_and_update_router_state_for_dead_agents
    to set standby for dead agent only in case we have more than one
    active.

    Change-Id: If5596eb24041ea9fae1d5d2563dcaf655c5face7
    Closes-bug:#1648242

Changed in neutron:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/423055
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=ff7c5c25d3845e4cbdd318389c02564290a7a0f2
Submitter: Jenkins
Branch: stable/newton

commit ff7c5c25d3845e4cbdd318389c02564290a7a0f2
Author: AKamyshnikova <email address hidden>
Date: Fri Dec 16 16:00:59 2016 +0400

    Add check for ha state

    If all agents are shown as a standby it is possible changing state
    were lost due to problems with RabbitMQ. Current change adds check
    for ha state in fetch_and_sync_all_routers. If state is different -
     notify server that state should be changed.

    Also change _get_bindings_and_update_router_state_for_dead_agents
    to set standby for dead agent only in case we have more than one
    active.

    (cherry picked from commit 1927da1bc7c4e56162dd3704d58d3b922d4ebce9)
    Change-Id: If5596eb24041ea9fae1d5d2563dcaf655c5face7
    Closes-bug:#1648242

tags: added: in-stable-newton

Reviewed: https://review.openstack.org/423064
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c8a4fa46948e5026923ae22eb75aeeb4a93161c4
Submitter: Jenkins
Branch: stable/mitaka

commit c8a4fa46948e5026923ae22eb75aeeb4a93161c4
Author: AKamyshnikova <email address hidden>
Date: Fri Dec 16 16:00:59 2016 +0400

    Add check for ha state

    If all agents are shown as a standby it is possible changing state
    were lost due to problems with RabbitMQ. Current change adds check
    for ha state in fetch_and_sync_all_routers. If state is different -
     notify server that state should be changed.

    Also change _get_bindings_and_update_router_state_for_dead_agents
    to set standby for dead agent only in case we have more than one
    active.

    (cherry picked from commit 1927da1bc7c4e56162dd3704d58d3b922d4ebce9)
    Change-Id: If5596eb24041ea9fae1d5d2563dcaf655c5face7
    Closes-bug:#1648242

tags: added: in-stable-mitaka

This issue was fixed in the openstack/neutron 10.0.0.0b3 development milestone.

This issue was fixed in the openstack/neutron 9.2.0 release.

Waiting for neutron 9.2.0 PR (bug 1664306) to hit N UCA (already landed in Ubuntu Y) before submitting for Xenial/Mitaka.

Edward Hope-Morley (hopem) wrote :
description: updated
summary: - Failure to retry update_ha_routers_states
+ [SRU] Failure to retry update_ha_routers_states
tags: added: sts sts-sru-needed
Edward Hope-Morley (hopem) wrote :

newton 9.2.0 now fully released so proceeding with mitaka sru

The attachment "lp1648242-xenial-mitaka.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Changed in neutron (Ubuntu Xenial):
assignee: nobody → Edward Hope-Morley (hopem)
Changed in neutron (Ubuntu):
importance: Undecided → Low
Changed in neutron (Ubuntu Xenial):
importance: Undecided → Low
James Page (james-page) on 2017-04-19
Changed in neutron (Ubuntu Xenial):
status: New → Triaged
Changed in neutron (Ubuntu):
status: New → Fix Released
Changed in cloud-archive:
status: New → Fix Released
James Page (james-page) wrote :

Uploaded to xenial-proposed for SRU team review as part of 2:8.4.0-0ubuntu2

Hello James, or anyone else affected,

Accepted neutron into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:8.4.0-0ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed
Edward Hope-Morley (hopem) wrote :

Tested on Xenial Mitaka and lgtm.

tags: added: verification-done
removed: verification-needed
James Page (james-page) wrote :

Hello James, or anyone else affected,

Accepted neutron into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2

---------------
neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium

  [ Edward Hope-Morley ]
  * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242)
    - d/p/add-check-for-ha-state.patch

  [ Chuck Short ]
  * d/neutron-common.install, d/neutron-dhcp-agent.install:
    Remove cron jobs since they will cause a race when
    using an L3 agent. The L3 agent cleans up after itself now.
    (LP: #1623664)

 -- Chuck Short <email address hidden> Wed, 19 Apr 2017 11:39:09 +0100

Changed in neutron (Ubuntu Xenial):
status: Fix Committed → Fix Released
James Page (james-page) wrote :

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package neutron - 2:8.4.0-0ubuntu2~cloud0
---------------

 neutron (2:8.4.0-0ubuntu2~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 neutron (2:8.4.0-0ubuntu2) xenial; urgency=medium
 .
   [ Edward Hope-Morley ]
   * Backport fix for Failure to retry update_ha_routers_states (LP: #1648242)
     - d/p/add-check-for-ha-state.patch
 .
   [ Chuck Short ]
   * d/neutron-common.install, d/neutron-dhcp-agent.install:
     Remove cron jobs since they will cause a race when
     using an L3 agent. The L3 agent cleans up after itself now.
     (LP: #1623664)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers