neutron-openvswitch-agent stuck on no queue 'q-agent-notifier-port-update_fanout..

Bug #1393391 reported by Miguel Angel Ajo on 2014-11-17
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Unassigned
neutron (Ubuntu)
Trusty
Medium
Unassigned

Bug Description

Under an HA deployment, neutron-openvswitch-agent can get stuck
when receiving a close command on a fanout queue the agent is not subscribed to.

It stops responding to any other messages, so it stops effectively working at all.

2014-11-11 10:27:33.092 3027 INFO neutron.common.config [-] Logging enabled!
2014-11-11 10:27:34.285 3027 INFO neutron.openstack.common.rpc.common [req-66ba318b-0fcc-42c2-959e-9a5233c292ef None] Connected to AMQP server on vip-rabbitmq:5672
2014-11-11 10:27:34.370 3027 INFO neutron.openstack.common.rpc.common [req-66ba318b-0fcc-42c2-959e-9a5233c292ef None] Connected to AMQP server on vip-rabbitmq:5672
2014-11-11 10:27:35.348 3027 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-66ba318b-0fcc-42c2-959e-9a5233c292ef None] Agent initialized successfully, now running...
2014-11-11 10:27:35.351 3027 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-66ba318b-0fcc-42c2-959e-9a5233c292ef None] Agent out of sync with plugin!
2014-11-11 10:27:35.401 3027 INFO neutron.plugins.openvswitch.agent.ovs_neutron_agent [req-66ba318b-0fcc-42c2-959e-9a5233c292ef None] Agent tunnel out of sync with plugin!
2014-11-11 10:27:35.414 3027 INFO neutron.openstack.common.rpc.common [req-66ba318b-0fcc-42c2-959e-9a5233c292ef None] Connected to AMQP server on vip-rabbitmq:5672
2014-11-11 10:32:33.143 3027 INFO neutron.agent.securitygroups_rpc [req-22c7fa11-882d-4278-9f83-6dd56ab95ba4 None] Security group member updated [u'4c7b3ad2-4526-48a7-959e-a8b8e4da6413']
2014-11-11 10:58:11.916 3027 INFO neutron.agent.securitygroups_rpc [req-484fd71f-8f61-496c-aa8a-2d3abf8de365 None] Security group member updated [u'4c7b3ad2-4526-48a7-959e-a8b8e4da6413']
2014-11-11 10:59:43.954 3027 INFO neutron.agent.securitygroups_rpc [req-2c0bc777-04ed-470a-aec5-927a59100b89 None] Security group member updated [u'4c7b3ad2-4526-48a7-959e-a8b8e4da6413']
2014-11-11 11:00:22.500 3027 INFO neutron.agent.securitygroups_rpc [req-df447d01-d132-40f2-8528-1c1c4d57c0f5 None] Security group member updated [u'4c7b3ad2-4526-48a7-959e-a8b8e4da6413']
2014-11-12 01:27:35.662 3027 ERROR neutron.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 579, in ensure
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common return method(*args, **kwargs)
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 659, in _consume
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 281, in drain_events
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 94, in drain_events
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common return connection.drain_events(**kwargs)
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 266, in drain_events
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common chanmap, None, timeout=timeout,
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 328, in _wait_multiple
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common channel, method_sig, args, content = read_timeout(timeout)
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 292, in read_timeout
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common return self.method_reader.read_method()
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 192, in read_method
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common raise m
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common IOError: Socket closed
2014-11-12 01:27:35.662 3027 TRACE neutron.openstack.common.rpc.common
2014-11-12 01:27:35.695 3027 INFO neutron.openstack.common.rpc.common [-] Reconnecting to AMQP server on vip-rabbitmq:5672
2014-11-12 01:27:35.722 3027 INFO neutron.openstack.common.rpc.common [-] Connected to AMQP server on vip-rabbitmq:5672
2014-11-12 02:00:22.682 3027 ERROR neutron.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common Traceback (most recent call last):
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 579, in ensure
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common return method(*args, **kwargs)
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 659, in _consume
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/connection.py", line 281, in drain_events
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/kombu/transport/pyamqp.py", line 94, in drain_events
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common return connection.drain_events(**kwargs)
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 266, in drain_events
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common chanmap, None, timeout=timeout,
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 328, in _wait_multiple
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common channel, method_sig, args, content = read_timeout(timeout)
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/connection.py", line 292, in read_timeout
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common return self.method_reader.read_method()
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common File "/usr/lib/python2.7/site-packages/amqp/method_framing.py", line 192, in read_method
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common raise m
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common IOError: Socket closed
2014-11-12 02:00:22.682 3027 TRACE neutron.openstack.common.rpc.common
2014-11-12 02:00:22.683 3027 INFO neutron.openstack.common.rpc.common [-] Reconnecting to AMQP server on vip-rabbitmq:5672
2014-11-12 02:00:23.017 3027 INFO neutron.openstack.common.rpc.common [-] Connected to AMQP server on vip-rabbitmq:5672
2014-11-12 02:00:23.021 3027 ERROR root [-] Unexpected exception occurred 1 time(s)... retrying.
2014-11-12 02:00:23.021 3027 TRACE root Traceback (most recent call last):
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 92, in inner_func
2014-11-12 02:00:23.021 3027 TRACE root return infunc(*args, **kwargs)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 746, in _consumer_thread
2014-11-12 02:00:23.021 3027 TRACE root self.consume()
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 737, in consume
2014-11-12 02:00:23.021 3027 TRACE root six.next(it)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 664, in iterconsume
2014-11-12 02:00:23.021 3027 TRACE root yield self.ensure(_error_callback, _consume)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 579, in ensure
2014-11-12 02:00:23.021 3027 TRACE root return method(*args, **kwargs)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 657, in _consume
2014-11-12 02:00:23.021 3027 TRACE root queues_tail.consume(nowait=False)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 190, in consume
2014-11-12 02:00:23.021 3027 TRACE root self.queue.consume(*args, callback=_callback, **options)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/kombu/entity.py", line 598, in consume
2014-11-12 02:00:23.021 3027 TRACE root nowait=nowait)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 1769, in basic_consume
2014-11-12 02:00:23.021 3027 TRACE root (60, 21), # Channel.basic_consume_ok
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 71, in wait
2014-11-12 02:00:23.021 3027 TRACE root return self.dispatch_method(method_sig, args, content)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/amqp/abstract_channel.py", line 88, in dispatch_method
2014-11-12 02:00:23.021 3027 TRACE root return amqp_method(self, args)
2014-11-12 02:00:23.021 3027 TRACE root File "/usr/lib/python2.7/site-packages/amqp/channel.py", line 224, in _close
2014-11-12 02:00:23.021 3027 TRACE root raise ChannelError(reply_code, reply_text, (class_id, method_id))
2014-11-12 02:00:23.021 3027 TRACE root ChannelError: 404: (NOT_FOUND - no queue 'q-agent-notifier-port-update_fanout_cc21f47607704321860757b7e6a1194a' in vhost '/', (60, 20), None)
2014-11-12 02:00:23.021 3027 TRACE root
2014-11-12 02:01:24.268 3027 ERROR root [-] Unexpected exception occurred 61 time(s)... retrying.
2014-11-12 02:01:24.268 3027 TRACE root Traceback (most recent call last):
2014-11-12 02:01:24.268 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/excutils.py", line 92, in inner_func
2014-11-12 02:01:24.268 3027 TRACE root return infunc(*args, **kwargs)
2014-11-12 02:01:24.268 3027 TRACE root File "/usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py", line 746, in _consumer_thread

---------------------------

[Impact]

This patch addresses an issue under a RabbitMQ HA deployment where neutron-openvswitch-agent stuck on no queue 'q-agent-notifier-port-update_fanout_xx' error when one of the RabbitMQ cluster node goes down, if there are more than 100 nova compute nodes, all neutron agents are down which is awful, even restart neutron-openvswitch agent can solve it, it is not the idea reality to restart all of the agents on all compute nodes, it broke HA.

[Test Case]

Note steps are for trusty-icehouse, including neutron package 1:2014.1.5-0ubuntu1.

Deploy an OpenStack cloud w/ multiple rabbit nodes and then abruptly kill one of the rabbit nodes (e.g. sudo service rabbitmq-server stop, etc). Observe that the neutron agents stopped to consume messages and keep throw no queue 'q-agent-notifier-port-update_fanout..' exception.

[Regression Potential]

The regression potential is low. The fix is fairly minimal and is limited to the code path where a 404 error occurs.

[Other Info]

Oslo library has this fix, but due to Neutron is using kombu other than oslo library in Icehouse, it still suffer this issue.

Related branches

Eugene Nikanorov (enikanorov) wrote :

The essential part of the issue are those conditions that trigger it. Marking as Incomplete for now

Changed in neutron:
status: New → Incomplete
Miguel Angel Ajo (mangelajo) wrote :

I'm working on it. It's reproduced on HA environment.

If you kill one of the rabbit servers, and the agent needs to reconnect to one of the others, and at the same time, we were deleting ports. But guessing yet.

Working to provide a more complete bug report.

description: updated
Miguel Angel Ajo (mangelajo) wrote :

Until fully understood, I'm trying next hack to avoid indefinite retry loop.

[root@rhos5-node3 ~]# diff -u /usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py.orig /usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py
--- /usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py.orig 2014-11-18 12:30:53.365256421 +0100
+++ /usr/lib/python2.7/site-packages/neutron/openstack/common/rpc/impl_kombu.py 2014-11-18 12:48:26.371730454 +0100
@@ -746,6 +746,12 @@
                 self.consume()
             except greenlet.GreenletExit:
                 return
+ except ChannelError as ce:
+ if ce.amqp_reply_code == 404:
+ LOG.error(_LE("Discarded ChannelError exception: %s"), ce)
+ else:
+ raise ce
+
         if self.consumer_thread is None:
             self.consumer_thread = eventlet.spawn(_consumer_thread)
         return self.consumer_thread

description: updated
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Miguel Angel Ajo (mangelajo) wrote :

The issue was with rabbitmq, and oslo messaging libraries, reconnection order, and queue creation

Kevin Bringard (kbringard) wrote :

Miguel: I think we're running into this exact same problem. Can you give me some details as to what specifically the issue was and how you addressed it?

From what I can tell, it looks as though what may have happened on our side is as follows:

* [Event] happened on the rabbit node where the queue was originally created (and thus the clients were consuming it)
* Channels were cancelled
* Because auto-delete is true, and expiry wasn't set, the fanout queues got deleted (because nothing is consuming the queue now)
* Clients reconnected to a new node
* Clients don't recreate the fanout queues (likely due to how it reconnects from a channel closure vs a connection closure)
* Clients start tossing the 404 error and never stop, unless they're forced to reconnect, which recreates the queues

So it seems like your hack is pretty spot on... when the failover happens the queue isn't properly recreated if it doesn't exist, which your code appears to fix.

Kevin Bringard (kbringard) wrote :

I'm also wondering, at least in our case, if setting the expiry would help, because the queue won't be immediately deleted. Something like this:

rabbitmqctl set_policy expiry ".*" '{"expires":1800000}' --apply-to queues

(taken from https://www.rabbitmq.com/ttl.html)

Xiang Hui (xianghui) on 2015-12-18
Changed in neutron:
status: Expired → Confirmed
Xiang Hui (xianghui) on 2015-12-18
description: updated
Xiang Hui (xianghui) on 2015-12-18
tags: added: patch
Xiang Hui (xianghui) wrote :
Changed in neutron (Ubuntu):
importance: Undecided → Medium
Changed in neutron (Ubuntu Trusty):
importance: Undecided → Medium
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in neutron (Ubuntu Trusty):
status: New → Confirmed
Changed in neutron (Ubuntu):
status: New → Confirmed
Corey Bryant (corey.bryant) wrote :

Hui,

Thanks for the patch. I've applied your debdiff. I've not uploaded quite yet as I want to see if there's anything else outstanding that we need to fix up.

Thanks,
Corey

description: updated
Xiang Hui (xianghui) wrote :

Corey,

  Thanks for looking at it, I will keep watching it for the new updates and please help to update it if there is any news, thanks.

Corey Bryant (corey.bryant) wrote :

Hui, I've uploaded your fix and it is now awaiting review by the sru team for trusty-proposed: https://launchpad.net/ubuntu/trusty/+queue?queue_state=1&queue_text=

Changed in neutron (Ubuntu):
status: Confirmed → Invalid
Changed in neutron (Ubuntu Trusty):
status: Confirmed → Fix Committed
Xiang Hui (xianghui) wrote :

Corey,

  Thank you very much for your efforts.

Hello Miguel, or anyone else affected,

Accepted neutron into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/1:2014.1.5-0ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-needed
tags: added: sts-sru
Xiang Hui (xianghui) wrote :

Used neutron 1:2014.1.5-0ubuntu4 from trusty-proposed, confirmed that didn't have such problem.

tags: added: verification-done
removed: verification-needed

The verification of the Stable Release Update for neutron has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 1:2014.1.5-0ubuntu4

---------------
neutron (1:2014.1.5-0ubuntu4) trusty; urgency=medium

  * neutron-openvswitch-agent stuck on 'q-agent-notifier-port-update_fanout-x'
    is NOT_FOUND exception, this bug caused RabbitMQ HA(active + active) failover
    in Icehouse not work with large scaled nova compute nodes. (LP: #1393391):
    - d/p/fix-neutron-agent-fanout-queue-not-found-loop.patch

 -- Hui Xiang <email address hidden> Mon, 25 Jan 2016 15:25:13 +0800

Changed in neutron (Ubuntu Trusty):
status: Fix Committed → Fix Released
Ryan Beisner (1chb1n) wrote :

FYI, following additional regression tests, today we promoted neutron 2014.1.5-0ubuntu4~cloud0 from proposed to icehouse-updates in the Ubuntu Cloud Archive.

Changed in neutron:
status: Confirmed → Invalid
no longer affects: neutron (Ubuntu)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers