Bug #856764 “RabbitMQ connections lack heartbeat or TCP keepaliv...” : Bugs : oslo.messaging

Thierry Carrez (ttx) on 2011-10-21

Changed in nova:
importance:	Undecided → Wishlist
status:	New → Confirmed

Revision history for this message

Andrea Rosa (andrea-rosa-m) wrote on 2011-11-17:

#1

For the solution 2 (heartbeat functionality) we need to use another amqp client (for example pika), at this moment the python-amqplib doesn't implement the heartbeat.

Revision history for this message

Brad McConnell (bmcconne) wrote on 2012-02-23:

#2

Just wanted to add an alternate solution to this for the folks that run into this bug while searching. If you make the ASA send active resets instead of silently dropping the connections out of their table, your environment should stabilize. Something along the lines of the following, plus any appropriate adjustments for port/policy-map based upon your individual environment:

class-map rabbit-hop
match port tcp eq 5672
policy-map global_policy
class rabbit-hop
set connection timeout idle 12:00:00 reset

Revision history for this message

Russell Bryant (russellb) wrote on 2012-12-15:

#3

From searching around it sounds like this should no longer be an issue due to enabling TCP keepalives.

"amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"

Changed in nova:
status:	Confirmed → Invalid

Revision history for this message

Justin Hopper (justin-hopper) wrote on 2013-07-03:

#4

The version of kombu we are now using and the py-amqp lib that provides the transport supports heartbeat.

Heartbeat will help close connections when a client using rabbit is forcefully terminated.

Using heartbeats may be an option and if so can either be exposed to the rpc-component user by way of server-params or a configuration for the rpc-component.

Changed in nova:
status:	Invalid → New

Revision history for this message

Kiall Mac Innes (kiall) wrote on 2013-07-03:

#5

By pure fluke, I submitted this a few days back: https://review.openstack.org/#/c/34949

It adds heartbeat support to the Kombu driver.

OpenStack Infra (hudson-openstack) on 2013-07-04

Changed in oslo:
assignee:	nobody → Kiall Mac Innes (kiall)
status:	New → In Progress

Revision history for this message

Mark McLoughlin (markmc) wrote on 2013-07-06:

#6

Russell's point should be addressed:

"amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"

Revision history for this message

Mark McLoughlin (markmc) wrote on 2013-07-06:

#7

I asked a bunch of questions in the oslo review

Main thing missing is what exactly the heartbeat fixes that SO_KEEPALIVE doesn't already address

Changed in nova:
status:	New → Incomplete
Changed in oslo:
status:	In Progress → Incomplete

Revision history for this message

Kiall Mac Innes (kiall) wrote on 2013-07-06:

#8

Hey Mark - I've responded to your comments in the review comments. Rather than split the conversation over two places, I'll just leave a link here:

https://review.openstack.org/#/c/34949/

Revision history for this message

Mark McLoughlin (markmc) wrote on 2013-07-08:

#9

The convincing point made in the review is that a service sitting there listening for RPC requests will have to wait 2 hours by default to be notified that it has lost its connection with the broker if we rely on SO_KEEPALIVE

Changed in oslo:
status:	Incomplete → Triaged
importance:	Undecided → High
Changed in nova:
status:	Incomplete → Confirmed
importance:	Wishlist → High
status:	Confirmed → Triaged

OpenStack Infra (hudson-openstack) on 2013-07-08

Changed in oslo:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-07-08: Fix merged to oslo-incubator (master)

#10

Reviewed: https://review.openstack.org/34949
Committed: http://github.com/openstack/oslo-incubator/commit/c37f6aaab3ac00b7865dee18158114433350237e
Submitter: Jenkins
Branch: master

commit c37f6aaab3ac00b7865dee18158114433350237e
Author: Kiall Mac Innes <email address hidden>
Date: Fri Jun 28 21:14:26 2013 +0100

Add support for heartbeating in the kombu RPC driver

This aides in detecting connection interruptions that would otherwise
go unnoticed.

Fixes bug #856764

Change-Id: Id4eb3d36036969b62890175d6a33b4e304be0527

Changed in oslo:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2013-07-17

Changed in oslo:
milestone:	none → havana-2
status:	Fix Committed → Fix Released

Revision history for this message

Mike Lundy (novas0x2a) wrote on 2013-07-19:

#11

Note that the fix for this was reverted: https://github.com/openstack/oslo-incubator/commit/23f2ea29bcfbcfd27a1e4eb2712b0f65245cc7ec

Thierry Carrez (ttx) on 2013-08-14

Changed in oslo:
status:	Fix Released → Triaged

Revision history for this message

Kevin Bringard (kbringard) wrote on 2013-08-20:

#12

I spoke with MarkMc about this in #openstack-dev, but another thing I've discovered:

I should start by saying I'm in no way an ampq or rabbit expert. This is just based on a lot of googling, testing in my environment and trial and error. If I say something which doesn't make sense, it's quite possible it doesn't :-D

In rabbit, when master promotion occurs a slave queue will kick off all of it's consumers, but not kill the connection (http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-January/017341.html). An almost identical issue was brought up on the springsource client forums here: http://forum.springsource.org/archive/index.php/t-121480.html.

While the ampq libraries support connection disruption handling, they don't appear to handle channel disruption or consumer cancel notifications. The end result of which is that when a master promotion occurs in rabbit, the OpenStack services will continue to consume from a queue whose channel has been closed.

Once you get all your consumers to re-establish their channels, messages begin flowing again, but the ultimate result is that a single node failure can cause the majority (or even all) messages to stop flowing to OS services until you force them to re-establish (either by bouncing all rabbit nodes with attached/hung consumers or by restarting individual OS services).

You can reproduce the effects like so:

* Determine the master for any given queue.
** I generally do this by running watch "rabbitmqctl list_queues -p /nova name slave_pids synchronised_slave_pids messages messages_unacknowledged consumers | grep -v fanout" and look for the node in the cluster which is not a slave (inherently making it the master)
* Stop rabbit on the master node
* Watch the consumers column. It should mostly drop to 0, and busy queues (such as q-plugin) will likely begin backing up
* Pick a service (quantum-server works well, as it will drain q-plugin) and validate which rabbit node it is connected to (netstat, grepping the logs of the service, or rabbitmqctl list_connections name should find it pretty easily)
* Restart said service or the rabbit broker it is connected to
*Once it restarts and/or determines the connection has been lost, the connection will be re-established
* Go back to your watch command, and you should now see the new subscriber on its specific queue

I'm adding notes here because I'm not sure if the heartbeat implementation works at the channel level, or if we need to implement consumer cancel notification support (https://lists.launchpad.net/openstack/msg15111.html).

Regardless, without properly handling master promotion in rabbit, it makes using HA queues a moot exercise as losing a single node can cause all messages to stop flowing. Given the heavy reliance on the message queue, I think we need to be especially careful how we handle this and make it as solid as possible.

I spoke with MarkMc about this in #openstack-dev, but another thing I've discovered:

I should start by saying I'm in no way an ampq or rabbit expert. This is just based on a lot of googling, testing in my environment and trial and error. If I say something which doesn't make sense, it's quite possible it doesn't :-D

In rabbit, when master promotion occurs a slave queue will kick off all of it's consumers, but not kill the connection (http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-January/017341.html). An almost identical issue was brought up on the springsource client forums here: http://forum.springsource.org/archive/index.php/t-121480.html.

While the ampq libraries support connection disruption handling, they don't appear to handle channel disruption or consumer cancel notifications. The end result of which is that when a master promotion occurs in rabbit, the OpenStack services will continue to consume from a queue whose channel has been closed.

Once you get all your consumers to re-establish their channels, messages begin flowing again, but the ultimate result is that a single node failure can cause the majority (or even all) messages to stop flowing to OS services until you force them to re-establish (either by bouncing all rabbit nodes with attached/hung consumers or by restarting individual OS services).

You can reproduce the effects like so:

* Determine the master for any given queue. 
** I generally do this by running watch "rabbitmqctl list_queues -p /nova name slave_pids synchronised_slave_pids messages messages_unacknowledged consumers | grep -v fanout"  and look for the node in the cluster which is not a slave (inherently making it the master)
* Stop rabbit on the master node
* Watch the consumers column. It should mostly drop to 0, and busy queues (such as q-plugin) will likely begin backing up
* Pick a service (quantum-server works well, as it will drain q-plugin) and validate which rabbit node it is connected to (netstat, grepping the logs of the service, or rabbitmqctl list_connections name should find it pretty easily)
* Restart said service or the rabbit broker it is connected to
*Once it restarts and/or determines the connection has been lost, the connection will be re-established
* Go back to your watch command, and you should now see the new subscriber on its specific queue

I'm adding notes here because I'm not sure if the heartbeat implementation works at the channel level, or if we need to implement consumer cancel notification support (https://lists.launchpad.net/openstack/msg15111.html).

Regardless, without properly handling master promotion in rabbit, it makes using HA queues a moot exercise as losing a single node can cause all messages to stop flowing. Given the heavy reliance on the message queue, I think we need to be especially careful how we handle this and make it as solid as possible.

Revision history for this message

Kevin Bringard (kbringard) wrote on 2013-08-22:

#13

So it looks like Ask Solem outlines how we need to do heartbeats in this post:

https://lists.launchpad.net/openstack/msg15111.html

Specifically:

An example of enabling heartbeats with eventlet could be:

import weakref
from kombu import Connection
from eventlet import spawn_after

def monitor_heartbeats(connection, rate=2):
if not connection.heartbeat:
return
interval = connection.heartbeat / 2.0
cref = weakref.ref(connection)

def heartbeat_check():
    conn = cref()
    if conn is not None and conn.connected:
        conn.heartbeat_check(rate=rate)
        spawn_after(interval, heartbeat_check)

return spawn_after(interval, heartbeat_check)

connection = Connection('pyamq://', heartbeat=10)

or:

connection = Connection('pyamqp://?heartbeat=10')

Additionally, I think adding support for consumer cancel notifications would aid in the master promotion issues I outlined above. From Ask's email:

- Consumer cancel notifications

Requires no changes to your code,
all you need is to properly reconnect when one of the
errors in Connection.channel_errors occur, which is handled
automatically by Connection.ensure / Connection.autoretry (I don't believe
Nova uses that, but it probably should).

Of course, this all requires updating to a newer version of kombu and amqp as well, but based on our experiences with rabbit, I really think the benefits of adding this functionality will help tremendously from an enterprise operationally ready standpoint. Without it, the HA story in rabbit is pretty dismal :-/

Revision history for this message

Kevin Bringard (kbringard) wrote on 2013-08-22:

#14

impl_kombu.py.patch Edit (871 bytes, text/plain)

Download full text (4.8 KiB)

So, based on Ask's comment about notifications, I started looking into it. As it turns out, *if* you're running a version of kombu/amqp which supports the channel_errors object (version 2.1.4 seems to be when it was introduced: http://kombu.readthedocs.org/en/latest/changelog.html), the following simple patch resolves the issue (also attached):

--- impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
+++ impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
@@ -488,7 +488,6 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
- self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -562,7 +561,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.channel_errors, socket.timeout, IOError), e:
+ except (self.connection_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Basically, in ensure() you want to watch the channel and not the connection.

I verified this in a 2 node rabbit cluster. There are 2 nodes: .139 and .141. .139 is currently the master.

The following is from the nova logs when .139 is stopped (and .141 is promoted to the master):

Notice, we're connected to 192.168.128.141:

013-08-22 21:27:45.807 INFO nova.openstack.common.rpc.common [req-20aa6610-b0df-4730-9773-6024e47a6da7 None None] Connected to AMQP server on 192.168.128.141:5672
2013-08-22 21:27:45.843 INFO nova.openstack.common.rpc.common [req-c82c8ea0-aa8b-49b0-925c-b79399f011de None None] Connected to AMQP server on 192.168.128.141:5672

...

Then, we stop rabbit on .139 and see the following *channel* error:

2013-08-22 21:28:13.475 20003 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: tag u'2'
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 572, in ensure
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return method(*args, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 654, in _consume
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 281, in drain_events
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", lin...

So, based on Ask's comment about notifications, I started looking into it. As it turns out, *if* you're running a version of kombu/amqp which supports the channel_errors object  (version 2.1.4 seems to be when it was introduced: http://kombu.readthedocs.org/en/latest/changelog.html), the following simple patch resolves the issue (also attached):

--- impl_kombu.py.new   2013-08-22 21:52:54.711337602 +0000
+++ impl_kombu.py.orig  2013-08-22 21:52:37.727386558 +0000
@@ -488,7 +488,6 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
-   self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -562,7 +561,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
-            except (self.channel_errors, socket.timeout, IOError), e:
+            except (self.connection_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Basically, in ensure() you want to watch the channel and not the connection.

I verified this in a 2 node rabbit cluster. There are 2 nodes: .139 and .141. .139 is currently the master.

The following is from the nova logs when .139 is stopped (and .141 is promoted to the master):

Notice, we're connected to 192.168.128.141:

013-08-22 21:27:45.807 INFO nova.openstack.common.rpc.common [req-20aa6610-b0df-4730-9773-6024e47a6da7 None None] Connected to AMQP server on 192.168.128.141:5672
2013-08-22 21:27:45.843 INFO nova.openstack.common.rpc.common [req-c82c8ea0-aa8b-49b0-925c-b79399f011de None None] Connected to AMQP server on 192.168.128.141:5672

...

Then, we stop rabbit on .139 and see the following *channel* error:

2013-08-22 21:28:13.475 20003 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: tag u'2'
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 572, in ensure
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common     return method(*args, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common   File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 654, in _consume
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common     return self.connection.drain_events(timeout=timeout)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common   File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 281, in drain_events
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common     return self.transport.drain_events(self.connection, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common   File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 91, in drain_events
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common     return connection.drain_events(**kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common   File "/usr/local/lib/python2.7/dist-packages/amqp/connection.py", line 286, in drain_events
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common     return amqp_method(channel, args)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common   File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1628, in _basic_cancel_notify
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common     raise ConsumerCancel('tag %r' % (consumer_tag, ))
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common ConsumerCancel: tag u'2'
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common

Ensure fails due to the channel error and causes the service to reconnect. It reconnects to the same host (as it is now the only one alive):

2013-08-22 21:28:13.478 20003 INFO nova.openstack.common.rpc.common [-] Reconnecting to AMQP server on 192.168.128.141:5672
2013-08-22 21:28:13.510 20003 INFO nova.openstack.common.rpc.common [-] Connected to AMQP server on 192.168.128.141:5672
2013-08-22 21:28:17.007 INFO nova.openstack.common.rpc.common [req-482627bb-812e-4997-90c0-96fbf3c8de34 None None] Connected to AMQP server on 192.168.128.141:5672

Message processing then continues as per usual.

Pip install --upgrade kombu works (even on Ubuntu 12.04) to upgrade kombu to support this, however the ultimate solution will likely need to be more robust than this patch as we should do our best to support the shipping version in LTS out of the box.

Revision history for this message

Kevin Bringard (kbringard) wrote on 2013-08-23:

#15

impl_kombu.py.patch Edit (876 bytes, text/plain)

Sorry, realized I created the patch the wrong way. :facepalm:

This is how it *should* be:

--- impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
+++ impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
@@ -488,6 +488,7 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
+ self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -561,7 +562,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.connection_errors, socket.timeout, IOError), e:
+ except (self.channel_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Revision history for this message

Kevin Bringard (kbringard) wrote on 2013-08-27:

#16

impl_kombu.py.patch Edit (1.0 KiB, text/plain)

Quick update on this... I will probably submit this patch upstream, but the channel_errors object seems to exist in the old kombu, so we can declare it without an error, but it doesn't get populated as that version of kombu doesn't populate it.

The supplied patch should "work" on any version, but will only detect channel_errors when running versions of kombu which support it.

Doubtlessly this could be cleaner, and I still think that adding heartbeat support to actively populate and check the channel would be worthwhile, but this should also help with the issue in the short term.

It's also worth pointing out that the newer versions of kombu inherently support a lot of the functionality we're duplicating, such as ensuring connections exist, pooling connections and determining which servers to use and in what order. It's probably worth looking at implementing those once the newer versions of kombu are "standard" on the bulk of distros.

Revision history for this message

Sam Morrison (sorrison) wrote on 2013-10-16:

#17

Hi Kevin,
Just wondering if you've had a chance to submit this upstream?

Thierry Carrez (ttx) on 2013-10-17

Changed in oslo:
milestone:	havana-2 → 2013.2
milestone:	2013.2 → none

Revision history for this message

Chris Friesen (cbf123) wrote on 2013-11-30:

#18

Any update on this issue? I've just run into an issue that I think might be related. We have active/standby controllers (using pacemaker) and multiple compute nodes.

If a controller is killed uncleanly all the services come up on the other controller but it takes about 9 minutes or so before I can boot up a new instance. After that time I see "nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed" on the compute nodes, then it reconnects to the AMQP server and I can then boot an instance.

Unfortunately, any instances I tried to boot during those 9 minutes stay in the "BUILD/scheduling" state forever.

Revision history for this message

Vish Ishaya (vishvananda) wrote on 2013-12-02:

#19

The following fix works for failover, but doesn't solve all of the problems in HA mode. For that kevin's patch above is needed.

When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates.

So solving the HA issue generally involves a rabbit config with a section like the following:

[
{rabbit, [{tcp_listen_options, [binary,
                                {packet, raw},
                                {reuseaddr, true},
                                {backlog, 128},
                                {nodelay, true},
                                {exit_on_close, false},
                                {keepalive, true}]}
          ]}
].

Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections:

echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time
echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes
echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl

Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use:

https://github.com/meebey/force_bind/blob/master/README

Mark McLoughlin (markmc) on 2013-12-11

Changed in oslo.messaging:
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Chet Burgess (cfb-n) wrote on 2013-12-21:

#20

I have done extensive testing using both Vish's keepalive tuning parameters and Kevin's proposed fix. We've been able to validate that the following occur correctly.

1) A client will reconnect if the server they are actively connected to dies (Vish's tuning).
2) A client will reconnect if the AMQP master for the queue its subscribed too goes away (Kevin's proposed fix).

As the original reporters of this we feel the combination successfully addresses the issue and allows for a complete HA solution at the RPC level with rabbit.

Given the time since the patch was posted to the issue I plan on submitting a review to oslo.messaging with the proposed fix as soon as I have definitively confirmed what version of kombu will be required.

I also think we should open a doc bug to document the tuning parameters Vish has outlined. The default behavior out of the box is fairly poor and the HA story isn't really complete until both things are done.

I'm not entirely sure of the proper procedure for the doc bug so any guidance would be appreciated.

Revision history for this message

Sergey Pimkov (sergey-pimkov) wrote on 2013-12-27:

#21

transport.py.patch Edit (458 bytes, text/plain)

Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.

This problem was expained on Stack Overflow: http://stackoverflow.com/questions/16320039/getting-disconnection-notification-using-tcp-keep-alive-on-write-blocked-socket

Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!

Revision history for this message

Nicolas Simonds (nicolas.simonds) wrote on 2014-02-26:

#22

I'm not sure if this is germane to the original bug report, but this seems to be where the discussion about RabbitMQ failover is happening, so here's the current state of the art, as far as we can tell:

With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.

Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled. We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them. They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.

We're not 100% sure why this is happening, but log analysis and observation causes us to wildly speculate that on failover with mirrored queues, RabbitMQ forces an election to determine a new master, and if clients attempt to teardown and re-establish their queues before the election has concluded, they will encounter a race condition where their termination requests get eaten and are unacknowledged by the server, and the clients just hang out forever waiting for their requests to complete, and never retry.

With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/nova-clear-rabbit-queues command must be run to clear out the "stale" fanout queues. With Kombu 3.x and newer, the situation is improved, and stopping RabbitMQ on all but one server will cause new CCNs to be generated, and the clients will cleanly migrate to the remaining server and begin working again.

This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine. As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs. The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.

I'm not sure if this is germane to the original bug report, but this seems to be where the discussion about RabbitMQ failover is happening, so here's the current state of the art, as far as we can tell:

With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.

Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled.  We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them.  They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.

We're not 100% sure why this is happening, but log analysis and observation causes us to wildly speculate that on failover with mirrored queues, RabbitMQ forces an election to determine a new master, and if clients attempt to teardown and re-establish their queues before the election has concluded, they will encounter a race condition where their termination requests get eaten and are unacknowledged by the server, and the clients just hang out forever waiting for their requests to complete, and never retry.

With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/nova-clear-rabbit-queues command must be run to clear out the "stale" fanout queues.  With Kombu 3.x and newer, the situation is improved, and stopping RabbitMQ on all but one server will cause new CCNs to be generated, and the clients will cleanly migrate to the remaining server and begin working again.

This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine.  As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs.  The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-26: Fix proposed to oslo.messaging (master)

#23

Fix proposed to branch: master
Review: https://review.openstack.org/76686

Changed in oslo.messaging:
assignee:	nobody → Nicolas Simonds (nicolas.simonds)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-02-28:

#24

Fix proposed to branch: master
Review: https://review.openstack.org/77276

Changed in oslo.messaging:
assignee:	Nicolas Simonds (nicolas.simonds) → Chet Burgess (cfb-n)
assignee:	Chet Burgess (cfb-n) → Nicolas Simonds (nicolas.simonds)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04: Fix merged to oslo.messaging (master)

#25

Reviewed: https://review.openstack.org/77276
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0400cbf4f83cf8d58076c7e65e08a156ec3508a8
Submitter: Jenkins
Branch: master

commit 0400cbf4f83cf8d58076c7e65e08a156ec3508a8
Author: Chet Burgess <email address hidden>
Date: Fri Feb 28 13:39:09 2014 -0800

Gracefully handle consumer cancel notifications

    With mirrored queues and clustered rabbit nodes a queue is still
    mastered by a single rabbit node. When the rabbit node dies an
    election occurs amongst the remaining nodes and a new master is
    elected. When a slave is promoted to master it will close all the
    open channels to its consumers but it will not close the
    connections. This is reported to consumers as a consumer cancel
    notification (CCN). Consumers need to re-subscribe to these queues
    when they recieve a CCN.

    kombu 2.1.4+ reports CCNs as channel errors. This patch updates
    the ensure function to be more inline with the upstream kombu
    functionality. We now monitor for channel errors as well as
    connection errors and initiate a reconnect if we detect an error.

Change-Id: Ie00f67e65250dc983fa45877c14091ad4ae136b4
Partial-Bug: 856764

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-03-04:

#26

Reviewed: https://review.openstack.org/76686
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
Submitter: Jenkins
Branch: master

commit fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
Author: Nicolas Simonds <email address hidden>
Date: Wed Feb 26 15:21:01 2014 -0800

Slow down Kombu reconnect attempts

For a rationale for this patch, see the discussion surrounding Bug

    When reconnecting to a RabbitMQ cluster with mirrored queues in
    use, the attempt to release the connection can hang "indefinitely"
    somewhere deep down in Kombu. Blocking the thread for a bit
    prior to release seems to kludge around the problem where it is
    otherwise reproduceable.

DocImpact

Change-Id: Ic2ede3046709b831adf8204e4c909c589c1786c4
Partial-Bug: 856764

Revision history for this message

Mark McLoughlin (markmc) wrote on 2014-03-31:

#27

Marking as Invalid for Nova because any fix would be in oslo.messaging

Changed in nova:
status:	Triaged → Invalid

Doug Hellmann (doug-hellmann) on 2014-05-14

Changed in oslo.messaging:
importance:	High → Critical

Kiall Mac Innes (kiall) on 2014-05-14

Changed in oslo:
assignee:	Kiall Mac Innes (kiall) → nobody

James Page (james-page) on 2014-05-21

Changed in oslo.messaging:
assignee:	Nicolas Simonds (nicolas.simonds) → James Page (james-page)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-21: Fix proposed to oslo.messaging (master)

#28

Fix proposed to branch: master
Review: https://review.openstack.org/94656

Tim Bell (tim-bell) on 2014-05-21

tags:

added: havana-backport-potential

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-05-26:

#29

Please sync kombu_reconnect_delay for all affected projects as well.

Changed in neutron:
status:	New → Confirmed
Changed in heat:
status:	New → Confirmed

Bogdan Dobrelya (bogdando) on 2014-05-26

Changed in ceilometer:
status:	New → Confirmed

Bogdan Dobrelya (bogdando) on 2014-05-26

Changed in neutron:
assignee:	nobody → Bogdan Dobrelya (bogdando)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-26: Related fix proposed to neutron (master)

#30

Related fix proposed to branch: master
Review: https://review.openstack.org/95479

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-26: Related fix proposed to heat (master)

#31

Related fix proposed to branch: master
Review: https://review.openstack.org/95482

Bogdan Dobrelya (bogdando) on 2014-05-26

Changed in heat:
assignee:	nobody → Bogdan Dobrelya (bogdando)
Changed in neutron:
status:	Confirmed → In Progress
Changed in heat:
status:	Confirmed → In Progress

Bogdan Dobrelya (bogdando) on 2014-05-26

Changed in ceilometer:
status:	Confirmed → New

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-26: Related fix proposed to ceilometer (stable/icehouse)

#32

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/95489

Bogdan Dobrelya (bogdando) on 2014-05-26

Changed in ceilometer:
assignee:	nobody → Bogdan Dobrelya (bogdando)
status:	New → In Progress

Eoghan Glynn (eglynn) on 2014-05-30

Changed in ceilometer:
importance:	Undecided → High
milestone:	none → 2014.1.1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-05-30: Related fix merged to ceilometer (stable/icehouse)

#33

Reviewed: https://review.openstack.org/95489
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=06eb8bc53225c2b58cd2ffeedad17b7428b5f1de
Submitter: Jenkins
Branch: stable/icehouse

commit 06eb8bc53225c2b58cd2ffeedad17b7428b5f1de
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 26 13:28:40 2014 +0300

Sync kombu_reconnect_delay from Oslo

    When reconnecting to a RabbitMQ cluster
    with mirrored queues in use, the attempt to release the
    connection can hang "indefinitely" somewhere deep down
    in Kombu. Blocking the thread for a bit prior to
    release seems to kludge around the problem where it is
    otherwise reproduceable.
    The value 5.0 fits for low perfomance environments as well.

    Cherry-picked from Oslo.messaging:
    fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
    Related-bug: #856764

Change-Id: Ifadda4dd9122df9ccb4ecf560ce3db3e38adf2b9
Signed-off-by: Bogdan Dobrelya <email address hidden>

tags:

added: in-stable-icehouse

Alan Pevec (apevec) on 2014-06-05

Changed in ceilometer:
milestone:	2014.1.1 → none
tags:	removed: in-stable-icehouse

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-06-06:

#34

Please note, the patch https://review.openstack.org/95489 does not close the subject of the issue, it only syncs kombu_reconnect_delay - the TCP heartbeat is still in TODO list, please reassign as appropriate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Related fix proposed to oslo-incubator (stable/icehouse)

#35

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/99007

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Related fix proposed to oslo-incubator (master)

#36

Related fix proposed to branch: master
Review: https://review.openstack.org/99009

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Related fix proposed to oslo-incubator (stable/icehouse)

#37

Related fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/99015

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Related fix proposed to oslo-incubator (master)

#38

Related fix proposed to branch: master
Review: https://review.openstack.org/99017

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Related fix proposed to neutron (master)

#39

Related fix proposed to branch: master
Review: https://review.openstack.org/99018

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-06-10: Change abandoned on oslo-incubator (stable/icehouse)

#40

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/99007
Reason: new change id is Ic2ede3046709b831adf8204e4c909c589c1786c4

OpenStack Infra (hudson-openstack) on 2014-06-10

Changed in oslo:
assignee:	nobody → Bogdan Dobrelya (bogdando)
status:	Triaged → In Progress

Bogdan Dobrelya (bogdando) on 2014-06-11

Changed in neutron:
assignee:	Bogdan Dobrelya (bogdando) → nobody

Bogdan Dobrelya (bogdando) on 2014-06-17

Changed in heat:
status:	In Progress → Confirmed
assignee:	Bogdan Dobrelya (bogdando) → nobody
Changed in ceilometer:
assignee:	Bogdan Dobrelya (bogdando) → nobody

Bogdan Dobrelya (bogdando) on 2014-07-01

Changed in fuel:
milestone:	none → 5.1
importance:	Undecided → High
status:	New → Confirmed

Vladimir Kuklin (vkuklin) on 2014-07-03

Changed in mos:
assignee:	nobody → MOS Oslo (mos-oslo)
importance:	Undecided → High
milestone:	none → 5.1
status:	New → Confirmed

Mike Scherbakov (mihgen) on 2014-07-09

Changed in mos:
milestone:	5.1 → 5.0.1

Alexei Kornienko (alexei-kornienko) on 2014-07-09

Changed in mos:
assignee:	MOS Oslo (mos-oslo) → Alexei Kornienko (alexei-kornienko)

Matthew Mosesohn (raytrac3r) on 2014-07-14

Changed in fuel:
assignee:	nobody → Fuel Library Team (fuel-library)

Bogdan Dobrelya (bogdando) on 2014-07-15

Changed in oslo:
assignee:	Bogdan Dobrelya (bogdando) → nobody

Vladimir Kuklin (vkuklin) on 2014-07-15

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)

Dmitry Mescheryakov (dmitrymex) on 2014-07-15

no longer affects:	fuel
Changed in mos:
status:	Confirmed → Fix Committed

OSCI Robot (oscirobot) on 2014-07-23

Changed in mos:
status:	Fix Committed → In Progress

Dmitry Mescheryakov (dmitrymex) on 2014-08-11

Changed in mos:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-08-24: Fix merged to oslo-incubator (stable/icehouse)

#62

Reviewed: https://review.openstack.org/99015
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=14720138309c67d3a6dcaeb6b7a784e21cd74ad2
Submitter: Jenkins
Branch: stable/icehouse

commit 14720138309c67d3a6dcaeb6b7a784e21cd74ad2
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Jun 10 14:26:42 2014 +0300

Slow down Kombu reconnect attempts

For a rationale for this patch, see the discussion surrounding Bug

    When reconnecting to a RabbitMQ cluster with mirrored queues in
    use, the attempt to release the connection can hang "indefinitely"
    somewhere deep down in Kombu. Blocking the thread for a bit
    prior to release seems to kludge around the problem where it is
    otherwise reproduceable.

DocImpact

Change-Id: Ic2ede3046709b831adf8204e4c909c589c1786c4
Partial-Bug: #856764

tags:

added: in-stable-icehouse

Doug Hellmann (doug-hellmann) on 2014-09-05

no longer affects:

oslo-incubator

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-06: Related fix proposed to oslo.messaging (master)

#63

Related fix proposed to branch: master
Review: https://review.openstack.org/126329

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-06:

#64

Related fix proposed to branch: master
Review: https://review.openstack.org/126330

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-10-06: Change abandoned on oslo.messaging (master)

#65

Change abandoned by Ilya Pekelny (<email address hidden>) on branch: master
Review: https://review.openstack.org/126329
Reason: Invalid change ID

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-10-28:

#66

related bug https://bugs.launchpad.net/mos/+bug/1371723

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-06: Related fix proposed to oslo.messaging (master)

#67

Related fix proposed to branch: master
Review: https://review.openstack.org/132979

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-24: Change abandoned on oslo.messaging (master)

#68

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/132979

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-11-27:

#69

Change abandoned by James Page (<email address hidden>) on branch: master
Review: https://review.openstack.org/94656
Reason: Alternative implementation proposed which is more complete

Revision history for this message

sridhar basam (sri-7) wrote on 2014-12-05:

#70

Our rabbitmq problems have gone away by using a version of rabbitmq > 3.3.0 due to the following change in rabbitmq.

26070 automatically reconsume when mirrored queues fail over (and
introduce x-cancel-on-ha-failover argument for the old behaviour)

This moves the logic to enable consumption on a queue back to the server side by default. Previously during a queue failover, the server notified consumers about the need to reconsume and left it to the clients to initiate it. Using version 3.3.5 of rabbitmq and 2.5.12 of kombu, we haven't had a single stuck queue after multiple restarts of members in our rabbitmq cluster.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#71

That is a good point, thank you. I believe Oslo.messaging should has an option (default false) to use this x-cancel-on-ha-failover for all created queues

ZhaoHangbo (497492840-9) on 2014-12-09

Changed in cinder:
status:	New → Confirmed
Changed in mos:
status:	Fix Committed → Incomplete
Changed in ceilometer:
status:	In Progress → Invalid

Dmitry Mescheryakov (dmitrymex) on 2014-12-09

Changed in mos:
status:	Incomplete → Fix Committed

Deliang Fan (vanderliang) on 2014-12-10

Changed in heat:
assignee:	nobody → Deliang Fan (vanderliang)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-09: Fix proposed to oslo.messaging (master)

#72

Fix proposed to branch: master
Review: https://review.openstack.org/146047

Changed in oslo.messaging:
assignee:	James Page (james-page) → Mehdi Abaakouk (sileht)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-21: Related fix proposed to oslo.messaging (master)

#73

Related fix proposed to branch: master
Review: https://review.openstack.org/148890

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-21: Fix proposed to oslo.messaging (master)

#74

Fix proposed to branch: master
Review: https://review.openstack.org/148891

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-21: Change abandoned on oslo.messaging (master)

#75

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/148891
Reason: wrong change id: see https://review.openstack.org/#/c/146047/

Mehdi Abaakouk (sileht) on 2015-01-29

Changed in oslo.messaging:
milestone:	none → next-kilo

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-02: Related fix merged to oslo.messaging (master)

#76

Reviewed: https://review.openstack.org/148890
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=16ee9a86830a1740655c097cd4714c67e31129bb
Submitter: Jenkins
Branch: master

commit 16ee9a86830a1740655c097cd4714c67e31129bb
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 10:24:54 2015 +0100

Refactor the replies waiter code

    This changes improves the way of we wait for replies.
    Currently, one of the rpc client is reponsible to poll the amqp connection
    used for replies and passed received answers to the correct client.

    In this way, we have some case if no client is waiting for a reply, the
    connection is not polled and no IO are done on the wire. The direct
    effect of that is we don't detect if the tcp connection is broken,
    from the system point of view, the tcp connection stay alive even if someone
    between the client and server have closed the connection.

    This change refactors the replies waiter code by creating a background
    thread responsible to poll the connection instead of a random client.
    The connection lost will be detect as soon as possible even if no rpc
    client are currently used the connection.

This is a mandatory change to be able to enable heartbeat on this
connection.

Related-Bug: #1371723
Related-Bug: #856764

Change-Id: I82d4029dd897ef13ae8ba3cda84a2fe65c8c91d2

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-02: Fix proposed to oslo.messaging (master)

#77

Fix proposed to branch: master
Review: https://review.openstack.org/152201

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-03: Change abandoned on oslo.messaging (master)

#78

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/126330
Reason: Ok Ilya, i'll mark it as abandoned

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-12:

#79

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/152201
Reason: Merged into the heartbeat patch.

Revision history for this message

Jason Harley (redmind) wrote on 2015-02-17:

#80

Is there any work being done to backport heartbeats to Icehouse's Oslo messaging?

Mehdi Abaakouk (sileht) on 2015-02-25

Changed in oslo.messaging:
milestone:	1.7.0 → none
milestone:	none → next-kilo

Doug Hellmann (doug-hellmann) on 2015-03-09

Changed in oslo.messaging:
milestone:	1.8.0 → next-liberty

Ivan Kolodyazhny (e0ne) on 2015-03-17

Changed in cinder:
assignee:	nobody → Ivan Kolodyazhny (e0ne)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-18: Fix merged to oslo.messaging (master)

#85

Reviewed: https://review.openstack.org/146047
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=b9e134d7e955b9180482d2f7c8844501c750adf6
Submitter: Jenkins
Branch: master

commit b9e134d7e955b9180482d2f7c8844501c750adf6
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 09:13:10 2015 +0100

rabbit: heartbeat implementation

    AMQP offers a heartbeat feature to ensure that the application layer
    promptly finds out about disrupted connections (and also completely
    unresponsive peers). If the client requests heartbeats on connection, rabbit
    server will regularly send messages to each connections with the expectation of
    a response.

    To acheive this, each driver connection object spawn a thread that
    send/retrieve heartbeat packets exchanged between the server and the
    client.

    To protect the concurrency access to the kombu connection between the
    driver and this thread use a lock that always prioritize the
    heartbeat thread. So when the heartbeat thread wakes up it will acquire the
    lock quickly, to ensure we have no heartbeat starvation when the driver
    sends a lot of messages.

Also when we are polling the broker, the lock can be held for a long
time by the 'consume' method, so this one does the heartbeat stuffs itself.

DocImpact: 2 new configuration options for Rabbit driver

Co-Authored-By: Oleksii Zamiatin <email address hidden>
Co-Authored-By: Ilya Pekelny <email address hidden>

Related-Bug: #1371723
Closes-Bug: #856764

Change-Id: I1d3a635f3853bc13ffc14034468f1ac6262c11a3

Changed in oslo.messaging:
status:	In Progress → Fix Committed

Revision history for this message

OSCI Robot (oscirobot) wrote on 2015-03-18:

#86

RPM package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1.mira10.git.b9e134d.acb3abf

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: patchset-created

Files placed on repository:
python-oslo-messaging-1.8.0-fuel6.1.mira10.git.b9e134d.acb3abf.noarch.rpm
python-oslo-messaging-doc-1.8.0-fuel6.1.mira10.git.b9e134d.acb3abf.noarch.rpm

NOTE: Changeset is not merged, created temporary package repository.
RPM repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-master-4736/centos

Revision history for this message

OSCI Robot (oscirobot) wrote on 2015-03-18:

#87

RPM package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1.mira10

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: change-merged

Files placed on repository:
python-oslo-messaging-1.8.0-fuel6.1.mira10.noarch.rpm
python-oslo-messaging-doc-1.8.0-fuel6.1.mira10.noarch.rpm

Changeset merged. Package placed on primary repository
RPM repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-master/centos

Revision history for this message

OSCI Robot (oscirobot) wrote on 2015-03-18:

#88

DEB package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1~mira10

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: change-merged

Files placed on repository:
python-oslo.messaging_1.8.0-fuel6.1~mira10_all.deb

Changeset merged. Package placed on primary repository
DEB repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-master/ubuntu

Revision history for this message

OSCI Robot (oscirobot) wrote on 2015-03-18:

#89

DEB package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1~mira10+git.b9e134d.acb3abf

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: patchset-created

Files placed on repository:
python-oslo.messaging_1.8.0-fuel6.1~mira10+git.b9e134d.acb3abf_all.deb

NOTE: Changeset is not merged, created temporary package repository.
DEB repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-master-4736/ubuntu

Deliang Fan (vanderliang) on 2015-03-19

Changed in heat:
assignee:	Deliang Fan (vanderliang) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-24: Fix proposed to oslo.messaging (master)

#90

Fix proposed to branch: master
Review: https://review.openstack.org/167299

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-24: Change abandoned on oslo.messaging (master)

#91

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/167299

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-24: Fix proposed to oslo.messaging (stable/kilo)

#92

Fix proposed to branch: stable/kilo
Review: https://review.openstack.org/167308

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-25: Fix merged to oslo.messaging (stable/kilo)

#93

Reviewed: https://review.openstack.org/167308
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=64bdd80c5fe4d53ac8d7ab3ed906ec9feaeb7ec4
Submitter: Jenkins
Branch: stable/kilo

commit 64bdd80c5fe4d53ac8d7ab3ed906ec9feaeb7ec4
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 09:13:10 2015 +0100

rabbit: heartbeat implementation

    AMQP offers a heartbeat feature to ensure that the application layer
    promptly finds out about disrupted connections (and also completely
    unresponsive peers). If the client requests heartbeats on connection, rabbit
    server will regularly send messages to each connections with the expectation of
    a response.

    To acheive this, each driver connection object spawn a thread that
    send/retrieve heartbeat packets exchanged between the server and the
    client.

    To protect the concurrency access to the kombu connection between the
    driver and this thread use a lock that always prioritize the
    heartbeat thread. So when the heartbeat thread wakes up it will acquire the
    lock quickly, to ensure we have no heartbeat starvation when the driver
    sends a lot of messages.

Also when we are polling the broker, the lock can be held for a long
time by the 'consume' method, so this one does the heartbeat stuffs itself.

DocImpact: 2 new configuration options for Rabbit driver

Co-Authored-By: Oleksii Zamiatin <email address hidden>
Co-Authored-By: Ilya Pekelny <email address hidden>

Related-Bug: #1371723
Closes-Bug: #856764

Change-Id: I1d3a635f3853bc13ffc14034468f1ac6262c11a3
(cherry picked from commit b9e134d7e955b9180482d2f7c8844501c750adf6)

tags:

added: in-stable-kilo

Mehdi Abaakouk (sileht) on 2015-03-25

Changed in oslo.messaging:
milestone:	next-liberty → 1.8.1
status:	Fix Committed → Fix Released

Revision history for this message

Ubuntu Foundations Team Bug Bot (crichton) wrote on 2015-03-25:

#94

The attachment "impl_kombu.py.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags:

added: patch

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-03-30:

#95

This bug was fixed in the package oslo.messaging - 1.8.1-0ubuntu1

---------------
oslo.messaging (1.8.1-0ubuntu1) vivid; urgency=medium

  * New upstream release for OpenStack Kilo, including enablement
    of RabbitMQ heartbeating for improved connection failure detection
    (LP: #856764):
    - d/p/zmq-redis-fix-topic-registration.patch,
      d/p/disable-zmq-tests.patch: Dropped, included upstream.
    - d/p/zmq-client-pooling.patch: Rebase.
    - d/p/disable-new-executors.patch: Disable hard requirement for
      trollius and aioeventlet executors for vivid release.
    - d/control: Align minimum version requirements with upstream.
  * d/pydist-overrides: Add overrides for new oslo package naming.
  * Misc fixes for zmq driver:
    - d/p/Fix-changing-keys-during-iteration-in-matchmaker-hea.patch:
      Fix changing keys during iteration in matchmaker heartbeat
      (LP: #1432966).
    - d/p/Add-pluggability-for-matchmakers.patch: Add entry points
      for matchmaker drivers (LP: #1291701).
-- James Page <email address hidden> Mon, 30 Mar 2015 09:52:29 +0100

Changed in oslo.messaging (Ubuntu):
status:	New → Fix Released

Revision history for this message

Mtl Alow (mtl-alow-deactivatedaccount) wrote on 2015-04-03:

#96

Any way to have this fixed in Juno too ?

Revision history for this message

Alejandro Comisario (alejandro-f) wrote on 2015-04-08:

#97

+ 1 is there a way to apply / backport this fix into Juno ?
Or maybe, pip install -U oslo.messagin will do ?

Revision history for this message

Tom Fifield (fifieldt) wrote on 2015-06-11:

#98

@Quentin, @Alejandro: Check out http://lists.openstack.org/pipermail/openstack-operators/2015-May/006849.html

Revision history for this message

Rongze Zhu (zrzhit) wrote on 2015-09-13:

#99

@Mehdi Abaakouk, @Alexei Kornienko, My patch about keepalive options had been merged into pyamqp[1] , it is very useful for aware the connection being terminated and raise a socket error exception.

We can add some keepalive options in oslo.messaging[2] and pass these keepalive options to kombu pyamqp transport, so the idle connections in consumer will aware the connection being terminated and the consumer will catch the socket Exception, and reconnect again.

The method of tcp keepalive is simpler than heartbeat checking thread. I had used this approach in multiple production environments more than a year, it is effective.

[1] https://github.com/celery/py-amqp/commit/b9a6601a93927449fa6f524750e3842cc5c181bd
[2] https://github.com/zhurongze/oslo.messaging/commit/c04d9b18536e8032a79c1889adde7beaf517adaf

Revision history for this message

Rongze Zhu (zrzhit) wrote on 2015-09-13:

#100

update [2] https://github.com/zhurongze/oslo.messaging/commit/71c3380c4c411eed261ab0a4741931d07d5c44c7

Alan Pevec (apevec) on 2015-11-24

tags:

removed: havana-backport-potential in-stable-icehouse in-stable-kilo

Thomas Herve (therve) on 2016-04-19

no longer affects:

heat

Mathew Hodson (mhodson) on 2016-08-31

Changed in oslo.messaging (Ubuntu):
importance:	Undecided → High

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2016-09-27:

#101

If I follow correctly, this no longer effects Cinder since it was implemented in oslo.messaging.

Changed in cinder:
status:	Confirmed → Invalid

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-03

no longer affects:

neutron

oslo.messaging

RabbitMQ connections lack heartbeat or TCP keepalives

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Remote bug watches

	Status	Importance	Assigned to	Milestone
Ceilometer	Invalid	High	Unassigned
Icehouse	Fix Released	High	Bogdan Dobrelya	Ceilometer 2014.1.1
Cinder	Invalid	Undecided	Ivan Kolodyazhny
Mirantis OpenStack	Fix Committed	High	Alexei Kornienko	Mirantis OpenStack 5.0.1
OpenStack Compute (nova)	Invalid	High	Unassigned
oslo.messaging	Fix Released	Critical	Mehdi Abaakouk	oslo.messaging 1.8.1
oslo.messaging (Ubuntu)	Fix Released	High	Unassigned