RabbitMQ connections lack heartbeat or TCP keepalives

Bug #856764 reported by Rafi Khardalian on 2011-09-22
292
This bug affects 50 people
Affects Status Importance Assigned to Milestone
Ceilometer
High
Unassigned
Icehouse
High
Bogdan Dobrelya
Cinder
Undecided
Ivan Kolodyazhny
Mirantis OpenStack
High
Alexei Kornienko
OpenStack Compute (nova)
High
Unassigned
oslo.messaging
Critical
Mehdi Abaakouk
oslo.messaging (Ubuntu)
High
Unassigned

Bug Description

There is currently no method built into Nova to keep connections from various components into RabbitMQ alive. As a result, placing a stateful firewall (such as a Cisco ASA) between the connection can/does result in idle connections being terminated without either endpoint being aware.

This issue can be mitigated a few different ways:

1. Connections to RabbitMQ set socket options to enable TCP keepalives.

2. Rabbit has heartbeat functionality. If the client requests heartbeats on connection, rabbit server will regularly send messages to each connections with the expectation of a response.

3. Other?

Thierry Carrez (ttx) on 2011-10-21
Changed in nova:
importance: Undecided → Wishlist
status: New → Confirmed
Andrea Rosa (andrea-rosa-m) wrote :

For the solution 2 (heartbeat functionality) we need to use another amqp client (for example pika), at this moment the python-amqplib doesn't implement the heartbeat.

Brad McConnell (bmcconne) wrote :

Just wanted to add an alternate solution to this for the folks that run into this bug while searching. If you make the ASA send active resets instead of silently dropping the connections out of their table, your environment should stabilize. Something along the lines of the following, plus any appropriate adjustments for port/policy-map based upon your individual environment:

class-map rabbit-hop
 match port tcp eq 5672
policy-map global_policy
 class rabbit-hop
  set connection timeout idle 12:00:00 reset

Russell Bryant (russellb) wrote :

From searching around it sounds like this should no longer be an issue due to enabling TCP keepalives.

"amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"

Changed in nova:
status: Confirmed → Invalid
Justin Hopper (justin-hopper) wrote :

The version of kombu we are now using and the py-amqp lib that provides the transport supports heartbeat.

Heartbeat will help close connections when a client using rabbit is forcefully terminated.

Using heartbeats may be an option and if so can either be exposed to the rpc-component user by way of server-params or a configuration for the rpc-component.

Changed in nova:
status: Invalid → New
Kiall Mac Innes (kiall) wrote :

By pure fluke, I submitted this a few days back: https://review.openstack.org/#/c/34949

It adds heartbeat support to the Kombu driver.

Changed in oslo:
assignee: nobody → Kiall Mac Innes (kiall)
status: New → In Progress
Mark McLoughlin (markmc) wrote :

Russell's point should be addressed:

  "amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"

Mark McLoughlin (markmc) wrote :

I asked a bunch of questions in the oslo review

Main thing missing is what exactly the heartbeat fixes that SO_KEEPALIVE doesn't already address

Changed in nova:
status: New → Incomplete
Changed in oslo:
status: In Progress → Incomplete
Kiall Mac Innes (kiall) wrote :

Hey Mark - I've responded to your comments in the review comments. Rather than split the conversation over two places, I'll just leave a link here:

https://review.openstack.org/#/c/34949/

Mark McLoughlin (markmc) wrote :

The convincing point made in the review is that a service sitting there listening for RPC requests will have to wait 2 hours by default to be notified that it has lost its connection with the broker if we rely on SO_KEEPALIVE

Changed in oslo:
status: Incomplete → Triaged
importance: Undecided → High
Changed in nova:
status: Incomplete → Confirmed
importance: Wishlist → High
status: Confirmed → Triaged
Changed in oslo:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/34949
Committed: http://github.com/openstack/oslo-incubator/commit/c37f6aaab3ac00b7865dee18158114433350237e
Submitter: Jenkins
Branch: master

commit c37f6aaab3ac00b7865dee18158114433350237e
Author: Kiall Mac Innes <email address hidden>
Date: Fri Jun 28 21:14:26 2013 +0100

    Add support for heartbeating in the kombu RPC driver

    This aides in detecting connection interruptions that would otherwise
    go unnoticed.

    Fixes bug #856764

    Change-Id: Id4eb3d36036969b62890175d6a33b4e304be0527

Changed in oslo:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2013-07-17
Changed in oslo:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-08-14
Changed in oslo:
status: Fix Released → Triaged
Kevin Bringard (kbringard) wrote :

I spoke with MarkMc about this in #openstack-dev, but another thing I've discovered:

I should start by saying I'm in no way an ampq or rabbit expert. This is just based on a lot of googling, testing in my environment and trial and error. If I say something which doesn't make sense, it's quite possible it doesn't :-D

In rabbit, when master promotion occurs a slave queue will kick off all of it's consumers, but not kill the connection (http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-January/017341.html). An almost identical issue was brought up on the springsource client forums here: http://forum.springsource.org/archive/index.php/t-121480.html.

While the ampq libraries support connection disruption handling, they don't appear to handle channel disruption or consumer cancel notifications. The end result of which is that when a master promotion occurs in rabbit, the OpenStack services will continue to consume from a queue whose channel has been closed.

Once you get all your consumers to re-establish their channels, messages begin flowing again, but the ultimate result is that a single node failure can cause the majority (or even all) messages to stop flowing to OS services until you force them to re-establish (either by bouncing all rabbit nodes with attached/hung consumers or by restarting individual OS services).

You can reproduce the effects like so:

* Determine the master for any given queue.
** I generally do this by running watch "rabbitmqctl list_queues -p /nova name slave_pids synchronised_slave_pids messages messages_unacknowledged consumers | grep -v fanout" and look for the node in the cluster which is not a slave (inherently making it the master)
* Stop rabbit on the master node
* Watch the consumers column. It should mostly drop to 0, and busy queues (such as q-plugin) will likely begin backing up
* Pick a service (quantum-server works well, as it will drain q-plugin) and validate which rabbit node it is connected to (netstat, grepping the logs of the service, or rabbitmqctl list_connections name should find it pretty easily)
* Restart said service or the rabbit broker it is connected to
*Once it restarts and/or determines the connection has been lost, the connection will be re-established
* Go back to your watch command, and you should now see the new subscriber on its specific queue

I'm adding notes here because I'm not sure if the heartbeat implementation works at the channel level, or if we need to implement consumer cancel notification support (https://lists.launchpad.net/openstack/msg15111.html).

Regardless, without properly handling master promotion in rabbit, it makes using HA queues a moot exercise as losing a single node can cause all messages to stop flowing. Given the heavy reliance on the message queue, I think we need to be especially careful how we handle this and make it as solid as possible.

Kevin Bringard (kbringard) wrote :

So it looks like Ask Solem outlines how we need to do heartbeats in this post:

https://lists.launchpad.net/openstack/msg15111.html

Specifically:

An example of enabling heartbeats with eventlet could be:

import weakref
from kombu import Connection
from eventlet import spawn_after

def monitor_heartbeats(connection, rate=2):
if not connection.heartbeat:
    return
interval = connection.heartbeat / 2.0
cref = weakref.ref(connection)

def heartbeat_check():
    conn = cref()
    if conn is not None and conn.connected:
        conn.heartbeat_check(rate=rate)
        spawn_after(interval, heartbeat_check)

 return spawn_after(interval, heartbeat_check)

connection = Connection('pyamq://', heartbeat=10)

or:

connection = Connection('pyamqp://?heartbeat=10')

Additionally, I think adding support for consumer cancel notifications would aid in the master promotion issues I outlined above. From Ask's email:

- Consumer cancel notifications

Requires no changes to your code,
all you need is to properly reconnect when one of the
errors in Connection.channel_errors occur, which is handled
automatically by Connection.ensure / Connection.autoretry (I don't believe
Nova uses that, but it probably should).

Of course, this all requires updating to a newer version of kombu and amqp as well, but based on our experiences with rabbit, I really think the benefits of adding this functionality will help tremendously from an enterprise operationally ready standpoint. Without it, the HA story in rabbit is pretty dismal :-/

Kevin Bringard (kbringard) wrote :
Download full text (4.8 KiB)

So, based on Ask's comment about notifications, I started looking into it. As it turns out, *if* you're running a version of kombu/amqp which supports the channel_errors object (version 2.1.4 seems to be when it was introduced: http://kombu.readthedocs.org/en/latest/changelog.html), the following simple patch resolves the issue (also attached):

--- impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
+++ impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
@@ -488,7 +488,6 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
- self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -562,7 +561,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.channel_errors, socket.timeout, IOError), e:
+ except (self.connection_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Basically, in ensure() you want to watch the channel and not the connection.

I verified this in a 2 node rabbit cluster. There are 2 nodes: .139 and .141. .139 is currently the master.

The following is from the nova logs when .139 is stopped (and .141 is promoted to the master):

Notice, we're connected to 192.168.128.141:

013-08-22 21:27:45.807 INFO nova.openstack.common.rpc.common [req-20aa6610-b0df-4730-9773-6024e47a6da7 None None] Connected to AMQP server on 192.168.128.141:5672
2013-08-22 21:27:45.843 INFO nova.openstack.common.rpc.common [req-c82c8ea0-aa8b-49b0-925c-b79399f011de None None] Connected to AMQP server on 192.168.128.141:5672

...

Then, we stop rabbit on .139 and see the following *channel* error:

2013-08-22 21:28:13.475 20003 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: tag u'2'
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 572, in ensure
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return method(*args, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 654, in _consume
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 281, in drain_events
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", lin...

Read more...

Kevin Bringard (kbringard) wrote :

Sorry, realized I created the patch the wrong way. :facepalm:

This is how it *should* be:

--- impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
+++ impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
@@ -488,6 +488,7 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
+ self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -561,7 +562,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.connection_errors, socket.timeout, IOError), e:
+ except (self.channel_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Kevin Bringard (kbringard) wrote :

Quick update on this... I will probably submit this patch upstream, but the channel_errors object seems to exist in the old kombu, so we can declare it without an error, but it doesn't get populated as that version of kombu doesn't populate it.

The supplied patch should "work" on any version, but will only detect channel_errors when running versions of kombu which support it.

Doubtlessly this could be cleaner, and I still think that adding heartbeat support to actively populate and check the channel would be worthwhile, but this should also help with the issue in the short term.

It's also worth pointing out that the newer versions of kombu inherently support a lot of the functionality we're duplicating, such as ensuring connections exist, pooling connections and determining which servers to use and in what order. It's probably worth looking at implementing those once the newer versions of kombu are "standard" on the bulk of distros.

Sam Morrison (sorrison) wrote :

Hi Kevin,
Just wondering if you've had a chance to submit this upstream?

Thierry Carrez (ttx) on 2013-10-17
Changed in oslo:
milestone: havana-2 → 2013.2
milestone: 2013.2 → none
Chris Friesen (cbf123) wrote :

Any update on this issue? I've just run into an issue that I think might be related. We have active/standby controllers (using pacemaker) and multiple compute nodes.

If a controller is killed uncleanly all the services come up on the other controller but it takes about 9 minutes or so before I can boot up a new instance. After that time I see "nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed" on the compute nodes, then it reconnects to the AMQP server and I can then boot an instance.

Unfortunately, any instances I tried to boot during those 9 minutes stay in the "BUILD/scheduling" state forever.

Vish Ishaya (vishvananda) wrote :

The following fix works for failover, but doesn't solve all of the problems in HA mode. For that kevin's patch above is needed.

When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates.

So solving the HA issue generally involves a rabbit config with a section like the following:

[
 {rabbit, [{tcp_listen_options, [binary,
                                {packet, raw},
                                {reuseaddr, true},
                                {backlog, 128},
                                {nodelay, true},
                                {exit_on_close, false},
                                {keepalive, true}]}
          ]}
].

Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections:

echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time
echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes
echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl

Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use:

https://github.com/meebey/force_bind/blob/master/README

Mark McLoughlin (markmc) on 2013-12-11
Changed in oslo.messaging:
importance: Undecided → High
status: New → Triaged
Chet Burgess (cfb-n) wrote :

I have done extensive testing using both Vish's keepalive tuning parameters and Kevin's proposed fix. We've been able to validate that the following occur correctly.

1) A client will reconnect if the server they are actively connected to dies (Vish's tuning).
2) A client will reconnect if the AMQP master for the queue its subscribed too goes away (Kevin's proposed fix).

As the original reporters of this we feel the combination successfully addresses the issue and allows for a complete HA solution at the RPC level with rabbit.

Given the time since the patch was posted to the issue I plan on submitting a review to oslo.messaging with the proposed fix as soon as I have definitively confirmed what version of kombu will be required.

I also think we should open a doc bug to document the tuning parameters Vish has outlined. The default behavior out of the box is fairly poor and the HA story isn't really complete until both things are done.

I'm not entirely sure of the proper procedure for the doc bug so any guidance would be appreciated.

Sergey Pimkov (sergey-pimkov) wrote :

Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.

This problem was expained on Stack Overflow: http://stackoverflow.com/questions/16320039/getting-disconnection-notification-using-tcp-keep-alive-on-write-blocked-socket

Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!

I'm not sure if this is germane to the original bug report, but this seems to be where the discussion about RabbitMQ failover is happening, so here's the current state of the art, as far as we can tell:

With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.

Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled. We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them. They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.

We're not 100% sure why this is happening, but log analysis and observation causes us to wildly speculate that on failover with mirrored queues, RabbitMQ forces an election to determine a new master, and if clients attempt to teardown and re-establish their queues before the election has concluded, they will encounter a race condition where their termination requests get eaten and are unacknowledged by the server, and the clients just hang out forever waiting for their requests to complete, and never retry.

With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/nova-clear-rabbit-queues command must be run to clear out the "stale" fanout queues. With Kombu 3.x and newer, the situation is improved, and stopping RabbitMQ on all but one server will cause new CCNs to be generated, and the clients will cleanly migrate to the remaining server and begin working again.

This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine. As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs. The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.

Fix proposed to branch: master
Review: https://review.openstack.org/76686

Changed in oslo.messaging:
assignee: nobody → Nicolas Simonds (nicolas.simonds)
status: Triaged → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/77276

Changed in oslo.messaging:
assignee: Nicolas Simonds (nicolas.simonds) → Chet Burgess (cfb-n)
assignee: Chet Burgess (cfb-n) → Nicolas Simonds (nicolas.simonds)

Reviewed: https://review.openstack.org/77276
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0400cbf4f83cf8d58076c7e65e08a156ec3508a8
Submitter: Jenkins
Branch: master

commit 0400cbf4f83cf8d58076c7e65e08a156ec3508a8
Author: Chet Burgess <email address hidden>
Date: Fri Feb 28 13:39:09 2014 -0800

    Gracefully handle consumer cancel notifications

    With mirrored queues and clustered rabbit nodes a queue is still
    mastered by a single rabbit node. When the rabbit node dies an
    election occurs amongst the remaining nodes and a new master is
    elected. When a slave is promoted to master it will close all the
    open channels to its consumers but it will not close the
    connections. This is reported to consumers as a consumer cancel
    notification (CCN). Consumers need to re-subscribe to these queues
    when they recieve a CCN.

    kombu 2.1.4+ reports CCNs as channel errors. This patch updates
    the ensure function to be more inline with the upstream kombu
    functionality. We now monitor for channel errors as well as
    connection errors and initiate a reconnect if we detect an error.

    Change-Id: Ie00f67e65250dc983fa45877c14091ad4ae136b4
    Partial-Bug: 856764

Reviewed: https://review.openstack.org/76686
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
Submitter: Jenkins
Branch: master

commit fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
Author: Nicolas Simonds <email address hidden>
Date: Wed Feb 26 15:21:01 2014 -0800

    Slow down Kombu reconnect attempts

    For a rationale for this patch, see the discussion surrounding Bug

    When reconnecting to a RabbitMQ cluster with mirrored queues in
    use, the attempt to release the connection can hang "indefinitely"
    somewhere deep down in Kombu. Blocking the thread for a bit
    prior to release seems to kludge around the problem where it is
    otherwise reproduceable.

    DocImpact

    Change-Id: Ic2ede3046709b831adf8204e4c909c589c1786c4
    Partial-Bug: 856764

Mark McLoughlin (markmc) wrote :

Marking as Invalid for Nova because any fix would be in oslo.messaging

Changed in nova:
status: Triaged → Invalid
Changed in oslo.messaging:
importance: High → Critical
Kiall Mac Innes (kiall) on 2014-05-14
Changed in oslo:
assignee: Kiall Mac Innes (kiall) → nobody
James Page (james-page) on 2014-05-21
Changed in oslo.messaging:
assignee: Nicolas Simonds (nicolas.simonds) → James Page (james-page)
Tim Bell (tim-bell) on 2014-05-21
tags: added: havana-backport-potential
Bogdan Dobrelya (bogdando) wrote :

Please sync kombu_reconnect_delay for all affected projects as well.

Changed in neutron:
status: New → Confirmed
Changed in heat:
status: New → Confirmed
Changed in ceilometer:
status: New → Confirmed
Changed in neutron:
assignee: nobody → Bogdan Dobrelya (bogdando)
Changed in heat:
assignee: nobody → Bogdan Dobrelya (bogdando)
Changed in neutron:
status: Confirmed → In Progress
Changed in heat:
status: Confirmed → In Progress
Changed in ceilometer:
status: Confirmed → New
Changed in ceilometer:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: New → In Progress
Eoghan Glynn (eglynn) on 2014-05-30
Changed in ceilometer:
importance: Undecided → High
milestone: none → 2014.1.1

Reviewed: https://review.openstack.org/95489
Committed: https://git.openstack.org/cgit/openstack/ceilometer/commit/?id=06eb8bc53225c2b58cd2ffeedad17b7428b5f1de
Submitter: Jenkins
Branch: stable/icehouse

commit 06eb8bc53225c2b58cd2ffeedad17b7428b5f1de
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 26 13:28:40 2014 +0300

    Sync kombu_reconnect_delay from Oslo

    When reconnecting to a RabbitMQ cluster
    with mirrored queues in use, the attempt to release the
    connection can hang "indefinitely" somewhere deep down
    in Kombu. Blocking the thread for a bit prior to
    release seems to kludge around the problem where it is
    otherwise reproduceable.
    The value 5.0 fits for low perfomance environments as well.

    Cherry-picked from Oslo.messaging:
    fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
    Related-bug: #856764

    Change-Id: Ifadda4dd9122df9ccb4ecf560ce3db3e38adf2b9
    Signed-off-by: Bogdan Dobrelya <email address hidden>

tags: added: in-stable-icehouse
Alan Pevec (apevec) on 2014-06-05
Changed in ceilometer:
milestone: 2014.1.1 → none
tags: removed: in-stable-icehouse
Bogdan Dobrelya (bogdando) wrote :

Please note, the patch https://review.openstack.org/95489 does not close the subject of the issue, it only syncs kombu_reconnect_delay - the TCP heartbeat is still in TODO list, please reassign as appropriate.

Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/icehouse
Review: https://review.openstack.org/99007
Reason: new change id is Ic2ede3046709b831adf8204e4c909c589c1786c4

Changed in oslo:
assignee: nobody → Bogdan Dobrelya (bogdando)
status: Triaged → In Progress
Changed in neutron:
assignee: Bogdan Dobrelya (bogdando) → nobody
Changed in heat:
status: In Progress → Confirmed
assignee: Bogdan Dobrelya (bogdando) → nobody
Changed in ceilometer:
assignee: Bogdan Dobrelya (bogdando) → nobody
Changed in fuel:
milestone: none → 5.1
importance: Undecided → High
status: New → Confirmed
Changed in mos:
assignee: nobody → MOS Oslo (mos-oslo)
importance: Undecided → High
milestone: none → 5.1
status: New → Confirmed
Mike Scherbakov (mihgen) on 2014-07-09
Changed in mos:
milestone: 5.1 → 5.0.1
Changed in mos:
assignee: MOS Oslo (mos-oslo) → Alexei Kornienko (alexei-kornienko)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in oslo:
assignee: Bogdan Dobrelya (bogdando) → nobody
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo)
no longer affects: fuel
Changed in mos:
status: Confirmed → Fix Committed
OSCI Robot (oscirobot) on 2014-07-23
Changed in mos:
status: Fix Committed → In Progress
Changed in mos:
status: In Progress → Fix Committed
21 comments hidden view all 101 comments

Reviewed: https://review.openstack.org/99015
Committed: https://git.openstack.org/cgit/openstack/oslo-incubator/commit/?id=14720138309c67d3a6dcaeb6b7a784e21cd74ad2
Submitter: Jenkins
Branch: stable/icehouse

commit 14720138309c67d3a6dcaeb6b7a784e21cd74ad2
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Jun 10 14:26:42 2014 +0300

    Slow down Kombu reconnect attempts

    For a rationale for this patch, see the discussion surrounding Bug

    When reconnecting to a RabbitMQ cluster with mirrored queues in
    use, the attempt to release the connection can hang "indefinitely"
    somewhere deep down in Kombu. Blocking the thread for a bit
    prior to release seems to kludge around the problem where it is
    otherwise reproduceable.

    DocImpact

    Change-Id: Ic2ede3046709b831adf8204e4c909c589c1786c4
    Partial-Bug: #856764

tags: added: in-stable-icehouse
no longer affects: oslo-incubator

Related fix proposed to branch: master
Review: https://review.openstack.org/126330

Change abandoned by Ilya Pekelny (<email address hidden>) on branch: master
Review: https://review.openstack.org/126329
Reason: Invalid change ID

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/132979

Change abandoned by James Page (<email address hidden>) on branch: master
Review: https://review.openstack.org/94656
Reason: Alternative implementation proposed which is more complete

sridhar basam (sri-7) wrote :

Our rabbitmq problems have gone away by using a version of rabbitmq > 3.3.0 due to the following change in rabbitmq.

26070 automatically reconsume when mirrored queues fail over (and
      introduce x-cancel-on-ha-failover argument for the old behaviour)

This moves the logic to enable consumption on a queue back to the server side by default. Previously during a queue failover, the server notified consumers about the need to reconsume and left it to the clients to initiate it. Using version 3.3.5 of rabbitmq and 2.5.12 of kombu, we haven't had a single stuck queue after multiple restarts of members in our rabbitmq cluster.

Bogdan Dobrelya (bogdando) wrote :

That is a good point, thank you. I believe Oslo.messaging should has an option (default false) to use this x-cancel-on-ha-failover for all created queues

ZhaoHangbo (497492840-9) on 2014-12-09
Changed in cinder:
status: New → Confirmed
Changed in mos:
status: Fix Committed → Incomplete
Changed in ceilometer:
status: In Progress → Invalid
Changed in mos:
status: Incomplete → Fix Committed
Changed in heat:
assignee: nobody → Deliang Fan (vanderliang)

Fix proposed to branch: master
Review: https://review.openstack.org/146047

Changed in oslo.messaging:
assignee: James Page (james-page) → Mehdi Abaakouk (sileht)

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/148891
Reason: wrong change id: see https://review.openstack.org/#/c/146047/

Mehdi Abaakouk (sileht) on 2015-01-29
Changed in oslo.messaging:
milestone: none → next-kilo

Reviewed: https://review.openstack.org/148890
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=16ee9a86830a1740655c097cd4714c67e31129bb
Submitter: Jenkins
Branch: master

commit 16ee9a86830a1740655c097cd4714c67e31129bb
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 10:24:54 2015 +0100

    Refactor the replies waiter code

    This changes improves the way of we wait for replies.
    Currently, one of the rpc client is reponsible to poll the amqp connection
    used for replies and passed received answers to the correct client.

    In this way, we have some case if no client is waiting for a reply, the
    connection is not polled and no IO are done on the wire. The direct
    effect of that is we don't detect if the tcp connection is broken,
    from the system point of view, the tcp connection stay alive even if someone
    between the client and server have closed the connection.

    This change refactors the replies waiter code by creating a background
    thread responsible to poll the connection instead of a random client.
    The connection lost will be detect as soon as possible even if no rpc
    client are currently used the connection.

    This is a mandatory change to be able to enable heartbeat on this
    connection.

    Related-Bug: #1371723
    Related-Bug: #856764

    Change-Id: I82d4029dd897ef13ae8ba3cda84a2fe65c8c91d2

Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https://review.openstack.org/126330
Reason: Ok Ilya, i'll mark it as abandoned

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/152201
Reason: Merged into the heartbeat patch.

Jason Harley (redmind) wrote :

Is there any work being done to backport heartbeats to Icehouse's Oslo messaging?

Mehdi Abaakouk (sileht) on 2015-02-25
Changed in oslo.messaging:
milestone: 1.7.0 → none
milestone: none → next-kilo
Changed in oslo.messaging:
milestone: 1.8.0 → next-liberty
Ivan Kolodyazhny (e0ne) on 2015-03-17
Changed in cinder:
assignee: nobody → Ivan Kolodyazhny (e0ne)
4 comments hidden view all 101 comments

Reviewed: https://review.openstack.org/146047
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=b9e134d7e955b9180482d2f7c8844501c750adf6
Submitter: Jenkins
Branch: master

commit b9e134d7e955b9180482d2f7c8844501c750adf6
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 09:13:10 2015 +0100

    rabbit: heartbeat implementation

    AMQP offers a heartbeat feature to ensure that the application layer
    promptly finds out about disrupted connections (and also completely
    unresponsive peers). If the client requests heartbeats on connection, rabbit
    server will regularly send messages to each connections with the expectation of
    a response.

    To acheive this, each driver connection object spawn a thread that
    send/retrieve heartbeat packets exchanged between the server and the
    client.

    To protect the concurrency access to the kombu connection between the
    driver and this thread use a lock that always prioritize the
    heartbeat thread. So when the heartbeat thread wakes up it will acquire the
    lock quickly, to ensure we have no heartbeat starvation when the driver
    sends a lot of messages.

    Also when we are polling the broker, the lock can be held for a long
    time by the 'consume' method, so this one does the heartbeat stuffs itself.

    DocImpact: 2 new configuration options for Rabbit driver

    Co-Authored-By: Oleksii Zamiatin <email address hidden>
    Co-Authored-By: Ilya Pekelny <email address hidden>

    Related-Bug: #1371723
    Closes-Bug: #856764

    Change-Id: I1d3a635f3853bc13ffc14034468f1ac6262c11a3

Changed in oslo.messaging:
status: In Progress → Fix Committed
OSCI Robot (oscirobot) wrote :

RPM package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1.mira10.git.b9e134d.acb3abf

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: patchset-created

Files placed on repository:
python-oslo-messaging-1.8.0-fuel6.1.mira10.git.b9e134d.acb3abf.noarch.rpm
python-oslo-messaging-doc-1.8.0-fuel6.1.mira10.git.b9e134d.acb3abf.noarch.rpm

NOTE: Changeset is not merged, created temporary package repository.
RPM repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-master-4736/centos

OSCI Robot (oscirobot) wrote :

RPM package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1.mira10

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: change-merged

Files placed on repository:
python-oslo-messaging-1.8.0-fuel6.1.mira10.noarch.rpm
python-oslo-messaging-doc-1.8.0-fuel6.1.mira10.noarch.rpm

Changeset merged. Package placed on primary repository
RPM repository URL: http://osci-obs.vm.mirantis.net:82/centos-fuel-master/centos

OSCI Robot (oscirobot) wrote :

DEB package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1~mira10

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: change-merged

Files placed on repository:
python-oslo.messaging_1.8.0-fuel6.1~mira10_all.deb

Changeset merged. Package placed on primary repository
DEB repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-master/ubuntu

OSCI Robot (oscirobot) wrote :

DEB package oslo.messaging has been built for project openstack/oslo.messaging
Package version == 1.8.0, package release == fuel6.1~mira10+git.b9e134d.acb3abf

Changeset: https://review.fuel-infra.org/4736
project: openstack/oslo.messaging
branch: master
author: Pekelny Ilya
committer: openstack-ci-mirrorer-jenkins
subject: rabbit: heartbeat implementation
status: patchset-created

Files placed on repository:
python-oslo.messaging_1.8.0-fuel6.1~mira10+git.b9e134d.acb3abf_all.deb

NOTE: Changeset is not merged, created temporary package repository.
DEB repository URL: http://osci-obs.vm.mirantis.net:82/ubuntu-fuel-master-4736/ubuntu

Changed in heat:
assignee: Deliang Fan (vanderliang) → nobody

Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https://review.openstack.org/167299

Reviewed: https://review.openstack.org/167308
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=64bdd80c5fe4d53ac8d7ab3ed906ec9feaeb7ec4
Submitter: Jenkins
Branch: stable/kilo

commit 64bdd80c5fe4d53ac8d7ab3ed906ec9feaeb7ec4
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 09:13:10 2015 +0100

    rabbit: heartbeat implementation

    AMQP offers a heartbeat feature to ensure that the application layer
    promptly finds out about disrupted connections (and also completely
    unresponsive peers). If the client requests heartbeats on connection, rabbit
    server will regularly send messages to each connections with the expectation of
    a response.

    To acheive this, each driver connection object spawn a thread that
    send/retrieve heartbeat packets exchanged between the server and the
    client.

    To protect the concurrency access to the kombu connection between the
    driver and this thread use a lock that always prioritize the
    heartbeat thread. So when the heartbeat thread wakes up it will acquire the
    lock quickly, to ensure we have no heartbeat starvation when the driver
    sends a lot of messages.

    Also when we are polling the broker, the lock can be held for a long
    time by the 'consume' method, so this one does the heartbeat stuffs itself.

    DocImpact: 2 new configuration options for Rabbit driver

    Co-Authored-By: Oleksii Zamiatin <email address hidden>
    Co-Authored-By: Ilya Pekelny <email address hidden>

    Related-Bug: #1371723
    Closes-Bug: #856764

    Change-Id: I1d3a635f3853bc13ffc14034468f1ac6262c11a3
    (cherry picked from commit b9e134d7e955b9180482d2f7c8844501c750adf6)

tags: added: in-stable-kilo
Mehdi Abaakouk (sileht) on 2015-03-25
Changed in oslo.messaging:
milestone: next-liberty → 1.8.1
status: Fix Committed → Fix Released

The attachment "impl_kombu.py.patch" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package oslo.messaging - 1.8.1-0ubuntu1

---------------
oslo.messaging (1.8.1-0ubuntu1) vivid; urgency=medium

  * New upstream release for OpenStack Kilo, including enablement
    of RabbitMQ heartbeating for improved connection failure detection
    (LP: #856764):
    - d/p/zmq-redis-fix-topic-registration.patch,
      d/p/disable-zmq-tests.patch: Dropped, included upstream.
    - d/p/zmq-client-pooling.patch: Rebase.
    - d/p/disable-new-executors.patch: Disable hard requirement for
      trollius and aioeventlet executors for vivid release.
    - d/control: Align minimum version requirements with upstream.
  * d/pydist-overrides: Add overrides for new oslo package naming.
  * Misc fixes for zmq driver:
    - d/p/Fix-changing-keys-during-iteration-in-matchmaker-hea.patch:
      Fix changing keys during iteration in matchmaker heartbeat
      (LP: #1432966).
    - d/p/Add-pluggability-for-matchmakers.patch: Add entry points
      for matchmaker drivers (LP: #1291701).
 -- James Page <email address hidden> Mon, 30 Mar 2015 09:52:29 +0100

Changed in oslo.messaging (Ubuntu):
status: New → Fix Released
Quentin MACHU (quentin-machu) wrote :

Any way to have this fixed in Juno too ?

+ 1 is there a way to apply / backport this fix into Juno ?
Or maybe, pip install -U oslo.messagin will do ?

Rongze Zhu (zrzhit) wrote :

@Mehdi Abaakouk, @Alexei Kornienko, My patch about keepalive options had been merged into pyamqp[1] , it is very useful for aware the connection being terminated and raise a socket error exception.

We can add some keepalive options in oslo.messaging[2] and pass these keepalive options to kombu pyamqp transport, so the idle connections in consumer will aware the connection being terminated and the consumer will catch the socket Exception, and reconnect again.

The method of tcp keepalive is simpler than heartbeat checking thread. I had used this approach in multiple production environments more than a year, it is effective.

[1] https://github.com/celery/py-amqp/commit/b9a6601a93927449fa6f524750e3842cc5c181bd
[2] https://github.com/zhurongze/oslo.messaging/commit/c04d9b18536e8032a79c1889adde7beaf517adaf

Alan Pevec (apevec) on 2015-11-24
tags: removed: havana-backport-potential in-stable-icehouse in-stable-kilo
Thomas Herve (therve) on 2016-04-19
no longer affects: heat
Changed in oslo.messaging (Ubuntu):
importance: Undecided → High
Sean McGinnis (sean-mcginnis) wrote :

If I follow correctly, this no longer effects Cinder since it was implemented in oslo.messaging.

Changed in cinder:
status: Confirmed → Invalid
no longer affects: neutron
Displaying first 40 and last 40 comments. View all 101 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers