Oslo - a Library of Common OpenStack Code

RabbitMQ connections lack heartbeat or TCP keepalives

Reported by Rafi Khardalian on 2011-09-22
94
This bug affects 16 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned
oslo
High
Kiall Mac Innes
oslo.messaging
High
Nicolas Simonds

Bug Description

There is currently no method built into Nova to keep connections from various components into RabbitMQ alive. As a result, placing a stateful firewall (such as a Cisco ASA) between the connection can/does result in idle connections being terminated without either endpoint being aware.

This issue can be mitigated a few different ways:

1. Connections to RabbitMQ set socket options to enable TCP keepalives.

2. Rabbit has heartbeat functionality. If the client requests heartbeats on connection, rabbit server will regularly send messages to each connections with the expectation of a response.

3. Other?

Thierry Carrez (ttx) on 2011-10-21
Changed in nova:
importance: Undecided → Wishlist
status: New → Confirmed
Andrea Rosa (andrea-rosa-m) wrote :

For the solution 2 (heartbeat functionality) we need to use another amqp client (for example pika), at this moment the python-amqplib doesn't implement the heartbeat.

Brad McConnell (bmcconne) wrote :

Just wanted to add an alternate solution to this for the folks that run into this bug while searching. If you make the ASA send active resets instead of silently dropping the connections out of their table, your environment should stabilize. Something along the lines of the following, plus any appropriate adjustments for port/policy-map based upon your individual environment:

class-map rabbit-hop
 match port tcp eq 5672
policy-map global_policy
 class rabbit-hop
  set connection timeout idle 12:00:00 reset

Russell Bryant (russellb) wrote :

From searching around it sounds like this should no longer be an issue due to enabling TCP keepalives.

"amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"

Changed in nova:
status: Confirmed → Invalid
Justin Hopper (justin-hopper) wrote :

The version of kombu we are now using and the py-amqp lib that provides the transport supports heartbeat.

Heartbeat will help close connections when a client using rabbit is forcefully terminated.

Using heartbeats may be an option and if so can either be exposed to the rpc-component user by way of server-params or a configuration for the rpc-component.

Changed in nova:
status: Invalid → New
Kiall Mac Innes (kiall) wrote :

By pure fluke, I submitted this a few days back: https://review.openstack.org/#/c/34949

It adds heartbeat support to the Kombu driver.

Changed in oslo:
assignee: nobody → Kiall Mac Innes (kiall)
status: New → In Progress
Mark McLoughlin (markmc) wrote :

Russell's point should be addressed:

  "amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"

Mark McLoughlin (markmc) wrote :

I asked a bunch of questions in the oslo review

Main thing missing is what exactly the heartbeat fixes that SO_KEEPALIVE doesn't already address

Changed in nova:
status: New → Incomplete
Changed in oslo:
status: In Progress → Incomplete
Kiall Mac Innes (kiall) wrote :

Hey Mark - I've responded to your comments in the review comments. Rather than split the conversation over two places, I'll just leave a link here:

https://review.openstack.org/#/c/34949/

Mark McLoughlin (markmc) wrote :

The convincing point made in the review is that a service sitting there listening for RPC requests will have to wait 2 hours by default to be notified that it has lost its connection with the broker if we rely on SO_KEEPALIVE

Changed in oslo:
status: Incomplete → Triaged
importance: Undecided → High
Changed in nova:
status: Incomplete → Confirmed
importance: Wishlist → High
status: Confirmed → Triaged
Changed in oslo:
status: Triaged → In Progress

Reviewed: https://review.openstack.org/34949
Committed: http://github.com/openstack/oslo-incubator/commit/c37f6aaab3ac00b7865dee18158114433350237e
Submitter: Jenkins
Branch: master

commit c37f6aaab3ac00b7865dee18158114433350237e
Author: Kiall Mac Innes <email address hidden>
Date: Fri Jun 28 21:14:26 2013 +0100

    Add support for heartbeating in the kombu RPC driver

    This aides in detecting connection interruptions that would otherwise
    go unnoticed.

    Fixes bug #856764

    Change-Id: Id4eb3d36036969b62890175d6a33b4e304be0527

Changed in oslo:
status: In Progress → Fix Committed
Thierry Carrez (ttx) on 2013-07-17
Changed in oslo:
milestone: none → havana-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2013-08-14
Changed in oslo:
status: Fix Released → Triaged
Kevin Bringard (kbringard) wrote :

I spoke with MarkMc about this in #openstack-dev, but another thing I've discovered:

I should start by saying I'm in no way an ampq or rabbit expert. This is just based on a lot of googling, testing in my environment and trial and error. If I say something which doesn't make sense, it's quite possible it doesn't :-D

In rabbit, when master promotion occurs a slave queue will kick off all of it's consumers, but not kill the connection (http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-January/017341.html). An almost identical issue was brought up on the springsource client forums here: http://forum.springsource.org/archive/index.php/t-121480.html.

While the ampq libraries support connection disruption handling, they don't appear to handle channel disruption or consumer cancel notifications. The end result of which is that when a master promotion occurs in rabbit, the OpenStack services will continue to consume from a queue whose channel has been closed.

Once you get all your consumers to re-establish their channels, messages begin flowing again, but the ultimate result is that a single node failure can cause the majority (or even all) messages to stop flowing to OS services until you force them to re-establish (either by bouncing all rabbit nodes with attached/hung consumers or by restarting individual OS services).

You can reproduce the effects like so:

* Determine the master for any given queue.
** I generally do this by running watch "rabbitmqctl list_queues -p /nova name slave_pids synchronised_slave_pids messages messages_unacknowledged consumers | grep -v fanout" and look for the node in the cluster which is not a slave (inherently making it the master)
* Stop rabbit on the master node
* Watch the consumers column. It should mostly drop to 0, and busy queues (such as q-plugin) will likely begin backing up
* Pick a service (quantum-server works well, as it will drain q-plugin) and validate which rabbit node it is connected to (netstat, grepping the logs of the service, or rabbitmqctl list_connections name should find it pretty easily)
* Restart said service or the rabbit broker it is connected to
*Once it restarts and/or determines the connection has been lost, the connection will be re-established
* Go back to your watch command, and you should now see the new subscriber on its specific queue

I'm adding notes here because I'm not sure if the heartbeat implementation works at the channel level, or if we need to implement consumer cancel notification support (https://lists.launchpad.net/openstack/msg15111.html).

Regardless, without properly handling master promotion in rabbit, it makes using HA queues a moot exercise as losing a single node can cause all messages to stop flowing. Given the heavy reliance on the message queue, I think we need to be especially careful how we handle this and make it as solid as possible.

Kevin Bringard (kbringard) wrote :

So it looks like Ask Solem outlines how we need to do heartbeats in this post:

https://lists.launchpad.net/openstack/msg15111.html

Specifically:

An example of enabling heartbeats with eventlet could be:

import weakref
from kombu import Connection
from eventlet import spawn_after

def monitor_heartbeats(connection, rate=2):
if not connection.heartbeat:
    return
interval = connection.heartbeat / 2.0
cref = weakref.ref(connection)

def heartbeat_check():
    conn = cref()
    if conn is not None and conn.connected:
        conn.heartbeat_check(rate=rate)
        spawn_after(interval, heartbeat_check)

 return spawn_after(interval, heartbeat_check)

connection = Connection('pyamq://', heartbeat=10)

or:

connection = Connection('pyamqp://?heartbeat=10')

Additionally, I think adding support for consumer cancel notifications would aid in the master promotion issues I outlined above. From Ask's email:

- Consumer cancel notifications

Requires no changes to your code,
all you need is to properly reconnect when one of the
errors in Connection.channel_errors occur, which is handled
automatically by Connection.ensure / Connection.autoretry (I don't believe
Nova uses that, but it probably should).

Of course, this all requires updating to a newer version of kombu and amqp as well, but based on our experiences with rabbit, I really think the benefits of adding this functionality will help tremendously from an enterprise operationally ready standpoint. Without it, the HA story in rabbit is pretty dismal :-/

Kevin Bringard (kbringard) wrote :
Download full text (4.8 KiB)

So, based on Ask's comment about notifications, I started looking into it. As it turns out, *if* you're running a version of kombu/amqp which supports the channel_errors object (version 2.1.4 seems to be when it was introduced: http://kombu.readthedocs.org/en/latest/changelog.html), the following simple patch resolves the issue (also attached):

--- impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
+++ impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
@@ -488,7 +488,6 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
- self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -562,7 +561,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.channel_errors, socket.timeout, IOError), e:
+ except (self.connection_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Basically, in ensure() you want to watch the channel and not the connection.

I verified this in a 2 node rabbit cluster. There are 2 nodes: .139 and .141. .139 is currently the master.

The following is from the nova logs when .139 is stopped (and .141 is promoted to the master):

Notice, we're connected to 192.168.128.141:

013-08-22 21:27:45.807 INFO nova.openstack.common.rpc.common [req-20aa6610-b0df-4730-9773-6024e47a6da7 None None] Connected to AMQP server on 192.168.128.141:5672
2013-08-22 21:27:45.843 INFO nova.openstack.common.rpc.common [req-c82c8ea0-aa8b-49b0-925c-b79399f011de None None] Connected to AMQP server on 192.168.128.141:5672

...

Then, we stop rabbit on .139 and see the following *channel* error:

2013-08-22 21:28:13.475 20003 ERROR nova.openstack.common.rpc.common [-] Failed to consume message from queue: tag u'2'
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common Traceback (most recent call last):
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 572, in ensure
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return method(*args, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/lib/python2.7/dist-packages/nova/openstack/common/rpc/impl_kombu.py", line 654, in _consume
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return self.connection.drain_events(timeout=timeout)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/connection.py", line 281, in drain_events
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common return self.transport.drain_events(self.connection, **kwargs)
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.common.rpc.common File "/usr/local/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", lin...

Read more...

Kevin Bringard (kbringard) wrote :

Sorry, realized I created the patch the wrong way. :facepalm:

This is how it *should* be:

--- impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
+++ impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
@@ -488,6 +488,7 @@
             self.connection = None
         self.connection = kombu.connection.BrokerConnection(**params)
         self.connection_errors = self.connection.connection_errors
+ self.channel_errors = self.connection.channel_errors
         if self.memory_transport:
             # Kludge to speed up tests.
             self.connection.transport.polling_interval = 0.0
@@ -561,7 +562,7 @@
         while True:
             try:
                 return method(*args, **kwargs)
- except (self.connection_errors, socket.timeout, IOError), e:
+ except (self.channel_errors, socket.timeout, IOError), e:
                 if error_callback:
                     error_callback(e)
             except Exception, e:

Kevin Bringard (kbringard) wrote :

Quick update on this... I will probably submit this patch upstream, but the channel_errors object seems to exist in the old kombu, so we can declare it without an error, but it doesn't get populated as that version of kombu doesn't populate it.

The supplied patch should "work" on any version, but will only detect channel_errors when running versions of kombu which support it.

Doubtlessly this could be cleaner, and I still think that adding heartbeat support to actively populate and check the channel would be worthwhile, but this should also help with the issue in the short term.

It's also worth pointing out that the newer versions of kombu inherently support a lot of the functionality we're duplicating, such as ensuring connections exist, pooling connections and determining which servers to use and in what order. It's probably worth looking at implementing those once the newer versions of kombu are "standard" on the bulk of distros.

Sam Morrison (sorrison) wrote :

Hi Kevin,
Just wondering if you've had a chance to submit this upstream?

Thierry Carrez (ttx) on 2013-10-17
Changed in oslo:
milestone: havana-2 → 2013.2
milestone: 2013.2 → none
Chris Friesen (cbf123) wrote :

Any update on this issue? I've just run into an issue that I think might be related. We have active/standby controllers (using pacemaker) and multiple compute nodes.

If a controller is killed uncleanly all the services come up on the other controller but it takes about 9 minutes or so before I can boot up a new instance. After that time I see "nova.openstack.common.rpc.common [-] Failed to consume message from queue: Socket closed" on the compute nodes, then it reconnects to the AMQP server and I can then boot an instance.

Unfortunately, any instances I tried to boot during those 9 minutes stay in the "BUILD/scheduling" state forever.

Vish Ishaya (vishvananda) wrote :

The following fix works for failover, but doesn't solve all of the problems in HA mode. For that kevin's patch above is needed.

When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates.

So solving the HA issue generally involves a rabbit config with a section like the following:

[
 {rabbit, [{tcp_listen_options, [binary,
                                {packet, raw},
                                {reuseaddr, true},
                                {backlog, 128},
                                {nodelay, true},
                                {exit_on_close, false},
                                {keepalive, true}]}
          ]}
].

Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections:

echo "5" > /proc/sys/net/ipv4/tcp_keepalive_time
echo "5" > /proc/sys/net/ipv4/tcp_keepalive_probes
echo "1" > /proc/sys/net/ipv4/tcp_keepalive_intvl

Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use:

https://github.com/meebey/force_bind/blob/master/README

Mark McLoughlin (markmc) on 2013-12-11
Changed in oslo.messaging:
importance: Undecided → High
status: New → Triaged
Chet Burgess (cfb-n) wrote :

I have done extensive testing using both Vish's keepalive tuning parameters and Kevin's proposed fix. We've been able to validate that the following occur correctly.

1) A client will reconnect if the server they are actively connected to dies (Vish's tuning).
2) A client will reconnect if the AMQP master for the queue its subscribed too goes away (Kevin's proposed fix).

As the original reporters of this we feel the combination successfully addresses the issue and allows for a complete HA solution at the RPC level with rabbit.

Given the time since the patch was posted to the issue I plan on submitting a review to oslo.messaging with the proposed fix as soon as I have definitively confirmed what version of kombu will be required.

I also think we should open a doc bug to document the tuning parameters Vish has outlined. The default behavior out of the box is fairly poor and the HA story isn't really complete until both things are done.

I'm not entirely sure of the proper procedure for the doc bug so any guidance would be appreciated.

Sergey Pimkov (sergey-pimkov) wrote :

Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.

This problem was expained on Stack Overflow: http://stackoverflow.com/questions/16320039/getting-disconnection-notification-using-tcp-keep-alive-on-write-blocked-socket

Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!

I'm not sure if this is germane to the original bug report, but this seems to be where the discussion about RabbitMQ failover is happening, so here's the current state of the art, as far as we can tell:

With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.

Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled. We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them. They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.

We're not 100% sure why this is happening, but log analysis and observation causes us to wildly speculate that on failover with mirrored queues, RabbitMQ forces an election to determine a new master, and if clients attempt to teardown and re-establish their queues before the election has concluded, they will encounter a race condition where their termination requests get eaten and are unacknowledged by the server, and the clients just hang out forever waiting for their requests to complete, and never retry.

With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/nova-clear-rabbit-queues command must be run to clear out the "stale" fanout queues. With Kombu 3.x and newer, the situation is improved, and stopping RabbitMQ on all but one server will cause new CCNs to be generated, and the clients will cleanly migrate to the remaining server and begin working again.

This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine. As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs. The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.

Fix proposed to branch: master
Review: https://review.openstack.org/76686

Changed in oslo.messaging:
assignee: nobody → Nicolas Simonds (nicolas.simonds)
status: Triaged → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/77276

Changed in oslo.messaging:
assignee: Nicolas Simonds (nicolas.simonds) → Chet Burgess (cfb-n)
assignee: Chet Burgess (cfb-n) → Nicolas Simonds (nicolas.simonds)

Reviewed: https://review.openstack.org/77276
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=0400cbf4f83cf8d58076c7e65e08a156ec3508a8
Submitter: Jenkins
Branch: master

commit 0400cbf4f83cf8d58076c7e65e08a156ec3508a8
Author: Chet Burgess <email address hidden>
Date: Fri Feb 28 13:39:09 2014 -0800

    Gracefully handle consumer cancel notifications

    With mirrored queues and clustered rabbit nodes a queue is still
    mastered by a single rabbit node. When the rabbit node dies an
    election occurs amongst the remaining nodes and a new master is
    elected. When a slave is promoted to master it will close all the
    open channels to its consumers but it will not close the
    connections. This is reported to consumers as a consumer cancel
    notification (CCN). Consumers need to re-subscribe to these queues
    when they recieve a CCN.

    kombu 2.1.4+ reports CCNs as channel errors. This patch updates
    the ensure function to be more inline with the upstream kombu
    functionality. We now monitor for channel errors as well as
    connection errors and initiate a reconnect if we detect an error.

    Change-Id: Ie00f67e65250dc983fa45877c14091ad4ae136b4
    Partial-Bug: 856764

Reviewed: https://review.openstack.org/76686
Committed: https://git.openstack.org/cgit/openstack/oslo.messaging/commit/?id=fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
Submitter: Jenkins
Branch: master

commit fcd51a67d18a9e947ae5f57eafa43ac756d1a5a8
Author: Nicolas Simonds <email address hidden>
Date: Wed Feb 26 15:21:01 2014 -0800

    Slow down Kombu reconnect attempts

    For a rationale for this patch, see the discussion surrounding Bug

    When reconnecting to a RabbitMQ cluster with mirrored queues in
    use, the attempt to release the connection can hang "indefinitely"
    somewhere deep down in Kombu. Blocking the thread for a bit
    prior to release seems to kludge around the problem where it is
    otherwise reproduceable.

    DocImpact

    Change-Id: Ic2ede3046709b831adf8204e4c909c589c1786c4
    Partial-Bug: 856764

Mark McLoughlin (markmc) wrote :

Marking as Invalid for Nova because any fix would be in oslo.messaging

Changed in nova:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers