RabbitMQ connections lack heartbeat or TCP keepalives
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Ceilometer |
Invalid
|
High
|
Unassigned | |
| Icehouse |
Fix Released
|
High
|
Bogdan Dobrelya | |
| Cinder |
Undecided
|
Ivan Kolodyazhny | ||
| Mirantis OpenStack |
High
|
Alexei Kornienko | ||
| OpenStack Compute (nova) |
High
|
Unassigned | ||
| oslo.messaging |
Critical
|
Mehdi Abaakouk | ||
| oslo.messaging (Ubuntu) |
High
|
Unassigned |
Bug Description
There is currently no method built into Nova to keep connections from various components into RabbitMQ alive. As a result, placing a stateful firewall (such as a Cisco ASA) between the connection can/does result in idle connections being terminated without either endpoint being aware.
This issue can be mitigated a few different ways:
1. Connections to RabbitMQ set socket options to enable TCP keepalives.
2. Rabbit has heartbeat functionality. If the client requests heartbeats on connection, rabbit server will regularly send messages to each connections with the expectation of a response.
3. Other?
Changed in nova: | |
importance: | Undecided → Wishlist |
status: | New → Confirmed |
Andrea Rosa (andrea-rosa-m) wrote : | #1 |
Brad McConnell (bmcconne) wrote : | #2 |
Just wanted to add an alternate solution to this for the folks that run into this bug while searching. If you make the ASA send active resets instead of silently dropping the connections out of their table, your environment should stabilize. Something along the lines of the following, plus any appropriate adjustments for port/policy-map based upon your individual environment:
class-map rabbit-hop
match port tcp eq 5672
policy-map global_policy
class rabbit-hop
set connection timeout idle 12:00:00 reset
Russell Bryant (russellb) wrote : | #3 |
From searching around it sounds like this should no longer be an issue due to enabling TCP keepalives.
"amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"
Changed in nova: | |
status: | Confirmed → Invalid |
Justin Hopper (justin-hopper) wrote : | #4 |
The version of kombu we are now using and the py-amqp lib that provides the transport supports heartbeat.
Heartbeat will help close connections when a client using rabbit is forcefully terminated.
Using heartbeats may be an option and if so can either be exposed to the rpc-component user by way of server-params or a configuration for the rpc-component.
Changed in nova: | |
status: | Invalid → New |
Kiall Mac Innes (kiall) wrote : | #5 |
By pure fluke, I submitted this a few days back: https:/
It adds heartbeat support to the Kombu driver.
Changed in oslo: | |
assignee: | nobody → Kiall Mac Innes (kiall) |
status: | New → In Progress |
Mark McLoughlin (markmc) wrote : | #6 |
Russell's point should be addressed:
"amqplib versions after and including 1.0 enables SO_KEEPALIVE by default, and Kombu versions after and including 1.2.1 depends on amqplib >= 1.0"
Mark McLoughlin (markmc) wrote : | #7 |
I asked a bunch of questions in the oslo review
Main thing missing is what exactly the heartbeat fixes that SO_KEEPALIVE doesn't already address
Changed in nova: | |
status: | New → Incomplete |
Changed in oslo: | |
status: | In Progress → Incomplete |
Kiall Mac Innes (kiall) wrote : | #8 |
Hey Mark - I've responded to your comments in the review comments. Rather than split the conversation over two places, I'll just leave a link here:
Mark McLoughlin (markmc) wrote : | #9 |
The convincing point made in the review is that a service sitting there listening for RPC requests will have to wait 2 hours by default to be notified that it has lost its connection with the broker if we rely on SO_KEEPALIVE
Changed in oslo: | |
status: | Incomplete → Triaged |
importance: | Undecided → High |
Changed in nova: | |
status: | Incomplete → Confirmed |
importance: | Wishlist → High |
status: | Confirmed → Triaged |
Changed in oslo: | |
status: | Triaged → In Progress |
Reviewed: https:/
Committed: http://
Submitter: Jenkins
Branch: master
commit c37f6aaab3ac00b
Author: Kiall Mac Innes <email address hidden>
Date: Fri Jun 28 21:14:26 2013 +0100
Add support for heartbeating in the kombu RPC driver
This aides in detecting connection interruptions that would otherwise
go unnoticed.
Fixes bug #856764
Change-Id: Id4eb3d36036969
Changed in oslo: | |
status: | In Progress → Fix Committed |
Changed in oslo: | |
milestone: | none → havana-2 |
status: | Fix Committed → Fix Released |
Mike Lundy (novas0x2a) wrote : | #11 |
Note that the fix for this was reverted: https:/
Changed in oslo: | |
status: | Fix Released → Triaged |
Kevin Bringard (kbringard) wrote : | #12 |
I spoke with MarkMc about this in #openstack-dev, but another thing I've discovered:
I should start by saying I'm in no way an ampq or rabbit expert. This is just based on a lot of googling, testing in my environment and trial and error. If I say something which doesn't make sense, it's quite possible it doesn't :-D
In rabbit, when master promotion occurs a slave queue will kick off all of it's consumers, but not kill the connection (http://
While the ampq libraries support connection disruption handling, they don't appear to handle channel disruption or consumer cancel notifications. The end result of which is that when a master promotion occurs in rabbit, the OpenStack services will continue to consume from a queue whose channel has been closed.
Once you get all your consumers to re-establish their channels, messages begin flowing again, but the ultimate result is that a single node failure can cause the majority (or even all) messages to stop flowing to OS services until you force them to re-establish (either by bouncing all rabbit nodes with attached/hung consumers or by restarting individual OS services).
You can reproduce the effects like so:
* Determine the master for any given queue.
** I generally do this by running watch "rabbitmqctl list_queues -p /nova name slave_pids synchronised_
* Stop rabbit on the master node
* Watch the consumers column. It should mostly drop to 0, and busy queues (such as q-plugin) will likely begin backing up
* Pick a service (quantum-server works well, as it will drain q-plugin) and validate which rabbit node it is connected to (netstat, grepping the logs of the service, or rabbitmqctl list_connections name should find it pretty easily)
* Restart said service or the rabbit broker it is connected to
*Once it restarts and/or determines the connection has been lost, the connection will be re-established
* Go back to your watch command, and you should now see the new subscriber on its specific queue
I'm adding notes here because I'm not sure if the heartbeat implementation works at the channel level, or if we need to implement consumer cancel notification support (https:/
Regardless, without properly handling master promotion in rabbit, it makes using HA queues a moot exercise as losing a single node can cause all messages to stop flowing. Given the heavy reliance on the message queue, I think we need to be especially careful how we handle this and make it as solid as possible.
Kevin Bringard (kbringard) wrote : | #13 |
So it looks like Ask Solem outlines how we need to do heartbeats in this post:
https:/
Specifically:
An example of enabling heartbeats with eventlet could be:
import weakref
from kombu import Connection
from eventlet import spawn_after
def monitor_
if not connection.
return
interval = connection.
cref = weakref.
def heartbeat_check():
conn = cref()
if conn is not None and conn.connected:
return spawn_after(
connection = Connection(
or:
connection = Connection(
Additionally, I think adding support for consumer cancel notifications would aid in the master promotion issues I outlined above. From Ask's email:
- Consumer cancel notifications
Requires no changes to your code,
all you need is to properly reconnect when one of the
errors in Connection.
automatically by Connection.ensure / Connection.
Nova uses that, but it probably should).
Of course, this all requires updating to a newer version of kombu and amqp as well, but based on our experiences with rabbit, I really think the benefits of adding this functionality will help tremendously from an enterprise operationally ready standpoint. Without it, the HA story in rabbit is pretty dismal :-/
Kevin Bringard (kbringard) wrote : | #14 |
So, based on Ask's comment about notifications, I started looking into it. As it turns out, *if* you're running a version of kombu/amqp which supports the channel_errors object (version 2.1.4 seems to be when it was introduced: http://
--- impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
+++ impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
@@ -488,7 +488,6 @@
- self.channel_errors = self.connection
if self.memory_
# Kludge to speed up tests.
@@ -562,7 +561,7 @@
while True:
try:
- except (self.channel_
+ except (self.connectio
if error_callback:
except Exception, e:
Basically, in ensure() you want to watch the channel and not the connection.
I verified this in a 2 node rabbit cluster. There are 2 nodes: .139 and .141. .139 is currently the master.
The following is from the nova logs when .139 is stopped (and .141 is promoted to the master):
Notice, we're connected to 192.168.128.141:
013-08-22 21:27:45.807 INFO nova.openstack.
2013-08-22 21:27:45.843 INFO nova.openstack.
...
Then, we stop rabbit on .139 and see the following *channel* error:
2013-08-22 21:28:13.475 20003 ERROR nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
2013-08-22 21:28:13.475 20003 TRACE nova.openstack.
Kevin Bringard (kbringard) wrote : | #15 |
Sorry, realized I created the patch the wrong way. :facepalm:
This is how it *should* be:
--- impl_kombu.py.orig 2013-08-22 21:52:37.727386558 +0000
+++ impl_kombu.py.new 2013-08-22 21:52:54.711337602 +0000
@@ -488,6 +488,7 @@
+ self.channel_errors = self.connection
if self.memory_
# Kludge to speed up tests.
@@ -561,7 +562,7 @@
while True:
try:
- except (self.connectio
+ except (self.channel_
if error_callback:
except Exception, e:
Kevin Bringard (kbringard) wrote : | #16 |
Quick update on this... I will probably submit this patch upstream, but the channel_errors object seems to exist in the old kombu, so we can declare it without an error, but it doesn't get populated as that version of kombu doesn't populate it.
The supplied patch should "work" on any version, but will only detect channel_errors when running versions of kombu which support it.
Doubtlessly this could be cleaner, and I still think that adding heartbeat support to actively populate and check the channel would be worthwhile, but this should also help with the issue in the short term.
It's also worth pointing out that the newer versions of kombu inherently support a lot of the functionality we're duplicating, such as ensuring connections exist, pooling connections and determining which servers to use and in what order. It's probably worth looking at implementing those once the newer versions of kombu are "standard" on the bulk of distros.
Sam Morrison (sorrison) wrote : | #17 |
Hi Kevin,
Just wondering if you've had a chance to submit this upstream?
Changed in oslo: | |
milestone: | havana-2 → 2013.2 |
milestone: | 2013.2 → none |
Chris Friesen (cbf123) wrote : | #18 |
Any update on this issue? I've just run into an issue that I think might be related. We have active/standby controllers (using pacemaker) and multiple compute nodes.
If a controller is killed uncleanly all the services come up on the other controller but it takes about 9 minutes or so before I can boot up a new instance. After that time I see "nova.openstack
Unfortunately, any instances I tried to boot during those 9 minutes stay in the "BUILD/scheduling" state forever.
Vish Ishaya (vishvananda) wrote : | #19 |
The following fix works for failover, but doesn't solve all of the problems in HA mode. For that kevin's patch above is needed.
When a connection to a socket is cut off completely, the receiving side doesn't know that the connection has dropped, so can end up with a half-open connection. The general solution for this in linux is to turn on TCP_KEEPALIVES. Kombu will enable keepalives if the version number is high enough (>1.0 iirc), but rabbit needs to be specially configured to send keepalives on the connections that it creates.
So solving the HA issue generally involves a rabbit config with a section like the following:
[
{rabbit, [{tcp_listen_
]}
].
Then you should also shorten the keepalive sysctl settings or it will still take ~2 hrs to terminate the connections:
echo "5" > /proc/sys/
echo "5" > /proc/sys/
echo "1" > /proc/sys/
Obviously this should be done in a sysctl config file instead of at the command line. Note that if you only want to shorten the rabbit keepalives but keep everything else as a default, you can use an LD_PRELOAD library to do so. For example you could use:
Changed in oslo.messaging: | |
importance: | Undecided → High |
status: | New → Triaged |
Chet Burgess (cfb-n) wrote : | #20 |
I have done extensive testing using both Vish's keepalive tuning parameters and Kevin's proposed fix. We've been able to validate that the following occur correctly.
1) A client will reconnect if the server they are actively connected to dies (Vish's tuning).
2) A client will reconnect if the AMQP master for the queue its subscribed too goes away (Kevin's proposed fix).
As the original reporters of this we feel the combination successfully addresses the issue and allows for a complete HA solution at the RPC level with rabbit.
Given the time since the patch was posted to the issue I plan on submitting a review to oslo.messaging with the proposed fix as soon as I have definitively confirmed what version of kombu will be required.
I also think we should open a doc bug to document the tuning parameters Vish has outlined. The default behavior out of the box is fairly poor and the HA story isn't really complete until both things are done.
I'm not entirely sure of the proper procedure for the doc bug so any guidance would be appreciated.
Sergey Pimkov (sergey-pimkov) wrote : | #21 |
Seems like tcp keepalive settings are not enough to provide good failure tolerance. For example, in my openstack cluster nova-conductor and neutron agents always stuck with some unacknowledged tcp traffic, so tcp keepalive timer is never been started. After 900 seconds services began to work.
This problem was expained on Stack Overflow: http://
Currently I use a hacky workaround: set TCP_USER_TIMEOUT with hardcoded value for socket in amqp library (the patch is attached). Is there a more elegant way to solve this problem? Thank you!
Nicolas Simonds (nicolas.simonds) wrote : | #22 |
I'm not sure if this is germane to the original bug report, but this seems to be where the discussion about RabbitMQ failover is happening, so here's the current state of the art, as far as we can tell:
With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.
Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled. We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them. They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.
We're not 100% sure why this is happening, but log analysis and observation causes us to wildly speculate that on failover with mirrored queues, RabbitMQ forces an election to determine a new master, and if clients attempt to teardown and re-establish their queues before the election has concluded, they will encounter a race condition where their termination requests get eaten and are unacknowledged by the server, and the clients just hang out forever waiting for their requests to complete, and never retry.
With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/
This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine. As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs. The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.
Fix proposed to branch: master
Review: https:/
Changed in oslo.messaging: | |
assignee: | nobody → Nicolas Simonds (nicolas.simonds) |
status: | Triaged → In Progress |
Fix proposed to branch: master
Review: https:/
Changed in oslo.messaging: | |
assignee: | Nicolas Simonds (nicolas.simonds) → Chet Burgess (cfb-n) |
assignee: | Chet Burgess (cfb-n) → Nicolas Simonds (nicolas.simonds) |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 0400cbf4f83cf8d
Author: Chet Burgess <email address hidden>
Date: Fri Feb 28 13:39:09 2014 -0800
Gracefully handle consumer cancel notifications
With mirrored queues and clustered rabbit nodes a queue is still
mastered by a single rabbit node. When the rabbit node dies an
election occurs amongst the remaining nodes and a new master is
elected. When a slave is promoted to master it will close all the
open channels to its consumers but it will not close the
connections. This is reported to consumers as a consumer cancel
notification (CCN). Consumers need to re-subscribe to these queues
when they recieve a CCN.
kombu 2.1.4+ reports CCNs as channel errors. This patch updates
the ensure function to be more inline with the upstream kombu
functionality. We now monitor for channel errors as well as
connection errors and initiate a reconnect if we detect an error.
Change-Id: Ie00f67e65250dc
Partial-Bug: 856764
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit fcd51a67d18a9e9
Author: Nicolas Simonds <email address hidden>
Date: Wed Feb 26 15:21:01 2014 -0800
Slow down Kombu reconnect attempts
For a rationale for this patch, see the discussion surrounding Bug
When reconnecting to a RabbitMQ cluster with mirrored queues in
use, the attempt to release the connection can hang "indefinitely"
somewhere deep down in Kombu. Blocking the thread for a bit
prior to release seems to kludge around the problem where it is
otherwise reproduceable.
DocImpact
Change-Id: Ic2ede3046709b8
Partial-Bug: 856764
Mark McLoughlin (markmc) wrote : | #27 |
Marking as Invalid for Nova because any fix would be in oslo.messaging
Changed in nova: | |
status: | Triaged → Invalid |
Changed in oslo.messaging: | |
importance: | High → Critical |
Changed in oslo: | |
assignee: | Kiall Mac Innes (kiall) → nobody |
Changed in oslo.messaging: | |
assignee: | Nicolas Simonds (nicolas.simonds) → James Page (james-page) |
Fix proposed to branch: master
Review: https:/
tags: | added: havana-backport-potential |
Bogdan Dobrelya (bogdando) wrote : | #29 |
Please sync kombu_reconnect
Changed in neutron: | |
status: | New → Confirmed |
Changed in heat: | |
status: | New → Confirmed |
Changed in ceilometer: | |
status: | New → Confirmed |
Changed in neutron: | |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
Related fix proposed to branch: master
Review: https:/
Related fix proposed to branch: master
Review: https:/
Changed in heat: | |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
Changed in neutron: | |
status: | Confirmed → In Progress |
Changed in heat: | |
status: | Confirmed → In Progress |
Changed in ceilometer: | |
status: | Confirmed → New |
Related fix proposed to branch: stable/icehouse
Review: https:/
Changed in ceilometer: | |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
status: | New → In Progress |
Changed in ceilometer: | |
importance: | Undecided → High |
milestone: | none → 2014.1.1 |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/icehouse
commit 06eb8bc53225c2b
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 26 13:28:40 2014 +0300
Sync kombu_reconnect
When reconnecting to a RabbitMQ cluster
with mirrored queues in use, the attempt to release the
connection can hang "indefinitely" somewhere deep down
in Kombu. Blocking the thread for a bit prior to
release seems to kludge around the problem where it is
otherwise reproduceable.
The value 5.0 fits for low perfomance environments as well.
Cherry-picked from Oslo.messaging:
fcd51a67d18
Related-bug: #856764
Change-Id: Ifadda4dd9122df
Signed-off-by: Bogdan Dobrelya <email address hidden>
tags: | added: in-stable-icehouse |
Changed in ceilometer: | |
milestone: | 2014.1.1 → none |
tags: | removed: in-stable-icehouse |
Bogdan Dobrelya (bogdando) wrote : | #34 |
Please note, the patch https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo-incubator (stable/icehouse) | #35 |
Related fix proposed to branch: stable/icehouse
Review: https:/
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to oslo-incubator (stable/icehouse) | #37 |
Related fix proposed to branch: stable/icehouse
Review: https:/
Related fix proposed to branch: master
Review: https:/
Related fix proposed to branch: master
Review: https:/
Change abandoned by Bogdan Dobrelya (<email address hidden>) on branch: stable/icehouse
Review: https:/
Reason: new change id is Ic2ede3046709b8
Changed in oslo: | |
assignee: | nobody → Bogdan Dobrelya (bogdando) |
status: | Triaged → In Progress |
Changed in neutron: | |
assignee: | Bogdan Dobrelya (bogdando) → nobody |
Changed in heat: | |
status: | In Progress → Confirmed |
assignee: | Bogdan Dobrelya (bogdando) → nobody |
Changed in ceilometer: | |
assignee: | Bogdan Dobrelya (bogdando) → nobody |
Changed in fuel: | |
milestone: | none → 5.1 |
importance: | Undecided → High |
status: | New → Confirmed |
Changed in mos: | |
assignee: | nobody → MOS Oslo (mos-oslo) |
importance: | Undecided → High |
milestone: | none → 5.1 |
status: | New → Confirmed |
Changed in mos: | |
milestone: | 5.1 → 5.0.1 |
Changed in mos: | |
assignee: | MOS Oslo (mos-oslo) → Alexei Kornienko (alexei-kornienko) |
Changed in fuel: | |
assignee: | nobody → Fuel Library Team (fuel-library) |
Changed in oslo: | |
assignee: | Bogdan Dobrelya (bogdando) → nobody |
Changed in fuel: | |
assignee: | Fuel Library Team (fuel-library) → MOS Oslo (mos-oslo) |
no longer affects: | fuel |
Changed in mos: | |
status: | Confirmed → Fix Committed |
Changed in mos: | |
status: | Fix Committed → In Progress |
Changed in mos: | |
status: | In Progress → Fix Committed |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/icehouse
commit 14720138309c67d
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Jun 10 14:26:42 2014 +0300
Slow down Kombu reconnect attempts
For a rationale for this patch, see the discussion surrounding Bug
When reconnecting to a RabbitMQ cluster with mirrored queues in
use, the attempt to release the connection can hang "indefinitely"
somewhere deep down in Kombu. Blocking the thread for a bit
prior to release seems to kludge around the problem where it is
otherwise reproduceable.
DocImpact
Change-Id: Ic2ede3046709b8
Partial-Bug: #856764
tags: | added: in-stable-icehouse |
no longer affects: | oslo-incubator |
Related fix proposed to branch: master
Review: https:/
Related fix proposed to branch: master
Review: https:/
Change abandoned by Ilya Pekelny (<email address hidden>) on branch: master
Review: https:/
Reason: Invalid change ID
Bogdan Dobrelya (bogdando) wrote : | #66 |
related bug https:/
Related fix proposed to branch: master
Review: https:/
Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https:/
Change abandoned by James Page (<email address hidden>) on branch: master
Review: https:/
Reason: Alternative implementation proposed which is more complete
sridhar basam (sri-7) wrote : | #70 |
Our rabbitmq problems have gone away by using a version of rabbitmq > 3.3.0 due to the following change in rabbitmq.
26070 automatically reconsume when mirrored queues fail over (and
introduce x-cancel-
This moves the logic to enable consumption on a queue back to the server side by default. Previously during a queue failover, the server notified consumers about the need to reconsume and left it to the clients to initiate it. Using version 3.3.5 of rabbitmq and 2.5.12 of kombu, we haven't had a single stuck queue after multiple restarts of members in our rabbitmq cluster.
Bogdan Dobrelya (bogdando) wrote : | #71 |
That is a good point, thank you. I believe Oslo.messaging should has an option (default false) to use this x-cancel-
Changed in cinder: | |
status: | New → Confirmed |
Changed in mos: | |
status: | Fix Committed → Incomplete |
Changed in ceilometer: | |
status: | In Progress → Invalid |
Changed in mos: | |
status: | Incomplete → Fix Committed |
Changed in heat: | |
assignee: | nobody → Deliang Fan (vanderliang) |
Fix proposed to branch: master
Review: https:/
Changed in oslo.messaging: | |
assignee: | James Page (james-page) → Mehdi Abaakouk (sileht) |
Related fix proposed to branch: master
Review: https:/
Fix proposed to branch: master
Review: https:/
Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https:/
Reason: wrong change id: see https:/
Changed in oslo.messaging: | |
milestone: | none → next-kilo |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 16ee9a86830a174
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 10:24:54 2015 +0100
Refactor the replies waiter code
This changes improves the way of we wait for replies.
Currently, one of the rpc client is reponsible to poll the amqp connection
used for replies and passed received answers to the correct client.
In this way, we have some case if no client is waiting for a reply, the
connection is not polled and no IO are done on the wire. The direct
effect of that is we don't detect if the tcp connection is broken,
from the system point of view, the tcp connection stay alive even if someone
between the client and server have closed the connection.
This change refactors the replies waiter code by creating a background
thread responsible to poll the connection instead of a random client.
The connection lost will be detect as soon as possible even if no rpc
client are currently used the connection.
This is a mandatory change to be able to enable heartbeat on this
connection.
Related-Bug: #1371723
Related-Bug: #856764
Change-Id: I82d4029dd897ef
Fix proposed to branch: master
Review: https:/
Change abandoned by Davanum Srinivas (dims) (<email address hidden>) on branch: master
Review: https:/
Reason: Ok Ilya, i'll mark it as abandoned
Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https:/
Reason: Merged into the heartbeat patch.
Jason Harley (redmind) wrote : | #80 |
Is there any work being done to backport heartbeats to Icehouse's Oslo messaging?
Changed in oslo.messaging: | |
milestone: | 1.7.0 → none |
milestone: | none → next-kilo |
Changed in oslo.messaging: | |
milestone: | 1.8.0 → next-liberty |
Changed in cinder: | |
assignee: | nobody → Ivan Kolodyazhny (e0ne) |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit b9e134d7e955b91
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 09:13:10 2015 +0100
rabbit: heartbeat implementation
AMQP offers a heartbeat feature to ensure that the application layer
promptly finds out about disrupted connections (and also completely
unresponsive peers). If the client requests heartbeats on connection, rabbit
server will regularly send messages to each connections with the expectation of
a response.
To acheive this, each driver connection object spawn a thread that
send/retrieve heartbeat packets exchanged between the server and the
client.
To protect the concurrency access to the kombu connection between the
driver and this thread use a lock that always prioritize the
heartbeat thread. So when the heartbeat thread wakes up it will acquire the
lock quickly, to ensure we have no heartbeat starvation when the driver
sends a lot of messages.
Also when we are polling the broker, the lock can be held for a long
time by the 'consume' method, so this one does the heartbeat stuffs itself.
DocImpact: 2 new configuration options for Rabbit driver
Co-Authored-By: Oleksii Zamiatin <email address hidden>
Co-Authored-By: Ilya Pekelny <email address hidden>
Related-Bug: #1371723
Closes-Bug: #856764
Change-Id: I1d3a635f3853bc
Changed in oslo.messaging: | |
status: | In Progress → Fix Committed |
OSCI Robot (oscirobot) wrote : | #86 |
RPM package oslo.messaging has been built for project openstack/
Package version == 1.8.0, package release == fuel6.1.
Changeset: https:/
project: openstack/
branch: master
author: Pekelny Ilya
committer: openstack-
subject: rabbit: heartbeat implementation
status: patchset-created
Files placed on repository:
python-
python-
NOTE: Changeset is not merged, created temporary package repository.
RPM repository URL: http://
OSCI Robot (oscirobot) wrote : | #87 |
RPM package oslo.messaging has been built for project openstack/
Package version == 1.8.0, package release == fuel6.1.mira10
Changeset: https:/
project: openstack/
branch: master
author: Pekelny Ilya
committer: openstack-
subject: rabbit: heartbeat implementation
status: change-merged
Files placed on repository:
python-
python-
Changeset merged. Package placed on primary repository
RPM repository URL: http://
OSCI Robot (oscirobot) wrote : | #88 |
DEB package oslo.messaging has been built for project openstack/
Package version == 1.8.0, package release == fuel6.1~mira10
Changeset: https:/
project: openstack/
branch: master
author: Pekelny Ilya
committer: openstack-
subject: rabbit: heartbeat implementation
status: change-merged
Files placed on repository:
python-
Changeset merged. Package placed on primary repository
DEB repository URL: http://
OSCI Robot (oscirobot) wrote : | #89 |
DEB package oslo.messaging has been built for project openstack/
Package version == 1.8.0, package release == fuel6.1~
Changeset: https:/
project: openstack/
branch: master
author: Pekelny Ilya
committer: openstack-
subject: rabbit: heartbeat implementation
status: patchset-created
Files placed on repository:
python-
NOTE: Changeset is not merged, created temporary package repository.
DEB repository URL: http://
Changed in heat: | |
assignee: | Deliang Fan (vanderliang) → nobody |
Fix proposed to branch: master
Review: https:/
Change abandoned by Mehdi Abaakouk (<email address hidden>) on branch: master
Review: https:/
Fix proposed to branch: stable/kilo
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/kilo
commit 64bdd80c5fe4d53
Author: Mehdi Abaakouk <email address hidden>
Date: Wed Jan 21 09:13:10 2015 +0100
rabbit: heartbeat implementation
AMQP offers a heartbeat feature to ensure that the application layer
promptly finds out about disrupted connections (and also completely
unresponsive peers). If the client requests heartbeats on connection, rabbit
server will regularly send messages to each connections with the expectation of
a response.
To acheive this, each driver connection object spawn a thread that
send/retrieve heartbeat packets exchanged between the server and the
client.
To protect the concurrency access to the kombu connection between the
driver and this thread use a lock that always prioritize the
heartbeat thread. So when the heartbeat thread wakes up it will acquire the
lock quickly, to ensure we have no heartbeat starvation when the driver
sends a lot of messages.
Also when we are polling the broker, the lock can be held for a long
time by the 'consume' method, so this one does the heartbeat stuffs itself.
DocImpact: 2 new configuration options for Rabbit driver
Co-Authored-By: Oleksii Zamiatin <email address hidden>
Co-Authored-By: Ilya Pekelny <email address hidden>
Related-Bug: #1371723
Closes-Bug: #856764
Change-Id: I1d3a635f3853bc
(cherry picked from commit b9e134d7e955b91
tags: | added: in-stable-kilo |
Changed in oslo.messaging: | |
milestone: | next-liberty → 1.8.1 |
status: | Fix Committed → Fix Released |
The attachment "impl_kombu.
[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]
tags: | added: patch |
Launchpad Janitor (janitor) wrote : | #95 |
This bug was fixed in the package oslo.messaging - 1.8.1-0ubuntu1
---------------
oslo.messaging (1.8.1-0ubuntu1) vivid; urgency=medium
* New upstream release for OpenStack Kilo, including enablement
of RabbitMQ heartbeating for improved connection failure detection
(LP: #856764):
- d/p/zmq-
d/
- d/p/zmq-
- d/p/disable-
trollius and aioeventlet executors for vivid release.
- d/control: Align minimum version requirements with upstream.
* d/pydist-overrides: Add overrides for new oslo package naming.
* Misc fixes for zmq driver:
- d/p/Fix-
Fix changing keys during iteration in matchmaker heartbeat
(LP: #1432966).
- d/p/Add-
for matchmaker drivers (LP: #1291701).
-- James Page <email address hidden> Mon, 30 Mar 2015 09:52:29 +0100
Changed in oslo.messaging (Ubuntu): | |
status: | New → Fix Released |
Any way to have this fixed in Juno too ?
Alejandro Comisario (alejandro-f) wrote : | #97 |
+ 1 is there a way to apply / backport this fix into Juno ?
Or maybe, pip install -U oslo.messagin will do ?
Tom Fifield (fifieldt) wrote : | #98 |
@Quentin, @Alejandro: Check out http://
Rongze Zhu (zrzhit) wrote : | #99 |
@Mehdi Abaakouk, @Alexei Kornienko, My patch about keepalive options had been merged into pyamqp[1] , it is very useful for aware the connection being terminated and raise a socket error exception.
We can add some keepalive options in oslo.messaging[2] and pass these keepalive options to kombu pyamqp transport, so the idle connections in consumer will aware the connection being terminated and the consumer will catch the socket Exception, and reconnect again.
The method of tcp keepalive is simpler than heartbeat checking thread. I had used this approach in multiple production environments more than a year, it is effective.
[1] https:/
[2] https:/
Rongze Zhu (zrzhit) wrote : | #100 |
tags: | removed: havana-backport-potential in-stable-icehouse in-stable-kilo |
no longer affects: | heat |
Changed in oslo.messaging (Ubuntu): | |
importance: | Undecided → High |
Sean McGinnis (sean-mcginnis) wrote : | #101 |
If I follow correctly, this no longer effects Cinder since it was implemented in oslo.messaging.
Changed in cinder: | |
status: | Confirmed → Invalid |
no longer affects: | neutron |
For the solution 2 (heartbeat functionality) we need to use another amqp client (for example pika), at this moment the python-amqplib doesn't implement the heartbeat.