Comment 12 for bug 856764

Revision history for this message
Kevin Bringard (kbringard) wrote :

I spoke with MarkMc about this in #openstack-dev, but another thing I've discovered:

I should start by saying I'm in no way an ampq or rabbit expert. This is just based on a lot of googling, testing in my environment and trial and error. If I say something which doesn't make sense, it's quite possible it doesn't :-D

In rabbit, when master promotion occurs a slave queue will kick off all of it's consumers, but not kill the connection (http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2012-January/017341.html). An almost identical issue was brought up on the springsource client forums here: http://forum.springsource.org/archive/index.php/t-121480.html.

While the ampq libraries support connection disruption handling, they don't appear to handle channel disruption or consumer cancel notifications. The end result of which is that when a master promotion occurs in rabbit, the OpenStack services will continue to consume from a queue whose channel has been closed.

Once you get all your consumers to re-establish their channels, messages begin flowing again, but the ultimate result is that a single node failure can cause the majority (or even all) messages to stop flowing to OS services until you force them to re-establish (either by bouncing all rabbit nodes with attached/hung consumers or by restarting individual OS services).

You can reproduce the effects like so:

* Determine the master for any given queue.
** I generally do this by running watch "rabbitmqctl list_queues -p /nova name slave_pids synchronised_slave_pids messages messages_unacknowledged consumers | grep -v fanout" and look for the node in the cluster which is not a slave (inherently making it the master)
* Stop rabbit on the master node
* Watch the consumers column. It should mostly drop to 0, and busy queues (such as q-plugin) will likely begin backing up
* Pick a service (quantum-server works well, as it will drain q-plugin) and validate which rabbit node it is connected to (netstat, grepping the logs of the service, or rabbitmqctl list_connections name should find it pretty easily)
* Restart said service or the rabbit broker it is connected to
*Once it restarts and/or determines the connection has been lost, the connection will be re-established
* Go back to your watch command, and you should now see the new subscriber on its specific queue

I'm adding notes here because I'm not sure if the heartbeat implementation works at the channel level, or if we need to implement consumer cancel notification support (https://lists.launchpad.net/openstack/msg15111.html).

Regardless, without properly handling master promotion in rabbit, it makes using HA queues a moot exercise as losing a single node can cause all messages to stop flowing. Given the heavy reliance on the message queue, I think we need to be especially careful how we handle this and make it as solid as possible.