OpenStack Compute (nova)

Bug #856764
Comment #22

Comment 22 for bug 856764

Revision history for this message

Nicolas Simonds (nicolas.simonds) wrote on 2014-02-26:

#22

I'm not sure if this is germane to the original bug report, but this seems to be where the discussion about RabbitMQ failover is happening, so here's the current state of the art, as far as we can tell:

With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.

Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled. We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them. They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.

We're not 100% sure why this is happening, but log analysis and observation causes us to wildly speculate that on failover with mirrored queues, RabbitMQ forces an election to determine a new master, and if clients attempt to teardown and re-establish their queues before the election has concluded, they will encounter a race condition where their termination requests get eaten and are unacknowledged by the server, and the clients just hang out forever waiting for their requests to complete, and never retry.

With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/nova-clear-rabbit-queues command must be run to clear out the "stale" fanout queues. With Kombu 3.x and newer, the situation is improved, and stopping RabbitMQ on all but one server will cause new CCNs to be generated, and the clients will cleanly migrate to the remaining server and begin working again.

This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine. As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs. The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.

With the RabbitMQ configs described above (and RabbitMQ 3.2.2), failover works pretty seamlessly, and Kombu 2.5.x and newer handle the Consumer Cancel Notifications properly and promptly.

Where things get interesting is when you have a cluster of >2 RabbitMQ servers and mirrored queues enabled.  We're seeing an odd phenomenon where, upon failover, a random subset of nova-compute nodes will "orphan" their topic and fanout queues, and never consume messages from them.  They will still publish messages successfully, though, so commands like "nova service-list" will show the nodes as active, although for all intents and purposes, they're dead.

With Kombu 2.5.x, a restart of nova-compute is required to get them to reconnect, and the /usr/bin/nova-clear-rabbit-queues command must be run to clear out the "stale" fanout queues.  With Kombu 3.x and newer, the situation is improved, and stopping RabbitMQ on all but one server will cause new CCNs to be generated, and the clients will cleanly migrate to the remaining server and begin working again.

This is still sub-wonderful because when the compute nodes "go dead", they can't receive messages on the bus, but Nova still thinks they're fine.  As a dodge around this, we've added a config option to the conductor to introduce an artificial delay before Kombu responds to CCNs.  The default value of 1.0 seconds seems to be more than enough time for RabbitMQ to get itself sorted out and avoid races, but users can turn it up (or down) as desired.