kolla-ansible

Bug #1691759
Comment #1

Comment 1 for bug 1691759

Revision history for this message

Paul Bourke (pauldbourke) wrote on 2017-06-09:

Just spent a few hours looking at this in more detail. It seems at least for now, the migration path from rabbitmq-clusterer is not straight forward. Unfortunately the plugin docs don't give much info on why the plugin is deprecated, other than linking to the existing rabbitmq clustering docs.

According to the docs, the plugin was created to solve two issues:

1) declarative cluster configuration
2) arbitrary node restart

From what I can tell, while 3.6.7 seems to have taken steps to address 1), 2) still seems to be lacking:

"When the entire cluster is brought down, the last node to go down must be the first node to be brought online. If this doesn't happen, the nodes will wait 30 seconds for the last disc node to come back online, and fail afterwards. If the last node to go offline cannot be brought back up, it can be removed from the cluster using the forget_cluster_node command - consult the rabbitmqctl manpage for more information." [0]

and,

"While not strictly necessary, it is a good idea to decide ahead of time which disc node will be the upgrader, stop that node last, and start it first. Otherwise changes to the cluster configuration that were made between the upgrader node stopping and the last node stopping will be lost." [1]

This kind of co-ordination is non trivial in Ansible.

It seems further steps are being taken to address this for the upcoming 3.7.0, where more complete auto clustering will be added mimic that in rabbitmq-autocluster[2]. Given the importance of a stable rabbitmq implementation for Kolla deployments, suggest we stick with clusterer and reevaluate when 3.7.0 is released.

[0] http://www.rabbitmq.com/clustering.html
[1] http://www.rabbitmq.com/clustering.html#upgrading
[2] https://groups.google.com/forum/#!msg/rabbitmq-users/Jwh0y2PRSy4/Nb4v7OhfAwAJ