Mirantis OpenStack

Bug #1399272
Comment #36

Comment 36 for bug 1399272

Revision history for this message

Alexey Khivin (akhivin) wrote on 2015-03-03:

#36

After a little investigation I think that two different issues were described in the discussion
First issue is it is RabbitMQ cluster failure
Second issue is Galera cluster failure

In our little investigation we saw both of them as consequences of the actions which were described above

Rebooting a primary controller and killing haproxy can be quite enough to break up galera cluster, but I think we cant kill rabbitmq by killing haproxy. There are some suspicions that we can break murano rabbitmq but we need investigate this in more details and this is not the subject of this discussion

At the same time when we see
root@node-18:~# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-18' ...
[{nodes,[{disc,['rabbit@node-17','rabbit@node-18']}]},
{running_nodes,['rabbit@node-17','rabbit@node-18']},
{cluster_name,<<"rabbit@node-17">>},
{partitions,[]}]
...done.

It means that one of the rabbit instances has gone away from the cluster and will never come back automatically
(I will write about this a little later)

If one of the rabbit instances has gone from cluster and halted, the system will work. In the case when one rabbit instance has gone from cluster and returned as standalone instance (we observed such issue) then whole cloud might be broken.

On other side, say again, while killing HA-proxy and rebooting an instance it is possible to break Galera cluster. So when you post issues that implies HA-proxy killing please check Galera cluster status. There are no mysql and galera logs in the snapshot and I can.

So my offer is to discuss only a rabbitmq cluster failure in this particular ticket and if someone wants to post ticket on Galera cluster failure or oslo.db error then might be better to open a new ticket with galera and mysql logs in it and galera cluister status description.

I will now try to reproduce rabbitmq cluster failure