RabbitMQ cluster contains offline node after failover

Bug #1461509 reported by Artem Panchenko
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Library (Deprecated)
5.1.x
Won't Fix
High
Denis Meltsaykin
6.0.x
Won't Fix
High
Denis Meltsaykin
6.1.x
Fix Released
High
Bogdan Dobrelya
7.0.x
Won't Fix
High
Fuel Library (Deprecated)

Bug Description

Fuel version info (6.1 build #478): http://paste.openstack.org/show/256472/

After controller node shutdown which is master for RabbitMQ cluster of 3 nodes, one of 2 rest controllers doesn't kick offline node from cluster and it leads to endless RabbitMQ server restarts by pacemaker:

<30>Jun 3 09:52:48 node-15 lrmd: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 2: /usr/sbin/rabbitmqctl join_cluster rabbit@node-5
<27>Jun 3 09:52:48 node-15 lrmd: ERROR: p_rabbitmq-server: join_to_cluster(): Can't join to cluster by node 'rabbit@node-5'. Stopping.
<30>Jun 3 09:52:48 node-15 lrmd: INFO: p_rabbitmq-server: stop: action begin.
...
<30>Jun 3 09:53:00 node-15 lrmd: INFO: p_rabbitmq-server: notify: post-start end.
<28>Jun 3 09:53:00 node-15 lrmd: WARNING: p_rabbitmq-server: notify: Failed to join the cluster on post-start. The resource will be restarted.
...
Jun 03 09:53:00 [18700] node-15.mirantis.com lrmd: notice: operation_finished: p_rabbitmq-server_notify_0:33394:stderr [ Error: {no_running_cluster_nodes,"You cannot
 are present."} ]
Jun 03 09:53:00 [18700] node-15.mirantis.com lrmd: info: log_finished: finished - rsc:p_rabbitmq-server action:notify call_id:282 pid:33394 exit-code:7 exec-
<29>Jun 3 09:53:00 node-15 lrmd[18700]: notice: operation_finished: p_rabbitmq-server_notify_0:33394:stderr [ Error: {no_running_cluster_nodes,"You cannot leave a cluster
Jun 03 09:53:00 [18703] node-15.mirantis.com crmd: info: match_graph_event: Action p_rabbitmq-server_notify_0 (133) confirmed on node-15.mirantis.com (rc=0)
Jun 03 09:53:00 [18703] node-15.mirantis.com crmd: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-15.mirantis.com, call=282, rc=0, c
<29>Jun 3 09:53:00 node-15 crmd[18703]: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-15.mirantis.com, call=282, rc=0, cib-update=0, confi

Steps to reproduce:

1. Deploy environment: CentOS, NovaVlan, Ceph, Classic Provisioning
2. Destroy primary controller
3. Check rabbitmq cluster status on controllers.

Expected result:

- rabbitmq cluster is re-assembled, offline controller is removed from it

Actual:

- rabbitmq cluster on 1 of controllers contains offline node

I reproduced this bug on bare metal lab. On Ubuntu environment (the same hardware and iso) I didn't observe such issue.

Tags: ha rabbitmq
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
importance: Undecided → High
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

On the master (node-1) failover, the node unjoin from the cluster was failed at the node-15 due to the race with the rabbit-fence daemon and rabbitmq stop_app logic in OCF. As the result joining the cluster fails in a start/stop loop, with no reset attempts at all.

The solution is to:
a) do not stop rabbit app locally by the OCF logic if it can see there is the rabbit-fence daemon trying to kick some node out of the cluster and assumes the rabbit app is running locally.
b) introduce additional reset action if joining to the cluster have failed.

While the b should be enough do handle this situation and let the failed node joine cluster after mnesia reset, the complete solution should include a as well.

Changed in fuel:
status: New → Triaged
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Won't fix for the <6.1 as there is no rabbit-fence daemon feature

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/187959

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/187959
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=752daa6cd569c780e91e11d8707badda8e7e72fd
Submitter: Jenkins
Branch: master

commit 752daa6cd569c780e91e11d8707badda8e7e72fd
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 3 13:36:36 2015 +0200

    Erase mnesia if a rabbit node cannot join the cluster

    W/o this fix, the situation is possible when a
    rabbit node would stuck in a start/stop loop failing
    to join the cluster with an error:
    "no_running_cluster_nodes, You cannot leave a cluster
    if no online nodes are present."

    This is an issue because the rabbit node should always
    be able to join the cluster, if it was ordered to start
    by pacemaker RA.

    The solution is to force the mnesia reset, if the
    rabbit node cannot join the cluster on post-start
    notify. Note, that for the master starting, the node
    wouldn't be reset. So, the mnesia will be kept intact
    at least on the resource master.

    Partial-bug: #1461509

    Change-Id: I69bc13266a1dc784681b2677ae5616bfc28cf54f
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The fix for "b" closes this bug, and the fix for "a" is nice to have for the 7.0 as well

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Verified. Rabbitmq cluster is re-assembled, offline controller is removed from it.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "521"
  build_id: "2015-06-08_06-13-27"
  nailgun_sha: "4340d55c19029394cd5610b0e0f56d6cb8cb661b"
  python-fuelclient_sha: "4fc55db0265bbf39c369df398b9dc7d6469ba13b"
  astute_sha: "7766818f079881e2dbeedb34e1f67e517ed7d479"
  fuel-library_sha: "f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d"
  fuel-ostf_sha: "7c938648a246e0311d05e2372ff43ef1eb2e2761"
  fuelmain_sha: "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Fix "b" https://review.openstack.org/187959 should be also backported

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The "nice to have" part is only a nice to have and the issue may be considered closed.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.