Fuel for OpenStack

RabbitMQ cluster contains offline node after failover

Bug #1461509 reported by Artem Panchenko on 2015-06-03

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Invalid	High	Fuel Library (Deprecated)	Fuel for OpenStack 6.1
5.1.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 6.0-updates
6.1.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 6.1
7.0.x	Won't Fix	High	Fuel Library (Deprecated)	Fuel for OpenStack 7.0

Bug Description

Fuel version info (6.1 build #478): http://paste.openstack.org/show/256472/

After controller node shutdown which is master for RabbitMQ cluster of 3 nodes, one of 2 rest controllers doesn't kick offline node from cluster and it leads to endless RabbitMQ server restarts by pacemaker:

<30>Jun 3 09:52:48 node-15 lrmd: INFO: p_rabbitmq-server: su_rabbit_cmd(): the invoked command exited 2: /usr/sbin/rabbitmqctl join_cluster rabbit@node-5
<27>Jun 3 09:52:48 node-15 lrmd: ERROR: p_rabbitmq-server: join_to_cluster(): Can't join to cluster by node 'rabbit@node-5'. Stopping.
<30>Jun 3 09:52:48 node-15 lrmd: INFO: p_rabbitmq-server: stop: action begin.
...
<30>Jun 3 09:53:00 node-15 lrmd: INFO: p_rabbitmq-server: notify: post-start end.
<28>Jun 3 09:53:00 node-15 lrmd: WARNING: p_rabbitmq-server: notify: Failed to join the cluster on post-start. The resource will be restarted.
...
Jun 03 09:53:00 [18700] node-15.mirantis.com lrmd: notice: operation_finished: p_rabbitmq-server_notify_0:33394:stderr [ Error: {no_running_cluster_nodes,"You cannot
are present."} ]
Jun 03 09:53:00 [18700] node-15.mirantis.com lrmd: info: log_finished: finished - rsc:p_rabbitmq-server action:notify call_id:282 pid:33394 exit-code:7 exec-
<29>Jun 3 09:53:00 node-15 lrmd[18700]: notice: operation_finished: p_rabbitmq-server_notify_0:33394:stderr [ Error: {no_running_cluster_nodes,"You cannot leave a cluster
Jun 03 09:53:00 [18703] node-15.mirantis.com crmd: info: match_graph_event: Action p_rabbitmq-server_notify_0 (133) confirmed on node-15.mirantis.com (rc=0)
Jun 03 09:53:00 [18703] node-15.mirantis.com crmd: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-15.mirantis.com, call=282, rc=0, c
<29>Jun 3 09:53:00 node-15 crmd[18703]: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-15.mirantis.com, call=282, rc=0, cib-update=0, confi

Steps to reproduce:

1. Deploy environment: CentOS, NovaVlan, Ceph, Classic Provisioning
2. Destroy primary controller
3. Check rabbitmq cluster status on controllers.

Expected result:

- rabbitmq cluster is re-assembled, offline controller is removed from it

Actual:

- rabbitmq cluster on 1 of controllers contains offline node

I reproduced this bug on bare metal lab. On Ubuntu environment (the same hardware and iso) I didn't observe such issue.

Tags:

Bogdan Dobrelya (bogdando) on 2015-06-03

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
importance:	Undecided → High

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-06-03:

logs.tgz Edit (65.6 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-03:

On the master (node-1) failover, the node unjoin from the cluster was failed at the node-15 due to the race with the rabbit-fence daemon and rabbitmq stop_app logic in OCF. As the result joining the cluster fails in a start/stop loop, with no reset attempts at all.

The solution is to:
a) do not stop rabbit app locally by the OCF logic if it can see there is the rabbit-fence daemon trying to kick some node out of the cluster and assumes the rabbit app is running locally.
b) introduce additional reset action if joining to the cluster have failed.

While the b should be enough do handle this situation and let the failed node joine cluster after mnesia reset, the complete solution should include a as well.

Changed in fuel:
status:	New → Triaged

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-03:

Won't fix for the <6.1 as there is no rabbit-fence daemon feature

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-03: Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/187959

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-06-03: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/187959
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=752daa6cd569c780e91e11d8707badda8e7e72fd
Submitter: Jenkins
Branch: master

commit 752daa6cd569c780e91e11d8707badda8e7e72fd
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Jun 3 13:36:36 2015 +0200

Erase mnesia if a rabbit node cannot join the cluster

    W/o this fix, the situation is possible when a
    rabbit node would stuck in a start/stop loop failing
    to join the cluster with an error:
    "no_running_cluster_nodes, You cannot leave a cluster
    if no online nodes are present."

    This is an issue because the rabbit node should always
    be able to join the cluster, if it was ordered to start
    by pacemaker RA.

    The solution is to force the mnesia reset, if the
    rabbit node cannot join the cluster on post-start
    notify. Note, that for the master starting, the node
    wouldn't be reset. So, the mnesia will be kept intact
    at least on the resource master.

Partial-bug: #1461509

Change-Id: I69bc13266a1dc784681b2677ae5616bfc28cf54f
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-03:

The fix for "b" closes this bug, and the fix for "a" is nice to have for the 7.0 as well

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-06-08:

Verified. Rabbitmq cluster is re-assembled, offline controller is removed from it.

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "521"
  build_id: "2015-06-08_06-13-27"
  nailgun_sha: "4340d55c19029394cd5610b0e0f56d6cb8cb661b"
  python-fuelclient_sha: "4fc55db0265bbf39c369df398b9dc7d6469ba13b"
  astute_sha: "7766818f079881e2dbeedb34e1f67e517ed7d479"
  fuel-library_sha: "f43c2ae1af3b493ee0e7810eab7bb7b50c986c7d"
  fuel-ostf_sha: "7c938648a246e0311d05e2372ff43ef1eb2e2761"
  fuelmain_sha: "bcc909ffc5dd5156ba54cae348b6a07c1b607b24"

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-06-15:

Fix "b" https://review.openstack.org/187959 should be also backported

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-08-05:

The "nice to have" part is only a nice to have and the issue may be considered closed.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-10-26:

#10

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

logs.tgz Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.