Pacemaker refused to recover RabbitMQ cluster

Bug #1591244 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Incomplete
High
Performance QA
Mitaka
Invalid
High
Performance QA
Newton
Incomplete
High
Performance QA

Bug Description

Version: 9.0

Steps to reproduce:
1. Run some Rally tests on scale

Results:
1. While tests were running, RabbitMQ nodes started to die one by one. Below is the timeline:
node-202 - 2016-06-07T13:01:14
node-203 - 2016-06-07T16:17:47
node-201 - 2016-06-08T11:47:13

Each time Pacemaker did not bring node back and we ended up with RabbitMQ cluster completely down. The situation resolved only after manual intervention. In pacemaker.log from node-1 one can see the following lines:

Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:1 cannot run anywhere
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:2 cannot run anywhere
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:0 cannot run anywhere

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
tags: added: area-oslo
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
tags: added: scale
tags: added: 10.0-reviewed
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

We have made a lot of changes to OCF script / RabbitMQ in the last three month and I think that the current bug should not reproduce any more. Scale team, could you please see if it reproduces and if yes, provide an environment. If will be impossible to provide env, please collect the following logs:
 * from controllers:
   /var/log/pacemaker.log*
 * from master node:
   /var/log/remote/node-X/lrmd.log (for each controller)

Also, please execute the following commands on one of the controllers and attach results:
crm_mon -fotAW -1
cibadmin -Q
crm_failcount -N node-X.domain.tld -r p_rabbitmq-server # for each controller

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.