Fuel for OpenStack

Pacemaker refused to recover RabbitMQ cluster

Bug #1591244 reported by Dmitry Mescheryakov on 2016-06-10

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Incomplete	High	Performance QA	Fuel for OpenStack 10.0
Mitaka	Invalid	High	Performance QA	Fuel for OpenStack 9.1
Newton	Incomplete	High	Performance QA	Fuel for OpenStack 10.0

Bug Description

Version: 9.0

Steps to reproduce:
1. Run some Rally tests on scale

Results:
1. While tests were running, RabbitMQ nodes started to die one by one. Below is the timeline:
node-202 - 2016-06-07T13:01:14
node-203 - 2016-06-07T16:17:47
node-201 - 2016-06-08T11:47:13

Each time Pacemaker did not bring node back and we ended up with RabbitMQ cluster completely down. The situation resolved only after manual intervention. In pacemaker.log from node-1 one can see the following lines:

Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:1 cannot run anywhere
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:2 cannot run anywhere
Jun 08 11:47:13 [39916] node-201.domain.tld pengine: info: native_color: Resource p_rabbitmq-server:0 cannot run anywhere

Tags:

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-10:

node-201-pacemaker-logs.tgz Edit (20.6 MiB, application/x-tar)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-10:

node-202-pacemaker-logs.tgz Edit (19.6 MiB, application/x-tar)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-10:

node-203-pacemaker-logs.tgz Edit (10.3 MiB, application/x-tar)

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-10:

controller-pacemaker-logs.tgz Edit (44.3 MiB, application/x-tar)

tags:

added: area-oslo

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-06-10:

rabbit-logs.tgz Edit (126.5 MiB, application/x-tar)

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-06-10: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Michael Semenov (msemenov) on 2016-06-14

tags:

added: scale

Dmitry Mescheryakov (dmitrymex) on 2016-07-11

tags:

added: 10.0-reviewed

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-09-14:

We have made a lot of changes to OCF script / RabbitMQ in the last three month and I think that the current bug should not reproduce any more. Scale team, could you please see if it reproduces and if yes, provide an environment. If will be impossible to provide env, please collect the following logs:
* from controllers:
/var/log/pacemaker.log*
* from master node:
/var/log/remote/node-X/lrmd.log (for each controller)

Also, please execute the following commands on one of the controllers and attach results:
crm_mon -fotAW -1
cibadmin -Q
crm_failcount -N node-X.domain.tld -r p_rabbitmq-server # for each controller

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.