Mirantis OpenStack

Rabbitmq cluster is not recovered from split-brain by pacemaker

Bug #1559949 reported by Olga Klochkova on 2016-03-21

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Invalid	High	MOS Oslo	Mirantis OpenStack 10.0
7.0.x	Invalid	High	MOS Maintenance	Mirantis OpenStack 7.0-updates

Bug Description

Rabbitmq cluster is not recovered from split-brain by pacemaker.
Pacemaker shows that cluster is assembled and running despite that 'rabbimqctl cluster_status' shows different output on different controllers.
As a result, environment is unusable
Version: https://paste.mirantis.net/show/2000/

Tags:

Revision history for this message

Olga Klochkova (oklochkova) wrote on 2016-03-21:

node-16.tar.bz2 Edit (1.0 MiB, application/x-tar)

Revision history for this message

Olga Klochkova (oklochkova) wrote on 2016-03-21:

node-17.tar.bz2 Edit (587.7 KiB, application/x-tar)

Revision history for this message

Olga Klochkova (oklochkova) wrote on 2016-03-21:

node-18.tar.bz2 Edit (3.2 MiB, application/x-tar)

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-03-21:

Oslo team, please take a look at this.

Changed in mos:
assignee:	nobody → MOS Oslo (mos-oslo)
importance:	Undecided → High
status:	New → Confirmed
tags:	added: area-oslo

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-03-23:

It looks like we're facing the very same issue in 7.0-swarm: https://patching-ci.infra.mirantis.net/view/7.0.swarm/job/7.0.system_test.ubuntu.ha_neutron_tun_scale/17/console

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-03-23:

I was able to get pacemaker restarted the rabbitmq resource by changing OCF-script's get_monitor() to return OCF_ERR_GENERIC instead of OCF_NOT_RUNNING. It is still not clear to me whether this is a pacemaker's bug or predicted behavior and if this change is harmful.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-03-24:

According to [0][1], OCF_NOT_RUNNING has no Recovery Type attached. So I'm wondering why we're using this return code at all? I'm not very confident in pacemaker's stuff, but I saw dozens of sources where it's clearly stated that OCF_NOT_RUNNING is not intended to show a _failure_, for failures OCF_ERR_* should be used.
Moreover, it is said also, that using OCF_NOT_RUNNING outside of any monitor action is an error and should be avoided. But we're using this return code unconditionally everywhere. I'm sure we should rethink return codes.

[0]: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-ocf-return-codes.html
[1]: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_how_are_ocf_return_codes_interpreted.html

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-03-24:

@Deins, it is definitely ok to return OCF_NOT_RUNNING in monitor, like we do right now. If Pacemaker considers resource to be active and OCF script returns OCF_NOT_RUNNING, then Pacemaker must start the resource. For instance, kill RabbitMQ while no monitor op is running. Next monitor operation will return OCF_NOT_RUNNING and Pacemaker will restart the RabbitMQ.

The problem here is that lrmd daemon sends return code of monitor operation back to crmd (or pengine?) _only_ when it changes. If the first sent error is lost by Pacemaker, the resource is damned to be stuck in broken state until return code changes by miracle.

For example, in that case the following would help as well:
* change OCF script to return OCF_SUCCESS instead of OCF_NOT_RUNNING
* wait for several monitor runs to succeed and then revert the changes
lrmd would return OCF_NOT_RUNNING and that time Pacemaker most probably will restart the resource.

Revision history for this message

Bug Checker Bot (bug-checker) wrote on 2016-03-28: Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags:

added: need-info

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-05-24:

#10

Dmitry, is there any progress on the bug? It seems like we need to patch pacemaker, right?

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-24:

#11

Please backport, the issue was fixed in master

Changed in mos:
status:	Confirmed → Invalid

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2016-05-24:

#12

Well, the ISO is old and does not contain backported fixes, see
fuel-library9.0-9.0.0-1.mos8032.noarch
* Чт мар 03 2016 Jenkins <email address hidden>
055b235 Merge "Refactor to pcmk_ resources"

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.