Rabbitmq cluster is not recovered from split-brain by pacemaker

Bug #1559949 reported by Olga Klochkova
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
High
MOS Oslo
7.0.x
Invalid
High
MOS Maintenance

Bug Description

Rabbitmq cluster is not recovered from split-brain by pacemaker.
Pacemaker shows that cluster is assembled and running despite that 'rabbimqctl cluster_status' shows different output on different controllers.
As a result, environment is unusable
Version: https://paste.mirantis.net/show/2000/

Revision history for this message
Olga Klochkova (oklochkova) wrote :
Revision history for this message
Olga Klochkova (oklochkova) wrote :
Revision history for this message
Olga Klochkova (oklochkova) wrote :
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Oslo team, please take a look at this.

Changed in mos:
assignee: nobody → MOS Oslo (mos-oslo)
importance: Undecided → High
status: New → Confirmed
tags: added: area-oslo
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

I was able to get pacemaker restarted the rabbitmq resource by changing OCF-script's get_monitor() to return OCF_ERR_GENERIC instead of OCF_NOT_RUNNING. It is still not clear to me whether this is a pacemaker's bug or predicted behavior and if this change is harmful.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

According to [0][1], OCF_NOT_RUNNING has no Recovery Type attached. So I'm wondering why we're using this return code at all? I'm not very confident in pacemaker's stuff, but I saw dozens of sources where it's clearly stated that OCF_NOT_RUNNING is not intended to show a _failure_, for failures OCF_ERR_* should be used.
Moreover, it is said also, that using OCF_NOT_RUNNING outside of any monitor action is an error and should be avoided. But we're using this return code unconditionally everywhere. I'm sure we should rethink return codes.

[0]: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/s-ocf-return-codes.html
[1]: http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Explained/_how_are_ocf_return_codes_interpreted.html

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

@Deins, it is definitely ok to return OCF_NOT_RUNNING in monitor, like we do right now. If Pacemaker considers resource to be active and OCF script returns OCF_NOT_RUNNING, then Pacemaker must start the resource. For instance, kill RabbitMQ while no monitor op is running. Next monitor operation will return OCF_NOT_RUNNING and Pacemaker will restart the RabbitMQ.

The problem here is that lrmd daemon sends return code of monitor operation back to crmd (or pengine?) _only_ when it changes. If the first sent error is lost by Pacemaker, the resource is damned to be stuck in broken state until return code changes by miracle.

For example, in that case the following would help as well:
 * change OCF script to return OCF_SUCCESS instead of OCF_NOT_RUNNING
 * wait for several monitor runs to succeed and then revert the changes
lrmd would return OCF_NOT_RUNNING and that time Pacemaker most probably will restart the resource.

Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

expected result

steps to reproduce

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Dmitry, is there any progress on the bug? It seems like we need to patch pacemaker, right?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Please backport, the issue was fixed in master

Changed in mos:
status: Confirmed → Invalid
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Well, the ISO is old and does not contain backported fixes, see
fuel-library9.0-9.0.0-1.mos8032.noarch
* Чт мар 03 2016 Jenkins <email address hidden>
055b235 Merge "Refactor to pcmk_ resources"

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.