Rabbitmq OCF RA: Pacemaker reports a slave running and does nothing to the resource, but lrmd logs contain a periodic error from the 2nd monitor

Bug #1567355 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
Mitaka
Fix Released
High
Bogdan Dobrelya
Newton
Fix Committed
High
Bogdan Dobrelya

Bug Description

Note, that is a floating issue, there is no 100% repro steps.

Pacemaker sometimes reports a slave running and does nothing to the resource, but lrmd logs contain a periodic error from the 2nd monitor and the rabbitmq app is not running w/o any recovery for a long time. Here is an example I caught by running a jepsen test against a rabbit cluster:

lrmd.log:
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-server[11463]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
Apr 5 15:18:12 n1 lrmd: ERROR: p_rabbitmq-server[11463]: get_monitor(): get_status() returns generic error 1
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-server[11463]: get_monitor(): ensuring this slave does not get promoted.
Apr 5 15:18:12 n1 lrmd: INFO: p_rabbitmq-server[11463]: master_score(): Updating master score attribute with 0
... snip ...
Apr 5 15:18:49 n1 lrmd: INFO: p_rabbitmq-server[12636]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
... snip (reoccurs every 35 sec as expected) ...
 Apr 5 15:27:30 n1 lrmd: INFO: p_rabbitmq-server[27111]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
 Apr 5 15:28:08 n1 lrmd: INFO: p_rabbitmq-server[28063]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker
 Apr 5 15:28:33 n1 lrmd: INFO: p_rabbitmq-server[29010]: get_monitor(): master exists and rabbit app is not running. Exiting to be restarted by pacemaker

pacemaker.log:
 Apr 05 15:13:17 [30970] n1 crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_35000: unknown error (node=n1, call=171, rc=1, cib-update=27, confirmed=false)
 ... snip (no more logs about error exit code!) ...
 Apr 05 15:28:34 [30970] n1 crmd: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_35000: unknown error (node=n1, call=174, rc=1, cib-update=30, confirmed=false)

So, pacemaker doesn't restart it, and doesn't "notice" errors.
But it recovers automagically ~15 min later!
 Apr 05 15:28:36 [30970] n1 crmd: info: do_lrm_rsc_op: Performing key=3:107:0:fd9993ea-2897-4c53-ae4c-bc30faf66315 op=p_rabbitmq-server_stop_0

The solution is to stop the rabbitmq server process instead of hoping on the being restarted by a Pacemaker...

Changed in fuel:
importance: Undecided → High
milestone: none → 9.0
tags: added: pacemaker rabbitmq
Changed in fuel:
assignee: nobody → Bogdan Dobrelya (bogdando)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/302669

Changed in fuel:
status: New → In Progress
Revision history for this message
Bug Checker Bot (bug-checker) wrote : Autochecker

(This check performed automatically)
Please, make sure that bug description contains the following sections filled in with the appropriate data related to the bug you are describing:

actual result

version

For more detailed information on the contents of each of the listed sections see https://wiki.openstack.org/wiki/Fuel/How_to_contribute#Here_is_how_you_file_a_bug

tags: added: need-info
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/302669
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=fc1e8aa4c79a3181fa880ff18f883c237de2dd06
Submitter: Jenkins
Branch: master

commit fc1e8aa4c79a3181fa880ff18f883c237de2dd06
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Apr 7 12:58:59 2016 +0200

    Stop a rabbitmq pacemaker resource when monitor fails

    Upstream PR https://github.com/rabbitmq/rabbitmq-server/pull/731
    Closes-bug: #1567355

    Change-Id: I83415e0e2a40f0e99e7baa26e35b6f7463c52928
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/307635

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/307635
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=b684391018bb3d8cac20083d9217ba821cf02384
Submitter: Jenkins
Branch: stable/mitaka

commit b684391018bb3d8cac20083d9217ba821cf02384
Author: Bogdan Dobrelya <email address hidden>
Date: Thu Apr 7 12:58:59 2016 +0200

    Stop a rabbitmq pacemaker resource when monitor fails

    Upstream PR https://github.com/rabbitmq/rabbitmq-server/pull/731
    Closes-bug: #1567355

    Change-Id: I83415e0e2a40f0e99e7baa26e35b6f7463c52928
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit fc1e8aa4c79a3181fa880ff18f883c237de2dd06)

Revision history for this message
Alexey Galkin (agalkin) wrote :

Fix was missing on 9.0-242. Waiting a new iso

Revision history for this message
Alexey Galkin (agalkin) wrote :

Verified as fixed in 9.0-254.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.