Pacemaker never restarts RabbitMQ if rabbitmqctl times out

Bug #1618843 reported by Dmitry Mescheryakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Dmitry Mescheryakov
Mitaka
Fix Released
High
Dmitry Mescheryakov

Bug Description

Version: 9.1

Steps to reproduce:
1. Install environment consisting of 3 controllers.
2. On one of the controllers make 'rabbitmqctl cluster_status' command run for more than 1 minute. You can do that for instance by applying the following patch to /usr/sbin/rabbitmqctl - http://paste.openstack.org/show/565123/
   Another alternative is to execute the following command on one of the controllers:
   rabbitmqctl eval 'sys:suspend(whereis(rabbit_node_monitor)).'
   That will make cluster_status hang on all controllers simultaneously until RabbitMQ is restarted.

Expected result:
After some time (up to 5 minutes) Pacemaker should restart the RabbitMQ on the controller where rabbitmqctl was corrupted.

Actual result:
RabbitMQ is not restarted, not even in days.

Reproducibility: 100%

Details:
Look into /var/log/remote/node-X/lrmd.log on master node, where node-X is hostname of controller on which you have corrupted rabbitmqctl.

After some time you will see the following entries appear in the log:
2016-08-29T22:16:36.741121+00:00 err: ERROR: p_rabbitmq-server[4041]: get_monitor():: is_cluster_status_ok: 'rabbitmqctl cluster_status' timed out 1 of max. 3 time(s) in a row and is not responding. The resource is failed.

At some point the counter will go over the limit (3), but nothing will change - Pacemaker will not restart the resource even though OCF script clearly reports error. In pacemaker.log it can be seen that Pacemaker sends pre-stop notifications about that node. If one adds logging of such notifications to the RabbitMQ OCF script, it can be seen that they do reach the nodes. Such pre-stop notifications are sent with some period 3-5 minutes). But nothing else happens, Pacemaker does not initiate the stop action itself.

Tags: area-library
Changed in fuel:
importance: Undecided → High
assignee: nobody → Dmitry Mescheryakov (dmitrymex)
milestone: none → 9.1
status: New → Confirmed
tags: added: area-library
summary: + Pacemaker never restarts RabbitMQ
summary: - Pacemaker never restarts RabbitMQ
+ Pacemaker never restarts RabbitMQ if rabbitmqctl times out
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/366058

Changed in fuel:
status: Confirmed → In Progress
no longer affects: fuel/newton
Changed in fuel:
milestone: 9.2 → 10.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/366858

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/366058
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=5fd072a4d626619fe77c591a08158cabdc729efb
Submitter: Jenkins
Branch: master

commit 5fd072a4d626619fe77c591a08158cabdc729efb
Author: Dmitry Mescheryakov <email address hidden>
Date: Thu Sep 1 17:57:09 2016 +0300

    Remove second slave monitor for RabbitMQ resource

    If the OCF script monitor action takes more than a minute, Pacemaker
    does not react on a returned error. Removing one of two slave
    monitors fixes that problem. The second monitor does not bring any
    value, so that change does not have any downside.

    Change-Id: I22adebbfeed1f128c068dc2a15d0f374337a32d4
    Closes-Bug: #1618843

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/mitaka)

Reviewed: https://review.openstack.org/366858
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=bf993fda123b08adfc2ddb2fd56330cc0a8ce5ac
Submitter: Jenkins
Branch: stable/mitaka

commit bf993fda123b08adfc2ddb2fd56330cc0a8ce5ac
Author: Dmitry Mescheryakov <email address hidden>
Date: Thu Sep 1 17:57:09 2016 +0300

    Remove second slave monitor for RabbitMQ resource

    If the OCF script monitor action takes more than a minute, Pacemaker
    does not react on a returned error. Removing one of two slave
    monitors fixes that problem. The second monitor does not bring any
    value, so that change does not have any downside.

    Change-Id: I22adebbfeed1f128c068dc2a15d0f374337a32d4
    Closes-Bug: #1618843

tags: added: on-verification
Revision history for this message
ElenaRossokhina (esolomina) wrote :

Verified on 9.1 snapshot 264

tags: removed: on-verification
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0rc1

This issue was fixed in the openstack/fuel-library 10.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/fuel-library 10.0.0

This issue was fixed in the openstack/fuel-library 10.0.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.