OCF Rabbit multiple failure tolerance mode does not clean up failures count after exiting monitor subroutine

Bug #1513421 reported by Vladimir Kuklin
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dmitry Mescheryakov
8.0.x
Fix Released
High
Dmitry Mescheryakov
Mitaka
Fix Released
High
Dmitry Mescheryakov

Bug Description

Fuel-Library SHA: 2c10f24398636e45f4661a4cacbeda70ec93f606

2015-11-04T17:15:39.108776+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): 'rabbitmqctl list_channels' timed out 36 of max. 1 time(s) in a row and is not responding. The resource is failed.

Dmitry Klenov (dklenov)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Dmitry Mescheryakov (dmitrymex)
importance: Medium → High
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Guys, the OCF script cleans up failure count on RabbitMQ start here:
https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L1416-L1418

I.e. the script should have restarted the RabbitMQ long time ago (after the first failure) and that would clean up the counter. Can we see lrmd.log for the corresponding node? We need to understand what happened.

Changed in fuel:
assignee: Dmitry Mescheryakov (dmitrymex) → Vladimir Kuklin (vkuklin)
status: Confirmed → Incomplete
Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

This bug is missing a diagnostic snapshot and steps to reproduce

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Okay, Matt, it is simple - inject return of error code into piece which does list_channels check and you will reproduce it easily.

Dmitry, the linest that you shared are not relevant.

Also, I meant that actually we should clear the counter after 1 successful command exit. Otherwise, although, you have your max_rabbitmqctl_timeout set to 3, it will fail infinitely after there were at least 3 1-time failures.

Changed in fuel:
status: Incomplete → Confirmed
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

Or, Matt, you can try to introduce return of RAND % 2 and set max_rabbitmqctl_timeout to 3 and you will see that it will die forever after 3 monitor commands.

Revision history for this message
Matthew Mosesohn (raytrac3r) wrote :

Assigning back to Fuel Library to be retriaged

Changed in fuel:
assignee: Vladimir Kuklin (vkuklin) → Fuel Library Team (fuel-library)
tags: added: tricky
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

That is the operations impact? I see zero deployment impact, so this bug can be high only if this affects stability of operations.

Changed in fuel:
importance: High → Medium
Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

Vladimir, seems like I got your point. The script already drops counter to 0 if there was a successful rabbitmqctl run here:
  https://github.com/openstack/fuel-library/blob/7db9175477f35e961c0178814928e103fdd29d81/files/fuel-ha-utils/ocf/rabbitmq#L1127

Does that address your concern? I believe that the log line you posted in bug description is caused by a different bug: it is when OCF script returns error on 'monitor' call, but Pacemaker does not react and just continues to call 'monitor' again and again.

tags: added: team-bugfix
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Reproduced in the bug 1528488 as well, see
2015-12-29T13:42:55.513788+00:00 node-17 lrmd err: ERROR: p_rabbitmq-server: get_monitor(): 'rabbitmqctl list_channels' timed out 23 of max. 1 time(s) in a row and is not responding. The resource is failed.

Revision history for this message
Alexey Lebedeff (alebedev-a) wrote :

Reproduced in the https://bugs.launchpad.net/mos/+bug/1533585

2016-01-13T07:25:38.524486+00:00 err: ERROR: p_rabbitmq-server: get_monitor(): 'rabbitmqctl list_channels' timed out 730 of max. 1 time(s) in a row and is not responding. The resource is failed.

Revision history for this message
Dmitry Mescheryakov (dmitrymex) wrote :

I've increased priority of the issue because it seems to cause the problem, when the script does not restart RabbitMQ while it is needed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/266890

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/266890
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=7379106d73e2d02a629b848bed1dce65acb9c7af
Submitter: Jenkins
Branch: master

commit 7379106d73e2d02a629b848bed1dce65acb9c7af
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed Jan 13 15:46:34 2016 +0300

    Reset master score if we decide to restart RabbitMQ on timeout

    Doing otherwise might not trigger the restart while it is clearly
    needed.

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/560

    Change-Id: I480ebaddc98fa0784098efbf0c5ab8c512c8661d
    Closes-Bug: #1513421

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/270778

tags: added: area-mos
removed: area-library team-bugfix
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)

Reviewed: https://review.openstack.org/270778
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=5f31ad4a21dd31a268120c866eca252b3683e38a
Submitter: Jenkins
Branch: stable/8.0

commit 5f31ad4a21dd31a268120c866eca252b3683e38a
Author: Dmitry Mescheryakov <email address hidden>
Date: Wed Jan 13 15:46:34 2016 +0300

    Reset master score if we decide to restart RabbitMQ on timeout

    Doing otherwise might not trigger the restart while it is clearly
    needed.

    Upstream PR: https://github.com/rabbitmq/rabbitmq-server/pull/560

    Change-Id: I480ebaddc98fa0784098efbf0c5ab8c512c8661d
    Closes-Bug: #1513421

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.