Wrong post-start notify exit code in RabbitMQ OCF causing additional resource failures in Pacemaker

Bug #1438699 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
5.0.x
Invalid
High
Unassigned
5.1.x
Won't Fix
High
Denis Meltsaykin
6.0.x
Won't Fix
High
Denis Meltsaykin
6.1.x
Fix Committed
High
Bogdan Dobrelya

Bug Description

The post-start notify event is sent by Pacemaker for all instances of the multistate RabbitMQ clone resource every time a rabbit node starts somewhere in the cluster. And there is an error in OCF logic causing the resource to be reported as $OCF_NOT_RUNNING that leads to additional restarts for rabbitmq resources in pacemaker.

The message "Failed to join the cluster on post-start. Resource is failed" indicates this issue and it should not be reported by the nodes then processing the post-start notify as well as the exit code for this event should be $OCF_SUCCESS for the nodes not joining the cluster. Only the node which actually has started and generated this notify is joining the cluster and may fail with this error message.

Changed in fuel:
milestone: none → 6.1
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/169320

Changed in fuel:
status: New → In Progress
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Bartlomiej Piotrowski (bpiotrowski)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/169320
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=dd34e200da582afd996c276530bb761ebd59dbb0
Submitter: Jenkins
Branch: master

commit dd34e200da582afd996c276530bb761ebd59dbb0
Author: Bogdan Dobrelya <email address hidden>
Date: Tue Mar 31 16:00:47 2015 +0200

    Fix post-start notify exit code for rabbit OCF

    There is an error in OCF logic causing the resource to be reported
    as $OCF_NOT_RUNNING that leads to additional restarts for rabbitmq
    resources in pacemaker.

    The solution is to ensure that only the node which actually has
    started and is joining the cluster may fail with this error code,
    while the other nodes may not.

    Closes-bug: #1438699

    Change-Id: I8d3b6e8f76a6a89608e59a52081c21931e5654fb
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.