RabbitMQ OCF applications check should rely on the 'kernel' module running instead of exit code

Bug #1446251 reported by Bogdan Dobrelya on 2015-04-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Alexander Nevenchannyy
5.1.x
Critical
Bogdan Dobrelya
6.0.x
Critical
Bogdan Dobrelya

Bug Description

This issue was discovered at the scale lab, when rabbit nodes were running under load.

In the get_status() https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/rabbitmq#L852-L853, the command rabbitmqctl eval 'rabbit_misc:which_applications().' may return an empty result [] and exit code will be 0.
Current logic relies on the exit code only. But it should check at least the 'kernel' module running, otherwise report "not running".

These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.

Changed in fuel:
milestone: none → 6.1
status: New → In Progress
assignee: nobody → Bogdan Dobrelya (bogdando)
importance: Undecided → Critical
Dina Belova (dbelova) on 2015-04-21
tags: added: scale
Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Alexander Nevenchannyy (anevenchannyy)

Reviewed: https://review.openstack.org/175457
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=bc4a7ec8093db81d8d1de478788825e228890b05
Submitter: Jenkins
Branch: master

commit bc4a7ec8093db81d8d1de478788825e228890b05
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:31:33 2015 +0200

    Fix RabbitMQ apps eval in OCF

    W/o this fix, if there are no apps running and rabbit node is actually
    not functioning, get_status() would still report 0 considering the rabbit
    resource is running. This is an issue as it may lead to the situations
    when the resource reported OK, but in fact, the rabbit node is not a cluster
    member.

    The solution is to not rely only on which_applications() eval exit code
    and test if the kernel app is running. Otherwise consider the pacemaker
    resource is "not running" as well.

    Closes-bug: #1446251

    Change-Id: Ia2fcb18abb3d977c5fcb26bfdeac864b6834f478
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/179750
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=4369fc9c20db94dc641d6ce8e1179a6a57306546
Submitter: Jenkins
Branch: stable/6.0

commit 4369fc9c20db94dc641d6ce8e1179a6a57306546
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:31:33 2015 +0200

    Fix RabbitMQ apps eval in OCF

    W/o this fix, if there are no apps running and rabbit node is actually
    not functioning, get_status() would still report 0 considering the rabbit
    resource is running. This is an issue as it may lead to the situations
    when the resource reported OK, but in fact, the rabbit node is not a cluster
    member.

    The solution is to not rely only on which_applications() eval exit code
    and test if the kernel app is running. Otherwise consider the pacemaker
    resource is "not running" as well.

    Closes-bug: #1446251

    Change-Id: Ia2fcb18abb3d977c5fcb26bfdeac864b6834f478
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit bc4a7ec8093db81d8d1de478788825e228890b05)

Reviewed: https://review.openstack.org/179751
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=0237cfb11a8cd68fdf511eef9d0999cde1324584
Submitter: Jenkins
Branch: stable/5.1

commit 0237cfb11a8cd68fdf511eef9d0999cde1324584
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 17:31:33 2015 +0200

    Fix RabbitMQ apps eval in OCF

    W/o this fix, if there are no apps running and rabbit node is actually
    not functioning, get_status() would still report 0 considering the rabbit
    resource is running. This is an issue as it may lead to the situations
    when the resource reported OK, but in fact, the rabbit node is not a cluster
    member.

    The solution is to not rely only on which_applications() eval exit code
    and test if the kernel app is running. Otherwise consider the pacemaker
    resource is "not running" as well.

    Closes-bug: #1446251

    Change-Id: Ia2fcb18abb3d977c5fcb26bfdeac864b6834f478
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit bc4a7ec8093db81d8d1de478788825e228890b05)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers