RabbitMQ OCF may hang on the stop/start actions as it ignores the stop/wait commands exit code

Bug #1446190 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Critical
Bogdan Dobrelya
5.1.x
Fix Committed
Critical
Fuel Library (Deprecated)
6.0.x
Fix Committed
Critical
Fuel Library (Deprecated)

Bug Description

This issue was discovered at the scale lab, when rabbit nodes were running under load.
The issues are:

1) stop_server_process() https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/rabbitmq#L596-L597 ignores the exit code of the "rabbitmqctl stop" command and verifies the old rc value left from the latest pidfile check, which is wrong and leads to broken "stop" actions logic.

2) try_to_start_rmq_app() https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/rabbitmq#L740-L744 ignores the exit code of the "rabbitmqctl wait" command and may hang until the given resource agent's operation timeout exceeded, which brakes the "start" action logic.

Here is an example log:
broken stop: http://paste.openstack.org/show/H89Uo8ZdPlMUstlp1Tb5/
broken start: http://paste.openstack.org/show/nHFoeSn21kne22vtBHZS/

These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.

Tags: scale
description: updated
Changed in fuel:
milestone: none → 6.1
importance: Undecided → Critical
assignee: nobody → Bogdan Dobrelya (bogdando)
status: New → In Progress
description: updated
summary: - RabbitMQ OCF may hang on the stop action as it ignores the stop command
- exit code
+ RabbitMQ OCF may hang on the stop/start actions as it ignores the
+ stop/wait commands exit code
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/175371

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/175371
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1b553e823effcd14d59f42ec36852339d709e7a4
Submitter: Jenkins
Branch: master

commit 1b553e823effcd14d59f42ec36852339d709e7a4
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 14:14:06 2015 +0200

    Fix RabbitMQ OCF stop/start actions

    W/o this fix, rabbitMQ OCF may hang on the stop or start
    as it ignores the stop/wait commands exit code. This is an
    issue as it prevents rabbit node from joining the cluster.

    The solution is to report generic failure, if corresponding
    commands timed out:
    * Fix stop_server_process() timeout command exit code check
    * Fix try_to_start_rmq_app() timeout command exit code check

    Closes-bug: #1446190

    Change-Id: I8a0ccdde88cbb4c8545427f5bd6bdca6856a4687
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Dina Belova (dbelova)
tags: added: scale
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/5.1)

Fix proposed to branch: stable/5.1
Review: https://review.openstack.org/175836

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/6.0)

Fix proposed to branch: stable/6.0
Review: https://review.openstack.org/175838

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/6.0)

Reviewed: https://review.openstack.org/175838
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=6d63386abe02d19c8f0fc75ac2219d30581249cf
Submitter: Jenkins
Branch: stable/6.0

commit 6d63386abe02d19c8f0fc75ac2219d30581249cf
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 14:14:06 2015 +0200

    Fix RabbitMQ OCF stop/start actions

    W/o this fix, rabbitMQ OCF may hang on the stop or start
    as it ignores the stop/wait commands exit code. This is an
    issue as it prevents rabbit node from joining the cluster.

    The solution is to report generic failure, if corresponding
    commands timed out:
    * Fix stop_server_process() timeout command exit code check.
    * Fix try_to_start_rmq_app() timeout command exit code check

    Closes-bug: #1446190

    Change-Id: I8a0ccdde88cbb4c8545427f5bd6bdca6856a4687
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 1b553e823effcd14d59f42ec36852339d709e7a4)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/5.1)

Reviewed: https://review.openstack.org/175836
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=8338d59832f87d5ae1b4bbf431fd06807e258a44
Submitter: Jenkins
Branch: stable/5.1

commit 8338d59832f87d5ae1b4bbf431fd06807e258a44
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 14:14:06 2015 +0200

    Fix RabbitMQ OCF stop/start actions

    W/o this fix, rabbitMQ OCF may hang on the stop or start
    as it ignores the stop/wait commands exit code. This is an
    issue as it prevents rabbit node from joining the cluster.

    The solution is to report generic failure, if corresponding
    commands timed out:
    * Fix stop_server_process() timeout command exit code check.
    * Fix try_to_start_rmq_app() timeout command exit code check.

    Closes-bug: #1446190

    Change-Id: I8a0ccdde88cbb4c8545427f5bd6bdca6856a4687
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 1b553e823effcd14d59f42ec36852339d709e7a4)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.