RabbitMQ OCF may hang on the stop/start actions as it ignores the stop/wait commands exit code

Bug #1446190 reported by Bogdan Dobrelya on 2015-04-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Critical
Bogdan Dobrelya
5.1.x
Critical
Fuel Library (Deprecated)
6.0.x
Critical
Fuel Library (Deprecated)

Bug Description

This issue was discovered at the scale lab, when rabbit nodes were running under load.
The issues are:

1) stop_server_process() https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/rabbitmq#L596-L597 ignores the exit code of the "rabbitmqctl stop" command and verifies the old rc value left from the latest pidfile check, which is wrong and leads to broken "stop" actions logic.

2) try_to_start_rmq_app() https://github.com/stackforge/fuel-library/blob/master/deployment/puppet/cluster/files/ocf/rabbitmq#L740-L744 ignores the exit code of the "rabbitmqctl wait" command and may hang until the given resource agent's operation timeout exceeded, which brakes the "start" action logic.

Here is an example log:
broken stop: http://paste.openstack.org/show/H89Uo8ZdPlMUstlp1Tb5/
broken start: http://paste.openstack.org/show/nHFoeSn21kne22vtBHZS/

These issues may appear only when the specified timeout for commands to stop or wait have exceeded. That is a usual case under load, hence is critical by its impact.

description: updated
Changed in fuel:
milestone: none → 6.1
importance: Undecided → Critical
assignee: nobody → Bogdan Dobrelya (bogdando)
status: New → In Progress
description: updated
summary: - RabbitMQ OCF may hang on the stop action as it ignores the stop command
- exit code
+ RabbitMQ OCF may hang on the stop/start actions as it ignores the
+ stop/wait commands exit code
description: updated

Reviewed: https://review.openstack.org/175371
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=1b553e823effcd14d59f42ec36852339d709e7a4
Submitter: Jenkins
Branch: master

commit 1b553e823effcd14d59f42ec36852339d709e7a4
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 14:14:06 2015 +0200

    Fix RabbitMQ OCF stop/start actions

    W/o this fix, rabbitMQ OCF may hang on the stop or start
    as it ignores the stop/wait commands exit code. This is an
    issue as it prevents rabbit node from joining the cluster.

    The solution is to report generic failure, if corresponding
    commands timed out:
    * Fix stop_server_process() timeout command exit code check
    * Fix try_to_start_rmq_app() timeout command exit code check

    Closes-bug: #1446190

    Change-Id: I8a0ccdde88cbb4c8545427f5bd6bdca6856a4687
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Dina Belova (dbelova) on 2015-04-21
tags: added: scale

Reviewed: https://review.openstack.org/175838
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=6d63386abe02d19c8f0fc75ac2219d30581249cf
Submitter: Jenkins
Branch: stable/6.0

commit 6d63386abe02d19c8f0fc75ac2219d30581249cf
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 14:14:06 2015 +0200

    Fix RabbitMQ OCF stop/start actions

    W/o this fix, rabbitMQ OCF may hang on the stop or start
    as it ignores the stop/wait commands exit code. This is an
    issue as it prevents rabbit node from joining the cluster.

    The solution is to report generic failure, if corresponding
    commands timed out:
    * Fix stop_server_process() timeout command exit code check.
    * Fix try_to_start_rmq_app() timeout command exit code check

    Closes-bug: #1446190

    Change-Id: I8a0ccdde88cbb4c8545427f5bd6bdca6856a4687
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 1b553e823effcd14d59f42ec36852339d709e7a4)

Reviewed: https://review.openstack.org/175836
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=8338d59832f87d5ae1b4bbf431fd06807e258a44
Submitter: Jenkins
Branch: stable/5.1

commit 8338d59832f87d5ae1b4bbf431fd06807e258a44
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Apr 20 14:14:06 2015 +0200

    Fix RabbitMQ OCF stop/start actions

    W/o this fix, rabbitMQ OCF may hang on the stop or start
    as it ignores the stop/wait commands exit code. This is an
    issue as it prevents rabbit node from joining the cluster.

    The solution is to report generic failure, if corresponding
    commands timed out:
    * Fix stop_server_process() timeout command exit code check.
    * Fix try_to_start_rmq_app() timeout command exit code check.

    Closes-bug: #1446190

    Change-Id: I8a0ccdde88cbb4c8545427f5bd6bdca6856a4687
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 1b553e823effcd14d59f42ec36852339d709e7a4)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers