Rabbit OCF stop action shall not fail

Bug #1529897 reported by Bogdan Dobrelya
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Bogdan Dobrelya
7.0.x
Won't Fix
High
Denis Puchkin
8.0.x
Fix Released
High
Bogdan Dobrelya

Bug Description

Source bug https://bugs.launchpad.net/fuel/+bug/1529861

The following events had brought the rabbit pacemaker resource to unmanaged state, because the action stop failed:

node-5 lrmd.log
2015-12-29T02:05:34.258234+00:00 warning: WARNING: p_rabbitmq-server: kill_rmq_and_remove_pid(): RMQ-runtime (beam) PID=16782 stopped by 'kill -TERM', sorry...
2015-12-29T02:05:44.297766+00:00 info: INFO: p_rabbitmq-server: stop: action end.

node-5 crmd.log
2015-12-29T02:05:44.770620+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_stop_0: unknown error (node=node-5.test.domain.local, call=201, rc=1, cib-update=201, c
onfirmed=true)

There are two issues:
1) The message "stop: action end" issued before the actual end took the place, which is very confusing. There is also hard-coded sleep 10 put in the hope that the kill -TERM will succeed and make it in time. See https://github.com/openstack/fuel-library/blob/master/files/fuel-ha-utils/ocf/rabbitmq#L1488-L1496

2) the get_status unexpectedly reported generic error and this was returned as the action stop result as well, so the resource turned unmanaged.

3) There is messy exit paths for the stop_server_process() - it may return ERROR while being ignored by a caller, it may also exit after the rabbitmqctl stop with or without ensuring there is no beam process left

These must be fixed

Changed in fuel:
importance: Undecided → Critical
milestone: none → 8.0
assignee: nobody → Bogdan Dobrelya (bogdando)
Changed in fuel:
importance: Critical → High
status: New → In Progress
tags: added: area-library rabbitmq
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (master)

Fix proposed to branch: master
Review: https://review.openstack.org/262519

tags: added: team-bugfix
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

How to test:
# crm resource unmanage p_rabbitmq-server-master
# kill -STOP `cat /var/run/rabbitmq/pid`; rm -f /var/run/rabbitmq/pid
# OCF_ROOT=/usr/lib/ocf /usr/lib/ocf/resource.d/fuel/rabbitmq-server-upsream stop

Once reported as stopped, there is must not the matched beam process be running anymore

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/262519
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=5da85825d13ca66f4e4104388519a8206a509edf
Submitter: Jenkins
Branch: master

commit 5da85825d13ca66f4e4104388519a8206a509edf
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Dec 30 13:17:42 2015 +0100

    Fix monitor/stop operations for the rabbit OCF resource

    W/o this fix, the situation is possible when:
    - beam's running and cannot process signals, but is reported "not running"
    by the get_status(), while in fact it shall be reported as generic error
    - which_applications() returned error, while its output is still
    being parsed for the "what" match, while it shall not.
    - action stop and proc_stop gives up then there is no pidfile and the beam's
    running unresponsive.

    The solution is to make get_status to return generic error and action
    stop to use the rabbit process name matching for killing it. These and
    other related fixes listed below (tl;dr)

    * Fix get_status, action_stop, proc_stop then beam's unresponsive
      (ie. fails to process signals or does it very slowly)
      - Fix get status() to catch beam state and output errors
      - Fix action_stop() to force name-based mathcing then no
        pidfile and the beam's unresponsive
      - Fix proc_stop to use name based matching if no pidfile
        found
      - Fix proc_stop to retry sending the signal when using the name
        based match as well
    * Fix the get_status() unexpectedly reports generic error
      instead of "not running"
    * Add reworked proc_stop and proc_kill functions from the
      ocf-fuel-funcs
    * Rework stop_server_process()
      - make it to return SUCCESS/ERROR as expected
      - grant the "rabbitmqctl stop" a graceful termintation window and only
        then ensure the beam process termination and pidfile removal as well
      - return the actual status with get_status()
    * Rework kill_rmq_and_remove_pid()
      - use proc_stop to try to kill by pgrp with -TERM, then -KILL, or
        by the beam process name match, if there is no PID.
      - make it to returns SUCCESS/ERROR
    * Fix action_stop()
      - fail early by the stop_server_process() results without additional
        rabbitmqctl invocations in the get_status() call
      - rework hard-coded sleep 10 to use the gracefull stop windows in the
        stop_server_process() instead
      - ensure the rabbit-start-time removal from CIB before to try to stop
        the server process
      - issue the "stop: action end" log record before the actual end
    * Add comments, adjust logs levels and make them to be more informational

    Upstream PRs
    https://github.com/rabbitmq/rabbitmq-server/pull/523
    https://github.com/rabbitmq/rabbitmq-server/pull/532
    https://github.com/rabbitmq/rabbitmq-server/pull/538
    https://github.com/rabbitmq/rabbitmq-server/pull/540

    Closes-bug: #1529897

    Change-Id: I1c382e3cf004630847b6626fabaecaa0094ee271
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Changed in fuel:
status: In Progress → Fix Committed
Dmitry Pyzhov (dpyzhov)
Changed in fuel:
milestone: 8.0 → 9.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-library (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/267413

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (stable/8.0)
Download full text (3.1 KiB)

Reviewed: https://review.openstack.org/267413
Committed: https://git.openstack.org/cgit/openstack/fuel-library/commit/?id=247d4b2612d97dafa557948fb7c7caa755007b2a
Submitter: Jenkins
Branch: stable/8.0

commit 247d4b2612d97dafa557948fb7c7caa755007b2a
Author: Bogdan Dobrelya <email address hidden>
Date: Wed Dec 30 13:17:42 2015 +0100

    Fix monitor/stop operations for the rabbit OCF resource

    W/o this fix, the situation is possible when:
    - beam's running and cannot process signals, but is reported "not running"
    by the get_status(), while in fact it shall be reported as generic error
    - which_applications() returned error, while its output is still
    being parsed for the "what" match, while it shall not.
    - action stop and proc_stop gives up then there is no pidfile and the beam's
    running unresponsive.

    The solution is to make get_status to return generic error and action
    stop to use the rabbit process name matching for killing it. These and
    other related fixes listed below (tl;dr)

    * Fix get_status, action_stop, proc_stop then beam's unresponsive
      (ie. fails to process signals or does it very slowly)
      - Fix get status() to catch beam state and output errors
      - Fix action_stop() to force name-based mathcing then no
        pidfile and the beam's unresponsive
      - Fix proc_stop to use name based matching if no pidfile
        found
      - Fix proc_stop to retry sending the signal when using the name
        based match as well
    * Fix the get_status() unexpectedly reports generic error
      instead of "not running"
    * Add reworked proc_stop and proc_kill functions from the
      ocf-fuel-funcs
    * Rework stop_server_process()
      - make it to return SUCCESS/ERROR as expected
      - grant the "rabbitmqctl stop" a graceful termintation window and only
        then ensure the beam process termination and pidfile removal as well
      - return the actual status with get_status()
    * Rework kill_rmq_and_remove_pid()
      - use proc_stop to try to kill by pgrp with -TERM, then -KILL, or
        by the beam process name match, if there is no PID.
      - make it to returns SUCCESS/ERROR
    * Fix action_stop()
      - fail early by the stop_server_process() results without additional
        rabbitmqctl invocations in the get_status() call
      - rework hard-coded sleep 10 to use the gracefull stop windows in the
        stop_server_process() instead
      - ensure the rabbit-start-time removal from CIB before to try to stop
        the server process
      - issue the "stop: action end" log record before the actual end
    * Add comments, adjust logs levels and make them to be more informational

    Upstream PRs
    https://github.com/rabbitmq/rabbitmq-server/pull/523
    https://github.com/rabbitmq/rabbitmq-server/pull/532
    https://github.com/rabbitmq/rabbitmq-server/pull/538
    https://github.com/rabbitmq/rabbitmq-server/pull/540

    Closes-bug: #1529897

    Change-Id: I1c382e3cf004630847b6626fabaecaa0094ee271
    Signed-off-by: Bogdan Dobrelya <email address hidden>
    (cherry picked from commit 5da85825d13ca66f4e4104388519a8206a509e...

Read more...

tags: added: on-verification
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

verified on 8.0-506

tags: removed: on-verification
Revision history for this message
Alexey Galkin (agalkin) wrote :

Verified as fixed in 9.0-242.

In accordance with this paste: http://paste.openstack.org/show/495443 can be seen that only once reported as stopped, there is not the matched beam process be running anymore.

Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 7.0-updates as this is too big change to be accepted to stable branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.