Rabbitmq OCF script requires additional criteria to be met for Master/Slave statuses

Bug #1396946 reported by Bogdan Dobrelya on 2014-11-27
46
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Bogdan Dobrelya
5.1.x
High
Bogdan Dobrelya
6.0.x
High
Bogdan Dobrelya
6.1.x
High
Bogdan Dobrelya

Bug Description

The build of http://jenkins-product.srt.mirantis.net:8080/view/6.0/job/6.0.ubuntu.promo_bvt/71/ shows there is a missing criteria in OCF script for Master/Slave readiness. According to the logs, we have the situation then OCF reported to Pacemaker the Master and all Slaves are running, but in reality, rabbitmqctl list_users returns an error for some slave node and cluster is not ready and requires reassembling (that happened because of the failed and hanged start_app and join_cluster commands).

The other floating issues with rabbitmq clustering are:
* forget_cluster_node command could take a lot of
  the time (and even out of the time given to post-stop notify event)
  if rabbit node is under heavy load.
* It is also possible that all
  rabbitmq resources could persist as a slaves and there won't be
  any master elected (see the dubplicating bug/1401956).
* Sometimes, join_cluster could take quite a long of a time. If it exceeded, the node will enter into join-wait-reset loop for ever.

In order to fix it, we should:
- additionally check if 'rabbitmqctl list_users' does not return an error at the given node, and only then report the Master or Slave of multistate clone as running. Otherwise it should report Stopped state.
- wrap rabbitmqctl commands to timeout with -KILL signal
- use disconnect_node prior to issuing forget_cluster_node
- thoroughly re-examine the OCF script logic and fix it (see the commit message for a related patch below)

Changed in fuel:
importance: Undecided → High
status: New → Triaged
milestone: none → 6.0
Bogdan Dobrelya (bogdando) wrote :
description: updated
summary: Rabbitmq OCF script requires additional criterias to be met for
- Master=running status
+ Master/Slave statuses
description: updated
description: updated

I am not sure how we can fix this issue and not sure about its severity.

1) It has happenned only once
2) we already have puppet check which waits for 5 minutes for rabbitmq to answer on list_users command
3) it is not easy to introduce high level check as:
  a) notify command failure will not make pacemaker reassemble rabbitmq cluster
  b) if we add it into monitor command, we can get race conditions and make cluster building process completely unstable

So I would say that we need to research better solution than simple high-level check in OCF script and also check if there is a corresponding rabbitmq bug.

Thus, I am moving this bug to 6.0.1 and further milestones.

Bogdan Dobrelya (bogdando) wrote :

I testes looped deployment of rabbitmq cluster of 5 nodes. And there was 15 failures with Error: {aborted,{no_exists,[rabbit_user,{internal_user,'_','_','_'}]}} from total of 128 runs.

summary: - Rabbitmq OCF script requires additional criterias to be met for
+ Rabbitmq OCF script requires additional criteria to be met for
Master/Slave statuses

Bug was reproduced after reverting snapshot of ha cluster
Fuel 5.1.1-45(RC1)
Logs attached

Bogdan Dobrelya (bogdando) wrote :

The case could be as well identified by the following loop in the logs /var/log/remote/$affected_node/rabbitmq-server.log:

2014-12-04T14:23:25.387054+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0
2014-12-04T14:23:55.502922+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
2014-12-04T14:23:55.505885+00:00 info: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
2014-12-04T14:23:55.609653+00:00 info: INFO: p_rabbitmq-server: get_monitor(): master attribute is (null)
2014-12-04T14:23:55.709681+00:00 info: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
2014-12-04T14:23:55.712210+00:00 info: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
2014-12-04T14:23:55.733834+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (0) with node-1 (2739)

it repeats for ever and there is no more messages in the log about 'healthy' cluster

Bogdan Dobrelya (bogdando) wrote :

As a related fix, we should use timeout command for all commands issued by ocf script (except 'stop' which is supposed to run detached)

description: updated

Related fix proposed to branch: master
Review: https://review.openstack.org/141009

description: updated

Reviewed: https://review.openstack.org/140092
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d18ccea5791aae9e9003f62f9e4ba0ddbce684cf
Submitter: Jenkins
Branch: master

commit d18ccea5791aae9e9003f62f9e4ba0ddbce684cf
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 8 17:48:08 2014 +0100

    Wrap OCF rabbitmq commands to timeout -KILL

    W/o this patch, rabbitmqctl commands issued by rabbitmq RA
    could outlast the given timeout for action notify (60 seconds).
    That may bring rabbitmq cluster to a state then join_cluster
    and start_app commands would never ended as well as the
    cluster reassemble process.

    The solution is:
    * Use limited interval for rabbitmqctl commands which depends on the
      OCF action in progress. Looks like start and notify actions require
      the most of the time to be finished.
    * Evaluate interval as (timeout / 6 + 5).
      So, for notify 180 sec timeout it would be 180/6+5=35 sec
      to wait before any command issued by action notify would be killed
    * Adjust timeout values for failure-timeout, start and notify 60->180.
      That is required in order to provide enough time for the
      commands to complete consequently w/o being killed by timeout.
    * Fix start/shutdown timeout evaluation then recieved a negative value for
      small timeouts (timeout=60s would result in -4 value)

    Related-bug: #1396946

    Change-Id: I33a0390b311646266522ddd0f4e8c75d762afe30
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed: https://review.openstack.org/141009
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ed66a433eb2e3dacfc8043afe35357158f964310
Submitter: Jenkins
Branch: master

commit ed66a433eb2e3dacfc8043afe35357158f964310
Author: Vladimir Kuklin <email address hidden>
Date: Mon Dec 8 11:18:07 2014 +0300

    Fixes for rabbit OCF logic

    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.

    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection

    Related-bug: #1396946

    Change-Id: I951825badb712c5575d469403cf65bf26713aff0
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Related fix proposed to branch: stable/6.0
Review: https://review.openstack.org/150432

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/150439

Reviewed: https://review.openstack.org/150438
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=936aeaf3331c73101c27f6a15b08fd6fc04c239a
Submitter: Jenkins
Branch: stable/5.1

commit 936aeaf3331c73101c27f6a15b08fd6fc04c239a
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 8 17:48:08 2014 +0100

    Wrap OCF rabbitmq commands to timeout -KILL

    W/o this patch, rabbitmqctl commands issued by rabbitmq RA
    could outlast the given timeout for action notify (60 seconds).
    That may bring rabbitmq cluster to a state then join_cluster
    and start_app commands would never ended as well as the
    cluster reassemble process.

    The solution is:
    * Use limited interval for rabbitmqctl commands which depends on the
      OCF action in progress. Looks like start and notify actions require
      the most of the time to be finished.
    * Evaluate interval as (timeout / 6 + 5).
      So, for notify 180 sec timeout it would be 180/6+5=35 sec
      to wait before any command issued by action notify would be killed
    * Adjust timeout values for failure-timeout, start and notify 60->180.
      That is required in order to provide enough time for the
      commands to complete consequently w/o being killed by timeout.
    * Fix start/shutdown timeout evaluation then recieved a negative value for
      small timeouts (timeout=60s would result in -4 value)

    Related-bug: #1396946

    Change-Id: I33a0390b311646266522ddd0f4e8c75d762afe30
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed: https://review.openstack.org/150439
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=faf540a7160baefba4b4af03ba0328fa1f71fc98
Submitter: Jenkins
Branch: stable/5.1

commit faf540a7160baefba4b4af03ba0328fa1f71fc98
Author: Vladimir Kuklin <email address hidden>
Date: Mon Dec 8 11:18:07 2014 +0300

    Fixes for rabbit OCF logic

    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.

    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection

    Related-bug: #1396946

    Change-Id: I951825badb712c5575d469403cf65bf26713aff0
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed: https://review.openstack.org/150431
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f193d2500773f02d675597091bdaf36d9bbf382b
Submitter: Jenkins
Branch: stable/6.0

commit f193d2500773f02d675597091bdaf36d9bbf382b
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 8 17:48:08 2014 +0100

    Wrap OCF rabbitmq commands to timeout -KILL

    W/o this patch, rabbitmqctl commands issued by rabbitmq RA
    could outlast the given timeout for action notify (60 seconds).
    That may bring rabbitmq cluster to a state then join_cluster
    and start_app commands would never ended as well as the
    cluster reassemble process.

    The solution is:
    * Use limited interval for rabbitmqctl commands which depends on the
      OCF action in progress. Looks like start and notify actions require
      the most of the time to be finished.
    * Evaluate interval as (timeout / 6 + 5).
      So, for notify 180 sec timeout it would be 180/6+5=35 sec
      to wait before any command issued by action notify would be killed
    * Adjust timeout values for failure-timeout, start and notify 60->180.
      That is required in order to provide enough time for the
      commands to complete consequently w/o being killed by timeout.
    * Fix start/shutdown timeout evaluation then recieved a negative value for
      small timeouts (timeout=60s would result in -4 value)

    Related-bug: #1396946

    Change-Id: I33a0390b311646266522ddd0f4e8c75d762afe30
    Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed: https://review.openstack.org/150432
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=4b4d4c937d0a38a4c2abc10de5e8d735eb8439f2
Submitter: Jenkins
Branch: stable/6.0

commit 4b4d4c937d0a38a4c2abc10de5e8d735eb8439f2
Author: Vladimir Kuklin <email address hidden>
Date: Mon Dec 8 11:18:07 2014 +0300

    Fixes for rabbit OCF logic

    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.

    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection

    Related-bug: #1396946

    Change-Id: I951825badb712c5575d469403cf65bf26713aff0
    Signed-off-by: Bogdan Dobrelya <email address hidden>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers