Bug #1396946 “Rabbitmq OCF script requires additional criteria t...” : Bugs : Fuel for OpenStack

Bogdan Dobrelya (bogdando) on 2014-11-27

Changed in fuel:
importance:	Undecided → High
status:	New → Triaged
milestone:	none → 6.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-11-27:

#1

Related https://bugs.launchpad.net/fuel/+bug/1339080

description:	updated
summary:	Rabbitmq OCF script requires additional criterias to be met for - Master=running status + Master/Slave statuses
description:	updated
description:	updated

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-11-27: Re: Rabbitmq OCF script requires additional criterias to be met for Master/Slave statuses

#2

I am not sure how we can fix this issue and not sure about its severity.

1) It has happenned only once
2) we already have puppet check which waits for 5 minutes for rabbitmq to answer on list_users command
3) it is not easy to introduce high level check as:
a) notify command failure will not make pacemaker reassemble rabbitmq cluster
b) if we add it into monitor command, we can get race conditions and make cluster building process completely unstable

So I would say that we need to research better solution than simple high-level check in OCF script and also check if there is a corresponding rabbitmq bug.

Thus, I am moving this bug to 6.0.1 and further milestones.

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-01:

#3

I testes looped deployment of rabbitmq cluster of 5 nodes. And there was 15 failures with Error: {aborted,{no_exists,[rabbit_user,{internal_user,'_','_','_'}]}} from total of 128 runs.

Vladimir Kuklin (vkuklin) on 2014-12-03

summary:

- Rabbitmq OCF script requires additional criterias to be met for
+ Rabbitmq OCF script requires additional criteria to be met for
Master/Slave statuses

Revision history for this message

Alexander Kurenyshev (akurenyshev) wrote on 2014-12-03:

#4

fuel-snapshot-2014-12-03_14-19-58.tgz Edit (83.9 MiB, application/x-tar)

Bug was reproduced after reverting snapshot of ha cluster
Fuel 5.1.1-45(RC1)
Logs attached

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-04:

#5

The case could be as well identified by the following loop in the logs /var/log/remote/$affected_node/rabbitmq-server.log:

2014-12-04T14:23:25.387054+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0
2014-12-04T14:23:55.502922+00:00 info: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
2014-12-04T14:23:55.505885+00:00 info: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
2014-12-04T14:23:55.609653+00:00 info: INFO: p_rabbitmq-server: get_monitor(): master attribute is (null)
2014-12-04T14:23:55.709681+00:00 info: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
2014-12-04T14:23:55.712210+00:00 info: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
2014-12-04T14:23:55.733834+00:00 info: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (0) with node-1 (2739)

it repeats for ever and there is no more messages in the log about 'healthy' cluster

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-12-08:

#6

As a related fix, we should use timeout command for all commands issued by ocf script (except 'stop' which is supposed to run detached)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-08: Related fix proposed to fuel-library (master)

#7

Related fix proposed to branch: master
Review: https://review.openstack.org/140092

Bogdan Dobrelya (bogdando) on 2014-12-11

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-11:

#8

Related fix proposed to branch: master
Review: https://review.openstack.org/141009

Bogdan Dobrelya (bogdando) on 2014-12-16

description:

updated

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-24: Related fix merged to fuel-library (master)

#9

Reviewed: https://review.openstack.org/140092
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d18ccea5791aae9e9003f62f9e4ba0ddbce684cf
Submitter: Jenkins
Branch: master

commit d18ccea5791aae9e9003f62f9e4ba0ddbce684cf
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 8 17:48:08 2014 +0100

Wrap OCF rabbitmq commands to timeout -KILL

    W/o this patch, rabbitmqctl commands issued by rabbitmq RA
    could outlast the given timeout for action notify (60 seconds).
    That may bring rabbitmq cluster to a state then join_cluster
    and start_app commands would never ended as well as the
    cluster reassemble process.

    The solution is:
    * Use limited interval for rabbitmqctl commands which depends on the
      OCF action in progress. Looks like start and notify actions require
      the most of the time to be finished.
    * Evaluate interval as (timeout / 6 + 5).
      So, for notify 180 sec timeout it would be 180/6+5=35 sec
      to wait before any command issued by action notify would be killed
    * Adjust timeout values for failure-timeout, start and notify 60->180.
      That is required in order to provide enough time for the
      commands to complete consequently w/o being killed by timeout.
    * Fix start/shutdown timeout evaluation then recieved a negative value for
      small timeouts (timeout=60s would result in -4 value)

Related-bug: #1396946

Change-Id: I33a0390b311646266522ddd0f4e8c75d762afe30
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2014-12-24:

#10

Reviewed: https://review.openstack.org/141009
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ed66a433eb2e3dacfc8043afe35357158f964310
Submitter: Jenkins
Branch: master

commit ed66a433eb2e3dacfc8043afe35357158f964310
Author: Vladimir Kuklin <email address hidden>
Date: Mon Dec 8 11:18:07 2014 +0300

Fixes for rabbit OCF logic

    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.

    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection

Related-bug: #1396946

Change-Id: I951825badb712c5575d469403cf65bf26713aff0
Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/141009
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ed66a433eb2e3dacfc8043afe35357158f964310
Submitter: Jenkins
Branch:    master

commit ed66a433eb2e3dacfc8043afe35357158f964310
Author: Vladimir Kuklin <vkuklin@mirantis.com>
Date:   Mon Dec 8 11:18:07 2014 +0300

Fixes for rabbit OCF logic
    
    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.
    
    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection
    
    Related-bug: #1396946
    
    Change-Id: I951825badb712c5575d469403cf65bf26713aff0
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-27: Related fix proposed to fuel-library (stable/6.0)

#11

Related fix proposed to branch: stable/6.0
Review: https://review.openstack.org/150431

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-27:

#12

Related fix proposed to branch: stable/6.0
Review: https://review.openstack.org/150432

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-27: Related fix proposed to fuel-library (stable/5.1)

#13

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/150438

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-01-27:

#14

Related fix proposed to branch: stable/5.1
Review: https://review.openstack.org/150439

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-06: Related fix merged to fuel-library (stable/5.1)

#15

Reviewed: https://review.openstack.org/150438
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=936aeaf3331c73101c27f6a15b08fd6fc04c239a
Submitter: Jenkins
Branch: stable/5.1

commit 936aeaf3331c73101c27f6a15b08fd6fc04c239a
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 8 17:48:08 2014 +0100

Wrap OCF rabbitmq commands to timeout -KILL

    W/o this patch, rabbitmqctl commands issued by rabbitmq RA
    could outlast the given timeout for action notify (60 seconds).
    That may bring rabbitmq cluster to a state then join_cluster
    and start_app commands would never ended as well as the
    cluster reassemble process.

    The solution is:
    * Use limited interval for rabbitmqctl commands which depends on the
      OCF action in progress. Looks like start and notify actions require
      the most of the time to be finished.
    * Evaluate interval as (timeout / 6 + 5).
      So, for notify 180 sec timeout it would be 180/6+5=35 sec
      to wait before any command issued by action notify would be killed
    * Adjust timeout values for failure-timeout, start and notify 60->180.
      That is required in order to provide enough time for the
      commands to complete consequently w/o being killed by timeout.
    * Fix start/shutdown timeout evaluation then recieved a negative value for
      small timeouts (timeout=60s would result in -4 value)

Related-bug: #1396946

Change-Id: I33a0390b311646266522ddd0f4e8c75d762afe30
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-06:

#16

Reviewed: https://review.openstack.org/150439
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=faf540a7160baefba4b4af03ba0328fa1f71fc98
Submitter: Jenkins
Branch: stable/5.1

commit faf540a7160baefba4b4af03ba0328fa1f71fc98
Author: Vladimir Kuklin <email address hidden>
Date: Mon Dec 8 11:18:07 2014 +0300

Fixes for rabbit OCF logic

    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.

    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection

Related-bug: #1396946

Change-Id: I951825badb712c5575d469403cf65bf26713aff0
Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/150439
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=faf540a7160baefba4b4af03ba0328fa1f71fc98
Submitter: Jenkins
Branch:    stable/5.1

commit faf540a7160baefba4b4af03ba0328fa1f71fc98
Author: Vladimir Kuklin <vkuklin@mirantis.com>
Date:   Mon Dec 8 11:18:07 2014 +0300

Fixes for rabbit OCF logic
    
    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.
    
    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection
    
    Related-bug: #1396946
    
    Change-Id: I951825badb712c5575d469403cf65bf26713aff0
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-06: Related fix merged to fuel-library (stable/6.0)

#17

Reviewed: https://review.openstack.org/150431
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=f193d2500773f02d675597091bdaf36d9bbf382b
Submitter: Jenkins
Branch: stable/6.0

commit f193d2500773f02d675597091bdaf36d9bbf382b
Author: Bogdan Dobrelya <email address hidden>
Date: Mon Dec 8 17:48:08 2014 +0100

Wrap OCF rabbitmq commands to timeout -KILL

    W/o this patch, rabbitmqctl commands issued by rabbitmq RA
    could outlast the given timeout for action notify (60 seconds).
    That may bring rabbitmq cluster to a state then join_cluster
    and start_app commands would never ended as well as the
    cluster reassemble process.

    The solution is:
    * Use limited interval for rabbitmqctl commands which depends on the
      OCF action in progress. Looks like start and notify actions require
      the most of the time to be finished.
    * Evaluate interval as (timeout / 6 + 5).
      So, for notify 180 sec timeout it would be 180/6+5=35 sec
      to wait before any command issued by action notify would be killed
    * Adjust timeout values for failure-timeout, start and notify 60->180.
      That is required in order to provide enough time for the
      commands to complete consequently w/o being killed by timeout.
    * Fix start/shutdown timeout evaluation then recieved a negative value for
      small timeouts (timeout=60s would result in -4 value)

Related-bug: #1396946

Change-Id: I33a0390b311646266522ddd0f4e8c75d762afe30
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-02-06:

#18

Reviewed: https://review.openstack.org/150432
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=4b4d4c937d0a38a4c2abc10de5e8d735eb8439f2
Submitter: Jenkins
Branch: stable/6.0

commit 4b4d4c937d0a38a4c2abc10de5e8d735eb8439f2
Author: Vladimir Kuklin <email address hidden>
Date: Mon Dec 8 11:18:07 2014 +0300

Fixes for rabbit OCF logic

    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.

    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection

Related-bug: #1396946

Change-Id: I951825badb712c5575d469403cf65bf26713aff0
Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/150432
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=4b4d4c937d0a38a4c2abc10de5e8d735eb8439f2
Submitter: Jenkins
Branch:    stable/6.0

commit 4b4d4c937d0a38a4c2abc10de5e8d735eb8439f2
Author: Vladimir Kuklin <vkuklin@mirantis.com>
Date:   Mon Dec 8 11:18:07 2014 +0300

Fixes for rabbit OCF logic
    
    W/o this fix,
    * forget_cluster_node command could take a lot of
      the time (and even out of the time given to post-stop notify event)
      if rabbit node is under heavy load.
    * It is also possible that all
      rabbitmq resources could persist as a slaves and there won't be
      any master elected.
    * Sometimes, join_cluster could take quite a long of a time. If it exceeded,
      the node will enter into join-wait-reset loop for ever.
    
    The solution includes:
    * Disconnect the node being unjoined forcibly from
      the every node issuing the forget action. That would make the forget
      action an instant and would ensure the node will be unjoined and
      reset w/o any issues.
    * Increase action start timeout even more and increase minimal value
      for command timeout as well.
    * Add info messages about timeouts given for commands execution
    * Fix missing OCF_FAILED_MASTER status processing for events and actions
    * Fix promote/demote/notify actions - return OCF_FAILED_MASTER then required
    * Fix reset_mnesia() procedure and logging prefixes for it
    * Replace detached stop_app call to timeout wrapper in stop_rmq_server_app()
      and stop_server_process(). Del del OCF_RESKEY_shutdown_time (unneeded).
    * Fix exit code for start_rmq_server_app()
    * Fix missing return for jjj_join
    * Fix my_host return codes to differ from OCF_ERR_GENERIC
    * Fix missing local declarations rc var
    * Fix missing rc checks for stop_rmq_server_app() and some other functions
    * Add some missing ocf_run wrappers
    * Fix shutdown/startup logs redirection
    
    Related-bug: #1396946
    
    Change-Id: I951825badb712c5575d469403cf65bf26713aff0
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.0-updates
5.1.x	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 5.1.1-updates
6.0.x	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.0-updates
6.1.x	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1

Fuel for OpenStack

Rabbitmq OCF script requires additional criteria to be met for Master/Slave statuses

Bug Description

Duplicates of this bug

Other bug subscribers

Related blueprints

Bug attachments

Remote bug watches