Fuel for OpenStack

Full RabbitMQ cluster failure when the master of rabbit multi-state resource goes down

Bug #1436812 reported by Alexey Khivin on 2015-03-26

This bug affects 5 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 6.1
5.1.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 5.1.1-updates
6.0.x	Won't Fix	High	Denis Meltsaykin	Fuel for OpenStack 6.0-updates
6.1.x	Fix Released	High	Bogdan Dobrelya	Fuel for OpenStack 6.1

Bug Description

Multinode with HA with 3 controllers

Steps to reproduce

1) find the master of rabbitmq multistate-resource (crm status)
2) On the other two controllers start

watch "rabbitmqctl eval 'rabbit_misc:which_applications().' |grep rabbit,"

to see whats happening with RabbitMQ cluster and rabbit application. You should see the line

{rabbit,"RabbitMQ","3.3.5"},

3) Turn off the primary controller and see whats happening with other two controllers

Because of currrent HA implementation you will see that rabbitmqctl could not connect to the node
something like that
Error: unable to connect to node 'rabbit@node-10': nodedown
or
line with rabbitmq application could disappear

The normal behaviour I believe should be
No disconnection on two others controllers and no full cluster failover when primary controller goes down

Also in the
https://bugs.launchpad.net/fuel/+bug/1435250
noted
3) destroy the master, node-3
Expected: new master election, failover with full downtime - no nodes can process AMQP connections

I think It should be improved

[root@fuel ~]# fuel --f
DEPRECATION WARNING: file /etc/fuel/client/config.yaml is found and will be used as a source for settings. However, it deprecated and will not be used by default in the ongoing version of python-fuelclient.
api: '1.0'
astute_sha: 4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9
auth_required: true
build_id: 2015-03-23_15-29-20
build_number: '218'
feature_groups:
- mirantis
fuellib_sha: a0265ae47bb2307a6967a3f1dd06fe222c561265
fuelmain_sha: a05ab877af31924585c81081f45305700961458e
nailgun_sha: 7c100f47450ea1a910e19fa09f78d586cb2bc0d3
ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
production: docker
python-fuelclient_sha: 3624051242c83fdbdd1df9a0e466797c06b75043
release: '6.1'
release_versions:
  2014.2-6.1:
    VERSION:
      api: '1.0'
      astute_sha: 4a117a1ca6bdcc34fe4d086959ace1a6d18eeca9
      build_id: 2015-03-23_15-29-20
      build_number: '218'
      feature_groups:
      - mirantis
      fuellib_sha: a0265ae47bb2307a6967a3f1dd06fe222c561265
      fuelmain_sha: a05ab877af31924585c81081f45305700961458e
      nailgun_sha: 7c100f47450ea1a910e19fa09f78d586cb2bc0d3
      ostf_sha: a4cf5f218c6aea98105b10c97a4aed8115c15867
      production: docker
      python-fuelclient_sha: 3624051242c83fdbdd1df9a0e466797c06b75043
      release: '6.1'

See original description

Tags:

Alexey Khivin (akhivin) on 2015-03-26

summary:

- Entire RabbitMQ cluster downtime when primary controller goes down
+ Full RabbitMQ cluster failure when primary controller goes down

Revision history for this message

Davanum Srinivas (DIMS) (dims-v) wrote on 2015-03-26: Re: Full RabbitMQ cluster failure when primary controller goes down

related bug - https://bugs.launchpad.net/fuel/+bug/1436343

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2015-03-26:

It should not be critical, cause not break usual deployment. Lowered to 'high'.

Changed in fuel:
importance:	Critical → High
status:	New → Confirmed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-27:

This bug cannot be fixed in the 6.1 as it requires a complete redesign of the rabbitmq management logic for a Pacemaker.
Currently, when the Master resource of multistate clone goes down, the downtime is expected until the cluster finished to reassemble

Changed in fuel:
milestone:	none → 7.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-31:

Another point to address amongst with the redesign to multi-master clones is the elected multistate resource masters could become the disc nodes, while the multistate resource slaves should be joined as ram nodes. The only data integrity risk here is when the single disc node failed, the cluster would operate w/o persistent data storage unless new master elected and data replicated. That is why having many masters for multistate clone is essential for this improvement

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-31:

superseded by bp https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone

Changed in fuel:
status:	Confirmed → Won't Fix

Bogdan Dobrelya (bogdando) on 2015-03-31

no longer affects:

fuel/6.1.x

Bogdan Dobrelya (bogdando) on 2015-05-15

summary:	- Full RabbitMQ cluster failure when primary controller goes down + Full RabbitMQ cluster failure when the master of rabbit multi-state + resource goes down
description:	updated

Dmitry Borodaenko (angdraug) on 2015-05-15

Changed in fuel:
status:	Won't Fix → Confirmed
milestone:	6.1 → 5.1.1-updates
milestone:	5.1.1-updates → 7.0

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-16:

Dmitry, please provide info why this bug was reopened. We applied a lot of fixes related to the blueprint which supersedes this bug. According to what I know - it should be marked as Invalid.

Changed in fuel:
status:	Confirmed → Invalid

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-17:

Folks, this is a known issue and fixing it would require complete redesign of RabbitMQ multi-state clone resource agent logic. That is why this cannot be fixed as a bug

Changed in fuel:
status:	Invalid → Won't Fix

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2015-05-19:

#10

We must keep bugs open even if there are associated blueprints. Once blueprint is implemented, you'd verify that the bug is closed by the implemented functionality.

Changed in fuel:
status:	Won't Fix → Confirmed

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2015-05-19:

#11

Checked build #432.
Behavior is still the same like described in https://bugs.launchpad.net/mos/+bug/1455613

Revision history for this message

Vitaly Sedelnik (vsedelnik) wrote on 2015-05-19:

#12

I nominated the bug for 6.1.1 milestone (to be renamed to 6.1-updates) so we could ship the fix via patching if there will be any fix/improvement for OCF scripts

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-21: Related fix proposed to fuel-library (master)

#13

Related fix proposed to branch: master
Review: https://review.openstack.org/184911

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Related fix merged to fuel-library (master)

#14

Reviewed: https://review.openstack.org/184911
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=3c356c12ae9da5047ef8a0189467ff8923fc0b14
Submitter: Jenkins
Branch: master

commit 3c356c12ae9da5047ef8a0189467ff8923fc0b14
Author: Vladimir Kuklin <email address hidden>
Date: Fri May 22 02:50:12 2015 +0300

Check whether beam is started before running start_app

    There is a mistake in OCF logic which tries
    to start rabbitmq app without running beam
    after Mnesia reset getting into the loop
    which constantly fails until it times out

Change-Id: Id096961e206a083b51978fc5034f99d04715d7ea
Related-bug: #1436812

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-22: Fix proposed to fuel-library (master)

#15

Fix proposed to branch: master
Review: https://review.openstack.org/185044

Changed in fuel:
assignee:	MOS Sustaining (mos-sustaining) → Bogdan Dobrelya (bogdando)
status:	Confirmed → In Progress

OpenStack Infra (hudson-openstack) on 2015-05-23

Changed in fuel:
assignee:	Bogdan Dobrelya (bogdando) → Vladimir Kuklin (vkuklin)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-25:

#16

Is looks like I found the issue in OCF logic causing this full downtime while new master election in progress.
The patch https://review.openstack.org/185044 shuould fix this w/o multiple masters as well. I will test it and provide a feedback

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-25:

#17

How to test:
1) Shut down the node running rabbitmq master resource
2) check pcs resource status does not report any masters for the rabbitmq resource (this means failover is in progress)
3) check the rabbitmqctl list_channels, list_queues output for the rest of the rabbit nodes - the channels and queues list should be reported w/o issues.

I tested the patch and it is passed.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-25: Related fix proposed to fuel-library (master)

#18

Related fix proposed to branch: master
Review: https://review.openstack.org/185397

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-25: Fix merged to fuel-library (master)

#19

Reviewed: https://review.openstack.org/185044
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Submitter: Jenkins
Branch: master

commit 085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Author: Bogdan Dobrelya <email address hidden>
Date: Fri May 22 16:40:16 2015 +0200

Fix rabbit OCF demote/stop/promote actions

    * When the rabbit node went down, its status remains 'running'
      in mnesia db for a while, so few retries (50 sec of total) are
      required in order to kick and forget this node from the cluster.
      This also requires +50 sec for actions stop & demote timeout.
    * The rabbit master score in the CIB is retained after the current
      master moved manually. This is wrong and the score must be reset
      ASAP for post-demote and post-stop as well.
    * The demoted node must be kicked from cluster by other nodes
      on post-demote processing.
    * Post-demote should stop the rabbit app at the node being demoted as
      this node should be kicked from the cluster by other nodes.
      Instead, it stops the app at the *other* nodes and brings full
      cluster downtime.
    * The check to join should be only done at the post-start and not at
      the post-promote, otherwise the node being promoted may think it
      is clustered with some node while the join check reports it as
      already clustered with another one.
      (the regression was caused by https://review.openstack.org/184671)
    * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE
      everywhere to ensure we are using correct pacemaker node name
    * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting
      the resource as not running. This will rerun resource and restore
      its state, eventually.

Closes-bug: #1436812
Closes-bug: #1455761

Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3
Signed-off-by: Bogdan Dobrelya <email address hidden>

Reviewed:  https://review.openstack.org/185044
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Submitter: Jenkins
Branch:    master

commit 085fe8c5a2255d4274cdcee5c2a74c15c443c0db
Author: Bogdan Dobrelya <bdobrelia@mirantis.com>
Date:   Fri May 22 16:40:16 2015 +0200

Fix rabbit OCF demote/stop/promote actions
    
    * When the rabbit node went down, its status remains 'running'
      in mnesia db for a while, so few retries (50 sec of total) are
      required in order to kick and forget this node from the cluster.
      This also requires +50 sec for actions stop & demote timeout.
    * The rabbit master score in the CIB is retained after the current
      master moved manually. This is wrong and the score must be reset
      ASAP for post-demote and post-stop as well.
    * The demoted node must be kicked from cluster by other nodes
      on post-demote processing.
    * Post-demote should stop the rabbit app at the node being demoted as
      this node should be kicked from the cluster by other nodes.
      Instead, it stops the app at the *other* nodes and brings full
      cluster downtime.
    * The check to join should be only done at the post-start and not at
      the post-promote, otherwise the node being promoted may think it
      is clustered with some node while the join check reports it as
      already clustered with another one.
      (the regression was caused by https://review.openstack.org/184671)
    * Change `hostname` call to `crm_node -n` via $THIS_PCMK_NODE
      everywhere to ensure we are using correct pacemaker node name
    * Handle empty values for OCF_RESKEY_CRM_meta_notify_* by reporting
      the resource as not running. This will rerun resource and restore
      its state, eventually.
    
    Closes-bug: #1436812
    Closes-bug: #1455761
    
    Change-Id: Ib01c1731b4f06e6b643a4bca845828f7db507ad3
    Signed-off-by: Bogdan Dobrelya <bdobrelia@mirantis.com>

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-05-26: Related fix merged to fuel-library (master)

#20

Reviewed: https://review.openstack.org/185397
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=152303c7e741861bbcfd3c3165be3d27107d6b38
Submitter: Jenkins
Branch: master

commit 152303c7e741861bbcfd3c3165be3d27107d6b38
Author: Bogdan Dobrelya <email address hidden>
Date: Mon May 25 15:57:39 2015 +0200

Add rabbit OCF functions to get pacemaker node names

    W/o this fix, the failover time was longer than expected
    as rabbit nodes was able to query corosync nodes left the
    cluster and also try to join them by rabbit cluster ending
    up being reset and rejoin alive nodes later.
    1) Add functions:
      a) to get all alive nodes in the partition
      b) to get all nodes
    This fixes get_monitor behaviour so that it ignores
    attributes for dead nodes as crm_node behaviour
    changed with upgrade of pacemaker. So rabbit nodes will
    never try to join the dead ones.

    2) Fix bash scopes for local variables
    Minor change removing unexcpeted behavior when local variable
    impacts global scope.

Related-bug: #1436812

Change-Id: I89b716b4cd007572bb6832365d4424669921f057
Signed-off-by: Bogdan Dobrelya <email address hidden>

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-05-26:

#21

Note, the patch https://review.openstack.org/185397 should NOT be backported for the 5.1.x and 6.0.x milestones as this change is not needed for the older pacemaker versions

no longer affects:

fuel/7.0.x

Sergey Novikov (snovikov) on 2015-05-28

tags:

added: on-verification

Revision history for this message

Sergey Novikov (snovikov) wrote on 2015-05-29:

#22

Verified on fuel-6.1-478-2015-05-28_20-55-26.iso.

Step to verify:
    1. Deploy cluster with 3 controllers
    2. Run OSTF
    3. Shut down the node running rabbitmq master resource
    4. Check pcs resource status does not report any masters for the rabbitmq resource (this means failover is in progress)
    5. Check the rabbitmqctl list_channels, list_queues output for the rest of the rabbit nodes - the channels and queues list should be reported w/o issues.

tags:

removed: on-verification

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2015-10-26:

#23

Setting this as Won't Fix for 5.1.1-updates and 6.0-updates, as such a complex change cannot be delivered in the scope of the Maintenance Update. Also, the possible solution of the backporting of RabbitMQ OCF script is covered in details by the Operations Guide from the official documentation of the Product.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1455613

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.