Fuel for OpenStack

An ability to forcefully kick rabbitmq node from the cluster when it dies

Series 6.1.x
Bug #1437348

Bug #1437348 reported by Bogdan Dobrelya on 2015-03-27

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1
5.1.x	Won't Fix	Undecided	Unassigned
6.0.x	Won't Fix	Undecided	Unassigned
6.1.x	Fix Committed	High	Bogdan Dobrelya	Fuel for OpenStack 6.1

Bug Description

Then the corosync node dies, the instances of clones running at the other corosync nodes do not receive a notification. That introduces for OCF RA logic an additional time lag to react and initiate the failover procedure, which is to reassemble the rabbit cluster w/o the failed node.

The solution is to provide a dedicated fencing system daemon running on the corosync nodes. This daemon should react on the dbus events triggered by the corosync-notifyd when corosync nodes leaving the cluster. The reaction should be to kick the dead rabbit node from the cluster.

Tags:

Bogdan Dobrelya (bogdando) on 2015-03-27

Changed in fuel:
importance:	Undecided → High
assignee:	nobody → Bogdan Dobrelya (bogdando)
milestone:	none → 6.1
status:	New → In Progress

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-27:

Addressed by https://review.openstack.org/#/c/108792/

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-31:

superseded by bp https://blueprints.launchpad.net/fuel/+spec/rabbitmq-pacemaker-multimaster-clone

Changed in fuel:
status:	In Progress → Won't Fix
milestone:	6.1 → 7.0

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-03-31:

The rabbit-fencing daemon should be included as a fix for the 6.1 release as it addresses the major issue with rabbitmq clustering

OpenStack Infra (hudson-openstack) on 2015-04-01

Changed in fuel:
status:	Won't Fix → In Progress

Bogdan Dobrelya (bogdando) on 2015-04-10

tags:

added: to-be-covered-by-tests

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-10:

Note for QAm how to test:
0) deploy any HA environment with 3 controllers;
at some controller node issue "pcs resource unmanage master_p_rabbitmq-server"

should not kick alive nodes:
1) at the 1st controller, for example node-1, stop corosync service gracefully
2) at master node check the /var/log/remote/node-*/rabbit-fence.log:
* it should contain info like:
"Got node-1.test.domain.local that left cluster
...
Preparing to fence node rabbit@node-1 from rabbit cluster
... (within a 1 minute) ...
Ignoring alive node rabbit@node-1"
3) at other (not the node-1, where corosync was stopped) controllers check rabbitmq cluster_status:
* it should contain all 3 rabbit nodes running and mentioned as cluster members
4) teardown:
* start stopped corosync service; restart pacemaker service at the same node
* pcs status should show all 3 nodes online within a 1 minute

should kick failed rabbit node only once:
5) at the 1st controller, for example node-1, issue rabbitmqctl stop_app; and stop
corosync service gracefully
6) at master node check the /var/log/remote/node-*/rabbit-fence.log:
* some of the controller node's log should contain info like:
"Got node-1.test.domain.local that left cluster
...
Preparing to fence node rabbit@node-1 from rabbit cluster
... (within a 1 minute) ...
Disconnecting rabbit@node-1
Forgetting cluster node rabbit@node-1"
3) at other (not the node-1, where corosync was stopped) controllers check rabbitmq cluster_status:
* it should contain only 2 rabbit nodes running and mentioned as cluster members (the node-1 should not be listed there)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-14: Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/108792
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ad9097a1bb55cf035f6afa1d12afd13bd965f2b5
Submitter: Jenkins
Branch: master

commit ad9097a1bb55cf035f6afa1d12afd13bd965f2b5
Author: Vladimir Kuklin <email address hidden>
Date: Tue Jul 22 22:20:27 2014 +0400

RabbitMQ node resource level fencing

    When Corosync notifies that particular node in its cluster
    is dead, rabbit-fence daemon fences the failed node in
    RabbitMQ cluster as well:

    * It casts disconnect failed_node & forget_cluster_node for
      the rest of the nodes in the RabbitMQ cluster.
    * Does not fence alive nodes with mnesia running.
    * Does not fence already forgotten nodes, that means that
      only the first node detected a 'dead event' will issue the
      fencing action, while the rest of the cluster nodes will
      ignore it.
    * Requires corosync compiled with --enable-dbus option,
      ensures corosync-notifyd and dbus (messagebus) are running.
    * Contains temp hacks in the corosync-notifyd init.d script to
      w/a upstream bugs
      https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1437368,
      https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1437359
    * Installs init.d and upstart scripts for rabbit-fence daemon and
      enables it after the puppet Rabbitmq class evaluated

    Note: system events may be monitored with dbus-monitor --system
    Note: If corosync package got updated with apt-get, the corosync-notifyd
      service would be affected by the mentioned Ubuntu upstream bugs again
      and wouldn't start as a result. Make sure to backup the init script for
      corosync-notifyd prior to issue the update and restore it once the
      update is done.

    Doc-Impact: ops guide
    Closes-bug: #1437348
    Related blueprint rabbitmq-pacemaker-multimaster-clone
    Change-Id: I691363386efe01421acc317ef6371ce45a0d4d11

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-14:

This is a major improvement rather than a bug and involves new required packages and new system daemon for fence actions, hence should not be backported for the other milestones

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-04-15:

The system tests related bug https://bugs.launchpad.net/fuel/+bug/1443827

tags:

added: ha rabbitmq
removed: to-be-covered-by-tests

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.