An ability to forcefully kick rabbitmq node from the cluster when it dies

Bug #1437348 reported by Bogdan Dobrelya
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Bogdan Dobrelya
5.1.x
Won't Fix
Undecided
Unassigned
6.0.x
Won't Fix
Undecided
Unassigned
6.1.x
Fix Committed
High
Bogdan Dobrelya

Bug Description

Then the corosync node dies, the instances of clones running at the other corosync nodes do not receive a notification. That introduces for OCF RA logic an additional time lag to react and initiate the failover procedure, which is to reassemble the rabbit cluster w/o the failed node.

The solution is to provide a dedicated fencing system daemon running on the corosync nodes. This daemon should react on the dbus events triggered by the corosync-notifyd when corosync nodes leaving the cluster. The reaction should be to kick the dead rabbit node from the cluster.

Tags: ha rabbitmq
Changed in fuel:
importance: Undecided → High
assignee: nobody → Bogdan Dobrelya (bogdando)
milestone: none → 6.1
status: New → In Progress
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Changed in fuel:
status: In Progress → Won't Fix
milestone: 6.1 → 7.0
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

The rabbit-fencing daemon should be included as a fix for the 6.1 release as it addresses the major issue with rabbitmq clustering

Changed in fuel:
status: Won't Fix → In Progress
tags: added: to-be-covered-by-tests
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Note for QAm how to test:
0) deploy any HA environment with 3 controllers;
at some controller node issue "pcs resource unmanage master_p_rabbitmq-server"

should not kick alive nodes:
1) at the 1st controller, for example node-1, stop corosync service gracefully
2) at master node check the /var/log/remote/node-*/rabbit-fence.log:
* it should contain info like:
"Got node-1.test.domain.local that left cluster
...
Preparing to fence node rabbit@node-1 from rabbit cluster
... (within a 1 minute) ...
Ignoring alive node rabbit@node-1"
3) at other (not the node-1, where corosync was stopped) controllers check rabbitmq cluster_status:
* it should contain all 3 rabbit nodes running and mentioned as cluster members
4) teardown:
* start stopped corosync service; restart pacemaker service at the same node
* pcs status should show all 3 nodes online within a 1 minute

should kick failed rabbit node only once:
5) at the 1st controller, for example node-1, issue rabbitmqctl stop_app; and stop
corosync service gracefully
6) at master node check the /var/log/remote/node-*/rabbit-fence.log:
* some of the controller node's log should contain info like:
"Got node-1.test.domain.local that left cluster
...
Preparing to fence node rabbit@node-1 from rabbit cluster
... (within a 1 minute) ...
Disconnecting rabbit@node-1
Forgetting cluster node rabbit@node-1"
3) at other (not the node-1, where corosync was stopped) controllers check rabbitmq cluster_status:
* it should contain only 2 rabbit nodes running and mentioned as cluster members (the node-1 should not be listed there)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/108792
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=ad9097a1bb55cf035f6afa1d12afd13bd965f2b5
Submitter: Jenkins
Branch: master

commit ad9097a1bb55cf035f6afa1d12afd13bd965f2b5
Author: Vladimir Kuklin <email address hidden>
Date: Tue Jul 22 22:20:27 2014 +0400

    RabbitMQ node resource level fencing

    When Corosync notifies that particular node in its cluster
    is dead, rabbit-fence daemon fences the failed node in
    RabbitMQ cluster as well:

    * It casts disconnect failed_node & forget_cluster_node for
      the rest of the nodes in the RabbitMQ cluster.
    * Does not fence alive nodes with mnesia running.
    * Does not fence already forgotten nodes, that means that
      only the first node detected a 'dead event' will issue the
      fencing action, while the rest of the cluster nodes will
      ignore it.
    * Requires corosync compiled with --enable-dbus option,
      ensures corosync-notifyd and dbus (messagebus) are running.
    * Contains temp hacks in the corosync-notifyd init.d script to
      w/a upstream bugs
      https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1437368,
      https://bugs.launchpad.net/ubuntu/+source/corosync/+bug/1437359
    * Installs init.d and upstart scripts for rabbit-fence daemon and
      enables it after the puppet Rabbitmq class evaluated

    Note: system events may be monitored with dbus-monitor --system
    Note: If corosync package got updated with apt-get, the corosync-notifyd
      service would be affected by the mentioned Ubuntu upstream bugs again
      and wouldn't start as a result. Make sure to backup the init script for
      corosync-notifyd prior to issue the update and restore it once the
      update is done.

    Doc-Impact: ops guide
    Closes-bug: #1437348
    Related blueprint rabbitmq-pacemaker-multimaster-clone
    Change-Id: I691363386efe01421acc317ef6371ce45a0d4d11

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

This is a major improvement rather than a bug and involves new required packages and new system daemon for fence actions, hence should not be backported for the other milestones

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
tags: added: ha rabbitmq
removed: to-be-covered-by-tests
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.