Rabbitmq cluster cannot recover one of the slaves

Bug #1354520 reported by Bogdan Dobrelya
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
Medium
Aleksandr Didenko

Bug Description

{"build_id": "2014-08-07_11-20-05", "ostf_sha": "e33390c275e225d648b36997460dc29b1a3c20ae", "build_number": "409", "auth_required": true, "api": "1.0", "nailgun_sha": "67c4f1c18ab0833175f6dc7f0f9c49c3eb722287", "production": "docker", "fuelmain_sha": "7b2e7ef083f239bd47b5c47aecb1f815c009521f", "astute_sha": "b52910642d6de941444901b0f20e95ebbcb2b2e9", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "53633cd9bb149f6c1b9d5ee8321efc85c71cee68"}

Centos, Nova HA flat (3 controllers, 2 computes)

Steps to reproduce:
1) Check for rabbit master node in corosync
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
 Masters: [ node-3.test.domain.local ]
 Slaves: [ node-2.test.domain.local node-4.test.domain.local ]
2) Login to master node (here is node-3) and block corosync traffic
 iptables -I INPUT -p udp --dport 5405 -m state --state NEW,ESTABLISHED,RELATED -j DROP
3) wait 10 min and unblock it
 iptables -D INPUT -p udp --dport 5405 -m state --state NEW,ESTABLISHED,RELATED -j DROP
4) check rabbitmqctl cluster_status at node-3 - there is no nodes running
 Cluster status of node 'rabbit@node-3' ...
 [{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]}]
 ...done.
And logs shows no problems found from OCF pov (/var/log/pacemaker.log at node-3)
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): master attribute is (null)
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (0) with node-2.test.domain.local (4132)
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0
There is also another error:
2014-08-08T17:06:05.609840+01:00 err: ERROR: Removing node 'rabbit@node-4' from cluster ... Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."}
5) recheck rabbit master node in corosync, e.g.
 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
 Masters: [ node-2.test.domain.local ]
 Slaves: [ node-3.test.domain.local node-4.test.domain.local ]

But rabbit cluster is broken and OSTF ha test [15 of 15] [failure] 'Check RabbitMQ is available' is also failing.

Tags: ha split-brain
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Other nodes cannot see node-3 in cluster
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-4','rabbit@node-2']},
 {partitions,[]}]
...done.

Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-2','rabbit@node-4']},
 {partitions,[]}]
...done.

But affected node-3 sees no running nodes at all
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]}]
...done.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Cannot reproduce.
After 'iptables -D ... <omitted> ...' and wait about 3-4 mins I see next on nodes:

[root@node-2 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-3','rabbit@node-4','rabbit@node-2']},
 {partitions,[]}]
...done.

[root@node-2 ~]# pcs status
Cluster name:
Last updated: Mon Aug 11 14:23:49 2014
Last change: Mon Aug 11 14:23:22 2014 via crm_attribute on node-4.test.domain.local
Stack: classic openais (with plugin)

Health check HA test of RabbitMQ also return OK.

Current DC: node-2.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
12 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]

<omitted>

 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-3.test.domain.local ]
     Slaves: [ node-2.test.domain.local node-4.test.domain.local ]
<omitted>

[root@node-3 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-2','rabbit@node-4','rabbit@node-3']},
 {partitions,[]}]
...done.

[root@node-4 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-2','rabbit@node-3','rabbit@node-4']},
 {partitions,[]}]
...done.

Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Little mistypo - string 'Health check HA test of RabbitMQ also return OK.' in my previous comment isn't part of 'pcs status' output, obviously. It separate remark.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

this should be another stage of rabbitmq fix. We should fix it by correct fencing of nodes. Also, it is partially a duplicate of https://bugs.launchpad.net/fuel/+bug/1348548 as it will stop the resources on the 3rd node and then start it again which will fix everything.

Changed in fuel:
milestone: 5.1 → 6.0
milestone: 6.0 → 5.1
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

This bug does not affect actual functionality of the cluster as rabbitmq cluster is working and only one of the slaves does not have rabbit app running. In case of network partitioning pacemaker will stop the minority of the cluster and after connectivity is restored it will start rabbitmq and thus join it to the cluster. Thus, this bug prio is not more than medium unless there are any reproducers showing that cluster is not working.

summary: - Rabbitmq cluster cannot recover
+ Rabbitmq cluster cannot recover one of the slaves
Changed in fuel:
milestone: 5.1 → 6.0
importance: High → Medium
Revision history for this message
Kirill Omelchenko (komelchenko) wrote :

Reproduced on
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_5/162/testReport/%28root%29/ha_disconnect_controllers/ha_disconnect_controllers/

        Scenario:
            1. Disconnect eth3 of the first controller
            2. Check pacemaker status
            3. Revert environment
            4. Disconnect eth3 of the second controller
            5. Check pacemaker status
            6. Run OSTF

Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-2','rabbit@node-1']},
 {cluster_name,<<"rabbit@node-1">>},
 {partitions,[]}]
...done.
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-1','rabbit@node-2']},
 {cluster_name,<<"rabbit@node-1">>},
 {partitions,[]}]
...done.
Warning: Permanently added 'node-4' (RSA) to the list of known hosts.
Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-4']}]},
 {running_nodes,['rabbit@node-4']},
 {cluster_name,<<"rabbit@node-1">>},
 {partitions,[]}]
...done.

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Tried to reproduce on
{
    "api": "1.0",
    "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
    "auth_required": true,
    "build_id": "2014-11-04_21-28-16",
    "build_number": "76",
    "feature_groups": [
        "mirantis"
    ],
    "fuellib_sha": "ba0b3010647dfdd675d88bdfe20dfbed3134f52f",
    "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13",
    "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
    "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
    "production": "docker",
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "api": "1.0",
                "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
                "build_id": "2014-11-04_21-28-16",
                "build_number": "76",
                "feature_groups": [
                    "mirantis"
                ],
                "fuellib_sha": "ba0b3010647dfdd675d88bdfe20dfbed3134f52f",
                "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13",
                "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
                "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
                "production": "docker",
                "release": "6.0"
            }
        }
    }
}

Indeed, right after you enable traffic back on the rabbitmq "master", you will get a broken rabbitmq cluster (node-1 was the master in my case):

[root@node-5 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]}]
...done.

But you just need to give it some time to recover itself. In about 5 minutes (or maybe even faster) you will get your rabbitmq cluster up and running (in my case with new "master" - node-4):

 Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-4.test.domain.local ]
     Slaves: [ node-1.test.domain.local node-5.test.domain.local ]

[root@node-5 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]},
 {running_nodes,['rabbit@node-1','rabbit@node-4','rabbit@node-5']},
 {cluster_name,<<"rabbit@node-4">>},
 {partitions,[]}]
...done.

[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]},
 {running_nodes,['rabbit@node-4','rabbit@node-5','rabbit@node-1']},
 {cluster_name,<<"rabbit@node-4">>},
 {partitions,[]}]
...done.

Environment passes OSTF tests just fine after recovery.

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.