Fuel for OpenStack

Rabbitmq cluster cannot recover one of the slaves

Bug #1354520 reported by Bogdan Dobrelya on 2014-08-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	Medium	Aleksandr Didenko	Fuel for OpenStack 6.0

Bug Description

{"build_id": "2014-08-07_11-20-05", "ostf_sha": "e33390c275e225d648b36997460dc29b1a3c20ae", "build_number": "409", "auth_required": true, "api": "1.0", "nailgun_sha": "67c4f1c18ab0833175f6dc7f0f9c49c3eb722287", "production": "docker", "fuelmain_sha": "7b2e7ef083f239bd47b5c47aecb1f815c009521f", "astute_sha": "b52910642d6de941444901b0f20e95ebbcb2b2e9", "feature_groups": ["mirantis"], "release": "5.1", "fuellib_sha": "53633cd9bb149f6c1b9d5ee8321efc85c71cee68"}

Centos, Nova HA flat (3 controllers, 2 computes)

Steps to reproduce:
1) Check for rabbit master node in corosync
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-3.test.domain.local ]
Slaves: [ node-2.test.domain.local node-4.test.domain.local ]
2) Login to master node (here is node-3) and block corosync traffic
iptables -I INPUT -p udp --dport 5405 -m state --state NEW,ESTABLISHED,RELATED -j DROP
3) wait 10 min and unblock it
iptables -D INPUT -p udp --dport 5405 -m state --state NEW,ESTABLISHED,RELATED -j DROP
4) check rabbitmqctl cluster_status at node-3 - there is no nodes running
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]}]
...done.
And logs shows no problems found from OCF pov (/var/log/pacemaker.log at node-3)
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): get_status() returns 0.
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): also checking if we are master.
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): master attribute is (null)
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): checking if rabbit app is running
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): preparing to update master score for node
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): comparing our uptime (0) with node-2.test.domain.local (4132)
<30>Aug 8 16:08:43 node-3 lrmd: INFO: p_rabbitmq-server: get_monitor(): get_monitor function ready to return 0
There is also another error:
2014-08-08T17:06:05.609840+01:00 err: ERROR: Removing node 'rabbit@node-4' from cluster ... Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."}
5) recheck rabbit master node in corosync, e.g.
Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-2.test.domain.local ]
Slaves: [ node-3.test.domain.local node-4.test.domain.local ]

But rabbit cluster is broken and OSTF ha test [15 of 15] [failure] 'Check RabbitMQ is available' is also failing.

Tags:

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-08:

fuel-snapshot-2014-08-08_17-11-06.tgz Edit (8.6 MiB, application/x-tar)

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2014-08-08:

Other nodes cannot see node-3 in cluster
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
{running_nodes,['rabbit@node-4','rabbit@node-2']},
{partitions,[]}]
...done.

Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
{running_nodes,['rabbit@node-2','rabbit@node-4']},
{partitions,[]}]
...done.

But affected node-3 sees no running nodes at all
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]}]
...done.

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2014-08-11:

Cannot reproduce.
After 'iptables -D ... <omitted> ...' and wait about 3-4 mins I see next on nodes:

[root@node-2 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
{running_nodes,['rabbit@node-3','rabbit@node-4','rabbit@node-2']},
{partitions,[]}]
...done.

[root@node-2 ~]# pcs status
Cluster name:
Last updated: Mon Aug 11 14:23:49 2014
Last change: Mon Aug 11 14:23:22 2014 via crm_attribute on node-4.test.domain.local
Stack: classic openais (with plugin)

Health check HA test of RabbitMQ also return OK.

Current DC: node-2.test.domain.local - partition with quorum
Version: 1.1.10-14.el6_5.3-368c726
3 Nodes configured, 3 expected votes
12 Resources configured

Online: [ node-2.test.domain.local node-3.test.domain.local node-4.test.domain.local ]

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-3.test.domain.local ]
Slaves: [ node-2.test.domain.local node-4.test.domain.local ]
<omitted>

[root@node-3 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-3' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
{running_nodes,['rabbit@node-2','rabbit@node-4','rabbit@node-3']},
{partitions,[]}]
...done.

[root@node-4 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-2','rabbit@node-3','rabbit@node-4']}]},
{running_nodes,['rabbit@node-2','rabbit@node-3','rabbit@node-4']},
{partitions,[]}]
...done.

Revision history for this message

Stanislaw Bogatkin (sbogatkin) wrote on 2014-08-11:

Little mistypo - string 'Health check HA test of RabbitMQ also return OK.' in my previous comment isn't part of 'pcs status' output, obviously. It separate remark.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-12:

this should be another stage of rabbitmq fix. We should fix it by correct fencing of nodes. Also, it is partially a duplicate of https://bugs.launchpad.net/fuel/+bug/1348548 as it will stop the resources on the 3rd node and then start it again which will fix everything.

Changed in fuel:
milestone:	5.1 → 6.0
milestone:	6.0 → 5.1

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2014-08-14:

This bug does not affect actual functionality of the cluster as rabbitmq cluster is working and only one of the slaves does not have rabbit app running. In case of network partitioning pacemaker will stop the minority of the cluster and after connectivity is restored it will start rabbitmq and thus join it to the cluster. Thus, this bug prio is not more than medium unless there are any reproducers showing that cluster is not working.

summary:	- Rabbitmq cluster cannot recover + Rabbitmq cluster cannot recover one of the slaves
Changed in fuel:
milestone:	5.1 → 6.0
importance:	High → Medium

Revision history for this message

Kirill Omelchenko (komelchenko) wrote on 2014-09-12:

Reproduced on
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.centos.thread_5/162/testReport/%28root%29/ha_disconnect_controllers/ha_disconnect_controllers/

        Scenario:
            1. Disconnect eth3 of the first controller
            2. Check pacemaker status
            3. Revert environment
            4. Disconnect eth3 of the second controller
            5. Check pacemaker status
            6. Run OSTF

Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-4']}]},
{running_nodes,['rabbit@node-2','rabbit@node-1']},
{cluster_name,<<"rabbit@node-1">>},
{partitions,[]}]
...done.
Warning: Permanently added 'node-2' (RSA) to the list of known hosts.
Cluster status of node 'rabbit@node-2' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-4']}]},
{running_nodes,['rabbit@node-1','rabbit@node-2']},
{cluster_name,<<"rabbit@node-1">>},
{partitions,[]}]
...done.
Warning: Permanently added 'node-4' (RSA) to the list of known hosts.
Cluster status of node 'rabbit@node-4' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-2','rabbit@node-4']}]},
{running_nodes,['rabbit@node-4']},
{cluster_name,<<"rabbit@node-1">>},
{partitions,[]}]
...done.

Vladimir Kuklin (vkuklin) on 2014-10-09

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)

Revision history for this message

Aleksandr Didenko (adidenko) wrote on 2014-11-07:

Tried to reproduce on
{
    "api": "1.0",
    "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
    "auth_required": true,
    "build_id": "2014-11-04_21-28-16",
    "build_number": "76",
    "feature_groups": [
        "mirantis"
    ],
    "fuellib_sha": "ba0b3010647dfdd675d88bdfe20dfbed3134f52f",
    "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13",
    "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
    "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
    "production": "docker",
    "release": "6.0",
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "api": "1.0",
                "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf",
                "build_id": "2014-11-04_21-28-16",
                "build_number": "76",
                "feature_groups": [
                    "mirantis"
                ],
                "fuellib_sha": "ba0b3010647dfdd675d88bdfe20dfbed3134f52f",
                "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13",
                "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129",
                "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa",
                "production": "docker",
                "release": "6.0"
            }
        }
    }
}

Indeed, right after you enable traffic back on the rabbitmq "master", you will get a broken rabbitmq cluster (node-1 was the master in my case):

[root@node-5 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]}]
...done.

But you just need to give it some time to recover itself. In about 5 minutes (or maybe even faster) you will get your rabbitmq cluster up and running (in my case with new "master" - node-4):

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
Masters: [ node-4.test.domain.local ]
Slaves: [ node-1.test.domain.local node-5.test.domain.local ]

[root@node-5 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]},
{running_nodes,['rabbit@node-1','rabbit@node-4','rabbit@node-5']},
{cluster_name,<<"rabbit@node-4">>},
{partitions,[]}]
...done.

[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]},
{running_nodes,['rabbit@node-4','rabbit@node-5','rabbit@node-1']},
{cluster_name,<<"rabbit@node-4">>},
{partitions,[]}]
...done.

Environment passes OSTF tests just fine after recovery.

Tried to reproduce on
{
    "api": "1.0", 
    "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf", 
    "auth_required": true, 
    "build_id": "2014-11-04_21-28-16", 
    "build_number": "76", 
    "feature_groups": [
        "mirantis"
    ], 
    "fuellib_sha": "ba0b3010647dfdd675d88bdfe20dfbed3134f52f", 
    "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13", 
    "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129", 
    "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa", 
    "production": "docker", 
    "release": "6.0", 
    "release_versions": {
        "2014.2-6.0": {
            "VERSION": {
                "api": "1.0", 
                "astute_sha": "c72dac7b31646fbedbfc56a2a87676c6d5713fcf", 
                "build_id": "2014-11-04_21-28-16", 
                "build_number": "76", 
                "feature_groups": [
                    "mirantis"
                ], 
                "fuellib_sha": "ba0b3010647dfdd675d88bdfe20dfbed3134f52f", 
                "fuelmain_sha": "d498d9153494b412cc75900ab8a1f4e18bc26c13", 
                "nailgun_sha": "35946b1f225c984f11915ba8e985584160f0b129", 
                "ostf_sha": "9c6fadca272427bb933bc459e14bb1bad7f614aa", 
                "production": "docker", 
                "release": "6.0"
            }
        }
    }
}

Indeed, right after you enable traffic back on the rabbitmq "master", you will get a broken rabbitmq cluster (node-1 was the master in my case):

[root@node-5 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]}]
...done.

But you just need to give it some time to recover itself. In about 5 minutes (or maybe even faster) you will get your rabbitmq cluster up and running (in my case with new "master" - node-4):

Master/Slave Set: master_p_rabbitmq-server [p_rabbitmq-server]
     Masters: [ node-4.test.domain.local ]
     Slaves: [ node-1.test.domain.local node-5.test.domain.local ]

[root@node-5 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-5' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]},
 {running_nodes,['rabbit@node-1','rabbit@node-4','rabbit@node-5']},
 {cluster_name,<<"rabbit@node-4">>},
 {partitions,[]}]
...done.

[root@node-1 ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@node-1' ...
[{nodes,[{disc,['rabbit@node-1','rabbit@node-4','rabbit@node-5']}]},
 {running_nodes,['rabbit@node-4','rabbit@node-5','rabbit@node-1']},
 {cluster_name,<<"rabbit@node-4">>},
 {partitions,[]}]
...done.

Environment passes OSTF tests just fine after recovery.