R5.0-micro-services provision - rabbitmq clustering fails on a particular node on a setup.

Bug #1764925 reported by Ritam Gangopadhyay on 2018-04-18
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Fix Committed
High
alexey-mr
Trunk
Fix Released
High
alexey-mr

Bug Description

Setup Details:-

nodem14,m6,m7 - contrail controllers and openstack nodes
nodem8,m9,m10 - compute nodes

detailed setup in instances.yaml on nodem14 - /root/contrail-ansible-deployer/config/instances.yaml

error seen:-
[root@nodem14 logs]# docker exec -it configdatabase_rabbitmq_1 rabbitmqctl -n contrail@nodem14 cluster_status
Cluster status of node contrail@nodem14
[{nodes,[{disc,[contrail@nodem14]}]},
 {running_nodes,[contrail@nodem14]},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodem14,[]}]}]
[root@nodem14 logs]#
[root@nodem14 logs]# docker exec -it configdatabase_rabbitmq_1 rabbitmqctl -n contrail@nodem6 cluster_status
Cluster status of node contrail@nodem6
[{nodes,[{disc,[contrail@nodem6,contrail@nodem7]}]},
 {running_nodes,[contrail@nodem7,contrail@nodem6]},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodem7,[]},{contrail@nodem6,[]}]}]
[root@nodem14 logs]#

nothing much in the log, on nodem6 seeing following:

=INFO REPORT==== 17-Apr-2018::15:25:07 ===
connection <0.845.0> (10.10.10.6:47144 -> 10.10.10.6:5673): user 'guest' authenticated and granted access to vhost '/'

=INFO REPORT==== 17-Apr-2018::15:25:07 ===
connection <0.604.0> (10.10.10.14:32830 -> 10.10.10.6:5673): user 'guest' authenticated and granted access to vhost '/'

=WARNING REPORT==== 17-Apr-2018::15:25:15 ===
closing AMQP connection <0.604.0> (10.10.10.14:32830 -> 10.10.10.6:5673, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

=WARNING REPORT==== 17-Apr-2018::15:25:15 ===
closing AMQP connection <0.607.0> (10.10.10.14:32832 -> 10.10.10.6:5673, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

=WARNING REPORT==== 17-Apr-2018::16:05:38 ===
closing AMQP connection <0.623.0> (10.10.10.7:50086 -> 10.10.10.6:5673, vhost: '/', user: 'guest'):
client unexpectedly closed TCP connection

Andrey Pavlov (apavlov-e) wrote :

Ritam, can you please attach instances.yaml here?

Sudheendra Rao (sudheendra-k) wrote :

instances.yaml attached.

tags: added: sanityblocker
Ritam Gangopadhyay (ritam) wrote :

Some more info on the bug.

The difference between rabbit clustering in openstack and contrail today is, kolla ansible generates the clustering info using this file
[root@nodem14 ~]# cat /etc/kolla/rabbitmq/rabbitmq-clusterer.config
[
  {version, 1},
  {nodes, [
      {'rabbit@nodem14', disc}, {'rabbit@nodem7', disc}, {'rabbit@nodem6', disc} ]},
  {gospel,
    {node, 'rabbit@nodem14'}}
].
[root@nodem14 ~]#

and on the openstack rabbit container I see cluster information for the rabbit service to come up is fine

root@nodem14 ~]# cat /var/lib/docker/volumes/rabbitmq/_data/mnesia/rabbit/cluster_nodes.config
{[rabbit@nodem14,rabbit@nodem6,rabbit@nodem7],[rabbit@nodem14,rabbit@nodem6,rabbit@nodem7]}.
[root@nodem14 ~]#

**********************************
**********************************
**********************************

The contrail conf db rabbit service inturn gets is populated by the autocluster plugin and the conf for autocluster looks like this

root@nodem14:/# cat /etc/rabbitmq/rabbitmq.config
[ { rabbit, [
        { loopback_users, [ ] },
        { tcp_listeners, [ 5672 ] },
        { ssl_listeners, [ ] },
        { hipe_compile, false }
] } ].
root@nodem14:/#

This generates a rabbit cluster conf file which has only 2 nodes in the cluster. So it seems we are missing something on the autocluster plugin

[root@nodem14 ~]# cat /var/lib/docker/volumes/22d6c4f86eddc06d6d8dc4481ce607487a06d82605340e091b1ef4d23e313e69/_data/mnesia/contrail\@nodem14/cluster_nodes.config
{[contrail@nodem14,contrail@nodem6],[contrail@nodem14,contrail@nodem6]}.
[root@nodem14 ~]#

Andrey Pavlov (apavlov-e) wrote :

looks like this is an issue with autoclustering.
nodem7 starts when nodem14 is not ready to accept connections.

as a workaround I made in the rabbitmq container on nodem7:

root@nodem7:/# rabbitmqctl -n contrail@nodem7 stop_app
Stopping rabbit application on node contrail@nodem7
root@nodem7:/# rabbitmqctl -n contrail@nodem7 join_cluster contrail@nodem14
Clustering node contrail@nodem7 with contrail@nodem14
root@nodem7:/# rabbitmqctl -n contrail@nodem7 start_app
Starting node contrail@nodem7

and then cluster looks good:

root@nodem7:/# rabbitmqctl -n contrail@nodem14 cluster_status
Cluster status of node contrail@nodem14
[{nodes,[{disc,[contrail@nodem14,contrail@nodem6,contrail@nodem7]}]},
 {running_nodes,[contrail@nodem7,contrail@nodem6,contrail@nodem14]},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodem7,[]},{contrail@nodem6,[]},{contrail@nodem14,[]}]}]
root@nodem7:/# rabbitmqctl -n contrail@nodem7 cluster_status
Cluster status of node contrail@nodem7
[{nodes,[{disc,[contrail@nodem14,contrail@nodem6,contrail@nodem7]}]},
 {running_nodes,[contrail@nodem14,contrail@nodem6,contrail@nodem7]},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodem14,[]},{contrail@nodem6,[]},{contrail@nodem7,[]}]}]

another workaround is to restart container that is not in the cluster

tags: removed: sanityblocker
Vineet Gupta (vineetrf) on 2018-04-20
tags: added: releasenote

Review in progress for https://review.opencontrail.org/42703
Submitter: alexey-mr (<email address hidden>)

Reviewed: https://review.opencontrail.org/42703
Committed: http://github.com/Juniper/contrail-container-builder/commit/c701bfcbced038bbe4b3d7f10535da78f9e7cd27
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit c701bfcbced038bbe4b3d7f10535da78f9e7cd27
Author: alexey-mr <email address hidden>
Date: Wed May 2 17:05:30 2018 +0300

Wait for first rabbitmq node

It is to avoid race-condition
during autoclustering.

Change-Id: I1dac81833dcee836abb8e7f6cbf3d366224d7552
Closes-Bug: #1764925

Review in progress for https://review.opencontrail.org/42736
Submitter: Andrey Pavlov (<email address hidden>)

Reviewed: https://review.opencontrail.org/42736
Committed: http://github.com/Juniper/contrail-container-builder/commit/59b6c6650bc19c7b70bdc9ee6827be277d9f17d1
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 59b6c6650bc19c7b70bdc9ee6827be277d9f17d1
Author: alexey-mr <email address hidden>
Date: Wed May 2 17:05:30 2018 +0300

Wait for first rabbitmq node

It is to avoid race-condition
during autoclustering.

Change-Id: I1dac81833dcee836abb8e7f6cbf3d366224d7552
Closes-Bug: #1764925
(cherry picked from commit c701bfcbced038bbe4b3d7f10535da78f9e7cd27)

vimal (vappachan) wrote :

Verified on master 89

 docker exec -it configdatabase_rabbitmq_1 rabbitmqctl -n contrail@nodem14 cluster_status
Cluster status of node contrail@nodem14
[{nodes,[{disc,[contrail@nodem14,contrail@nodem6,contrail@nodem7]}]},
 {running_nodes,[contrail@nodem7,contrail@nodem6,contrail@nodem14]},
 {cluster_name,<<"<email address hidden>">>},
 {partitions,[]},
 {alarms,[{contrail@nodem7,[]},{contrail@nodem6,[]},{contrail@nodem14,[]}]}]

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.