[R3.1-36] contrail-collector not able to come up due to "initializing (KafkaPub:10.204.217.33:9092 connection down)"

Bug #1634397 reported by Suresh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.1
Fix Committed
Critical
Nikhil Bansal
R3.2
Fix Committed
Critical
Nikhil Bansal
Trunk
Fix Committed
Critical
Nikhil Bansal

Bug Description

Topo:
=====
nodel11

 This node has 6 VM named nodel11-vm1 to nodel11-vm6.

Description:
============
1. Trying out different combinations of roledef in vm1, vm2 & vm3. vm4, vm5 & vm6 are used as compute nodes.

2. Saw controller not coming up with a combination where database is in vm3, controller is in vm1 & vm3 and config on vm1 & vm2.

env.roledefs = {
    'control' : [host1, host2, host3, ],
    'all' : [host1, host2, host3, host4, host5, host6, ],
    'compute' : [host4, host5, host6, ],
    'database' : [host3, ],
    'webui' : [host2, host3, ],
    'cfgm' : [host1, host2, ],
    'openstack' : [host2, ],
    'collector' : [host1, host3, ],
    'build' : [host_build, ],
}

root@nodel11-vm3:/var/log# !co
contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen active
contrail-analytics-api initializing (UvePartitions:UVE-Aggregation[Partitions:0] connection down)
contrail-analytics-nodemgr active
contrail-collector initializing (KafkaPub:10.204.217.33:9092 connection down)
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active

== Contrail Supervisor Database ==
supervisor-database: active
contrail-database-nodemgr active
kafka active

3. Issue was not seen in a combination where database was in vm1, vm2 & vm3, controller was in vm1 & vm3, config was in vm2 & vm3.

4. Further debugging by Nikhil, we found that value set to "default.replication.factor" in /usr/share/kafka/config/server.properties file caused the issue.

5. In the failure case, default.replication.factor was set to 2 while database was only on one node. Somehow default.replication.factor is related to zookeeper value and which in turn was related to config nodes.

6. After setting default.replication.factor to 1, restarting the database and analytics services resolved the issue.

7. The issue is seen in mainline as well.

Tags: analytics
Suresh (suresha)
description: updated
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/24970
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.1

Review in progress for https://review.opencontrail.org/24971
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/24970
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/24971
Committed: http://github.org/Juniper/contrail-provisioning/commit/58c4d7179dbceaa2b7d3c8fec8d235153d714181
Submitter: Zuul
Branch: R3.1

commit 58c4d7179dbceaa2b7d3c8fec8d235153d714181
Author: Nikhil B <email address hidden>
Date: Tue Oct 18 13:46:21 2016 +0530

kafka replication factor should depend on list of database nodes

Earlier zookeeper node list was same as database node list but it was changed
recently to be same as cfgm nodes. As a result, replication factor was getting
set wrongly if cfgm and database nodes are not same.
Closes-Bug: #1634397

Change-Id: Ib33dd446b18aee5174b7199d87a005bcaa533513
(cherry picked from commit a96ed20857154f478f56868d012c0dafcad4ff9d)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/24970
Committed: http://github.org/Juniper/contrail-provisioning/commit/9127d0b27b2d2574621a379587a322c1d337d3d1
Submitter: Zuul
Branch: master

commit 9127d0b27b2d2574621a379587a322c1d337d3d1
Author: Nikhil B <email address hidden>
Date: Tue Oct 18 13:46:21 2016 +0530

kafka replication factor should depend on list of database nodes

Earlier zookeeper node list was same as database node list but it was changed
recently to be same as cfgm nodes. As a result, replication factor was getting
set wrongly if cfgm and database nodes are not same.
Closes-Bug: #1634397

Change-Id: Ib33dd446b18aee5174b7199d87a005bcaa533513

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/25597
Submitter: Nikhil Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/25597
Committed: http://github.org/Juniper/contrail-provisioning/commit/5a34d52d32da8219fb2d3c59aa66cdc182cab38c
Submitter: Zuul
Branch: R3.2

commit 5a34d52d32da8219fb2d3c59aa66cdc182cab38c
Author: Nikhil B <email address hidden>
Date: Tue Oct 18 13:46:21 2016 +0530

kafka replication factor should depend on list of database nodes

Earlier zookeeper node list was same as database node list but it was changed
recently to be same as cfgm nodes. As a result, replication factor was getting
set wrongly if cfgm and database nodes are not same.
Closes-Bug: #1634397

Change-Id: Ib33dd446b18aee5174b7199d87a005bcaa533513
(cherry picked from commit 9127d0b27b2d2574621a379587a322c1d337d3d1)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.