On a HA setup, on shutting down one of the CFGM nodes, query engine remain in init state

Bug #1400208 reported by Vinod Nair
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Fix Committed
High
c mishra
R2.0
Fix Committed
High
c mishra

Bug Description

On a Ha cluster with 3 Openstack/cfgm nodes , on shutting down one , Query engine remains in init state.

root@cs-scale-3:~# contrail-status | grep query
contrail-query-engine initializing

contrail-status | grep query

contrail-query-engine initializing

looks like ebenthough teh config has 3 Cassandra nodes , it is still trying to connect to 1

Version : contrail-analytics 2.0-11 11 Ubuntu 12.04.3 Icehouse

root@cs-scale-3:~# tail -f /var/log/contrail/contrail-query-engine.log

2014-12-07 Sun 07:26:17:449.297 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: RAC_Connect status: 0 0x1fb69e0
2014-12-07 Sun 07:26:17:449.321 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: Connected to REDIS...

2014-12-07 Sun 07:26:17:449.347 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: RAC_Connect status: 0 0x1fb7260
2014-12-07 Sun 07:26:17:449.359 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: Connected to REDIS...

2014-12-07 Sun 08:01:11:998.110 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: SANDESH: Send FAILED: 1417968071998094 TCP [SYS_DEBUG]: TcpSessionMessageLog: Session 127.0.0.1:33156::127.0.0.1:8086(-1) < Active session connection complete controller/src/io/tcp_session.cc 182
2014-12-07 Sun 08:08:29:628.355 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: SANDESH: Send FAILED: 1417968509628320 TCP [SYS_DEBUG]: TcpSessionMessageLog: Session 127.0.0.1:51924::127.0.0.1:8086(-1) < Active session connection complete controller/src/io/tcp_session.cc 182
2014-12-07 Sun 08:22:11:949.665 PST cs-scale-3 [Thread 140559956899648, Pid 4810]: SANDESH: Send FAILED: 1417969331949646 TCP [SYS_DEBUG]: TcpSessionMessageLog: Session 127.0.0.1:55481::127.0.0.1:8086(-1) < Active session connection complete controller/src/io/tcp_session.cc 182
^C
root@cs-scale-3:~# tail -f /var/log/contrail/contrail-query-engine
contrail-query-engine.log contrail-query-engine-stdout.log
root@cs-scale-3:~# tail -f /var/log/contrail/contrail-query-engine-stdout.log
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out
Thrift: Sun Dec 7 21:23:59 2014 TSocketPool::open failed <Host: 13.1.0.1 Port: 9160>: connect() failed: Connection timed out

Tags: analytics
Vinod Nair (vinodnair)
description: updated
Revision history for this message
Sunil Bakhru (sbakhru) wrote :

Prabhakar, Please triage.

Changed in juniperopenstack:
assignee: nobody → Prabhakaran Ganesan (gprabhak)
Changed in juniperopenstack:
importance: Undecided → Critical
importance: Critical → High
Revision history for this message
Sunil Bakhru (sbakhru) wrote :

Raj, This appears to require specialty knowledge on QE. Can you please have someone take a look?

Changed in juniperopenstack:
assignee: Prabhakaran Ganesan (gprabhak) → Raj Reddy (rajreddy)
Raj Reddy (rajreddy)
Changed in juniperopenstack:
assignee: Raj Reddy (rajreddy) → c mishra (cdmishra)
Revision history for this message
c mishra (cdmishra) wrote :

Can you provide following information:
1) What is the topology: which systems have analytics, cfg, control, cassandra enabled

2) Which system was brought down?

3) Was it done while the HA cluster was in stable state?

c mishra (cdmishra)
Changed in juniperopenstack:
status: New → In Progress
information type: Proprietary → Public
Raj Reddy (rajreddy)
tags: added: analytics
Raj Reddy (rajreddy)
Changed in juniperopenstack:
milestone: none → r2.20-fcs
c mishra (cdmishra)
Changed in juniperopenstack:
milestone: r2.20-fcs → none
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : master

Review in progress for https://review.opencontrail.org/9629
Submitter: c mishra (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/9629
Committed: http://github.org/Juniper/contrail-controller/commit/92023f33366f0fa88f34ac2430118bfa88f327ef
Submitter: Zuul
Branch: master

commit 92023f33366f0fa88f34ac2430118bfa88f327ef
Author: Chandan Mishra <email address hidden>
Date: Tue Apr 28 10:52:20 2015 -0700

This commit is fix the issue in previous commit:
https://review.opencontrail.org/#/c/5763

As part of previous commit we were increasing max # of threads available
as part of the task library.
In previous fix we did not take care of the scenario after TaskScheduler class
initialization.

This checkin takes care of scenario after TaskScheduler instantiation.

Change-Id: I6f5568847ea6af2bdc045c29056b1271e64a1a68
Closes-Bug: 1400208

Changed in juniperopenstack:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.