TimedOutException in collector and query-engine on read/write data from/to cassandra

Bug #1527225 reported by Sundaresan Rajangam
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Fix Committed
High
Megh Bhatt

Bug Description

In a 3-node database cluster, TimedOutException raised in collector/query-engine during cassandra read/write operations.

The following warning message was continuously logged in cassandra/system.log

WARN [MessagingService-Incoming-/40.43.1.7] 2015-12-16 21:16:11,629 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing

Also, the nodetool repair command was running on all the nodes in the cluster and seem to be stuck

Tags: analytics
Revision history for this message
Sundaresan Rajangam (srajanga) wrote :
Download full text (4.3 KiB)

Megh debugged the issue and here is his analysis:

It seems the the UnknownColumnFamilyException is most likely caused due to schema mismatch between the cassandra nodes. It seems that .6 and .7 are somehow not in sync with respect to the schema’s of the column families and not able to get in sync. I checked that ping between them works fine even with higher packet sizes. We should actually be creating column families with QUORUM or ALL consistency then these problems might not happen or happen sooner during creation of the tables itself.

There is also another issue in that contrail-cassandra-status script is not able to parse the nodetool status output of cassandra2.1, that needs fixing too.

root@nodeh1:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 40.43.1.7 1.23 GB 256 ? d95824d5-571d-44ce-9d97-7e643f53b78e rack1
UN 40.43.1.6 1.39 GB 256 ? 62dcbdfc-0d85-4d99-b84c-9b90bd792aba rack1
UN 40.43.1.5 1.3 GB 256 ? f42d739b-fe87-4645-8aac-c1fec665d77e rack1

Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
root@nodeh1:~# nodetool info
ID : 62dcbdfc-0d85-4d99-b84c-9b90bd792aba
Gossip active : true
Thrift active : true
Native Transport active: true
Load : 1.39 GB
Generation No : 1450164892
Uptime (seconds) : 125495
Heap Memory (MB) : 2076.05 / 8152.00
Off Heap Memory (MB) : 3.00
Data Center : datacenter1
Rack : rack1
Exceptions : 37
Key Cache : entries 18242, size 1.64 MB, capacity 100 MB, 480580 hits, 520230 requests, 0.924 recent hit rate, 14400 save period in seconds
Row Cache : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 requests, NaN recent hit rate, 0 save period in seconds
Counter Cache : entries 0, size 0 bytes, capacity 50 MB, 0 hits, 0 requests, NaN recent hit rate, 7200 save period in seconds
Token : (invoke with -T/--tokens to see all 256 tokens)
root@nodeh1:~# nodetool describecluster
Cluster Information:
Name: Contrail
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
408d512e-3a32-31ef-8eaa-961077d5ce80: [40.43.1.6, 40.43.1.5]

UNREACHABLE: [40.43.1.7]

root@nodeh1:~#

root@nodeh2:~# nodetool describecluster
Cluster Information:
Name: Contrail
Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
Schema versions:
408d512e-3a32-31ef-8eaa-961077d5ce80: [40.43.1.7, 40.43.1.5]

UNREACHABLE: [40.43.1.6]

root@nodeh2:~# nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 40.43.1.7 1.23 GB 256 ? d95824d5-571d-44ce-9d97-7e643f53b78e rack1
UN 40.43.1.6 1.39 GB 256 ? ...

Read more...

Revision history for this message
Ankit Jain (ankitja) wrote :
Download full text (5.5 KiB)

2015-12-17 Thu 19:35:32:814.950 IST nodeh1 [Thread 140498530252544, Pid 18910]: Received Disconnect: nodeh1:Control:contrail-dns:0 Session:40.43.1.6:8086::40.43.1.6:50157(23)
2015-12-17 Thu 19:35:32:815.001 IST nodeh1 [Thread 140498530252544, Pid 18910]: nodeh1:Control:contrail-dns:0: DisconnectSession: Session:40.43.1.6:8086::40.43.1.6:50157(23)
2015-12-17 Thu 19:35:32:816.079 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodeh1:Control:contrail-dns:0> : ObjectDns:nodeh1 , DnsState
2015-12-17 Thu 19:35:32:816.180 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodeh1:Control:contrail-dns:0> : ObjectGeneratorInfo:nodeh1:Control:contrail-dns:0 , ModuleClientState
2015-12-17 Thu 19:35:32:816.727 IST nodeh1 [Thread 140498530252544, Pid 18910]: Received Disconnect: nodei4:Compute:contrail-vrouter-agent:0 Session:40.43.1.6:8086::40.40.40.5:33162(32)
2015-12-17 Thu 19:35:32:816.775 IST nodeh1 [Thread 140498530252544, Pid 18910]: nodei4:Compute:contrail-vrouter-agent:0: DisconnectSession: Session:40.43.1.6:8086::40.40.40.5:33162(32)
2015-12-17 Thu 19:35:32:818.518 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectVRouter:nodei4 , ComputeCpuState
2015-12-17 Thu 19:35:32:818.582 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectGeneratorInfo:nodei4:Compute:contrail-vrouter-agent:0 , Mod
uleClientState
2015-12-17 Thu 19:35:32:818.612 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectVRouter:nodei4 , NodeStatus
2015-12-17 Thu 19:35:32:818.658 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectPhysicalInterfaceTable:vlan4000 , UvePhysicalInterfaceAgent
2015-12-17 Thu 19:35:32:818.684 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectVNTable:__UNKNOWN__ , UveVirtualNetworkAgent
2015-12-17 Thu 19:35:32:818.709 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectVRouter:nodei4 , VrouterAgent
2015-12-17 Thu 19:35:32:818.732 IST nodeh1 [Thread 140498530252544, Pid 18910]: SyncDeleteUVE <nodei4:Compute:contrail-vrouter-agent:0> : ObjectVRouter:nodei4 , VrouterStatsAgent
2015-12-17 Thu 19:35:32:927.820 IST nodeh1 [Thread 140498563827456, Pid 18910]: nodeh2:Config:contrail-discovery:0: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1456: 10:
TimedOutException: Default TException.
2015-12-17 Thu 19:35:34:743.983 IST nodeh1 [Thread 140498568025856, Pid 18910]: nodeh2:Analytics:contrail-alarm-gen:0: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1456: 1
6: TimedOutException: Default TException.
2015-12-17 Thu 19:35:34:824.238 IST nodeh1 [Thread 140498530252544, Pid 18910]: nodeh2:Control:contrail-control:0: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1456: 10: T
imedOutException: Default TException.
2015-12-17 Thu 19:35:34:928.737 IST nodeh1 [Thread 14049856382745...

Read more...

Revision history for this message
Megh Bhatt (meghb) wrote :

With CQL driver changes, we are creating TABLES at QUORUM - https://bugs.launchpad.net/juniperopenstack/+bug/1522930

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/15884
Submitter: Megh Bhatt (<email address hidden>)

Revision history for this message
Megh Bhatt (meghb) wrote :

Have submitted review for the contrail-cassandra-status failure issue

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/15884
Committed: http://github.org/Juniper/contrail-controller/commit/ff60d485698e9833605640a948bcacd374248c29
Submitter: Zuul
Branch: master

commit ff60d485698e9833605640a948bcacd374248c29
Author: Megh Bhatt <email address hidden>
Date: Thu Dec 17 16:58:27 2015 -0800

Do not fail nodetool status parsing if expected number of columns
are not present in the output line
Closes-Bug: #1527225

Change-Id: I6143c33d214891bf9567a7fabdd47995b8a83426

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.