Build 2680: Contrail collector timing out intermittently

Bug #1522703 reported by Ankit Jain
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Nipa
R2.21.x
Fix Committed
High
Nipa
R2.22.x
Fix Committed
High
Nipa
Trunk
Fix Committed
High
Nipa

Bug Description

Issue : Contrail collector timing out due to discovery not receiving heartbeat properly

This is happening on my setup having 3 collectors

env.roledefs = {

    'all': [host1, host2, host3,host4, host5,host6, host7],
    'cfgm': [host1,host2,host3],
    'webui': [host1],
    'openstack': [host1],
    'control': [host2, host3],
    'collector': [host1, host2, host3],
    'database': [host1, host2, host3],
    'compute': [host4, host5,host6, host7],

}

env.hostnames = {
    'all': ['nodeg32', 'nodeh1', 'nodeh2', 'nodeh6', 'nodei4', 'nodei5', 'nodeh7']

}

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector timeout
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

Home | Publishers | Subscribers | Stats | Config | Cleanup |
Service Type Remote IP Service Id Oper State Admin State In Use Time since last Heartbeat
OpServer 40.43.1.5 nodeg32:OpServer up up 1 0:00:01
OpServer 40.43.1.6 nodeh1:OpServer up up 1 0:00:02
OpServer 40.43.1.7 nodeh2:OpServer up up 1 0:00:01
AlarmGenerator 40.43.1.5 nodeg32:AlarmGenerator up up 0 0:00:05
AlarmGenerator 40.43.1.6 nodeh1:AlarmGenerator up up 0 0:00:03
AlarmGenerator 40.43.1.7 nodeh2:AlarmGenerator up up 0 0:00:03
IfmapServer 40.43.1.5 nodeg32:IfmapServer up up 3 0:00:07
IfmapServer 40.43.1.6 nodeh1:IfmapServer up up 2 0:00:09
IfmapServer 40.43.1.7 nodeh2:IfmapServer up up 3 0:00:03
dns-server 40.43.1.6 nodeh1:dns-server up up 5 0:00:00
dns-server 40.43.1.7 nodeh2:dns-server up up 5 0:00:01
xmpp-server 40.43.1.6 nodeh1:xmpp-server up up 4 0:00:03
xmpp-server 40.43.1.7 nodeh2:xmpp-server up up 4 0:00:01
Collector 40.43.1.5 nodeg32:Collector up up 17 0:00:07
Collector 40.43.1.6 nodeh1:Collector up up 23 0:00:04
Collector 40.43.1.7 nodeh2:Collector up up 30 0:00:04
ApiServer 40.43.1.5 nodeg32:ApiServer up up 1 0:00:07
ApiServer 40.43.1.6 nodeh1:ApiServer up up 1 0:00:04
ApiServer 40.43.1.7 nodeh2:ApiServer up up 1 0:00:03

/var/log/contrail-collector:
2015-12-04 Fri 12:44:27:860.073 IST nodeg32 [Thread 140232816654080, Pid 30179]: nodeg32:Analytics:contrail-analytics-api:0: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1455: 6: TimedOutException: Default TException.
2015-12-04 Fri 12:44:27:860.284 IST nodeg32 [Thread 140232820852480, Pid 30179]: nodeg32:Global: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1455: 7: TimedOutException: Default TException.
2015-12-04 Fri 12:44:28:533.860 IST nodeg32 [Thread 140232787277568, Pid 30179]: nodeg32:Analytics:contrail-collector:0: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1455: 16: TimedOutException: Default TException.
2015-12-04 Fri 12:44:30:657.640 IST nodeg32 [Thread 140232816654080, Pid 30179]: nodeg32:Analytics:contrail-collector:0: Db_BatchAddColumn:controller/src/database/cassandra/thrift/thrift_if_impl.cc:1455: 20: TimedOutException: Default TException.

Tags: discovery
Ankit Jain (ankitja)
Changed in juniperopenstack:
importance: Undecided → Critical
importance: Critical → High
Revision history for this message
Ankit Jain (ankitja) wrote :

Setup :

    'all': [host1, host2, host3,host4, host5,host6, host7],
    'cfgm': [host1,host2,host3],
    'webui': [host1],
    'openstack': [host1],
    'control': [host2, host3],
    'collector': [host1, host2, host3],
    'database': [host1, host2, host3],
    'compute': [host4, host5,host6, host7],

env.hostnames = {
    'all': ['nodeg32', 'nodeh1', 'nodeh2', 'nodeh6', 'nodei4', 'nodei5', 'nodeh7']

}

description: updated
Revision history for this message
Nipa (nipak) wrote :

On client side, the publish request and publish response at times takes more than 4 sec. The next publish will be sent 5 sec after a publish response is received. This results in publish at the discovery server end more than 5 sec apart and hence the service moves
to yellow.

1) We will need to send a publish request periodically irrespective of publish-response being received (if publish were to act as a heart-beat)
2) Need to debug the reason behing publish response taking long. Debugged this is not a a) cassandar access issue b) request and response at discovery end. This points to mainly delay at ha-proxy end.

Revision history for this message
Nipa (nipak) wrote :

The issue was seen on symantec setup too with build R2.21 #15, the load on cassandra is high and hence heartbeat updates to cassandra will take longer.

In above case heartbeat response from discovery server is not received by the client in 4 seconds and hence it timesout and
starts sending publish after 5 seconds.

Disocvery Client needs to be more resilient to cassandra loads or misses, hence client shud send heartbeats (R2.21) for atleast 3 times before sending publish.

For later releases where publish and heartbeat are the same, publish shud be sent right away after 4 sec timeout without waiting.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.21.x

Review in progress for https://review.opencontrail.org/16535
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/16802
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/16535
Committed: http://github.org/Juniper/contrail-controller/commit/8ea2799a70a28825f69af8e48ed7d28839bcc532
Submitter: Zuul
Branch: R2.21.x

commit 8ea2799a70a28825f69af8e48ed7d28839bcc532
Author: Nipa Kumar <email address hidden>
Date: Tue Jan 26 16:51:16 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

In R2.21.x periodic hearbeat of each service is used to declare aliveness of a service.
In cases where the cassandra updates of hearbeat messages take longer than 4secs
(discovery client timeout), discovery client needs to be more resilient before sending
publish again. On heartbeat timeout, publish send interval was based on no of attempts made
which is a varied interval. Send time beyond 15secs
which would result in services at Discovery Server end declared DOWN.

Services declared down if three heartbeats are missed on disocvery server.

Solution is to send publish message only after 3 heartbeat response timeouts at client
end.

Change-Id: I36b9862afe170becfe967e49280052965a71a680
Partial-Bug:1522703

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/16802
Committed: http://github.org/Juniper/contrail-controller/commit/35663e91e6de3bac22be6162624ad08894e54700
Submitter: Zuul
Branch: master

commit 35663e91e6de3bac22be6162624ad08894e54700
Author: Nipa Kumar <email address hidden>
Date: Tue Feb 2 16:09:54 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

Periodic (every 5sec) publish message is used to declare aliveness and status
of a service. In cases where discovery client gets a timeout error (due to
no response from Discovery Server for 4 secs), discovery waits for another
publish heartbeat timeout to send next publish, which could be greater than
9 secs and susequent fails results in service not being updated and the
service timesout and marked DOWN.

Fix is to be more resilient to timeouts and resend publish immediately
following timeout failure.

As heartbeat is not used to declare liveness of a service, remove heartbeat
counters. Add additional last subscribe ttl counters.

Realign traces and logs based on checksum of the message.

Change-Id: I10cd60bcf5888e92d636e4b9de629bc91a52f0b9
Partial-Bug:1522703

Revision history for this message
Nipa (nipak) wrote :

1) There are some automatic deletes using Cassandra TTL
2) There are frequent updates of column value (read, modify and write)
3) There is only one row (and many columns – hundreds)

Any red flags?


Deepinder

From: Megh Bhatt <email address hidden>
Date: Tuesday, February 2, 2016 at 1:59 PM
To: Raj Reddy <email address hidden>, Nipa Kumar <email address hidden>
Cc: Deepinder SETIA <email address hidden>, "Ankit Jain (MVI)" <email address hidden>
Subject: Re: http://nodeg13.englab.juniper.net:5998/services

I’m not sure this is cassandra delay, is there a reason for assuming that?

From nodetool cfhistograms DISCOVERY_SERVER discovery below are the write and read latencies and as you can see there are not near seconds:

root@nodeg13:~# nodetool cfhistograms DISCOVERY_SERVER discovery
DISCOVERY_SERVER/discovery histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
                              (micros) (micros) (bytes)
50% 1.00 17.00 72.00 4768 20
75% 1.00 24.00 124.00 9887 50
95% 1.00 35.00 372.00 105778 372
98% 1.00 35.00 535.00 105778 372
99% 1.00 42.00 535.00 105778 372
Min 0.00 2.00 11.00 3312 11
Max 1.00 51012.00 51012.00 105778 372

I do see tombstones for discovery

----------------
Keyspace: DISCOVERY_SERVER
Read Count: 417438
Read Latency: 0.11146106248113492 ms.
Write Count: 406572
Write Latency: 0.01853075224068554 ms.
Pending Flushes: 0
Table: discovery
SSTable count: 1
Space used (live): 28223
Space used (total): 28223
Space used by snapshots (total): 0
Off heap memory used (total): 52
SSTable Compression Ratio: 0.17103882476390347
Number of keys (estimate): 15
Memtable cell count: 185544
Memtable data size: 211925
Memtable off heap memory used: 0
Memtable switch count: 1
Local read count: 417438
Local read latency: 0.112 ms
Local write count: 406572
Local write latency: 0.019 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 20
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 3312
Compacted partition maximum bytes: 105778
Compacted partition mean bytes: 19952
Average live cells per slice (last five minutes): 5.755028477645803
Maximum live cells per slice (last five minutes): 40.0
Average tombstones per slice (last five minutes): 2.4407406537272656
Maximum tombstones per slice (last five minutes): 23.0

————————

Are deletes expected? I thought that there were no deletes expected and hence no tombstones expected.

Thanks

Megh

Ankit Jain (ankitja)
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/16902
Submitter: Deepinder Setia (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22.x

Review in progress for https://review.opencontrail.org/19254
Submitter: Deepinder Setia (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/19254
Committed: http://github.org/Juniper/contrail-controller/commit/9560ca26ac0172cddacfe824dd40e6d41354032e
Submitter: Zuul
Branch: R2.22.x

commit 9560ca26ac0172cddacfe824dd40e6d41354032e
Author: Nipa Kumar <email address hidden>
Date: Tue Jan 26 16:51:16 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

In R2.21.x periodic hearbeat of each service is used to declare aliveness of a service.
In cases where the cassandra updates of hearbeat messages take longer than 4secs
(discovery client timeout), discovery client needs to be more resilient before sending
publish again. On heartbeat timeout, publish send interval was based on no of attempts made
which is a varied interval. Send time beyond 15secs
which would result in services at Discovery Server end declared DOWN.

Services declared down if three heartbeats are missed on disocvery server.

Solution is to send publish message only after 3 heartbeat response timeouts at client
end.

Change-Id: I36b9862afe170becfe967e49280052965a71a680
Partial-Bug:1522703
(cherry picked from commit 8ea2799a70a28825f69af8e48ed7d28839bcc532)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/19495
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/19495
Committed: http://github.org/Juniper/contrail-controller/commit/67133102c8185e897d8173667e05f2365f7f1153
Submitter: Zuul
Branch: R2.20

commit 67133102c8185e897d8173667e05f2365f7f1153
Author: Nipa Kumar <email address hidden>
Date: Tue Jan 26 16:51:16 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

In R2.21.x periodic hearbeat of each service is used to declare aliveness of a service.
In cases where the cassandra updates of hearbeat messages take longer than 4secs
(discovery client timeout), discovery client needs to be more resilient before sending
publish again. On heartbeat timeout, publish send interval was based on no of attempts made
which is a varied interval. Send time beyond 15secs
which would result in services at Discovery Server end declared DOWN.

Services declared down if three heartbeats are missed on disocvery server.

Solution is to send publish message only after 3 heartbeat response timeouts at client
end.

Change-Id: I36b9862afe170becfe967e49280052965a71a680
Partial-Bug:1522703
(cherry picked from commit 8ea2799a70a28825f69af8e48ed7d28839bcc532)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.