Bug #1522703 “Build 2680: Contrail collector timing out intermit...” : Bugs : Juniper Openstack

Ankit Jain (ankitja) on 2015-12-04

Changed in juniperopenstack:
importance:	Undecided → Critical
importance:	Critical → High

Revision history for this message

Ankit Jain (ankitja) wrote on 2015-12-07:

#1

Setup :

    'all': [host1, host2, host3,host4, host5,host6, host7],
    'cfgm': [host1,host2,host3],
    'webui': [host1],
    'openstack': [host1],
    'control': [host2, host3],
    'collector': [host1, host2, host3],
    'database': [host1, host2, host3],
    'compute': [host4, host5,host6, host7],

env.hostnames = {
'all': ['nodeg32', 'nodeh1', 'nodeh2', 'nodeh6', 'nodei4', 'nodei5', 'nodeh7']

}

description:

updated

Revision history for this message

Nipa (nipak) wrote on 2015-12-08:

#2

On client side, the publish request and publish response at times takes more than 4 sec. The next publish will be sent 5 sec after a publish response is received. This results in publish at the discovery server end more than 5 sec apart and hence the service moves
to yellow.

1) We will need to send a publish request periodically irrespective of publish-response being received (if publish were to act as a heart-beat)
2) Need to debug the reason behing publish response taking long. Debugged this is not a a) cassandar access issue b) request and response at discovery end. This points to mainly delay at ha-proxy end.

Revision history for this message

Nipa (nipak) wrote on 2016-01-25:

#3

The issue was seen on symantec setup too with build R2.21 #15, the load on cassandra is high and hence heartbeat updates to cassandra will take longer.

In above case heartbeat response from discovery server is not received by the client in 4 seconds and hence it timesout and
starts sending publish after 5 seconds.

Disocvery Client needs to be more resilient to cassandra loads or misses, hence client shud send heartbeats (R2.21) for atleast 3 times before sending publish.

For later releases where publish and heartbeat are the same, publish shud be sent right away after 4 sec timeout without waiting.

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-01-27: [Review update] R2.21.x

#4

Review in progress for https://review.opencontrail.org/16535
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-02-03: [Review update] master

#7

Review in progress for https://review.opencontrail.org/16802
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-02-03: A change has been merged

#11

Reviewed: https://review.opencontrail.org/16535
Committed: http://github.org/Juniper/contrail-controller/commit/8ea2799a70a28825f69af8e48ed7d28839bcc532
Submitter: Zuul
Branch: R2.21.x

commit 8ea2799a70a28825f69af8e48ed7d28839bcc532
Author: Nipa Kumar <email address hidden>
Date: Tue Jan 26 16:51:16 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

In R2.21.x periodic hearbeat of each service is used to declare aliveness of a service.
In cases where the cassandra updates of hearbeat messages take longer than 4secs
(discovery client timeout), discovery client needs to be more resilient before sending
publish again. On heartbeat timeout, publish send interval was based on no of attempts made
which is a varied interval. Send time beyond 15secs
which would result in services at Discovery Server end declared DOWN.

Services declared down if three heartbeats are missed on disocvery server.

Solution is to send publish message only after 3 heartbeat response timeouts at client
end.

Change-Id: I36b9862afe170becfe967e49280052965a71a680
Partial-Bug:1522703

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-02-03:

#12

Reviewed: https://review.opencontrail.org/16802
Committed: http://github.org/Juniper/contrail-controller/commit/35663e91e6de3bac22be6162624ad08894e54700
Submitter: Zuul
Branch: master

commit 35663e91e6de3bac22be6162624ad08894e54700
Author: Nipa Kumar <email address hidden>
Date: Tue Feb 2 16:09:54 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

Periodic (every 5sec) publish message is used to declare aliveness and status
of a service. In cases where discovery client gets a timeout error (due to
no response from Discovery Server for 4 secs), discovery waits for another
publish heartbeat timeout to send next publish, which could be greater than
9 secs and susequent fails results in service not being updated and the
service timesout and marked DOWN.

Fix is to be more resilient to timeouts and resend publish immediately
following timeout failure.

As heartbeat is not used to declare liveness of a service, remove heartbeat
counters. Add additional last subscribe ttl counters.

Realign traces and logs based on checksum of the message.

Change-Id: I10cd60bcf5888e92d636e4b9de629bc91a52f0b9
Partial-Bug:1522703

Revision history for this message

Nipa (nipak) wrote on 2016-02-03:

#13

1) There are some automatic deletes using Cassandra TTL
2) There are frequent updates of column value (read, modify and write)
3) There is only one row (and many columns – hundreds)

Any red flags?

—
Deepinder

From: Megh Bhatt <email address hidden>
Date: Tuesday, February 2, 2016 at 1:59 PM
To: Raj Reddy <email address hidden>, Nipa Kumar <email address hidden>
Cc: Deepinder SETIA <email address hidden>, "Ankit Jain (MVI)" <email address hidden>
Subject: Re: http://nodeg13.englab.juniper.net:5998/services

I’m not sure this is cassandra delay, is there a reason for assuming that?

From nodetool cfhistograms DISCOVERY_SERVER discovery below are the write and read latencies and as you can see there are not near seconds:

root@nodeg13:~# nodetool cfhistograms DISCOVERY_SERVER discovery
DISCOVERY_SERVER/discovery histograms
Percentile SSTables Write Latency Read Latency Partition Size Cell Count
(micros) (micros) (bytes)
50% 1.00 17.00 72.00 4768 20
75% 1.00 24.00 124.00 9887 50
95% 1.00 35.00 372.00 105778 372
98% 1.00 35.00 535.00 105778 372
99% 1.00 42.00 535.00 105778 372
Min 0.00 2.00 11.00 3312 11
Max 1.00 51012.00 51012.00 105778 372

I do see tombstones for discovery

----------------
Keyspace: DISCOVERY_SERVER
Read Count: 417438
Read Latency: 0.11146106248113492 ms.
Write Count: 406572
Write Latency: 0.01853075224068554 ms.
Pending Flushes: 0
Table: discovery
SSTable count: 1
Space used (live): 28223
Space used (total): 28223
Space used by snapshots (total): 0
Off heap memory used (total): 52
SSTable Compression Ratio: 0.17103882476390347
Number of keys (estimate): 15
Memtable cell count: 185544
Memtable data size: 211925
Memtable off heap memory used: 0
Memtable switch count: 1
Local read count: 417438
Local read latency: 0.112 ms
Local write count: 406572
Local write latency: 0.019 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 20
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 3312
Compacted partition maximum bytes: 105778
Compacted partition mean bytes: 19952
Average live cells per slice (last five minutes): 5.755028477645803
Maximum live cells per slice (last five minutes): 40.0
Average tombstones per slice (last five minutes): 2.4407406537272656
Maximum tombstones per slice (last five minutes): 23.0

————————

Are deletes expected? I thought that there were no deletes expected and hence no tombstones expected.

Thanks

Megh

1) There are some automatic deletes using Cassandra TTL
2) There are frequent updates of column value (read, modify and write)
3) There is only one row (and many columns – hundreds)

Any red flags?

— 
Deepinder

From: Megh Bhatt <meghb@juniper.net>
Date: Tuesday, February 2, 2016 at 1:59 PM
To: Raj Reddy <rajreddy@juniper.net>, Nipa Kumar <nipak@juniper.net>
Cc: Deepinder SETIA <dsetia@juniper.net>, "Ankit Jain (MVI)" <ankitja@juniper.net>
Subject: Re: http://nodeg13.englab.juniper.net:5998/services

I’m not sure this is cassandra delay, is there a reason for assuming that?

From nodetool cfhistograms DISCOVERY_SERVER discovery below are the write and read latencies and as you can see there are not near seconds:

root@nodeg13:~# nodetool cfhistograms DISCOVERY_SERVER discovery
DISCOVERY_SERVER/discovery histograms
Percentile  SSTables     Write Latency      Read Latency    Partition Size        Cell Count
                              (micros)          (micros)           (bytes)
50%             1.00             17.00             72.00              4768                20
75%             1.00             24.00            124.00              9887                50
95%             1.00             35.00            372.00            105778               372
98%             1.00             35.00            535.00            105778               372
99%             1.00             42.00            535.00            105778               372
Min             0.00              2.00             11.00              3312                11
Max             1.00          51012.00          51012.00            105778               372

I do see tombstones for discovery

----------------
Keyspace: DISCOVERY_SERVER
Read Count: 417438
Read Latency: 0.11146106248113492 ms.
Write Count: 406572
Write Latency: 0.01853075224068554 ms.
Pending Flushes: 0
Table: discovery
SSTable count: 1
Space used (live): 28223
Space used (total): 28223
Space used by snapshots (total): 0
Off heap memory used (total): 52
SSTable Compression Ratio: 0.17103882476390347
Number of keys (estimate): 15
Memtable cell count: 185544
Memtable data size: 211925
Memtable off heap memory used: 0
Memtable switch count: 1
Local read count: 417438
Local read latency: 0.112 ms
Local write count: 406572
Local write latency: 0.019 ms
Pending flushes: 0
Bloom filter false positives: 0
Bloom filter false ratio: 0.00000
Bloom filter space used: 24
Bloom filter off heap memory used: 16
Index summary off heap memory used: 20
Compression metadata off heap memory used: 16
Compacted partition minimum bytes: 3312
Compacted partition maximum bytes: 105778
Compacted partition mean bytes: 19952
Average live cells per slice (last five minutes): 5.755028477645803
Maximum live cells per slice (last five minutes): 40.0
Average tombstones per slice (last five minutes): 2.4407406537272656
Maximum tombstones per slice (last five minutes): 23.0

————————

Are deletes expected? I thought that there were no deletes expected and hence no tombstones expected.

Thanks

Megh

Ankit Jain (ankitja) on 2016-02-04

information type:

Proprietary → Public

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-02-05: [Review update] master

#14

Review in progress for https://review.opencontrail.org/16902
Submitter: Deepinder Setia (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-04-12: [Review update] R2.22.x

#16

Review in progress for https://review.opencontrail.org/19254
Submitter: Deepinder Setia (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-04-14: A change has been merged

#17

Reviewed: https://review.opencontrail.org/19254
Committed: http://github.org/Juniper/contrail-controller/commit/9560ca26ac0172cddacfe824dd40e6d41354032e
Submitter: Zuul
Branch: R2.22.x

commit 9560ca26ac0172cddacfe824dd40e6d41354032e
Author: Nipa Kumar <email address hidden>
Date: Tue Jan 26 16:51:16 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

In R2.21.x periodic hearbeat of each service is used to declare aliveness of a service.
In cases where the cassandra updates of hearbeat messages take longer than 4secs
(discovery client timeout), discovery client needs to be more resilient before sending
publish again. On heartbeat timeout, publish send interval was based on no of attempts made
which is a varied interval. Send time beyond 15secs
which would result in services at Discovery Server end declared DOWN.

Services declared down if three heartbeats are missed on disocvery server.

Solution is to send publish message only after 3 heartbeat response timeouts at client
end.

Change-Id: I36b9862afe170becfe967e49280052965a71a680
Partial-Bug:1522703
(cherry picked from commit 8ea2799a70a28825f69af8e48ed7d28839bcc532)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-04-20: [Review update] R2.20

#18

Review in progress for https://review.opencontrail.org/19495
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2016-04-27: A change has been merged

#19

Reviewed: https://review.opencontrail.org/19495
Committed: http://github.org/Juniper/contrail-controller/commit/67133102c8185e897d8173667e05f2365f7f1153
Submitter: Zuul
Branch: R2.20

commit 67133102c8185e897d8173667e05f2365f7f1153
Author: Nipa Kumar <email address hidden>
Date: Tue Jan 26 16:51:16 2016 -0800

Discovery Client needs to be more resilient to loads on Discovery Server.

In R2.21.x periodic hearbeat of each service is used to declare aliveness of a service.
In cases where the cassandra updates of hearbeat messages take longer than 4secs
(discovery client timeout), discovery client needs to be more resilient before sending
publish again. On heartbeat timeout, publish send interval was based on no of attempts made
which is a varied interval. Send time beyond 15secs
which would result in services at Discovery Server end declared DOWN.

Services declared down if three heartbeats are missed on disocvery server.

Solution is to send publish message only after 3 heartbeat response timeouts at client
end.

Change-Id: I36b9862afe170becfe967e49280052965a71a680
Partial-Bug:1522703
(cherry picked from commit 8ea2799a70a28825f69af8e48ed7d28839bcc532)

	Status	Importance	Assigned to	Milestone
Juniper Openstack	Status tracked in Trunk
R2.20	Fix Committed	High	Nipa	Juniper Openstack r2.23 "r2.23"
R2.21.x	Fix Committed	High	Nipa
R2.22.x	Fix Committed	High	Nipa	Juniper Openstack r2.22.2
Trunk	Fix Committed	High	Nipa

Juniper Openstack

Build 2680: Contrail collector timing out intermittently

Bug Description

Other bug subscribers

Remote bug watches