2.10-39~havana: Openstack HA Failover tests resulted in Contrail UI Dashboard showing incorrect info.

Bug #1461761 reported by Sandeep Sridhar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Raj Reddy
Trunk
Fix Committed
High
Raj Reddy

Bug Description

Used Build : 2.10-39~havana on Ubuntu 12.04

The roledef in testbed.py looked like below :

#Role definition of the hosts.
env.roledefs = {
    'all': [host1, host2, host3, host4, host5],
    'cfgm': [host1, host2, host3],
    'openstack': [host1, host2, host3],
    'control': [host1, host2, host3],
    'compute': [host4, host5],
    'collector': [host1, host2, host3],
    'webui': [host1, host2, host3],
    'database': [host1, host2, host3],
    'build': [host_build],
    #'storage-master': [host1],
    #'storage-compute': [host4, host5, host6, host7, host8, host9, host10],
    # 'vgw': [host4, host5], # Optional, Only to enable VGW. Only compute can support vgw
    # 'tsn': [host10], # Optional, Only to enable TSN. Only compute can support TSN
    # 'toragent': [host10], Optional, Only to enable Tor Agent. Only compute can
    # support Tor Agent
    # 'backup':[backup_node], # only if the backup_node is defined
}

host1 = 10.204.74.1
host2 = 10.204.74.2
host3 = 10.204.74.3
host4 = 10.204.74.4
host5 = 10.204.74.5

Post installation, I could see vRouter agent had established XMPP connections with host2 and host3. Please see below :

Host1:
root@budweiser:/opt/contrail/utils# netstat -an | grep :5269
tcp 0 0 0.0.0.0:5269 0.0.0.0:* LISTEN

Host2:
root@corona:~# netstat -an | grep :5269
tcp 0 0 0.0.0.0:5269 0.0.0.0:* LISTEN
tcp 0 0 10.204.74.2:5269 10.204.74.5:35461 ESTABLISHED
tcp 0 0 10.204.74.2:5269 10.204.74.4:41753 ESTABLISHED

Host3:
root@heineken:~# netstat -an | grep :5269
tcp 0 0 0.0.0.0:5269 0.0.0.0:* LISTEN
tcp 0 0 10.204.74.3:5269 10.204.74.5:42164 ESTABLISHED
tcp 0 0 10.204.74.3:5269 10.204.74.4:52994 ESTABLISHED

Host2 and Host3 were having connections with Host4 and Host5.

I plugged Host2 out of the network. Within 2-3 minutes, I could see that Host1 and Host3 established XMPP connections with Host4 and Host5. This is fine.

However, the dashboard (Contrail UI) looks unstable with red dots pointing "system processes unavailable". At times, under Contol Nodes - it showed Host2 and Host3 even though Host 2 was out of the network. The dashboard did not stabilize at all.

I went ahead and rebooted the nodes to fix the issue of dashboard. I feel from the functionality perspective there is no issue as the vRouter agents correctly established connections with Host1 when Host2 was pulled out from the network. However, dashboard reports incorrect status. It is easily reproducible in the lab.

Please investigate.

Tags: analytics ui
Changed in juniperopenstack:
importance: Undecided → Medium
tags: added: ui
Revision history for this message
Sandeep Sridhar (ssandeep) wrote :

Can you please update in which version of Contrail will this bug be addressed ?

- Sandeep.

Revision history for this message
Rahul (rahuls) wrote :

Accessed the system subsequently, 3 analytics nodes were showing different data for control node uve's.

Please comment whether 2.20 will have similar issues addresed when discovery has issues.

Changed in juniperopenstack:
assignee: nobody → Raj Reddy (rajreddy)
Rahul (rahuls)
Changed in juniperopenstack:
importance: Medium → High
Rahul (rahuls)
tags: added: analytics
Revision history for this message
Raj Reddy (rajreddy) wrote :

the tcpkeepalive mechanism doesn't seem to be working.. looks like till the send-buffer gets full, till then the connection doesn't close and switch.. we have to look into why the tcp keepalive mechanism is not working..

root@vse2100-4:/var/log/contrail# ss -e | grep 5.5.5.13
tcp SYN-SENT 0 1 5.5.5.14:47609 5.5.5.13:11211 timer:(on,304ms,0) uid:111 ino:27471385 sk:ffff88052705c600 <->
tcp ESTAB 0 0 5.5.5.14:35000 5.5.5.13:9160 ino:26239287 sk:ffff88050f418e00 <->
tcp ESTAB 0 0 5.5.5.14:afs3-fileserver 5.5.5.13:53622 ino:720801 sk:ffff88050f496900 <->
tcp ESTAB 0 0 5.5.5.14:9081 5.5.5.13:53323 uid:118 ino:711727 sk:ffff880b4f210e00 <->
tcp ESTAB 0 0 5.5.5.14:9081 5.5.5.13:35284 uid:118 ino:26901943 sk:ffff880b4f214d00 <->
tcp SYN-SENT 0 1 5.5.5.14:34160 5.5.5.13:mysql timer:(on,560ms,0) uid:111 ino:27484000 sk:ffff8805301fb100 <->
tcp ESTAB 0 0 5.5.5.14:9081 5.5.5.13:53694 uid:118 ino:27162139 sk:ffff8805897db100 <->
tcp SYN-SENT 0 1 5.5.5.14:34167 5.5.5.13:mysql timer:(on,892ms,0) uid:111 ino:27484010 sk:ffff8805301faa00 <->
tcp SYN-SENT 0 1 5.5.5.14:42928 5.5.5.13:afs3-fileserver timer:(on,19sec,6) ino:27449446 sk:ffff88051dd23800 <->
tcp ESTAB 0 0 5.5.5.14:40185 5.5.5.13:9160 uid:118 ino:27165580 sk:ffff8805de259c00 <->
tcp ESTAB 0 0 5.5.5.14:afs3-fileserver 5.5.5.13:34976 ino:26751407 sk:ffff880c1c71e900 <->
tcp SYN-SENT 0 1 5.5.5.14:34097 5.5.5.13:mysql timer:(on,1.052ms,1) uid:111 ino:27484482 sk:ffff880b442f5b00 <->
tcp SYN-SENT 0 1 5.5.5.14:50098 5.5.5.13:9100 timer:(on,224ms,0) uid:111 ino:27483990 sk:ffff88058627e900 <->
tcp ESTAB 0 65892 5.5.5.14:55247 5.5.5.13:8086 timer:(on,58sec,14) uid:118 ino:27226323 sk:ffff880530028700 <->
tcp ESTAB 0 0 5.5.5.14:35009 5.5.5.13:9160 ino:26240195 sk:ffff880b44362300 <->
tcp ESTAB 0 8993 5.5.5.14:40041 5.5.5.13:9160 timer:(on,1min,14) uid:118 ino:27170293 sk:ffff880b440d1500 <->
tcp SYN-SENT 0 1 5.5.5.14:35784 5.5.5.13:9697 timer:(on,812ms,0) uid:111 ino:27471393 sk:ffff88052705a300 <->

information type: Proprietary → Public
tags: added: releasenote
tags: added: quench
Raj Reddy (rajreddy)
tags: removed: quench
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/12142
Submitter: Raj Reddy (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12164
Submitter: Raj Reddy (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12142
Committed: http://github.org/Juniper/contrail-controller/commit/36746a15c59cf4a7358c9a2549740c259aff4e5e
Submitter: Zuul
Branch: R2.20

commit 36746a15c59cf4a7358c9a2549740c259aff4e5e
Author: Raj Reddy <email address hidden>
Date: Wed Jul 1 15:32:26 2015 -0700

When the socket has data to send, then the TCP_KEEPALIVE timer doesn't get
activated and it takes much longer time before the socket is closed.
TCP_USER_TIMEOUT also needs to be set for the keepalive to function
when there's traffic on the tcp socket. This will cause the socket
to close when the remote end is not reachable.

Change-Id: I90b9ece162babc4ccc23498eec6082d20fa6c461
Partial-Bug: #1461761

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/12165
Submitter: Raj Reddy (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12164
Committed: http://github.org/Juniper/contrail-controller/commit/5ed88321b7ba37285c855704b0ed2a4be6b28599
Submitter: Zuul
Branch: master

commit 5ed88321b7ba37285c855704b0ed2a4be6b28599
Author: Raj Reddy <email address hidden>
Date: Thu Jul 2 17:28:24 2015 -0700

When the socket has data to send, then the TCP_KEEPALIVE timer doesn't get
activated and it takes much longer time before the socket is closed.
TCP_USER_TIMEOUT also needs to be set for the keepalive to function
when there's traffic on the tcp socket. This will cause the socket
to close when the remote end is not reachable.
Partial-Bug: #1461761

Change-Id: I920f8e54798d27abe570814841749a7cea4ceb91

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/12165
Committed: http://github.org/Juniper/contrail-sandesh/commit/1c07be315cf158264fe89da5ff48940358f4a15a
Submitter: Zuul
Branch: R2.20

commit 1c07be315cf158264fe89da5ff48940358f4a15a
Author: Raj Reddy <email address hidden>
Date: Thu Jul 2 08:56:26 2015 -0700

When the socket has data to send, then the TCP_KEEPALIVE timer doesn't get
activated and it takes much longer time before the socket is closed.
TCP_USER_TIMEOUT also needs to be set for the keepalive to function
when there's traffic on the tcp socket. This will cause the socket
to close when the remote end is not reachable.

Change-Id: I14c45d05992c604691f2a4c1e7ddb4c741f8fc99
Closes-Bug: #1461761

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/12178
Submitter: Raj Reddy (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/12178
Committed: http://github.org/Juniper/contrail-sandesh/commit/40fdaf3e5471129bce5dfe92d49b4d9147688839
Submitter: Zuul
Branch: master

commit 40fdaf3e5471129bce5dfe92d49b4d9147688839
Author: Raj Reddy <email address hidden>
Date: Thu Jul 2 08:56:26 2015 -0700

When the socket has data to send, then the TCP_KEEPALIVE timer doesn't get
activated and it takes much longer time before the socket is closed.
TCP_USER_TIMEOUT also needs to be set for the keepalive to function
when there's traffic on the tcp socket. This will cause the socket
to close when the remote end is not reachable.

Change-Id: I14c45d05992c604691f2a4c1e7ddb4c741f8fc99
Closes-Bug: #1461761

Rahul (rahuls)
tags: removed: releasenote
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.