R2.20 build 86 : Oper state down/Initial Registration for some of the xmpp servers

Bug #1488936 reported by Ankit Jain
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R2.20
Fix Committed
High
Nipa
Trunk
Fix Released
High
Nipa

Bug Description

Below is the discovery output in freshly provisioned setup build 86
I see oper state for one of the xmpp servers is not up as shown below:

Service Type Remote IP Service Id Oper State Admin State In Use Time since last Heartbeat
OpServer 22.22.22.34 nodea34:OpServer up up 3 0:00:02
OpServer 22.22.22.35 nodea35:OpServer up up 3 0:00:01
OpServer 22.22.22.53 nodec53:OpServer up up 3 0:00:00
OpServer 22.22.22.55 nodec55:OpServer up up 3 0:00:01
IfmapServer 22.22.22.34 nodea34:IfmapServer up up 5 0:00:04
IfmapServer 22.22.22.35 nodea35:IfmapServer up up 4 0:00:01
IfmapServer 22.22.22.53 nodec53:IfmapServer up up 4 0:00:03
IfmapServer 22.22.22.54 nodec54:IfmapServer up up 3 0:00:04
dns-server 22.22.22.34 nodea34:dns-server up up 3 0:00:00
dns-server 22.22.22.35 nodea35:dns-server up up 3 0:00:00
dns-server 22.22.22.53 nodec53:dns-server up up 0 0:00:01
dns-server 22.22.22.54 nodec54:dns-server up up 0 0:00:05
xmpp-server 22.22.22.34 nodea34:xmpp-server down/Initial Registration up 0 0:00:03
xmpp-server 22.22.22.35 nodea35:xmpp-server up up 3 0:00:02
xmpp-server 22.22.22.53 nodec53:xmpp-server up up 3 0:00:00
xmpp-server 22.22.22.54 nodec54:xmpp-server up up 0 0:00:00
Collector 22.22.22.34 nodea34:Collector up up 19 0:00:03
Collector 22.22.22.35 nodea35:Collector up up 19 0:00:02
Collector 22.22.22.53 nodec53:Collector up up 18 0:00:00
Collector 22.22.22.55 nodec55:Collector up up 18 0:00:00
ApiServer 22.22.22.34 nodea34:ApiServer up up 3 0:00:04
ApiServer 22.22.22.35 nodea35:ApiServer up up 3 0:00:01
ApiServer 22.22.22.53 nodec53:ApiServer up up 3 0:00:03
ApiServer 22.22.22.54 nodec54:ApiServer up up 3 0:00:04

contrail status is fine.

== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager active
contrail-discovery:0 active
contrail-schema active
contrail-svc-monitor active
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
supervisor-database: active
contrail-database active
contrail-database-nodemgr active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

contrail-control logs:
2015-08-26 Wed 03:12:50:968.339 PDT nodea34 [Thread 140047558211520, Pid 8184]: Starting Bgp Server at port 179
2015-08-26 Wed 03:12:50:971.860 PDT nodea34 [Thread 140047558211520, Pid 8184]: SANDESH: No Client: 1440583970971656 SandeshModuleClientTrace: data= [ name = nodea34:Control:contrail-control:0 client_info= [ status = Idle successful_connections = 0 pid = 8184 http_port = 8083 start_time = 1440583970971553 collector_name = primary = 0.0.0.0:0 secondary = 0.0.0.0:0 rx_socket_stats= [ bytes = 0 calls = 0 average_bytes = 0 blocked_duration = 00:00:00 blocked_count = 0 average_blocked_duration = errors = 0 ] tx_socket_stats= [ bytes = 0 calls = 0 average_bytes = 0 blocked_duration = 00:00:00 blocked_count = 0 average_blocked_duration = errors = 0 ] ] ]
2015-08-26 Wed 03:12:50:972.271 PDT nodea34 [Thread 140047398672128, Pid 8184]: SANDESH: Send FAILED: 1440583970972028 NodeStatusUVE: data= [ name = nodea34 process_status= [ [ [ module_id = contrail-control instance_id = 0 state = Non-Functional connection_infos= [ [ [ type = Discovery name = Collector server_addrs= [ [ (*_iter6) = 22.22.22.35:5998, ] ] status = Initializing description = Subscribe ], [ type = Discovery name = xmpp-server server_addrs= [ [ (*_iter6) = 22.22.22.35:5998, ] ] status = Initializing description = Publish ], ] ] description = Number of connections:2, Expected:5 ], ] ] ]
2015-08-26 Wed 03:12:50:972.326 PDT nodea34 [Thread 140047398672128, Pid 8184]: SANDESH: Send FAILED: 1440583970972091 NodeStatusUVE: data= [ name = nodea34 process_status= [ [ [ module_id = contrail-control instance_id = 0 state = Non-Functional connection_infos= [ [ [ type = Discovery name = Collector server_addrs= [ [ (*_iter6) = 22.22.22.35:5998, ] ] status = Initializing description = Subscribe ], [ type = Discovery name = IfmapServer server_addrs= [ [ (*_iter6) = 22.22.22.35:5998, ] ] status = Initializing description = Subscribe ], [ type = Discovery name = xmpp-server server_addrs= [ [ (*_iter6) = 22.22.22.35:5998, ] ] status = Initializing description = Publish ], ] ] description = Number of connections:3, Expected:5 ], ] ] ]

discovery link : http://nodea35:5998/services

testbed:
host1 = 'root@10.204.216.31'
host2 = 'root@10.204.216.30'
host3 = 'root@10.204.217.93'
host4 = 'root@10.204.217.94'
host5 = 'root@10.204.217.95'
host6 = 'root@10.204.217.96'

env.roledefs = {
    'all': [host1, host2, host3, host4, host5, host6],
    'cfgm': [host1, host2, host3, host4],
    'openstack': [host2],
    'webui': [host3, host1, host2],
    'control': [host1, host2, host3, host4],
    'compute': [host4, host5, host6, host5],
    'vgw': [host4, host5],
    'collector': [host1, host3, host2, host5],
    'database': [host1, host2, host3, host4],
    'build': [host_build],
}

Tags: discovery
Revision history for this message
Ankit Jain (ankitja) wrote :
Revision history for this message
Ankit Jain (ankitja) wrote :

sometimes restarting the service also causing xmpp-server service oper state to go into down/Initial Registration

Revision history for this message
Nipa (nipak) wrote :

This is in preparation of next phase where the service will marked initially DOWN. Once control-node gets all the information from IfMapServer and has established bgp peer membership the xmpp-server service is pulled UP, hence u see thr service is DOWN, but shud resolve very soon. Please keep the bug open if it does not change the xmpp-server service UP.

Revision history for this message
Ankit Jain (ankitja) wrote :

This is happening consistently, tried in build 87. Xmpp server services get stuck in down/initial registration state.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/13313
Submitter: Nipa Kumar (<email address hidden>)

Nischal Sheth (nsheth)
information type: Proprietary → Public
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/13407
Submitter: Karl Klashinsky (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13313
Committed: http://github.org/Juniper/contrail-controller/commit/8d0db481ca65b8b48f1bf80a1a1edaa9c5d8495e
Submitter: Zuul
Branch: R2.20

commit 8d0db481ca65b8b48f1bf80a1a1edaa9c5d8495e
Author: Nipa Kumar <email address hidden>
Date: Tue Aug 25 16:34:45 2015 -0700

Unconditionally send publish with updated oper-state or oper-state-reason

Background: Discovery client periodically sends heartbeat and
publish if there is change is publish information. Write to
cassandra db if not sequenced results in older stale oper-state
and oper-state-reason updated due to last db read during
heartbeat update instead of new published information.
(TODO: Fix Cassandra DB writes)

To overcome above issue, we are sending publish unconditionally
everytime.

Change-Id: Ia0ae8be877210df755613a9332804d97dad48f66
ParTial-Bug:1488936

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.20

Review in progress for https://review.opencontrail.org/13413
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13407
Committed: http://github.org/Juniper/contrail-controller/commit/90947f1b39e3d41c7d50963b6b44429dbf9a2528
Submitter: Zuul
Branch: master

commit 90947f1b39e3d41c7d50963b6b44429dbf9a2528
Author: Nipa Kumar <email address hidden>
Date: Tue Aug 25 16:34:45 2015 -0700

Unconditionally send publish with updated oper-state or oper-state-reason

Background: Discovery client periodically sends heartbeat and
publish if there is change is publish information. Write to
cassandra db if not sequenced results in older stale oper-state
and oper-state-reason updated due to last db read during
heartbeat update instead of new published information.
(TODO: Fix Cassandra DB writes)

To overcome above issue, we are sending publish unconditionally
everytime.

Change-Id: Ia0ae8be877210df755613a9332804d97dad48f66
ParTial-Bug:1488936

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/13413
Committed: http://github.org/Juniper/contrail-controller/commit/2e69e2353f87bf0404c230f8a7bfdc57a8acbfe1
Submitter: Zuul
Branch: R2.20

commit 2e69e2353f87bf0404c230f8a7bfdc57a8acbfe1
Author: Nipa Kumar <email address hidden>
Date: Fri Aug 28 16:29:45 2015 -0700

For discovery clients using publish re-evaluate options, send only publish
periodically (there is no need to send heart-beat in addition to publish)

Change-Id: Ic4fccfe783bf7595ca5bd4c5b35385def58b819f
Partial-Bug:1488936

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/13423
Submitter: Nipa Kumar (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/13423
Committed: http://github.org/Juniper/contrail-controller/commit/91127ba49247e52230da33ddbabd958717caed83
Submitter: Zuul
Branch: master

commit 91127ba49247e52230da33ddbabd958717caed83
Author: Nipa Kumar <email address hidden>
Date: Fri Aug 28 16:29:45 2015 -0700

For discovery clients using publish re-evaluate options, send only publish
periodically (there is no need to send heart-beat in addition to publish)

Change-Id: Ic4fccfe783bf7595ca5bd4c5b35385def58b819f
Partial-Bug:1488936

Revision history for this message
Nipa (nipak) wrote :

Discovery client fixed the issue by sending a "publish" and completely eliminating sending heatbeat for clients that have reevaluate callback registered.

Discvoery server needs to fix the cell update issue i.e on reception of heartbeat, it shut only update the timestamp cell of relevance as it could be reading stale data.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R2.22-dev

Review in progress for https://review.opencontrail.org/13927
Submitter: Vinay Vithal Mahuli (<email address hidden>)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.