Build 2665: Alarms are not shown when one of the collectors is down

Bug #1513409 reported by Ankit Jain
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Fix Committed
High
Anish Mehta

Bug Description

System has 3 collectors, when one of them goes down, I stopped seeing the alarms which were being shown earlier.
Even some of the analytics services are down on 1 node, for that also, no alarms are being shown.

http://nodeg32:8081/analytics/alarms
{ }

http://nodeh1:8081/analytics/alarms
{ }

http://nodeh1:8081/analytics/alarms
{ }

contrail-alarm-gen logs:

ERROR:contrail-alarm-gen:Part 13 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 13 collectors ['40.43.1.6:6379']
ERROR:contrail-alarm-gen:Stopping part 13 UVEs ['ObjectGeneratorInfo:nodeg32:Config:contrail-api:0', u'ObjectGeneratorInfo:nodeh2:Control:contr
ail-control:0', 'ObjectGeneratorInfo:nodeh2:Config:contrail-discovery:0', 'ObjectGeneratorInfo:nodeh1:Analytics:contrail-collector:0', u'Object
VNTable:default-domain:admin:vn1', u'ObjectXmppPeerInfo:nodeh2:40.41.1.5', 'ObjectGeneratorInfo:nodeh1:Database:contrail-database-nodemgr:0', '
ObjectVNTable:__UNKNOWN__', 'ObjectGeneratorInfo:nodeh1:Control:contrail-control-nodemgr:0', 'ObjectGeneratorInfo:nodeh1:Analytics:contrail-ana
lytics-api:0', u'ObjectBgpPeer:default-domain:default-project:ip-fabric:__default__:nodeh2:default-domain:default-project:ip-fabric:__default__
:nodeh1', 'ObjectGeneratorInfo:nodeh1:Config:contrail-svc-monitor:0', 'ObjectBgpPeer:default-domain:default-project:ip-fabric:__default__:nodeh
1:default-domain:default-project:ip-fabric:__default__:nodeh2', u'ObjectGeneratorInfo:nodeh7:Compute:contrail-vrouter-nodemgr:0', u'ObjectXmppP
eerInfo:nodeh2:40.40.10.6', 'ObjectVNTable:default-domain:demo:test1', 'ObjectGeneratorInfo:nodeh2:Control:contrail-dns:0', 'ObjectVNTable:defa
ult-domain:default-project:ip-fabric']
ERROR:contrail-alarm-gen:Part 4 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 4 collectors ['40.43.1.6:6379']
ERROR:contrail-alarm-gen:Stopping part 4 UVEs []
ERROR:contrail-alarm-gen:Part 9 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 9 collectors ['40.43.1.6:6379']
ERROR:contrail-alarm-gen:Stopping part 9 UVEs ['ObjectPhysicalInterfaceTable:vlan4001']
ERROR:contrail-alarm-gen:Part 7 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 7 collectors ['40.43.1.6:6379']
ERROR:contrail-alarm-gen:Stopping part 7 UVEs ['ObjectVMITable:default-domain:demo:97563d12-936f-4f88-a2ab-1aa7214ef1a0']
ERROR:contrail-alarm-gen:Part 6 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 6 collectors ['40.43.1.6:6379']
ERROR:contrail-alarm-gen:Stopping part 6 UVEs ['ObjectVMTable:7a1920c7-7fbd-4369-aabd-2af4008601b5']
ERROR:contrail-alarm-gen:Part 12 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 12 collectors ['40.43.1.6:6379']
ERROR:contrail-alarm-gen:Stopping part 12 UVEs ['ObjectGeneratorInfo:nodeg32:Database:contrail-database-nodemgr:0', 'ObjectGeneratorInfo:nodeh1
:Control:contrail-dns:0', 'ObjectGeneratorInfo:nodeh1:Control:contrail-control:0', 'ObjectGeneratorInfo:nodei5:Compute:contrail-vrouter-nodemgr
:0', u'ObjectXmppPeerInfo:nodeh1:40.40.10.6', 'ObjectGeneratorInfo:nodeh6:Compute:contrail-vrouter-nodemgr:0', 'ObjectGeneratorInfo:nodeg32:Con
fig:contrail-schema:0', 'ObjectVNTable:default-domain:default-project:__link_local__', 'ObjectGeneratorInfo:nodeh2:Config:contrail-config-nodem
gr:0', 'ObjectXmppPeerInfo:nodeh1:40.41.1.5', 'ObjectGeneratorInfo:nodeh1:Analytics:contrail-snmp-collector:0', 'ObjectGeneratorInfo:nodeh7:Com
pute:contrail-vrouter-agent:0', 'ObjectVNTable:default-domain:default-project:default-virtual-network', 'ObjectGeneratorInfo:nodeh1:Config:cont
rail-discovery:0', u'ObjectXmppPeerInfo:nodeh2:40.40.40.6']
ERROR:contrail-alarm-gen:Part 11 lost collector 40.43.1.6:6379
ERROR:contrail-alarm-gen:Stopping part 11 collectors ['40.43.1.6:6379']

Pasting the contrail status on all 3 collector nodes :

root@nodeg32:~# contrail-status
== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager backup
contrail-discovery:0 active
contrail-schema active
contrail-svc-monitor backup
ifmap active

== Contrail Web UI ==
supervisor-webui: active
contrail-webui active
contrail-webui-middleware active

== Contrail Database ==
contrail-database: active
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

root@nodeh1:~# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen initializing (Collector connection down)
contrail-analytics-api initializing (Collector connection down)
contrail-analytics-nodemgr active
contrail-collector inactive
contrail-query-engine initializing (Collector connection down)
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager backup
contrail-discovery:0 active
contrail-schema backup
contrail-svc-monitor active
ifmap active

== Contrail Database ==
contrail-database: active
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

root@nodeh2:~# contrail-status
== Contrail Control ==
supervisor-control: active
contrail-control active
contrail-control-nodemgr active
contrail-dns active
contrail-named active

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen active
contrail-analytics-api active
contrail-analytics-nodemgr active
contrail-collector active
contrail-query-engine active
contrail-snmp-collector active
contrail-topology active

== Contrail Config ==
supervisor-config: active
contrail-api:0 active
contrail-config-nodemgr active
contrail-device-manager active
contrail-discovery:0 active
contrail-schema backup
contrail-svc-monitor backup
ifmap active

== Contrail Database ==
contrail-database: active
supervisor-database: active
contrail-database-nodemgr active
kafka active

== Contrail Support Services ==
supervisor-support-service: active
rabbitmq-server active

Setup :

    'all': [host1, host2, host3,host4, host5,host6, host7],
    'cfgm': [host1,host2,host3],
    'webui': [host1],
    'openstack': [host1],
    'control': [host2, host3],
    'collector': [host1, host2, host3],
    'database': [host1, host2, host3],
    'compute': [host4, host5,host6, host7],
'all': ['nodeg32', 'nodeh1', 'nodeh2', 'nodeh6', 'nodei4', 'nodei5', 'nodeh7']

Tags: analytics
Raj Reddy (rajreddy)
Changed in juniperopenstack:
importance: Undecided → High
Revision history for this message
Anish Mehta (amehta00) wrote :

Ankit, I need more information:

- Which alarms did you expect, and on which UVEs?
- content of analytics-node UVEs (I need to know which partitions live on which analytics node)
- logs for all 3 contrail-alarm-gen instances.
- Introspect output for all 3 contrail-alarm-gens:
http://10.84.13.40:5995/Snh_PartitionStatusReq?partition=-1
http://10.84.13.40:5995/Snh_UVETableInfoReq?partition=-1
http://10.84.13.40:5995/Snh_UVETableAlarmReq?table=all

Ofcourse, it will be best if you can reproduce and leave the system in the bad state.

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/15068
Submitter: Anish Mehta (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/15068
Committed: http://github.org/Juniper/contrail-controller/commit/d408fb9a9fa73da315ebbfc7dae23809b053b26d
Submitter: Zuul
Branch: master

commit d408fb9a9fa73da315ebbfc7dae23809b053b26d
Author: Anish Mehta <email address hidden>
Date: Thu Nov 12 09:55:38 2015 -0800

Analytics-Nodemgr and contrail-alarm-gen will now connect to any collector (as per discovery)
instead of only connecting to the local collector. This help get alarm correctly on collector failure

The FieldNames table is now poppulated for stats and flows.
Also, we only write to it once every T2 per name/value combination

Adding support for redis-HA. When alarmgen reconnects to redis,
restart partitions

Change-Id: I21485b765c3c49759f20c5b308198141789ec06c
Closes-Bug: 1512539
Closes-Bug: 1512537
Closes-Bug: 1512536
Closes-Bug: 1512532
Closes-Bug: 1513409

Revision history for this message
Ankit Jain (ankitja) wrote :

Anish,

Could you please check this?

The setup is in failed state.

setup : nodea21, nodeg13, nodeg20

    'cfgm': [host1,host2,host3],
    'webui': [host1],
    'openstack': [host1],
    'control': [host2, host3],
    'collector': [host1, host2, host3],
    'database': [host1, host2, host3],
    'compute': [host2, host3],

For the same scenario, what I see now is :

== Contrail Analytics ==
supervisor-analytics: active
contrail-alarm-gen active
contrail-analytics-api initializing (Collector connection down)
contrail-analytics-nodemgr active
contrail-collector inactive
contrail-query-engine initializing (Collector connection down)
contrail-snmp-collector initializing (Collector connection down)
contrail-topology active

In this case, contrail-analytics-api contrail-query-engine and contrail-snmp-collector failed to connect to remote collector ( as local collector is down)

Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

Ankit, this behavior is as per design.
- All the analytics services connect to local collector service.
- With commit for the original issue reported in this bug, changes have been made to let alarm-gen and analytics-nodemgr connect to any collector based on collector list received from discovery service so that alarms can be raised if collector/any other analytics service goes down.

Please create a separate bug to track the issue you have mentioned at comment #15 as it is not related to the original issue reported.

information type: Proprietary → Public
Revision history for this message
Ankit Jain (ankitja) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.