Open stack HA:mainline3020:contrail analytics process gets stuck at initializing (Collector connection down)

Bug #1653870 reported by sundarkh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
Trunk
Invalid
High
sundarkh

Bug Description

Open stack HA:mainline3020:contrail analytics process gets stuck at initializing (Collector connection down)

1) Provisioned a cluster with Openstack HA as below

server-manager-client display server --select id,cluster_id,roles,ip_address
+---------+----------------+----------------+----------------------------------------------------------------------------+
| id | cluster_id | ip_address | roles |
+---------+----------------+----------------+----------------------------------------------------------------------------+
| nodec28 | cluster5sanity | 10.204.217.13 | [u'config', u'control', u'collector', u'database', u'webui', u'openstack'] |
| nodeg37 | cluster5sanity | 10.204.217.77 | [u'config', u'control', u'collector', u'database', u'webui', u'openstack'] |
| nodec10 | cluster5sanity | 10.204.217.176 | [u'config', u'control', u'collector', u'database', u'webui', u'openstack'] |
| nodei17 | cluster5sanity | 10.204.217.129 | [u'compute'] |
| nodei19 | cluster5sanity | 10.204.217.131 | [u'compute'] |
| nodei20 | cluster5sanity | 10.204.217.132 | [u'compute'] |
+---------+----------------+----------------+----------------------------------------------------------------------------+
root@nodej3:~#

2) After the provision gets completed, contrail analytics process gets stuck at initializing (Collector connection down)

1) the contrail-analytics processes in nodec10 gets stuck at initializing state (contrail-analytics logs, that complains of uve port is attached)

nodec10
== Contrail Control ==
supervisor-control:           active
contrail-control              active
contrail-control-nodemgr      initializing (Collector connection down)
contrail-dns                  active
contrail-named                active

== Contrail Analytics ==
supervisor-analytics:         active
contrail-alarm-gen:0          failed
contrail-analytics-api        initializing (Collector connection down)
contrail-analytics-nodemgr    initializing (Collector connection down)
contrail-collector            active
contrail-query-engine         active
contrail-snmp-collector       initializing (Collector connection down)
contrail-topology             initializing (Collector connection down)

root@nodeg37:~# contrail-status
== Contrail Control ==
supervisor-control:           active
contrail-control              active
contrail-control-nodemgr      initializing (Collector connection down)
contrail-dns                  active
contrail-named                active
== Contrail Analytics ==
supervisor-analytics:         active
contrail-alarm-gen:0          active
contrail-analytics-api        active
contrail-analytics-nodemgr    active
contrail-collector            active
contrail-query-engine         active
contrail-snmp-collector       active
contrail-topology             active

root@nodec28:~# contrail-status
== Contrail Control ==
supervisor-control:           active
contrail-control              active
contrail-control-nodemgr      initializing (Collector connection down)
contrail-dns                  active
contrail-named                active
== Contrail Analytics ==
supervisor-analytics:         active
contrail-alarm-gen:0          active
contrail-analytics-api        active
contrail-analytics-nodemgr    active
contrail-collector            active
contrail-query-engine         active
contrail-snmp-collector       active
contrail-topology             active

root@nodec10:~# vi /var/log/contrail/contrail-analytics-api.log

01/03/2017 12:36:38 AM [contrail-analytics-api]: Initializing UVE Cache
01/03/2017 12:36:38 AM [contrail-analytics-api]: updated redis_uve_list {('127.0.0.1', 6379, 0): None}
01/03/2017 12:36:38 AM [contrail-analytics-api]: Cannot write http_port 8090 to /tmp/contrail-analytics-api.2282.http_port
01/03/2017 12:36:38 AM [contrail-analytics-api]: Starting Introspect on HTTP Port 8090
01/03/2017 12:36:38 AM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 process_status = [ << module_id = contrail-analytics-api instance_id = 0 state = Non-Functional connection_infos = [ << type = UvePartitions name = UVE-Aggregation server_addrs = [ ] status = Initializing >>, ] description = UvePartitions:UVE-Aggregation[None] connection down >>, ] >>
01/03/2017 12:36:38 AM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 process_status = [ << module_id = contrail-analytics-api instance_id = 0 state = Non-Functional connection_infos = [ << type = UvePartitions name = UVE-Aggregation server_addrs = [ ] status = Initializing >>, << type = Redis-UVE name = 127.0.0.1:6379 server_addrs = [ ] status = Initializing >>, ] description = UvePartitions:UVE-Aggregation[None], Redis-UVE:127.0.0.1:6379[None] connection down >>, ] >>
01/03/2017 12:36:38 AM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 process_status = [ << module_id = contrail-analytics-api instance_id = 0 state = Non-Functional connection_infos = [ << type = UvePartitions name = UVE-Aggregation server_addrs = [ ] status = Initializing >>, << type = Collector name = server_addrs = [ , ] status = Down description = none to Idle on EvStart >>, << type = Redis-UVE name = 127.0.0.1:6379 server_addrs = [ ] status = Initializing >>, ] description = UvePartitions:UVE-Aggregation[None], Collector, Redis-UVE:127.0.0.1:6379[None] connection down >>, ] >>
01/03/2017 12:36:38 AM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] SandeshModuleClientTrace: data = << name = nodec10:Analytics:contrail-analytics-api:0 client_info = << status = Idle successful_connections = 0 pid = 2423 http_port = 8090 start_time = 1483432598225932 collector_name = collector_ip = >> sm_queue_count = 3 max_sm_queue_count = 3 >>
01/03/2017 12:36:38 AM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 process_status = [ << module_id = contrail-analytics-api instance_id = 0 state = Non-Functional connection_infos = [ << type = UvePartitions name = UVE-Aggregation server_addrs = [ ] status = Initializing >>, << type = Collector name = server_addrs = [ , ] status = Down description = none to Idle on EvStart >>, << type = Redis-UVE name = 127.0.0.1:6379 server_addrs = [ ] status = Up >>, ] description = UvePartitions:UVE-Aggregation[None], Collector connection down >>, ] >>
01/03/2017 12:36:38 AM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 process_status = [ << module_id = contrail-analytics-api instance_id = 0 state = Non-Functional connection_infos = [ << type = UvePartitions name = UVE-Aggregation server_addrs = [ ] status = Up description = Partitions:30 >>, << type = Collector name = server_addrs = [ , ] status = Down description = none to Idle on EvStart >>, << type = Redis-UVE name = 127.0.0.1:6379 server_addrs = [ ] status = Up >>, ] description = Collector connection down >>, ] >>
01/03/2017 12:36:38 AM [contrail-analytics-api]: Exception: get_cql_session Failure ('Unable to connect to any servers', {'127.0.0.1': error(111, "Tried connecting to [('127.0.0.1', 9160)]. Last error: Connection refused")})
01/03/2017 12:36:38 AM [contrail-analytics-api]: Starting UveStreamer
01/03/2017 12:36:38 AM [contrail-analytics-api]: Exception: [Errno 98] Address already in use
01/03/2017 12:36:38 AM [contrail-analytics-api]: stopping everythin

Discovery server: 192.168.100.10
Discovery port: 5998
Collector address: []
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: SANDESH: CONNECT TO COLLECTOR: True
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: Cannot write http_port 8104 to /tmp/contrail-analytics-nodemgr.4985.http_port
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: Starting Introspect on HTTP Port 8104
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: Processing event[EvStart] in state[none]
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: Sandesh Client: Event[EvStart] => State[none] -> State[Idle]
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: SANDESH: Logging: LEVEL: [SYS_INFO] -> [SYS_INFO]
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: SANDESH: Logging: FILE: [None] -> [<stdout>]
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: SANDESH: Logging: SYSLOG: [None] -> [LOG_LOCAL0]
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: SANDESH: Trace: PRINT: [None] -> [False]
01/03/2017 12:37:01 AM [contrail-analytics-nodemgr]: SANDESH: Flow Logging: [None] -> [False]
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: send_nodemgr_process_status_base: Sending UVE:NodeStatusUVE(_context='', _scope='', _category='', _send_queue_enabled=True, _seqnum=0, _versionsig=2524127670, _source='nodec10', _instance_id='0', _client=None, _type=6, _hints=1, _http_server=None, _logger=None, _more=False, _node_type='Analytics', data=NodeStatus(status=None, name='nodec10', installed_package_version=None, deleted=None, disk_usage_info=None, build_info=None, running_package_version=None, process_mem_cpu_usage=None, system_cpu_info=None, system_mem_usage=None, process_status=[ProcessStatus(instance_id='0', module_id='contrail-analytics-nodemgr', state='Functional', description='', connection_infos=None)], all_core_file_list=None, system_cpu_usage=None, _table='ObjectCollectorInfo', process_info=None, description=None), _module='contrail-analytics-nodemgr', _level=2147483647, _timestamp=1483432622237417, _client_context='', _connect_to_collector=True, _role=0)
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: Processing event[EvSandeshUVESend] in state[Idle]
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 build_info = {"build-info" : [{"build-version" : "4.0.0.0", "build-time" : "2016-12-23 11:45:53.527374", "build-user" : "contrail-builder", "build-hostname" : "ubuntu", "build-id" : "4.0.0.0-3020", "build-number" : "3020"}]} system_cpu_info = << num_socket = 1 num_cpu = 4 num_core_per_socket = 4 num_thread_per_core = 1 >> running_package_version = 4.0.0.0-3020 installed_package_version = 4.0.0.0-3020 >>
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: Discarding event[EvSandeshUVESend] in state[Idle]
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: Processing event[EvSandeshUVESend] in state[Idle]
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec10 process_status = [ << module_id = contrail-analytics-nodemgr instance_id = 0 state = Functional description = >>, ] >>
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: Discarding event[EvSandeshUVESend] in state[Idle]
wokeup and found a line
wokeup and found a line
wokeup and found a line
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: Received discovery update [{u'partcount': u'{ "1":[0,1], "2":[1,3], "3":[4,8], "4":[12,8], "5":[20,10]}', u'@publisher-id': u'nodec10', u'pid': u'2419', u'ip-address': u'192.168.100.13', u'redis-gen': u'1', u'port': u'8086'}] for collector service
01/03/2017 12:37:02 AM [contrail-analytics-nodemgr]: Processing event[EvSandeshUVESend] in state[Idle]
@ @
"/var/log/contrail/contrail-analytics-nodemgr-stderr.log" 462 lines, 68989

Alarm gen failure
------------------

01/03/2017 05:09:49 AM [contrail-alarm-gen]: Agg unexpected key ObjectConfigNode:nodeg37 from inst:part 0:0
01/03/2017 05:09:49 AM [contrail-alarm-gen]: Agg unexpected rows [OutputRow(key='ObjectConfigNode:nodeg37', typ='NodeStatus', val=None)]
01/03/2017 05:09:49 AM [contrail-alarm-gen]: AlarmGen stopping everything
01/03/2017 05:09:49 AM [contrail-alarm-gen]: Stopped http server

Notes :
------

1) Issue not seen in Single node setup
2) service contrail-collector restart
service supervisor-analytics restart
recovers the collector to active state
3) Issue seen with cluster provisioned using Fab/SM

Tags: analytics
sundarkh (sundar-kh)
description: updated
sundarkh (sundar-kh)
description: updated
Revision history for this message
Raj Reddy (rajreddy) wrote :

Sundar, are you still seeing this issue, if not, we can close it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.