[4.1.0.0-8] Alarms not getting raised after any of the contrail processes is stopped

Bug #1736102 reported by Ankit Jain
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R4.1
Incomplete
Low
Ankit Jain
R5.0
Incomplete
Medium
Ankit Jain
Trunk
Incomplete
Medium
Ankit Jain

Bug Description

All the test cases failed in one of the sanity setups.

Alarms were not getting raised. When stopped the process manually, I could see the same issue.
Logging a bug to track this problem. As the problem was seen only on one setup, the problem might not be reproducible.

Setup details:
Build : 4.1.0.0-8
CoreLocation : /cs-shared/test_runs/nodec7/jenkins-ubuntu-14-04_mitaka_Multi_Node_Sanity-569
cores : {'10.204.216.65': [], '10.204.216.64': [], '10.204.216.150': [], '10.204.216.153': [], '10.204.217.115': [], '10.204.217.76': [], '10.204.217.114': []}
LogsLocation : http://10.204.216.50/Docs/logs/4.1.0.0-8_jenkins-ubuntu-14-04_mitaka_Multi_Node_Sanity-569_1512230719.61/logs/
Report : http://10.204.216.50/Docs/logs/4.1.0.0-8_jenkins-ubuntu-14-04_mitaka_Multi_Node_Sanity-569_1512230719.61/junit-noframes.html
Topology :
DISTRO : "Ubuntu 14.04.5 LTS"
SKU : mitaka
Config Nodes : [u'nodec7', u'nodec8', u'nodec57']
Control Nodes : [u'nodec7', u'nodec8', u'nodec57']
Compute Nodes : [u'nodei1', u'nodei2', u'nodei3']
Openstack Node : [u'nodec7']
WebUI Node : [u'nodec7', u'nodec8', u'nodec57']
Analytics Nodes : [u'nodec7', u'nodec8', u'nodec57']
Database Nodes : [u'nodec7', u'nodec8', u'nodec57']
Physical Devices : [u'hooper', u"'hooper'"]
LB Nodes : [u'nodeg36']

The following errors was seen in the log file:

contrail-analytics-api.log

12/02/2017 06:47:14 PM [contrail-analytics-api]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec7 process_status = [ << module_id = contrail-analytics-api instance_id = 0 state = Non-Functional connection_infos = [ << type = Redis-UVE name = 192.168.192.7:6381 server_addrs = [ 192.168.192.7:6381, ] status = Initializing >>, << type = Collector name = server_addrs = [ , ] status = Down description = none to Idle on EvStart >>, << type = Zookeeper name = OpServer server_addrs = [ 192.168.192.6:2181, 192.168.192.5:2181, 192.168.192.7:2181, ] status = Initializing description = >>, << type = Redis-UVE name = 192.168.192.6:6381 server_addrs = [ 192.168.192.6:6381, ] status = Initializing >>, << type = ApiServer name = server_addrs = [ 192.168.192.6:8082, 192.168.192.5:8082, 192.168.192.7:8082, ] status = Initializing description = >>, << type = UvePartitions name = UVE-Aggregation server_addrs = [ ] status = Initializing >>, << type = Redis-UVE name = 192.168.192.5:6381 server_addrs = [ 192.168.192.5:6381, ] status = Initializing >>, ] description = Redis-UVE:192.168.192.7:6381[None], Collector, Zookeeper:OpServer[], Redis-UVE:192.168.192.6:6381[None], ApiServer, UvePartitions:UVE-Aggregation[None], Redis-UVE:192.168.192.5:6381[None] connection down >>, ] >>
12/02/2017 06:47:14 PM [contrail-analytics-api]: redis/collector healthcheck failed Error 111 connecting to 192.168.192.7:6381. Connection refused. for RedisInstKey(ip='192.168.192.7', port=6381)

12/02/2017 06:47:14 PM [contrail-alarm-gen]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec7 process_status = [ << module_id = contrail-alarm-gen instance_id = 0 state = Non-Functional connection_infos = [ << type = Redis-UVE name = 192.168.192.7:6381 server_addrs = [ 192.168.192.7:6381, ] status = Initializing >>, << type = Collector name = server_addrs = [ , ] status = Down description = none to Idle on EvStart >>, << type = Redis-UVE name = 192.168.192.6:6381 server_addrs = [ 192.168.192.6:6381, ] status = Initializing >>, << type = Database name = RabbitMQ server_addrs = [ 192.168.192.6, ] status = Initializing description = >>, << type = Redis-UVE name = 192.168.192.5:6381 server_addrs = [ 192.168.192.5:6381, ] status = Initializing >>, << type = Zookeeper name = AlarmGenerator server_addrs = [ 192.168.192.6:2181, 192.168.192.5:2181, 192.168.192.7:2181, ] status = Initializing description = >>, ] description = Redis-UVE:192.168.192.7:6381[None], Collector, Redis-UVE:192.168.192.6:6381[None], Database:RabbitMQ[], Redis-UVE:192.168.192.5:6381[None], Zookeeper:AlarmGenerator[] connection down >>, ] >>
12/02/2017 06:47:14 PM [contrail-alarm-gen]: SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec7 process_status = [ << module_id = contrail-alarm-gen instance_id = 0 state = Non-Functional connection_infos = [ << type = Redis-UVE name = 192.168.192.7:6381 server_addrs = [ 192.168.192.7:6381, ] status = Initializing >>, << type = Collector name = server_addrs = [ , ] status = Down description = none to Idle on EvStart >>, << type = Redis-UVE name = 192.168.192.6:6381 server_addrs = [ 192.168.192.6:6381, ] status = Initializing >>, << type = Database name = RabbitMQ server_addrs = [ 192.168.192.6, ] status = Down description = >>, << type = Redis-UVE name = 192.168.192.5:6381 server_addrs = [ 192.168.192.5:6381, ] status = Initializing >>, << type = Zookeeper name = AlarmGenerator server_addrs = [ 192.168.192.6:2181, 192.168.192.5:2181, 192.168.192.7:2181, ] status = Initializing description = >>, ] description = Redis-UVE:192.168.192.7:6381[None], Collector, Redis-UVE:192.168.192.6:6381[None], Database:RabbitMQ[], Redis-UVE:192.168.192.5:6381[None], Zookeeper:AlarmGenerator[] connection down >>, ] >>
12/02/2017 06:47:14 PM [contrail-alarm-gen]: SANDESH: [DROP: WrongClientSMState] AlarmgenStatusTrace: data = << name = nodec7 counters = [ << instance = 0 partitions = 0 keys = 0 updates = 0 table_stats = [ ] >>, ] alarmgens = [ nodec7:Analytics:contrail-alarm-gen:0, ] >>
12/02/2017 06:47:14 PM [contrail-alarm-gen]: Exception ConnectionError in uve proc. Arguments:
('Error 111 connecting to 127.0.0.1:6381. Connection refused.',)
12/02/2017 06:47:14 PM [contrail-alarm-gen]: Analytics Discovery cannot publish while down

12/02/2017 08:14:09 PM [contrail-alarm-gen]: Starting part 24 UVEs 0
12/02/2017 08:25:42 PM [kafka.conn]: <BrokerConnection host=192.168.192.5 port=9092> timed out after 40000 ms. Closing connection.
12/02/2017 08:31:25 PM [kafka.conn]: <BrokerConnection host=192.168.192.5 port=9092> timed out after 40000 ms. Closing connection.
12/02/2017 08:32:01 PM [kafka.conn]: <BrokerConnection host=192.168.192.5 port=9092> timed out after 40000 ms. Closing connection.
12/02/2017 08:32:05 PM [kafka.conn]: <BrokerConnection host=192.168.192.5 port=9092> timed out after 40000 ms. Closing connection.
12/02/2017 08:32:05 PM [kafka.conn]: <BrokerConnection host=192.168.192.5 port=9092> timed out after 40000 ms. Closing connection.
12/02/2017 08:35:43 PM [kafka.conn]: <BrokerConnection host=192.168.192.5 port=9092> timed out after 40000 ms. Closing connection.
12/02/2017 08:42:12 PM [kafka.conn]: <BrokerConnection host=192.168.192.6 port=9092>: Error receiving 4-byte payload header - closing socket
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/kafka/conn.py", line 248, in recv
    self._rbuffer.write(self._sock.recv(4 - self._rbuffer.tell()))
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 385, in recv
    return sock.recv(*args)
error: [Errno 104] Connection reset by peer
12/02/2017 08:42:12 PM [kafka.consumer.fetcher]: Fetch to node 1 failed: [Errno 104] Connection reset by peer
12/02/2017 08:42:12 PM [kafka.conn]: <BrokerConnection host=192.168.192.6 port=9092>: Error receiving 4-byte payload header - closing socket
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/kafka/conn.py", line 248, in recv
    self._rbuffer.write(self._sock.recv(4 - self._rbuffer.tell()))
  File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 385, in recv
    return sock.recv(*args)
error: [Errno 104] Connection reset by peer

Build : 4.1.0.0-8
CoreLocation : /cs-shared/test_runs/nodec7/jenkins-ubuntu-14-04_mitaka_Multi_Node_Sanity-569
cores : {'10.204.216.65': [], '10.204.216.64': [], '10.204.216.150': [], '10.204.216.153': [], '10.204.217.115': [], '10.204.217.76': [], '10.204.217.114': []}
LogsLocation : http://10.204.216.50/Docs/logs/4.1.0.0-8_jenkins-ubuntu-14-04_mitaka_Multi_Node_Sanity-569_1512230719.61/logs/
Report : http://10.204.216.50/Docs/logs/4.1.0.0-8_jenkins-ubuntu-14-04_mitaka_Multi_Node_Sanity-569_1512230719.61/junit-noframes.html
Topology :
DISTRO : "Ubuntu 14.04.5 LTS"
SKU : mitaka
Config Nodes : [u'nodec7', u'nodec8', u'nodec57']
Control Nodes : [u'nodec7', u'nodec8', u'nodec57']
Compute Nodes : [u'nodei1', u'nodei2', u'nodei3']
Openstack Node : [u'nodec7']
WebUI Node : [u'nodec7', u'nodec8', u'nodec57']
Analytics Nodes : [u'nodec7', u'nodec8', u'nodec57']
Database Nodes : [u'nodec7', u'nodec8', u'nodec57']
Physical Devices : [u'hooper', u"'hooper'"]
LB Nodes : [u'nodeg36']

http://nodec7:8080/proxy?proxyURL=http://192.168.192.7:8081/analytics/uves/analytics-node/*?cfilt=AlarmgenPartition
{
value: [
{
name: "nodec57",
value: {
AlarmgenPartition: {
__T: 1512225809352367,
inst_parts: [
{
instance: "0",
partitions: [
"3",
"5",
"10",
"11",
"15",
"16",
"21",
"26",
"27",
"28",
"29"
]
}
]
}
}
},
{
name: "nodec8",
value: {
AlarmgenPartition: {
__T: 1512225809352367,
inst_parts: [
{
instance: "0",
partitions: [
"0",
"1",
"4",
"6",
"7",
"8",
"17",
"18",
"19",
"20"
]
}
]
}
}
},
{
name: "nodec7",
value: {
AlarmgenPartition: {
__T: 1512233857795934,
inst_parts: [
{
instance: "0",
partitions: [
"2",
"9",
"12",
"13",
"14",
"22",
"23",
"24",
"25"
]
}
]
}
}

Full logs copied here:

Logs and Json file copied at :

/cs-shared/bugs/<bug-id>/

Tags: analytics
Anish Mehta (amehta00)
Changed in juniperopenstack:
assignee: Anish Mehta (amehta00) → Sundaresan Rajangam (srajanga)
Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

From the logs, it is evident that contrail-alarm-gen was not able to connect to kafka. Did you check if there was any network connectivity issue?

Next time when you see the issue, please check if the NodeStatus UVE gets updated properly after you stop the service.
Also, please check if the contrail-alarm-gen has all the alarm config objects
http://<analytics-ip>:5995/Snh_AlarmConfigRequest?name=

Changed in juniperopenstack:
status: New → Incomplete
assignee: Sundaresan Rajangam (srajanga) → Ankit Jain (ankitja)
Revision history for this message
Ankit Jain (ankitja) wrote :

This issue appeared again in R5.0.

contrail-alarm-gen could not connect to Kafka. [kafka.conn]: <BrokerConnection host=10.204.216.105/10.204.216.105 port=9092>: socket disconnected

Also, NodeStatus UVE was not getting updated after stopping any of the process.

When checked here http://<analytics-ip>:5995/Snh_AlarmConfigRequest?name=, All the expected configs were found.

After restating Alarm-gen, the system recovered.

Latest log copied here:
 /cs-shared/bugs/1736102/R5.0

Changed in juniperopenstack:
status: Incomplete → New
assignee: Ankit Jain (ankitja) → Sundaresan Rajangam (srajanga)
Revision history for this message
mkheni (mkheni) wrote :

Issue in 5.0 is reported and tracked in https://bugs.launchpad.net/juniperopenstack/+bug/1742006 and it is because of the removal of polling for process-status in nodemgr.

Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

Ankit,

As Miraj pointed in #3, the NodeStatus UVE not updated on stopping a service is due to https://bugs.launchpad.net/juniperopenstack/+bug/1742006
But, restarting the contrail-alarm-gen service wouldn't have fixed the issue.
What do you mean by "After restating Alarm-gen, the system recovered"? Did you see alarms after restarting the contrail-alarm-gen service?

Revision history for this message
Ankit Jain (ankitja) wrote :

Ok....

contrail-alarm-gen connection to Kafka was down initially and alarms were also not getting raised.
After restating the alarm-gen process, contrail-alarm-gen connection to Kafka was established, and I could also see the alarms getting generated. I thought that might have caused the issue.

contrail-alarm-gen could not connect to Kafka. [kafka.conn]: <BrokerConnection host=10.204.216.105/10.204.216.105 port=9092>: socket disconnected

Revision history for this message
mkheni (mkheni) wrote :

Ankit,

as Sundar mentioned above, alarms not getting generated for stopped process could be because of
https://bugs.launchpad.net/juniperopenstack/+bug/1742006. The error you are seeing

<BrokerConnection host=10.204.216.105/10.204.216.105 port=9092>: socket disconnected
01/12/2018 05:04:50 AM [kafka.client]: Node 2 connection failed -- refreshing metadata
01/12/2018 05:07:27 AM [kafka.client]: Node 0 connection failed -- refreshing metadata
01/12/2018 05:12:48 AM [kafka.client]: Node 0 connection failed -- refreshing metadata

These are warnings when kafka.client is not able to send a request to kafka, which is essentially a warning and not an error, we do not believe that it is the reason for alarms not getting generated as we were able to generate alarms even when these warnings were there. Could you test a build with #1742006 is fixed and see if you still see that alarms are not generated?

About the warnings, it seems like a library issue and we are investigating it further.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.