Bug #1761424 “[5.0 build 50]Alarm not getting generated on stopp...” : Bugs : Juniper Openstack

aswani kumar (aswanikumar90) on 2018-04-05

tags:

added: analytics sanityblocker

Revision history for this message

Sundaresan Rajangam (srajanga) wrote on 2018-04-05:

#1

Please provide the setup or logs. Did you check the NodeStatus UVE for the database node where you stopped the kafka container?

Revision history for this message

aswani kumar (aswanikumar90) wrote on 2018-04-06:

#2

I didn't have multinode so tried on single node

I stopped the kafka container
[root@nodeg6 ~]# docker ps -a | grep kafka
b4ab654d9d07 opencontrailnightly/contrail-external-kafka:ocata-master-50 "/docker-entrypoin..." 43 hours ago Exited (143) 3 minutes ago analyticsdatabase_kafka_1

In database nodestatus, process_info still showing as process running
process_info: [
{
process_name: "kafka",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1522831835000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},

you can use nodeg6
10.204.217.46

Revision history for this message

Sundaresan Rajangam (srajanga) wrote on 2018-04-10:

#3

nodemgr doesn't seem to detect the status of kafka container when it goes down

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-10:

#4

checked this env (nodeg6)

cluster is up, kafka is in the running state
- stopped kafka: docker stop ab882abc227c
- checked contrail-status output: kafka is in exited state, inactive. collector is initializing state due to kafka connection error.
- checked introspect port: kafka is in exited state.

Revision history for this message

Sundaresan Rajangam (srajanga) wrote on 2018-04-10:

#5

@Andrey, the issue is not about contrail-status reflecting the correct state of the docker. The issue is that nodemgr doesn't send the correct status of the docker in the NodeStatus UVE and hence alarms are not being raised.

The url below shows the kafka status as RUNNING. This is incorrect.
Nodemgr should have detected that the docker was stopped and updated the same in the NodeStatus UVE.

https://10.204.217.46:8143/proxy?proxyURL=http://5.5.5.234:8081/analytics/uves/database-node/nodeg6?cfilt=NodeStatus:process_info

{
NodeStatus: {
process_info: [
{
process_name: "kafka",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING", <<<<<<<<<<<<<<<<<<< Incorrect
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},
{
process_name: "zookeeper",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},
{
process_name: "contrail-database-nodemgr",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},
{
process_name: "cassandra",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
}
]
}
}

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-10:

#6

I checked it with
[root@nodeg6 ~]# curl -s http://localhost:8103/Snh_SandeshUVECacheReq?x=NodeStatus | xmllint --format -

and I see
            <ProcessInfo>
              <process_name type="string" identifier="1">kafka</process_name>
              <process_state type="string" identifier="2">PROCESS_STATE_EXITED</process_state>
              <start_count type="u32" identifier="3">1</start_count>
              <stop_count type="u32" identifier="4">0</stop_count>
              <exit_count type="u32" identifier="5">0</exit_count>
              <last_start_time type="string" identifier="6">1523364755000000</last_start_time>
              <last_stop_time type="string" identifier="7"/>
              <last_exit_time type="string" identifier="8"/>
              <core_file_list type="list" identifier="9">
                <list type="string" size="0"/>
              </core_file_list>
            </ProcessInfo>

And I make a conclusion that nodemgr detects correct status.

I'm not an expert in UVE's - How I can debug that nodemgr sends it?

In your example - what does URL mean?

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-10:

#7

In my 3-nodes setup, when I switched off all three kafka's - WebUI shows only one alarm from the first. And it shows other two kafka's are up.

When I switched kafka on first server - WebUI shows the same picture.
So I need to know how to debug this all.

Revision history for this message

Andrey Pavlov (apavlov-e) wrote on 2018-04-10:

#8

nodemgr.conf -

[COLLECTOR]
server_list=10.1.26.148:8086 10.1.26.149:8086 10.1.26.150:8086

Revision history for this message

Sundaresan Rajangam (srajanga) wrote on 2018-04-10:

#9

If the kafka cluster is down, then analytics-api and alarm-gen should report the state as non-functional. /analytics/uves/<uve-type> should not return any data

summary:

- [5.0 build 50]Alarm not gettinng generated on stopping kafka containner
+ [5.0 build 50]Alarm not getting generated on stopping kafka container

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2018-04-14: [Review update] master

#10

Review in progress for https://review.opencontrail.org/41920
Submitter: Jack Jonnalagadda (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2018-04-17: [Review update] R5.0

#22

Review in progress for https://review.opencontrail.org/42049
Submitter: Jack Jonnalagadda (<email address hidden>)

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2018-04-18: A change has been merged

#24

Reviewed: https://review.opencontrail.org/41920
Committed: http://github.com/Juniper/contrail-analytics/commit/5f8494f35c670bd71d09606ec5a2a9efea39b432
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 5f8494f35c670bd71d09606ec5a2a9efea39b432
Author: Jackjvs <email address hidden>
Date: Sat Apr 14 10:05:10 2018 -0700

After 4.0, alarm is not getting generated when kafka container is
down, because discovery doesn't detect the event in the container
model of deployment.

This fix dectects kafka down event and raises SysExit exception
in alrmgen. Upon detecting kafka health check producer failure
the said exception is raised.

Change-Id: If3922d8dc550d292578731c97d7ce9e5ae3a1631
Closes-Bug: 1761424

Revision history for this message

OpenContrail Admin (ci-admin-f) wrote on 2018-04-18:

#25

Reviewed: https://review.opencontrail.org/42049
Committed: http://github.com/Juniper/contrail-analytics/commit/2a19b7645950b8932de831e8eea5ee17b6068879
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 2a19b7645950b8932de831e8eea5ee17b6068879
Author: Jackjvs <email address hidden>
Date: Sat Apr 14 10:05:10 2018 -0700

After 4.0, alarm is not getting generated when kafka container is
down, because discovery doesn't detect the event in the container
model of deployment.

This fix dectects kafka down event and raises SysExit exception
in alrmgen. Upon detecting kafka health check producer failure
the said exception is raised.

Change-Id: If3922d8dc550d292578731c97d7ce9e5ae3a1631
Closes-Bug: 1761424

Revision history for this message

aswani kumar (aswanikumar90) wrote on 2018-05-08:

#26

Seeing this issue again on 5.0 build 40 and above
I have a kafka cluster runnning on nodec4, nodec5, nodec6

On stopping kafka service on nodec5 alarm is not getting generated
On other nodes its getting generated

[root@nodec5 ~]# docker ps -a | grep kafka
5820ef8e8e26 10.204.217.152:5000/contrail-external-kafka:ocata-5.0-40 "/docker-entrypoin..." 4 days ago Exited (137) 3 minutes ago analyticsdatabase_kafka_1

contrail-status

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: inactive
nodemgr: active
zookeeper: active
cassandra: active

Nodestatus on nodec5 showing as kafka process running

process_info: [
{
process_name: "kafka",
start_count: 3,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1525764463322191",
stop_count: 0,
last_exit_time: "1525764097238154",
exit_count: 1
},

Revision history for this message

Jack Jonnalagadda (jackjvs) wrote on 2018-05-10:

#27

Do you mean that on stopping kafka service on nodec5 alarm is not getting generated on nodec5, but alarm is getting generated on nodec4 and nodec6?

What is the status of alarmgen on nodec4 and nodec6 - when kafka is stopped?
And what is the status of alarmgen on nodec5?

What is the HA config of kafka and replication factor?

Revision history for this message

aswani kumar (aswanikumar90) wrote on 2018-05-14:

#28

I mean on the other nodes c4 and c6 if i stop kafka i am seeing respective alarms

If i stop kafka on c5 i am not seeing any alarms

After stopping kafak on nodec5 alarm_gen is active.
The above conntrail-status is from nodec5

Its not HA setup

Revision history for this message

Jack Jonnalagadda (jackjvs) wrote on 2018-05-21:

#29

What is the list of kafka brokers that the alarmgen on nodec5 can connect with?

Revision history for this message

Jack Jonnalagadda (jackjvs) wrote on 2018-05-22:

#30

Download full text (4.3 KiB)

Example output from 5.0, when kafka is stopped:
[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/virtual-networks | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2 100 2 0 0 606 0 --:--:-- --:--:-- --:--:-- 666
[]
[root@a1s19 /]# contrail-status
Pod Service Original Name State Status
analytics alarm-gen contrail-analytics-alarm-gen running Up About a minute
analytics api contrail-analytics-api running Up 6 hours
analytics collector contrail-analytics-collector running Up 6 hours
analytics nodemgr contrail-nodemgr running Up 6 hours
analytics query-engine contrail-analytics-query-engine running Up 6 hours
config api contrail-controller-config-api running Up 6 hours
config cassandra contrail-external-cassandra running Up 6 hours
config device-manager contrail-controller-config-devicemgr running Up 6 hours
config nodemgr contrail-nodemgr running Up 6 hours
config rabbitmq contrail-external-rabbitmq running Up 6 hours
config schema contrail-controller-config-schema running Up 6 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 6 hours
config zookeeper contrail-external-zookeeper running Up 6 hours
control control contrail-controller-control-control running Up 6 hours
control dns contrail-controller-control-dns running Up 6 hours
control named contrail-controller-control-named running Up 6 hours
control nodemgr contrail-nodemgr running Up 6 hours
database cassandra contrail-external-cassandra running Up 6 hours
database kafka contrail-external-kafka exited Exited (143) 23 minutes ago
database nodemgr contrail-nodemgr running Up 6 hours
database zookeeper contrail-external-zookeeper running Up 6 hours
vrouter agent contrail-vrouter-agent running Up 6 hours
vrouter nodemgr contrail-nodemgr running Up 6 hours
webui job contrail-controller-webui-job running Up 6 hours
webui web contrail-controller-webui-web running Up 6 hours

vrouter kernel module is PRESENT
== Contrail control ==
control: active
nodemgr: timeout
named: active
dns: active

== Contrail database ==
kafka: inactive
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: inactive
query-engine: active
api: initializing (UvePartitions:UVE-Aggregation[Partitions:0] connection down)
alarm-gen: active
nodemgr: active
collector: initializing (KafkaPub:10...

Example output from 5.0, when kafka is stopped:
[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/virtual-networks | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0    606      0 --:--:-- --:--:-- --:--:--   666
[]
[root@a1s19 /]# contrail-status
Pod        Service         Original Name                          State    Status
analytics  alarm-gen       contrail-analytics-alarm-gen           running  Up About a minute
analytics  api             contrail-analytics-api                 running  Up 6 hours
analytics  collector       contrail-analytics-collector           running  Up 6 hours
analytics  nodemgr         contrail-nodemgr                       running  Up 6 hours
analytics  query-engine    contrail-analytics-query-engine        running  Up 6 hours
config     api             contrail-controller-config-api         running  Up 6 hours
config     cassandra       contrail-external-cassandra            running  Up 6 hours
config     device-manager  contrail-controller-config-devicemgr   running  Up 6 hours
config     nodemgr         contrail-nodemgr                       running  Up 6 hours
config     rabbitmq        contrail-external-rabbitmq             running  Up 6 hours
config     schema          contrail-controller-config-schema      running  Up 6 hours
config     svc-monitor     contrail-controller-config-svcmonitor  running  Up 6 hours
config     zookeeper       contrail-external-zookeeper            running  Up 6 hours
control    control         contrail-controller-control-control    running  Up 6 hours
control    dns             contrail-controller-control-dns        running  Up 6 hours
control    named           contrail-controller-control-named      running  Up 6 hours
control    nodemgr         contrail-nodemgr                       running  Up 6 hours
database   cassandra       contrail-external-cassandra            running  Up 6 hours
database   kafka           contrail-external-kafka                exited   Exited (143) 23 minutes ago
database   nodemgr         contrail-nodemgr                       running  Up 6 hours
database   zookeeper       contrail-external-zookeeper            running  Up 6 hours
vrouter    agent           contrail-vrouter-agent                 running  Up 6 hours
vrouter    nodemgr         contrail-nodemgr                       running  Up 6 hours
webui      job             contrail-controller-webui-job          running  Up 6 hours
webui      web             contrail-controller-webui-web          running  Up 6 hours

vrouter kernel module is PRESENT
== Contrail control ==
control: active
nodemgr: timeout
named: active
dns: active

== Contrail database ==
kafka: inactive
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: inactive
query-engine: active
api: initializing (UvePartitions:UVE-Aggregation[Partitions:0] connection down)
alarm-gen: active
nodemgr: active
collector: initializing (KafkaPub:10.84.5.19:9092 connection down)
topology: inactive

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/virtual-networks | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0    480      0 --:--:-- --:--:-- --:--:--   666
[]
[root@a1s19 /]# docker ps | grep contrail-api
[root@a1s19 /]# docker ps | grep api                                                                                                                                                                                                            6b70d30b55e1        ci-repo.englab.juniper.net:5000/contrail-analytics-api:ocata-5.0-35                           "/entrypoint.sh /u..."   3 weeks ago         Up 6 hours                              analytics_api_1
35d8134b2997        ci-repo.englab.juniper.net:5000/contrail-controller-config-api:ocata-5.0-35                   "/entrypoint.sh /u..."   3 weeks ago         Up 6 hours                              config_api_1

Revision history for this message

Jack Jonnalagadda (jackjvs) wrote on 2018-05-22:

#31

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/projects | python -m json.tool
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 2 100 2 0 0 473 0 --:--:-- --:--:-- --:--:-- 666
[] << Kafka is DOWN (no UVEs displayed)

After starting kafka:
database kafka contrail-external-kafka running Up 2 minutes
database nodemgr contrail-nodemgr running Up 6 hours
database zookeeper contrail-external-zookeeper running Up 6 hours
vrouter agent contrail-vrouter-agent running Up 6 hours
vrouter nodemgr contrail-nodemgr running Up 6 hours
webui job contrail-controller-webui-job running Up 6 hours
webui web contrail-controller-webui-web running Up 6 hours

vrouter kernel module is PRESENT
== Contrail control ==
control: active
nodemgr: timeout
named: active
dns: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: inactive
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: inactive

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/projects | python -m json.tool
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 137 100 137 0 0 31086 0 --:--:-- --:--:-- --:--:-- 34250
[
    {
        "href": "http://10.84.5.19:8081/analytics/uves/project/default-domain:default-project?flat",
        "name": "default-domain:default-project"
    }
] << Kafka is backup

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/projects | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100     2  100     2    0     0    473      0 --:--:-- --:--:-- --:--:--   666
[]  << Kafka is DOWN (no UVEs displayed)

After starting kafka:
database   kafka           contrail-external-kafka                running  Up 2 minutes
database   nodemgr         contrail-nodemgr                       running  Up 6 hours
database   zookeeper       contrail-external-zookeeper            running  Up 6 hours
vrouter    agent           contrail-vrouter-agent                 running  Up 6 hours
vrouter    nodemgr         contrail-nodemgr                       running  Up 6 hours
webui      job             contrail-controller-webui-job          running  Up 6 hours
webui      web             contrail-controller-webui-web          running  Up 6 hours

vrouter kernel module is PRESENT
== Contrail control ==
control: active
nodemgr: timeout
named: active
dns: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: inactive
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: inactive

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/projects | python -m json.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   137  100   137    0     0  31086      0 --:--:-- --:--:-- --:--:-- 34250
[
    {
        "href": "http://10.84.5.19:8081/analytics/uves/project/default-domain:default-project?flat",
        "name": "default-domain:default-project"
    }
] << Kafka is backup

Revision history for this message

Ankit Jain (ankitja) wrote on 2018-06-19:

#32

Download full text (13.9 KiB)

Reopenin the bug as it is seen again in 5.0-104~ocata.

AnalyticsTestSanity.test_db_node_process_status_alarms failing because of the issue :

After stopping the kafka service, the script expects the alarm, which is not getting generated.

Expecting the alarm to be generated of type "default-global-system-config:system-defined-process-status" after executing "docker stop analyticsdatabase_kafka_1 -t 60 on nodem7

Looks like UVE is also not updated. It showed kafka as PROCESS_STATE_RUNNING. Pasting the UVE below:

[root@nodem7 ~]# docker ps -a | grep kafka
1592e3c73167 10.204.217.152:5000/contrail-external-kafka:ocata-5.0-104 "/docker-entrypoin..." 7 hours ago Exited (1) 2 minutes ago analyticsdatabase_kafka_1
[root@nodem7 ~]#

  "NodeStatus": {
    "build_info": "{\"build-info\" : [{\"build-version\" : \"5.0.1\", \"build-time\" : \"2018-06-18 08:57:10.831815\", \"build-user\" : \"zuul\", \"build-hostname\" : \"centos-7-4-builder-juniper-contrail-ci-0000055808\", \"build-id\" : \"5.0-104.el7\", \"build-number\" : \"@contrail\"}]}",
    "installed_package_version": "5.0-104.el7",
    "deleted": false,
    "disk_usage_info": {
      "/dev/mapper/nodem7--vg00-lv_root": {
        "partition_space_available_1k": 332337864,
        "partition_space_used_1k": 14033452,
        "percentage_partition_space_used": 4,
        "partition_type": "ext4"
      }
    },
    "__T": 1529408522111042,
    "running_package_version": "5.0-104.el7",
    "process_mem_cpu_usage": {
      "zookeeper": {
        "mem_res": 574980,
        "cpu_share": 0.01,
        "mem_virt": 577572
      },
      "cassandra": {
        "mem_res": 2721272,
        "cpu_share": 1.86,
        "mem_virt": 5909560
      },
      "contrail-database-nodemgr": {
        "mem_res": 47616,
        "cpu_share": 0.15,
        "mem_virt": 47628
      },
      "kafka": {
        "mem_res": 547848,
        "cpu_share": 0.0,
        "mem_virt": 549752
      }
    },
    "system_cpu_info": {
      "num_cpu": 40,
      "num_core_per_socket": 10,
      "num_thread_per_core": 2,
      "num_socket": 2
    },
    "system_mem_usage": {
      "used": 17315484,
      "cached": 13772224,
      "free": 230580516,
      "node_type": "database-node",
      "total": 263857016,
      "buffers": 2188792
    },
    "process_status": [
      {
        "instance_id": "0",
        "module_id": "contrail-database-nodemgr",
        "state": "Functional",
        "description": null,
        "connection_infos": [
          {
            "server_addrs": [
              "10.204.216.103:8086"
            ],
            "status": "Up",
            "type": "Collector",
            "name": null,
            "description": "ClientInit to Established on EvSandeshCtrlMessageRecv"
          }
        ]
      }
    ],
    "all_core_file_list": [
      "/var/crashes/core.contrail-query-.1.nodem7.1529387523"
    ],
    "system_cpu_usage": {
      "fifteen_min_avg": 2.28,
      "node_type": "database-node",
      "cpu_share": 0.18,
      "five_min_avg": 2.15,
      "one_min_avg": 2.31
    },
    "process_info": [
      {
        "process_name": "kafka",
        "start_count":...

Reopenin the bug as it is seen again in 5.0-104~ocata.

AnalyticsTestSanity.test_db_node_process_status_alarms failing because of the issue :

After stopping the kafka service, the script expects the alarm, which is not getting generated.

Expecting the alarm to be generated of type "default-global-system-config:system-defined-process-status" after executing "docker stop analyticsdatabase_kafka_1 -t 60 on nodem7

Looks like UVE is also not updated. It showed kafka as PROCESS_STATE_RUNNING. Pasting the UVE below:

[root@nodem7 ~]# docker ps -a | grep kafka
1592e3c73167        10.204.217.152:5000/contrail-external-kafka:ocata-5.0-104                 "/docker-entrypoin..."   7 hours ago         Exited (1) 2 minutes ago                       analyticsdatabase_kafka_1
[root@nodem7 ~]#

"NodeStatus": {
    "build_info": "{\"build-info\" : [{\"build-version\" : \"5.0.1\", \"build-time\" : \"2018-06-18 08:57:10.831815\", \"build-user\" : \"zuul\", \"build-hostname\" : \"centos-7-4-builder-juniper-contrail-ci-0000055808\", \"build-id\" : \"5.0-104.el7\", \"build-number\" : \"@contrail\"}]}",
    "installed_package_version": "5.0-104.el7",
    "deleted": false,
    "disk_usage_info": {
      "/dev/mapper/nodem7--vg00-lv_root": {
        "partition_space_available_1k": 332337864,
        "partition_space_used_1k": 14033452,
        "percentage_partition_space_used": 4,
        "partition_type": "ext4"
      }
    },
    "__T": 1529408522111042,
    "running_package_version": "5.0-104.el7",
    "process_mem_cpu_usage": {
      "zookeeper": {
        "mem_res": 574980,
        "cpu_share": 0.01,
        "mem_virt": 577572
      },
      "cassandra": {
        "mem_res": 2721272,
        "cpu_share": 1.86,
        "mem_virt": 5909560
      },
      "contrail-database-nodemgr": {
        "mem_res": 47616,
        "cpu_share": 0.15,
        "mem_virt": 47628
      },
      "kafka": {
        "mem_res": 547848,
        "cpu_share": 0.0,
        "mem_virt": 549752
      }
    },
    "system_cpu_info": {
      "num_cpu": 40,
      "num_core_per_socket": 10,
      "num_thread_per_core": 2,
      "num_socket": 2
    },
    "system_mem_usage": {
      "used": 17315484,
      "cached": 13772224,
      "free": 230580516,
      "node_type": "database-node",
      "total": 263857016,
      "buffers": 2188792
    },
    "process_status": [
      {
        "instance_id": "0",
        "module_id": "contrail-database-nodemgr",
        "state": "Functional",
        "description": null,
        "connection_infos": [
          {
            "server_addrs": [
              "10.204.216.103:8086"
            ],
            "status": "Up",
            "type": "Collector",
            "name": null,
            "description": "ClientInit to Established on EvSandeshCtrlMessageRecv"
          }
        ]
      }
    ],
    "all_core_file_list": [
      "/var/crashes/core.contrail-query-.1.nodem7.1529387523"
    ],
    "system_cpu_usage": {
      "fifteen_min_avg": 2.28,
      "node_type": "database-node",
      "cpu_share": 0.18,
      "five_min_avg": 2.15,
      "one_min_avg": 2.31
    },
    "process_info": [
      {
        "process_name": "kafka",
        "start_count": 4,
        "process_state": "PROCESS_STATE_RUNNING",
        "last_stop_time": null,
        "core_file_list": [
          
        ],
        "last_start_time": "1529408511162401",
        "stop_count": 0,
        "last_exit_time": "1529408108967889",
        "exit_count": 3
      },
      {
        "process_name": "zookeeper",
        "start_count": 1,
        "process_state": "PROCESS_STATE_RUNNING",
        "last_stop_time": null,
        "core_file_list": [
          
        ],
        "last_start_time": "1529363031000000",
        "stop_count": 0,
        "last_exit_time": null,
        "exit_count": 0
      },
      {
        "process_name": "contrail-database-nodemgr",
        "start_count": 1,
        "process_state": "PROCESS_STATE_RUNNING",
        "last_stop_time": null,
        "core_file_list": [
          
        ],
        "last_start_time": "1529383306000000",
        "stop_count": 0,
        "last_exit_time": null,
        "exit_count": 0
      },
      {
        "process_name": "cassandra",
        "start_count": 1,
        "process_state": "PROCESS_STATE_RUNNING",
        "last_stop_time": null,
        "core_file_list": [
          
        ],
        "last_start_time": "1529363031000000",
        "stop_count": 0,
        "last_exit_time": null,
        "exit_count": 0
      }
    ]
  },
  "CassandraStatusData": {
    "__T": 1529408511750954,
    "thread_pool_stats": [
      [
        [
          {
            "active": 0,
            "pool_name": "ReadStage",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "CompactionExecutor",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "MutationStage",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "MemtableFlushWriter",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "Native-Transport-Requests",
            "pending": 0,
            "all_time_blocked": 379269
          }
        ],
        "nodem7:Database:contrail-database-nodemgr:0"
      ],
      [
        [
          {
            "active": 0,
            "pool_name": "ReadStage",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "CompactionExecutor",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "MutationStage",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "Native-Transport-Requests",
            "pending": 0,
            "all_time_blocked": 0
          },
          {
            "active": 0,
            "pool_name": "MemtableFlushWriter",
            "pending": 0,
            "all_time_blocked": 0
          }
        ],
        "nodem7:Config:contrail-config-nodemgr:0"
      ]
    ],
    "cassandra_compaction_task": {
      "pending_compaction_tasks": 0
    }
  },
  "ContrailConfig": {
    "deleted": false,
    "__T": 1529408511750954,
    "elements": {
      "database_node_ip_address": "\"10.204.216.96\"",
      "fq_name": "[\"default-global-system-config\", \"nodem7\"]",
      "parent_uuid": "\"a715f78f-7b3f-4acd-ac47-d1bb8d1fbe0c\"",
      "parent_type": "\"global-system-config\"",
      "perms2": "{\"owner\": \"cloud-admin\", \"owner_access\": 7, \"global_access\": 0, \"share\": []}",
      "id_perms": "{\"enable\": true, \"uuid\": {\"uuid_mslong\": 2094117328666119418, \"uuid_lslong\": 11968844716202857479}, \"created\": \"2018-06-19T04:33:51.815690\", \"description\": null, \"creator\": null, \"user_visible\": true, \"last_modified\": \"2018-06-19T04:33:51.815690\", \"permissions\": {\"owner\": \"admin\", \"owner_access\": 7, \"other_access\": 7, \"group\": \"admin\", \"group_access\": 7}}",
      "display_name": "\"nodem7\"",
      "uuid": "\"1d0fcc9d-8522-4cfa-a619-e0db11151407\""
    }
  },
  "UVEAlarms": {
    "alarms": [
      {
        "severity": 0,
        "alarm_rules": {
          "or_list": [
            {
              "and_list": [
                {
                  "condition": {
                    "operation": "!=",
                    "operand1": "NodeStatus.all_core_file_list",
                    "variables": [
                      
                    ],
                    "operand2": {
                      "json_value": "null"
                    }
                  },
                  "match": [
                    {
                      "json_operand1_value": "[\"/var/crashes/core.contrail-query-.1.nodem7.1529387523\"]",
                      "json_variables": {
                        
                      }
                    }
                  ]
                },
                {
                  "condition": {
                    "operation": "size!=",
                    "operand1": "NodeStatus.all_core_file_list",
                    "variables": [
                      
                    ],
                    "operand2": {
                      "json_value": "0"
                    }
                  },
                  "match": [
                    {
                      "json_operand1_value": "[\"/var/crashes/core.contrail-query-.1.nodem7.1529387523\"]",
                      "json_variables": {
                        
                      }
                    }
                  ]
                }
              ]
            }
          ]
        },
        "timestamp": 1529408511510250,
        "ack": false,
        "token": "eyJ0aW1lc3RhbXAiOiAxNTI5NDA4NTExNTEwMjUwLCAiaHR0cF9wb3J0IjogNTk5NSwgImhvc3RfaXAiOiAiMTAuMjA0LjIxNi45NSJ9",
        "type": "default-global-system-config:system-defined-core-files",
        "description": "A core file has been generated on the node."
      }
    ],
    "__T": 1529408512654665
  },
  "DatabaseUsageInfo": {
    "__T": 1529408519048656,
    "database_usage": [
      [
        [
          {
            "disk_space_available_1k": 332338572,
            "analytics_db_size_1k": 2464820,
            "disk_space_used_1k": 14032744
          }
        ],
        "nodem7:Database:contrail-database-nodemgr:0"
      ],
      [
        [
          {
            "disk_space_available_1k": 332337780,
            "config_db_size_1k": 1480,
            "disk_space_used_1k": 14033536
          }
        ],
        "nodem7:Config:contrail-config-nodemgr:0"
      ]
    ]
  }
}

Contrail alarm gen logs:

06/19/2018 05:12:14 PM [kafka.conn]: <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092>: socket disconnected
06/19/2018 05:12:14 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:14 PM [kafka.consumer.fetcher]: Fetch to node 3 failed: ConnectionError: socket disconnected
06/19/2018 05:12:14 PM [kafka.conn]: <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092>: socket disconnected
06/19/2018 05:12:14 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:14 PM [kafka.consumer.fetcher]: Fetch to node 3 failed: ConnectionError: socket disconnected
06/19/2018 05:12:14 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:14 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:14 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:14 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:15 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:15 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:15 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:15 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:15 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:15 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:15 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:15 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:16 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:16 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:16 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:16 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:16 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:16 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:16 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:16 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:17 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:17 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:17 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:17 PM [kafka.client]: Node 3 connection failed -- refreshing metadata
06/19/2018 05:12:17 PM [kafka.conn]: Connect attempt to <BrokerConnection host=10.204.216.96/10.204.216.96 port=9092> returned error 111. Disconnecting.
06/19/2018 05:12:17 PM [kafka.client]: Node 3 connection failed -- refreshing metadata

Setup details:

Build : 5.0-104
DISTRO : CentOS Linux release 7.5.1804 Core 
SKU : ocata 
Config Nodes : ['nodem7', 'nodem6', 'nodem14'] 
Control Nodes : ['nodem7', 'nodem6', 'nodem14'] 
Compute Nodes : ['nodem10', 'nodem9', 'nodem8'] 
Openstack Node : ['nodem7', 'nodem6', 'nodem14'] 
WebUI Node : ['nodem7', 'nodem6', 'nodem14'] 
Analytics Nodes : ['nodem7', 'nodem6', 'nodem14'] 
Database Nodes : ['nodem7', 'nodem6', 'nodem14'] 
Physical Devices : ['blr-! mx1', u"'blr-mx1', '10.10.10.101'"]

L

	Status	Importance	Assigned to	Milestone
Juniper Openstack	Status tracked in Trunk
R5.0	Fix Released	Critical	Jack Jonnalagadda	Juniper Openstack r5.0.1
Trunk	Fix Committed	Critical	Jack Jonnalagadda	Juniper Openstack r5.1.0

Juniper Openstack

[5.0 build 50]Alarm not getting generated on stopping kafka container

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches