[5.0 build 50]Alarm not getting generated on stopping kafka container

Bug #1761424 reported by aswani kumar
24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
Fix Released
Critical
Jack Jonnalagadda
Trunk
Fix Committed
Critical
Jack Jonnalagadda

Bug Description

5.0 build tag ocata-master-50 multinode openstack

I having kafka cluster running on three nodes nodec7,nodec8,nodec57

On stopping kafka container on nodec7 i am able to see the alarm
but stopping kafka container on other nodes nodec8 and nodec57 i am not able to see any alarms

tags: added: analytics sanityblocker
Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

Please provide the setup or logs. Did you check the NodeStatus UVE for the database node where you stopped the kafka container?

Revision history for this message
aswani kumar (aswanikumar90) wrote :

I didn't have multinode so tried on single node

I stopped the kafka container
[root@nodeg6 ~]# docker ps -a | grep kafka
b4ab654d9d07 opencontrailnightly/contrail-external-kafka:ocata-master-50 "/docker-entrypoin..." 43 hours ago Exited (143) 3 minutes ago analyticsdatabase_kafka_1

In database nodestatus, process_info still showing as process running
process_info: [
{
process_name: "kafka",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1522831835000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},

you can use nodeg6
10.204.217.46

Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

nodemgr doesn't seem to detect the status of kafka container when it goes down

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

checked this env (nodeg6)

cluster is up, kafka is in the running state
- stopped kafka: docker stop ab882abc227c
- checked contrail-status output: kafka is in exited state, inactive. collector is initializing state due to kafka connection error.
- checked introspect port: kafka is in exited state.

Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

@Andrey, the issue is not about contrail-status reflecting the correct state of the docker. The issue is that nodemgr doesn't send the correct status of the docker in the NodeStatus UVE and hence alarms are not being raised.

The url below shows the kafka status as RUNNING. This is incorrect.
Nodemgr should have detected that the docker was stopped and updated the same in the NodeStatus UVE.

https://10.204.217.46:8143/proxy?proxyURL=http://5.5.5.234:8081/analytics/uves/database-node/nodeg6?cfilt=NodeStatus:process_info

{
NodeStatus: {
process_info: [
{
process_name: "kafka",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING", <<<<<<<<<<<<<<<<<<< Incorrect
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},
{
process_name: "zookeeper",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},
{
process_name: "contrail-database-nodemgr",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
},
{
process_name: "cassandra",
start_count: 1,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1523364755000000",
stop_count: 0,
last_exit_time: null,
exit_count: 0
}
]
}
}

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

I checked it with
[root@nodeg6 ~]# curl -s http://localhost:8103/Snh_SandeshUVECacheReq?x=NodeStatus | xmllint --format -

and I see
            <ProcessInfo>
              <process_name type="string" identifier="1">kafka</process_name>
              <process_state type="string" identifier="2">PROCESS_STATE_EXITED</process_state>
              <start_count type="u32" identifier="3">1</start_count>
              <stop_count type="u32" identifier="4">0</stop_count>
              <exit_count type="u32" identifier="5">0</exit_count>
              <last_start_time type="string" identifier="6">1523364755000000</last_start_time>
              <last_stop_time type="string" identifier="7"/>
              <last_exit_time type="string" identifier="8"/>
              <core_file_list type="list" identifier="9">
                <list type="string" size="0"/>
              </core_file_list>
            </ProcessInfo>

And I make a conclusion that nodemgr detects correct status.

I'm not an expert in UVE's - How I can debug that nodemgr sends it?

In your example - what does URL mean?

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

In my 3-nodes setup, when I switched off all three kafka's - WebUI shows only one alarm from the first. And it shows other two kafka's are up.

When I switched kafka on first server - WebUI shows the same picture.
So I need to know how to debug this all.

Revision history for this message
Andrey Pavlov (apavlov-e) wrote :

nodemgr.conf -

[COLLECTOR]
server_list=10.1.26.148:8086 10.1.26.149:8086 10.1.26.150:8086

Revision history for this message
Sundaresan Rajangam (srajanga) wrote :

If the kafka cluster is down, then analytics-api and alarm-gen should report the state as non-functional. /analytics/uves/<uve-type> should not return any data

summary: - [5.0 build 50]Alarm not gettinng generated on stopping kafka containner
+ [5.0 build 50]Alarm not getting generated on stopping kafka container
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/41920
Submitter: Jack Jonnalagadda (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R5.0

Review in progress for https://review.opencontrail.org/42049
Submitter: Jack Jonnalagadda (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/41920
Committed: http://github.com/Juniper/contrail-analytics/commit/5f8494f35c670bd71d09606ec5a2a9efea39b432
Submitter: Zuul v3 CI (<email address hidden>)
Branch: master

commit 5f8494f35c670bd71d09606ec5a2a9efea39b432
Author: Jackjvs <email address hidden>
Date: Sat Apr 14 10:05:10 2018 -0700

After 4.0, alarm is not getting generated when kafka container is
down, because discovery doesn't detect the event in the container
model of deployment.

This fix dectects kafka down event and raises SysExit exception
in alrmgen. Upon detecting kafka health check producer failure
the said exception is raised.

Change-Id: If3922d8dc550d292578731c97d7ce9e5ae3a1631
Closes-Bug: 1761424

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/42049
Committed: http://github.com/Juniper/contrail-analytics/commit/2a19b7645950b8932de831e8eea5ee17b6068879
Submitter: Zuul v3 CI (<email address hidden>)
Branch: R5.0

commit 2a19b7645950b8932de831e8eea5ee17b6068879
Author: Jackjvs <email address hidden>
Date: Sat Apr 14 10:05:10 2018 -0700

After 4.0, alarm is not getting generated when kafka container is
down, because discovery doesn't detect the event in the container
model of deployment.

This fix dectects kafka down event and raises SysExit exception
in alrmgen. Upon detecting kafka health check producer failure
the said exception is raised.

Change-Id: If3922d8dc550d292578731c97d7ce9e5ae3a1631
Closes-Bug: 1761424

Revision history for this message
aswani kumar (aswanikumar90) wrote :

Seeing this issue again on 5.0 build 40 and above
I have a kafka cluster runnning on nodec4, nodec5, nodec6

On stopping kafka service on nodec5 alarm is not getting generated
On other nodes its getting generated

[root@nodec5 ~]# docker ps -a | grep kafka
5820ef8e8e26 10.204.217.152:5000/contrail-external-kafka:ocata-5.0-40 "/docker-entrypoin..." 4 days ago Exited (137) 3 minutes ago analyticsdatabase_kafka_1

contrail-status

== Contrail control ==
control: active
nodemgr: active
named: active
dns: active

== Contrail analytics ==
snmp-collector: active
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: active

== Contrail config ==
api: active
zookeeper: active
svc-monitor: backup
nodemgr: active
device-manager: backup
cassandra: active
rabbitmq: active
schema: backup

== Contrail webui ==
web: active
job: active

== Contrail database ==
kafka: inactive
nodemgr: active
zookeeper: active
cassandra: active

Nodestatus on nodec5 showing as kafka process running

process_info: [
{
process_name: "kafka",
start_count: 3,
process_state: "PROCESS_STATE_RUNNING",
last_stop_time: null,
core_file_list: [ ],
last_start_time: "1525764463322191",
stop_count: 0,
last_exit_time: "1525764097238154",
exit_count: 1
},

Revision history for this message
Jack Jonnalagadda (jackjvs) wrote :

Do you mean that on stopping kafka service on nodec5 alarm is not getting generated on nodec5, but alarm is getting generated on nodec4 and nodec6?

What is the status of alarmgen on nodec4 and nodec6 - when kafka is stopped?
And what is the status of alarmgen on nodec5?

What is the HA config of kafka and replication factor?

Revision history for this message
aswani kumar (aswanikumar90) wrote :

I mean on the other nodes c4 and c6 if i stop kafka i am seeing respective alarms

If i stop kafka on c5 i am not seeing any alarms

After stopping kafak on nodec5 alarm_gen is active.
The above conntrail-status is from nodec5

Its not HA setup

Revision history for this message
Jack Jonnalagadda (jackjvs) wrote :

What is the list of kafka brokers that the alarmgen on nodec5 can connect with?

Revision history for this message
Jack Jonnalagadda (jackjvs) wrote :
Download full text (4.3 KiB)

Example output from 5.0, when kafka is stopped:
[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/virtual-networks | python -m json.tool
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 2 100 2 0 0 606 0 --:--:-- --:--:-- --:--:-- 666
[]
[root@a1s19 /]# contrail-status
Pod Service Original Name State Status
analytics alarm-gen contrail-analytics-alarm-gen running Up About a minute
analytics api contrail-analytics-api running Up 6 hours
analytics collector contrail-analytics-collector running Up 6 hours
analytics nodemgr contrail-nodemgr running Up 6 hours
analytics query-engine contrail-analytics-query-engine running Up 6 hours
config api contrail-controller-config-api running Up 6 hours
config cassandra contrail-external-cassandra running Up 6 hours
config device-manager contrail-controller-config-devicemgr running Up 6 hours
config nodemgr contrail-nodemgr running Up 6 hours
config rabbitmq contrail-external-rabbitmq running Up 6 hours
config schema contrail-controller-config-schema running Up 6 hours
config svc-monitor contrail-controller-config-svcmonitor running Up 6 hours
config zookeeper contrail-external-zookeeper running Up 6 hours
control control contrail-controller-control-control running Up 6 hours
control dns contrail-controller-control-dns running Up 6 hours
control named contrail-controller-control-named running Up 6 hours
control nodemgr contrail-nodemgr running Up 6 hours
database cassandra contrail-external-cassandra running Up 6 hours
database kafka contrail-external-kafka exited Exited (143) 23 minutes ago
database nodemgr contrail-nodemgr running Up 6 hours
database zookeeper contrail-external-zookeeper running Up 6 hours
vrouter agent contrail-vrouter-agent running Up 6 hours
vrouter nodemgr contrail-nodemgr running Up 6 hours
webui job contrail-controller-webui-job running Up 6 hours
webui web contrail-controller-webui-web running Up 6 hours

vrouter kernel module is PRESENT
== Contrail control ==
control: active
nodemgr: timeout
named: active
dns: active

== Contrail database ==
kafka: inactive
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: inactive
query-engine: active
api: initializing (UvePartitions:UVE-Aggregation[Partitions:0] connection down)
alarm-gen: active
nodemgr: active
collector: initializing (KafkaPub:10...

Read more...

Revision history for this message
Jack Jonnalagadda (jackjvs) wrote :

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/projects | python -m json.tool
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 2 100 2 0 0 473 0 --:--:-- --:--:-- --:--:-- 666
[] << Kafka is DOWN (no UVEs displayed)

After starting kafka:
database kafka contrail-external-kafka running Up 2 minutes
database nodemgr contrail-nodemgr running Up 6 hours
database zookeeper contrail-external-zookeeper running Up 6 hours
vrouter agent contrail-vrouter-agent running Up 6 hours
vrouter nodemgr contrail-nodemgr running Up 6 hours
webui job contrail-controller-webui-job running Up 6 hours
webui web contrail-controller-webui-web running Up 6 hours

vrouter kernel module is PRESENT
== Contrail control ==
control: active
nodemgr: timeout
named: active
dns: active

== Contrail database ==
kafka: active
nodemgr: active
zookeeper: active
cassandra: active

== Contrail analytics ==
snmp-collector: inactive
query-engine: active
api: active
alarm-gen: active
nodemgr: active
collector: active
topology: inactive

[root@a1s19 /]# curl -X GET -H "X-Auth-Token: e792b56948eb42f7bea3e863d4e297cb" -H "content-type: application/json" http://10.84.5.19:8081/analytics/uves/projects | python -m json.tool
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
100 137 100 137 0 0 31086 0 --:--:-- --:--:-- --:--:-- 34250
[
    {
        "href": "http://10.84.5.19:8081/analytics/uves/project/default-domain:default-project?flat",
        "name": "default-domain:default-project"
    }
] << Kafka is backup

Revision history for this message
Ankit Jain (ankitja) wrote :
Download full text (13.9 KiB)

Reopenin the bug as it is seen again in 5.0-104~ocata.

AnalyticsTestSanity.test_db_node_process_status_alarms failing because of the issue :

After stopping the kafka service, the script expects the alarm, which is not getting generated.

Expecting the alarm to be generated of type "default-global-system-config:system-defined-process-status" after executing "docker stop analyticsdatabase_kafka_1 -t 60 on nodem7

Looks like UVE is also not updated. It showed kafka as PROCESS_STATE_RUNNING. Pasting the UVE below:

[root@nodem7 ~]# docker ps -a | grep kafka
1592e3c73167 10.204.217.152:5000/contrail-external-kafka:ocata-5.0-104 "/docker-entrypoin..." 7 hours ago Exited (1) 2 minutes ago analyticsdatabase_kafka_1
[root@nodem7 ~]#

  "NodeStatus": {
    "build_info": "{\"build-info\" : [{\"build-version\" : \"5.0.1\", \"build-time\" : \"2018-06-18 08:57:10.831815\", \"build-user\" : \"zuul\", \"build-hostname\" : \"centos-7-4-builder-juniper-contrail-ci-0000055808\", \"build-id\" : \"5.0-104.el7\", \"build-number\" : \"@contrail\"}]}",
    "installed_package_version": "5.0-104.el7",
    "deleted": false,
    "disk_usage_info": {
      "/dev/mapper/nodem7--vg00-lv_root": {
        "partition_space_available_1k": 332337864,
        "partition_space_used_1k": 14033452,
        "percentage_partition_space_used": 4,
        "partition_type": "ext4"
      }
    },
    "__T": 1529408522111042,
    "running_package_version": "5.0-104.el7",
    "process_mem_cpu_usage": {
      "zookeeper": {
        "mem_res": 574980,
        "cpu_share": 0.01,
        "mem_virt": 577572
      },
      "cassandra": {
        "mem_res": 2721272,
        "cpu_share": 1.86,
        "mem_virt": 5909560
      },
      "contrail-database-nodemgr": {
        "mem_res": 47616,
        "cpu_share": 0.15,
        "mem_virt": 47628
      },
      "kafka": {
        "mem_res": 547848,
        "cpu_share": 0.0,
        "mem_virt": 549752
      }
    },
    "system_cpu_info": {
      "num_cpu": 40,
      "num_core_per_socket": 10,
      "num_thread_per_core": 2,
      "num_socket": 2
    },
    "system_mem_usage": {
      "used": 17315484,
      "cached": 13772224,
      "free": 230580516,
      "node_type": "database-node",
      "total": 263857016,
      "buffers": 2188792
    },
    "process_status": [
      {
        "instance_id": "0",
        "module_id": "contrail-database-nodemgr",
        "state": "Functional",
        "description": null,
        "connection_infos": [
          {
            "server_addrs": [
              "10.204.216.103:8086"
            ],
            "status": "Up",
            "type": "Collector",
            "name": null,
            "description": "ClientInit to Established on EvSandeshCtrlMessageRecv"
          }
        ]
      }
    ],
    "all_core_file_list": [
      "/var/crashes/core.contrail-query-.1.nodem7.1529387523"
    ],
    "system_cpu_usage": {
      "fifteen_min_avg": 2.28,
      "node_type": "database-node",
      "cpu_share": 0.18,
      "five_min_avg": 2.15,
      "one_min_avg": 2.31
    },
    "process_info": [
      {
        "process_name": "kafka",
        "start_count":...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.