R5.0-341:RHOSP13: all alarm tests are failing since alarm is not getting generated

Bug #1802485 reported by alok kumar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R5.0
New
High
Santosh Gupta
Trunk
New
High
Santosh Gupta

Bug Description

All alarm sanity cases failed on build 341 in RHOSP13 setup.

while debugging one particular test case test_vrouter_process_status_alarms, we see alarm getting generated with wrong host name.

when vrouter agent is stopped on overcloud-novacompute-2, alarm was seen for overcloud-novacompute-1.

{u'vrouter': [{u'name': u'overcloud-novacompute-1', u'value': {u'UVEAlarms': {u'alarms': [{u'severity': 2, u'alarm_rules': {u'or_list': [{u'and_list': [{u'condition': {u'operation': u'>=', u'operand1': u'VrouterStatsAgent.out_bps_ewm.*.sigma', u'variables': [u'VrouterStatsAgent.out_bps_ewm.__key'], u'operand2': {u'json_value': u'2'}}, u'match': [{u'json_operand1_value': u'3.0', u'json_variables': {u'VrouterStatsAgent.out_bps_ewm.__key': u'"enp4s0f1"'}}]}]}, {u'and_list': [{u'condition': {u'operation': u'>=', u'operand1': u'VrouterStatsAgent.in_bps_ewm.*.sigma', u'variables': [u'VrouterStatsAgent.in_bps_ewm.__key'], u'operand2': {u'json_value': u'2'}}, u'match': [{u'json_operand1_value': u'3.0', u'json_variables': {u'VrouterStatsAgent.in_bps_ewm.__key': u'"enp4s0f1"'}}]}]}]}, u'timestamp': 1541756354310490, u'ack': False, u'token': u'eyJ0aW1lc3RhbXAiOiAxNTQxNzU2MzU0MzEwNDkwLCAiaHR0cF9wb3J0IjogNTk5NSwgImhvc3RfaXAiOiAiMTAuMS4wLjI2In0=', u'type': u'default-global-system-config:system-defined-phyif-bandwidth', u'description': u'Physical Bandwidth usage anomaly.'}], u'__T': 1541756354311573}}}]}

tried stopping it manually on other computes too but alarm is always seen with name overcloud-novacompute-1.

still debugging the issue, will update the bug if i have more info further.

setup info:

hypervisor: 10.204.217.133
undercloud: 192.168.122.68

(undercloud) [stack@queensa ~]$ openstack server list
+--------------------------------------+--------------------------------+--------+------------------------+----------------+---------------------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+--------------------------------+--------+------------------------+----------------+---------------------+
| 0127b6d9-0942-4047-9739-29d2cc9bf26f | overcloud-contrailcontroller-0 | ACTIVE | ctlplane=192.168.24.14 | overcloud-full | contrail-controller |
| 86879faf-293b-4be7-ae24-7bd268451be1 | overcloud-contrailcontroller-1 | ACTIVE | ctlplane=192.168.24.6 | overcloud-full | contrail-controller |
| 0e3e0f2e-8ef5-4ab8-8236-f58a0cae929b | overcloud-controller-1 | ACTIVE | ctlplane=192.168.24.23 | overcloud-full | control |
| 99509e79-4264-4e41-94bd-18252cb6428b | overcloud-controller-0 | ACTIVE | ctlplane=192.168.24.20 | overcloud-full | control |
| e847506d-2615-49b8-833e-432de751a5c0 | overcloud-novacompute-2 | ACTIVE | ctlplane=192.168.24.9 | overcloud-full | compute |
| 774b6e04-9704-4f33-b8c5-6c2a9ec0352e | overcloud-novacompute-0 | ACTIVE | ctlplane=192.168.24.15 | overcloud-full | compute |
| c0bda172-7039-47b8-92d0-03ad30be9103 | overcloud-controller-2 | ACTIVE | ctlplane=192.168.24.19 | overcloud-full | control |
| 59fc3a15-66d6-4e48-81ef-00e724ad2aed | overcloud-novacompute-1 | ACTIVE | ctlplane=192.168.24.12 | overcloud-full | compute |
| 79d3c0c0-305a-4e50-9502-ecde64276fd4 | overcloud-contrailcontroller-2 | ACTIVE | ctlplane=192.168.24.8 | overcloud-full | contrail-controller |
+--------------------------------------+--------------------------------+--------+------------------------+----------------+---------------------+

Revision history for this message
alok kumar (kalok) wrote :
Download full text (3.3 KiB)

looks like alarm for diff. host name was for some other reason, now we don't see alarm getting generated when stopped the agent manually but the process status uve is updated with PROCESS_STATE_EXITED which is expected. but the alarm could not get generated.

(Pdb) self.ops_inspect[collector_ip].dict_get('analytics/uves/vrouter/overcloud-novacompute-2?flat')
{u'NodeStatus': {u'build_info': u'{"build-info" : [{"build-version" : "5.0.2", "build-time" : "2018-11-07 07:55:22.713293", "build-user" : "zuul", "build-hostname" : "rhel-7-builder-juniper-contrail-ci-0000139243", "build-id" : "5.0-341.el7", "build-number" : "@contrail"}]}', u'installed_package_version': u'5.0-341.el7', u'deleted': False, u'disk_usage_info': {u'/dev/sda2': {u'partition_space_available_1k': 963361544, u'partition_space_used_1k': 13388732, u'percentage_partition_space_used': 1, u'partition_type': u'xfs'}}, u'__T': 1541761913816681, u'running_package_version': u'5.0-341.el7', u'process_mem_cpu_usage': {u'contrail-vrouter-nodemgr': {u'mem_res': 32808, u'cpu_share': 0.41, u'mem_virt': 60456}}, u'system_cpu_usage': {u'fifteen_min_avg': 0.29, u'node_type': u'vrouter', u'cpu_share': 0.02, u'five_min_avg': 0.24, u'one_min_avg': 0.18}, u'system_mem_usage': {u'used': 2806072, u'cached': 4112024, u'free': 190961068, u'node_type': u'vrouter', u'total': 197881792, u'buffers': 2628}, u'process_status': [{u'instance_id': u'0', u'module_id': u'contrail-vrouter-nodemgr', u'state': u'Functional', u'description': None, u'connection_infos': [{u'server_addrs': [u'10.1.0.16:8086'], u'status': u'Up', u'type': u'Collector', u'name': None, u'description': u'ClientInit to Established on EvSandeshCtrlMessageRecv'}]}], u'system_cpu_info': {u'num_cpu': 32, u'num_core_per_socket': 8, u'num_thread_per_core': 2, u'num_socket': 2}, u'process_info': [{u'process_name': u'contrail-vrouter-agent', u'start_count': 9, u'process_state': u'PROCESS_STATE_EXITED', u'last_stop_time': None, u'core_file_list': [], u'last_start_time': u'1541761626829708', u'stop_count': 0, u'last_exit_time': u'1541761652287532', u'exit_count': 9}, {u'process_name': u'contrail-vrouter-nodemgr', u'start_count': 1, u'process_state': u'PROCESS_STATE_RUNNING', u'last_stop_time': None, u'core_file_list': [], u'last_start_time': u'1541684145000000', u'stop_count': 0, u'last_exit_time': None, u'exit_count': 0}]}, u'ContrailConfig': {u'deleted': False, u'__T': 1541743958781663, u'elements': {u'fq_name': u'["default-global-system-config", "overcloud-novacompute-2"]', u'parent_uuid': u'"47fe663e-7f70-404b-adfb-00f579062afe"', u'virtual_router_dpdk_enabled': u'false', u'parent_type': u'"global-system-config"', u'uuid': u'"e5c7c7e9-4b21-4496-aa1c-e83f84b3aa97"', u'perms2': u'{"owner": "cloud-admin", "owner_access": 7, "global_access": 0, "share": []}', u'id_perms': u'{"enable": true, "description": null, "created": "2018-11-08T13:36:36.070703", "creator": null, "uuid": {"uuid_mslong": 16557422359852696726, "uuid_lslong": 12257927645302598295}, "user_visible": true, "last_modified": "2018-11-08T17:32:13.278946", "permissions": {"owner": "admin", "owner_access": 7, "other_access": 7, "group": "admin", "group_access": 7}}', u'display_name': u'"overcl...

Read more...

Revision history for this message
Biswajit Mandal (bmandal) wrote :

@alok, what was the issue related to hostname?

We see that in overcloud-contrailcontroller-0 alarm_config was not pushed.
[root@overcloud-contrailcontroller-0 contrail]#curl localhost:5995/Snh_AlarmConfigRequest?name=
<?xml-stylesheet type="text/xsl" href="/universal_parse.xsl"?><AlarmConfigResponse type="sandesh"><alarms type="list" identifier="1"><list type="struct" size="0"></list></alarms></AlarmConfigResponse>
[root@overcloud-contrailcontroller-0 contrail]#

While other two nodes were fine.
Even there is connection drop from alarmgen to kafka.

11/09/2018 10:56:17 PM [kafka.client] [WARNING]: Node 1 connection failed -- refreshing metadata
11/09/2018 10:59:11 PM [kafka.client] [WARNING]: Node 2 connection failed -- refreshing metadata
11/09/2018 10:59:12 PM [kafka.client] [WARNING]: Node 1 connection failed -- refreshing metadata
11/09/2018 11:00:33 PM [kafka.client] [WARNING]: Node 1 connection failed -- refreshing metadata
11/09/2018 11:14:08 PM [kafka.client] [WARNING]: Node 1 connection failed -- refreshing metadata

Revision history for this message
alok kumar (kalok) wrote :

@Biswajit, when compute2 agent was stopped, alarm was seen with name of compute1 however agent was up on compute1. in next rerun of the test case, I didn't see this behaviour and no alarm was generated.

Revision history for this message
Santosh Gupta (sangupta) wrote :

@alok Couple of questions on this
- We have a general concern why we had this mixup in names. Assigning proper hostname/name resolution
  is a basic function. Is this a provisioning issue?
  The fact this is not seen with ansible points to some issue in provisioning.
- When was the last time this was working fine. This would give us some baseline.
  Did we have some changes in RHOSP provisioning scripts? We need to know the changes.
  There are some changes on controller/analytics on 5.0.2 so having a baseline would help isolating
  the issue.

Revision history for this message
alok kumar (kalok) wrote :

@Santosh, there is no issue in hostname, please check the setup.
as mentioned in comment #3, alarm was seen with name of compute1 instead compute2, now why this was seen even when agent was running on compute1(and stopped on compute2) this you can debug from the logs only as this is not seen again while debugging.

I think right now you can debug the issue from the perspective of alarm not at all getting generated.

Revision history for this message
alok kumar (kalok) wrote :

alarm cases had passed on 5.0.2 build 330 in RHOSP13 setup.

Revision history for this message
alok kumar (kalok) wrote :

As a workaround, restarting alarmgen fixes the issue.
verified after restarting alarmgen on all 3 controllers, all alarm sanity cases passed.
verified on build 349 dpdk setup.

Revision history for this message
Sudheendra Rao (sudheendra-k) wrote :

removing the blocker tag as we have work around and the bug will be release noted, if no fixed Immediately.

tags: removed: blocker sanityblocker
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.