Brief Description
-----------------
After ceph monitor removed and stopped, Alarm 200.006 ceph monitor failure alarm were not raised.
Severity
--------
Major
Steps to Reproduce
------------------
Remove the monitor
Stop the ceph monitor
Check whether ceph monitor failure alarm is raised
TC-name: ceph/test_ceph.py::test_ceph_mon_process_kill[controller-0]
Expected Behavior
------------------
200.006 alarm raised
Actual Behavior
----------------
200.006 alarm not raised
Reproducibility
---------------
Seen once
System Configuration
--------------------
Multi-node system
Lab-name: WCP_113-121
Branch/Pull Time/Commit
-----------------------
stx master as of 20190705T013000Z
Last Pass
---------
Lab: WCP_113_121
Load: 20190704T013000Z
Timestamp/Logs
--------------
[2019-07-05 17:48:14,138] 301 DEBUG MainThread ssh.send :: Send 'ceph mon remove controller-0'
[2019-07-05 17:48:14,495] 423 DEBUG MainThread ssh.expect :: Output:
removing mon.controller-0 at 192.168.222.3:6789/0, there will be 2 monitors
]0;root@controller-0:~controller-0:~#
[2019-07-05 17:48:14,599] 301 DEBUG MainThread ssh.send :: Send 'service ceph stop mon.controller-0'
[2019-07-05 17:48:14,933] 423 DEBUG MainThread ssh.expect :: Output:
=== mon.controller-0 ===
Stopping Ceph mon.controller-0 on controller-0...done
[2019-07-05 17:48:19,974] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 17:48:21,340] 423 DEBUG MainThread ssh.expect :: Output:
[sysadmin@controller-1 ~(keystone_admin)]$
[2019-07-05 17:53:11,026] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 17:53:12,774] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| a4ef9cae-0ba1-4a73-81e9-48266d6ac7f9 | 400.005 | Communication failure detected with peer over port ens785f0.166 on host controller-1 | host=controller-1.network=mgmt | major | 2019-07-05T17:50:35.284358 |
| 245c97d6-6688-45f5-a739-d59b16ee9b7a | 400.005 | Communication failure detected with peer over port eno1 on host controller-1 | host=controller-1.network=oam | major | 2019-07-05T17:50:35.162347 |
| de90b87b-2b74-425a-af5a-4594b8f17803 | 400.005 | Communication failure detected with peer over port ens785f0.167 on host controller-1 | host=controller-1.network=cluster-host | major | 2019-07-05T17:50:35.000430 |
| c70e7c57-b0ad-4d35-be98-d7ae047a7484 | 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=oam-services | major | 2019-07-05T17:49:23.468544 |
| 964d0e8e-5697-4261-859d-cbb3c60b6e4c | 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=controller-services | major | 2019-07-05T17:49:23.385467 |
| 4d626ad5-66dd-490a-b185-c9e92c4a8dfe | 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=cloud-services | major | 2019-07-05T17:49:23.303507 |
| 412b21f1-7a26-45a1-9e05-7aa75941626c | 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=vim-services | major | 2019-07-05T17:49:23.221508 |
| b0efcf99-5ee5-4d66-9de1-90d8e6d5aa9d | 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=patching-services | major | 2019-07-05T17:49:23.139569 |
| 5d296e4a-89b3-4091-ac05-873b34e3b034 | 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=directory-services | major | 2019-07-05T17:49:23.057483 |
| 382219b1-a819-4752-abc3-53987e614d50 | 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=web-services | major | 2019-07-05T17:49:22.975577 |
| a07c627d-320d-4593-ad54-9ab990afff82 | 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=storage-services | major | 2019-07-05T17:49:22.894474 |
| 3a2d26f0-ea11-4e0d-840e-6e890c5a2406 | 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=storage-monitoring-services | major | 2019-07-05T17:49:22.852499 |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
[sysadmin@controller-1 ~(keystone_admin)]$
Test Activity
-------------
Sanity
Marking as Low priority / not gating as the issue was seen once and doesn't have a large system impact (just an alarm not getting raised).
If the issue becomes more reproducible, please let us know.