ceph monitor failure alarm not raised after ceph monitor stopped

Bug #1835571 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Triaged
Low
Tingjie Chen

Bug Description

Brief Description
-----------------
After ceph monitor removed and stopped, Alarm 200.006 ceph monitor failure alarm were not raised.

Severity
--------
Major

Steps to Reproduce
------------------
Remove the monitor
Stop the ceph monitor
Check whether ceph monitor failure alarm is raised

TC-name: ceph/test_ceph.py::test_ceph_mon_process_kill[controller-0]

Expected Behavior
------------------
200.006 alarm raised

Actual Behavior
----------------
200.006 alarm not raised

Reproducibility
---------------
Seen once

System Configuration
--------------------
Multi-node system

Lab-name: WCP_113-121

Branch/Pull Time/Commit
-----------------------
stx master as of 20190705T013000Z

Last Pass
---------
Lab: WCP_113_121
Load: 20190704T013000Z

Timestamp/Logs
--------------
[2019-07-05 17:48:14,138] 301 DEBUG MainThread ssh.send :: Send 'ceph mon remove controller-0'
[2019-07-05 17:48:14,495] 423 DEBUG MainThread ssh.expect :: Output:
removing mon.controller-0 at 192.168.222.3:6789/0, there will be 2 monitors
]0;root@controller-0:~controller-0:~#

[2019-07-05 17:48:14,599] 301 DEBUG MainThread ssh.send :: Send 'service ceph stop mon.controller-0'
[2019-07-05 17:48:14,933] 423 DEBUG MainThread ssh.expect :: Output:
=== mon.controller-0 ===
Stopping Ceph mon.controller-0 on controller-0...done

[2019-07-05 17:48:19,974] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 17:48:21,340] 423 DEBUG MainThread ssh.expect :: Output:

[sysadmin@controller-1 ~(keystone_admin)]$

[2019-07-05 17:53:11,026] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.222.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-07-05 17:53:12,774] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
| a4ef9cae-0ba1-4a73-81e9-48266d6ac7f9 | 400.005 | Communication failure detected with peer over port ens785f0.166 on host controller-1 | host=controller-1.network=mgmt | major | 2019-07-05T17:50:35.284358 |
| 245c97d6-6688-45f5-a739-d59b16ee9b7a | 400.005 | Communication failure detected with peer over port eno1 on host controller-1 | host=controller-1.network=oam | major | 2019-07-05T17:50:35.162347 |
| de90b87b-2b74-425a-af5a-4594b8f17803 | 400.005 | Communication failure detected with peer over port ens785f0.167 on host controller-1 | host=controller-1.network=cluster-host | major | 2019-07-05T17:50:35.000430 |
| c70e7c57-b0ad-4d35-be98-d7ae047a7484 | 400.002 | Service group oam-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=oam-services | major | 2019-07-05T17:49:23.468544 |
| 964d0e8e-5697-4261-859d-cbb3c60b6e4c | 400.002 | Service group controller-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=controller-services | major | 2019-07-05T17:49:23.385467 |
| 4d626ad5-66dd-490a-b185-c9e92c4a8dfe | 400.002 | Service group cloud-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=cloud-services | major | 2019-07-05T17:49:23.303507 |
| 412b21f1-7a26-45a1-9e05-7aa75941626c | 400.002 | Service group vim-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=vim-services | major | 2019-07-05T17:49:23.221508 |
| b0efcf99-5ee5-4d66-9de1-90d8e6d5aa9d | 400.002 | Service group patching-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=patching-services | major | 2019-07-05T17:49:23.139569 |
| 5d296e4a-89b3-4091-ac05-873b34e3b034 | 400.002 | Service group directory-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=directory-services | major | 2019-07-05T17:49:23.057483 |
| 382219b1-a819-4752-abc3-53987e614d50 | 400.002 | Service group web-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=web-services | major | 2019-07-05T17:49:22.975577 |
| a07c627d-320d-4593-ad54-9ab990afff82 | 400.002 | Service group storage-services loss of redundancy; expected 2 active members but only 1 active member available | service_domain=controller.service_group=storage-services | major | 2019-07-05T17:49:22.894474 |
| 3a2d26f0-ea11-4e0d-840e-6e890c5a2406 | 400.002 | Service group storage-monitoring-services loss of redundancy; expected 1 standby member but no standby members available | service_domain=controller.service_group=storage-monitoring-services | major | 2019-07-05T17:49:22.852499 |
+--------------------------------------+----------+--------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------+----------+----------------------------+
[sysadmin@controller-1 ~(keystone_admin)]$

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as Low priority / not gating as the issue was seen once and doesn't have a large system impact (just an alarm not getting raised).

If the issue becomes more reproducible, please let us know.

Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
tags: added: stx.storage
Changed in starlingx:
assignee: nobody → Cindy Xie (xxie1)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

Assigning to Cindy as this is ceph related. Suggest to monitor the bug for a re-occurrence before starting the investigation.

Cindy Xie (xxie1)
Changed in starlingx:
assignee: Cindy Xie (xxie1) → Tingjie Chen (silverhandy)
Revision history for this message
Peng Peng (ppeng) wrote :

Issue was reproduced on
Lab: IP_5_6
Load: 2019-08-20_20-59-00

[2019-08-21 15:49:50,132] 301 DEBUG MainThread ssh.send :: Send 'cat /var/run/ceph/mon.controller.pid'
[2019-08-21 15:49:50,236] 423 DEBUG MainThread ssh.expect :: Output:
291350

[2019-08-21 15:49:50,836] 301 DEBUG MainThread ssh.send :: Send 'ceph mon remove controller-0'
[2019-08-21 15:49:51,346] 423 DEBUG MainThread ssh.expect :: Output:
mon.controller-0 does not exist or has already been removed
]0;root@controller-0

[2019-08-21 15:49:51,450] 301 DEBUG MainThread ssh.send :: Send 'service ceph stop mon.controller-0'
[2019-08-21 15:49:51,676] 423 DEBUG MainThread ssh.expect :: Output:
/etc/init.d/ceph: mon.controller-0 not found (/etc/ceph/ceph.conf defines mon.controller osd.0, /var/lib/ceph defines mon.controller osd.0)
]0;root@controller-0:~controller-0:~#

[2019-08-21 15:54:49,016] 301 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-08-21 15:54:50,575] 423 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
| 86cf0be8-c66b-41ac-ac96-b29540d26bf4 | 100.114 | NTP address 2600:3c04::f03c is not a valid or a reachable NTP server. | host=controller-1.ntp=2600:3c04::f03c | minor | 2019-08-21T15:16:35.097000 |
| 87d228e5-2241-4571-80f1-7e9ee0e8e6f6 | 100.114 | NTP address 2607:5300:61:c0 is not a valid or a reachable NTP server. | host=controller-0.ntp=2607:5300:61:c0 | minor | 2019-08-21T11:11:17.785826 |
+--------------------------------------+----------+-----------------------------------------------------------------------+---------------------------------------+----------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Yang Liu (yliu12)
tags: added: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.