StarlingX

200.006 alarm "controller-1 is degraded due to the failure of its 'ceph' process"

Bug #1860363 reported by Peng Peng on 2020-01-20

This bug report is a duplicate of: Bug #1856064: Active controller became degraded after lock/unlock compute node. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Elena Taivan

Bug Description

Brief Description
-----------------
On a regular system, after lock/unlock one node (compute or controller), alarm 200.006 | "controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress." is raised.

Severity
--------
Major

Steps to Reproduce
------------------
lock compute node
unlock compute node
check alarm-list

TC-name: mtc/test_lock_unlock_host.py::test_lock_unlock_host[compute]

Expected Behavior
------------------
no 200.006 alarm

Actual Behavior
----------------
200.006 alarm raised

Reproducibility
---------------
Intermittent 2/3 in sanity run

System Configuration
--------------------
Multi-node system

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
master as of 2020-01-20_00-10-00

Last Pass
---------
2020-01-16_22-02-50

Timestamp/Logs
--------------
[2020-01-20 15:53:49,593] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-01-20 15:53:50,644] 436 DEBUG MainThread ssh.expect :: Output:

[2020-01-20 15:54:41,112] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

[2020-01-20 15:56:38,659] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2020-01-20 16:02:19,034] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-01-20 16:02:20,042] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 13ff9738-fb74-4b62-9820-883c40a9f70a | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2020-01-20T16:00:35.265175 |
| 2a258bf7-d14a-46c4-a6b5-75053f4a850f | 100.114 | NTP address 64:ff9b::a2f3:c2cb is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::a2f3:c2cb | minor | 2020-01-20T16:00:35.223428 |
| 05853c2b-531f-40ee-95af-de9405266948 | 100.114 | NTP address 64:ff9b::3d9:4ff2 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::3d9:4ff2 | minor | 2020-01-20T16:00:35.181396 |
| e308f8d2-6688-4eba-8700-a4a13188f492 | 100.114 | NTP address 64:ff9b::d8e5:442 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::d8e5:442 | minor | 2020-01-20T16:00:35.179159 |
| 0a9c655b-bb56-4e3a-bca1-24c0f88252e4 | 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=e065d077-c129-4eea-b108-c6c20ad8ba85 | warning | 2020-01-20T15:56:19.887072 |
| ac178ecc-bdbd-470c-91ef-cfcd60aa5cc1 | 200.006 | controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress. | host=controller-0.process=ceph | major | 2020-01-20T15:55:51.361058 |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Sanity

See original description

Tags: