200.006 alarm "controller-1 is degraded due to the failure of its 'ceph' process"

Bug #1860363 reported by Peng Peng on 2020-01-20
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Medium
Unassigned

Bug Description

Brief Description
-----------------
On a regular system, after lock/unlock one node (compute or controller), alarm 200.006 | "controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress." is raised.

Severity
--------
Major

Steps to Reproduce
------------------
lock compute node
unlock compute node
check alarm-list

TC-name: mtc/test_lock_unlock_host.py::test_lock_unlock_host[compute]

Expected Behavior
------------------
no 200.006 alarm

Actual Behavior
----------------
200.006 alarm raised

Reproducibility
---------------
Intermittent 2/3 in sanity run

System Configuration
--------------------
Multi-node system

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
master as of 2020-01-20_00-10-00

Last Pass
---------
2020-01-16_22-02-50

Timestamp/Logs
--------------
[2020-01-20 15:53:49,593] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-01-20 15:53:50,644] 436 DEBUG MainThread ssh.expect :: Output:

[sysadmin@controller-0 ~(keystone_admin)]$
[2020-01-20 15:54:38,599] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2020-01-20 15:54:39,738] 436 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2020-01-20 15:54:41,112] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

[2020-01-20 15:56:38,659] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2020-01-20 16:02:15,135] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2020-01-20 16:02:16,317] 436 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2020-01-20 16:02:19,034] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-01-20 16:02:20,042] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 13ff9738-fb74-4b62-9820-883c40a9f70a | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2020-01-20T16:00:35.265175 |
| 2a258bf7-d14a-46c4-a6b5-75053f4a850f | 100.114 | NTP address 64:ff9b::a2f3:c2cb is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::a2f3:c2cb | minor | 2020-01-20T16:00:35.223428 |
| 05853c2b-531f-40ee-95af-de9405266948 | 100.114 | NTP address 64:ff9b::3d9:4ff2 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::3d9:4ff2 | minor | 2020-01-20T16:00:35.181396 |
| e308f8d2-6688-4eba-8700-a4a13188f492 | 100.114 | NTP address 64:ff9b::d8e5:442 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::d8e5:442 | minor | 2020-01-20T16:00:35.179159 |
| 0a9c655b-bb56-4e3a-bca1-24c0f88252e4 | 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=e065d077-c129-4eea-b108-c6c20ad8ba85 | warning | 2020-01-20T15:56:19.887072 |
| ac178ecc-bdbd-470c-91ef-cfcd60aa5cc1 | 200.006 | controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress. | host=controller-0.process=ceph | major | 2020-01-20T15:55:51.361058 |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Sanity

Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 / medium priority - intermittent issue related to ceph; needs further investigation

description: updated
tags: added: stx.storage
Changed in starlingx:
status: New → Triaged
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers