200.006 alarm "controller-1 is degraded due to the failure of its 'ceph' process"

Bug #1860363 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Elena Taivan

Bug Description

Brief Description
-----------------
On a regular system, after lock/unlock one node (compute or controller), alarm 200.006 | "controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress." is raised.

Severity
--------
Major

Steps to Reproduce
------------------
lock compute node
unlock compute node
check alarm-list

TC-name: mtc/test_lock_unlock_host.py::test_lock_unlock_host[compute]

Expected Behavior
------------------
no 200.006 alarm

Actual Behavior
----------------
200.006 alarm raised

Reproducibility
---------------
Intermittent 2/3 in sanity run

System Configuration
--------------------
Multi-node system

Lab-name: WCP_71-75

Branch/Pull Time/Commit
-----------------------
master as of 2020-01-20_00-10-00

Last Pass
---------
2020-01-16_22-02-50

Timestamp/Logs
--------------
[2020-01-20 15:53:49,593] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-01-20 15:53:50,644] 436 DEBUG MainThread ssh.expect :: Output:

[sysadmin@controller-0 ~(keystone_admin)]$
[2020-01-20 15:54:38,599] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2020-01-20 15:54:39,738] 436 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | available |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2020-01-20 15:54:41,112] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-lock compute-0'

[2020-01-20 15:56:38,659] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-unlock compute-0'

[2020-01-20 16:02:15,135] 314 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne host-list'
[2020-01-20 16:02:16,317] 436 DEBUG MainThread ssh.expect :: Output:
+----+--------------+-------------+----------------+-------------+--------------+
| id | hostname | personality | administrative | operational | availability |
+----+--------------+-------------+----------------+-------------+--------------+
| 1 | controller-0 | controller | unlocked | enabled | degraded |
| 2 | compute-0 | worker | unlocked | enabled | available |
| 3 | compute-1 | worker | unlocked | enabled | available |
| 4 | compute-2 | worker | unlocked | enabled | available |
| 5 | controller-1 | controller | unlocked | enabled | available |
+----+--------------+-------------+----------------+-------------+--------------+

[2020-01-20 16:02:19,034] 314 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://[face::1]:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2020-01-20 16:02:20,042] 436 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
| 13ff9738-fb74-4b62-9820-883c40a9f70a | 100.114 | NTP configuration does not contain any valid or reachable NTP servers. | host=controller-1.ntp | major | 2020-01-20T16:00:35.265175 |
| 2a258bf7-d14a-46c4-a6b5-75053f4a850f | 100.114 | NTP address 64:ff9b::a2f3:c2cb is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::a2f3:c2cb | minor | 2020-01-20T16:00:35.223428 |
| 05853c2b-531f-40ee-95af-de9405266948 | 100.114 | NTP address 64:ff9b::3d9:4ff2 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::3d9:4ff2 | minor | 2020-01-20T16:00:35.181396 |
| e308f8d2-6688-4eba-8700-a4a13188f492 | 100.114 | NTP address 64:ff9b::d8e5:442 is not a valid or a reachable NTP server. | host=controller-1.ntp=64:ff9b::d8e5:442 | minor | 2020-01-20T16:00:35.179159 |
| 0a9c655b-bb56-4e3a-bca1-24c0f88252e4 | 800.001 | Storage Alarm Condition: HEALTH_WARN. Please check 'ceph -s' for more details. | cluster=e065d077-c129-4eea-b108-c6c20ad8ba85 | warning | 2020-01-20T15:56:19.887072 |
| ac178ecc-bdbd-470c-91ef-cfcd60aa5cc1 | 200.006 | controller-0 is degraded due to the failure of its 'ceph' process. Auto recovery of this major process is in progress. | host=controller-0.process=ceph | major | 2020-01-20T15:55:51.361058 |
+--------------------------------------+----------+------------------------------------------------------------------------------------------------------------------------+----------------------------------------------+----------+----------------------------+
[sysadmin@controller-0 ~(keystone_admin)]$

Test Activity
-------------
Sanity

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Marking as stx.4.0 / medium priority - intermittent issue related to ceph; needs further investigation

description: updated
tags: added: stx.storage
Changed in starlingx:
status: New → Triaged
tags: added: stx.4.0
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Peng Peng (ppeng) wrote :
Frank Miller (sensfan22)
Changed in starlingx:
assignee: nobody → Elena Taivan (etaivan)
Revision history for this message
Elena Taivan (etaivan) wrote :

After looking over the logs, this issue is the same issue as https://bugs.launchpad.net/starlingx/+bug/1856064.

The problem seems to be IPv6 related and is triggered by a swact.
It didn't reproduce on a IPV4 standard setup.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Update the status to match the duplicate LP: https://bugs.launchpad.net/starlingx/+bug/1856064
Merged on 2020-03-29

Changed in starlingx:
status: Triaged → Fix Released
Revision history for this message
Yang Liu (yliu12) wrote :

This issue has not been seen in recent sanity on same system (wcp71-75).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.