StarlingX

Ceph Alarm 800.001 raised then cleared after lock/unlock on subcloud

Bug #1931103 reported by Mihnea Saracin on 2021-06-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Low	Mihnea Saracin

Bug Description

Brief Description
-----------------
Ceph Alarm 800.001 raised after lock/unlock on subcloud controller-0. Alarm cleared on it's own 37 minutes later.

Alarm ID Reason Text Entity ID Severity Time Stamp
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

800.001 Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for cluster=259931e6-a1cc-47d6-895a- warning 2020-10-06T21:09:59
more details. 2fc5f2745099 .229876

Severity
--------
Minor: System/Feature is usable but degraded

Steps to Reproduce
------------------
lock/unlock controller

Expected Behavior
------------------
unlock is successful without alarms

Actual Behavior
----------------
unlock completes but with ceph alarm

Reproducibility
---------------
Infrequent

System Configuration
--------------------
AIO-SX subcloud

Branch/Pull Time/Commit
-----------------------
stx 4.0

Timestamp/Logs
--------------

# bash.log

2020-10-06T20:55:05.000 controller-0 -sh: info HISTORY: PID=3985811 UID=42425 system host-unlock controller-0
2020-10-06T20:55:05.000 controller-0 -sh: info HISTORY: PID=3985811 UID=42425 system host-unlock controller-0
2020-10-06T21:02:45.000 controller-0 -sh: info HISTORY: PID=1508207 UID=42425 sudo reboot

# fm-manager.log

fm-manager.log:2765:2020-10-06T21:09:59.230 fmMsgServer.cpp(398): Raising Alarm/Log, (800.001) (cluster=259931e6-a1cc-47d6-895a-2fc5f2745099)
fm-manager.log:2766:2020-10-06T21:09:59.231 fmMsgServer.cpp(421): Alarm created/updated: (800.001) (cluster=259931e6-a1cc-47d6-895a-2fc5f2745099) (1) (54beb906-be7f-4fcc-9372-49849e02fa42)

# After 37 min alarm cleared

fm-manager.log:2931:2020-10-06T21:46:23.403 fmMsgServer.cpp(494): Deleted alarm: (800.001) (cluster=259931e6-a1cc-47d6-895a-2fc5f2745099)

# ceph status

[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
cluster:
id: 259931e6-a1cc-47d6-895a-2fc5f2745099
health: HEALTH_WARN
Reduced data availability: 64 pgs inactive

services:
mon: 1 daemons, quorum controller-0
mgr: controller-0(active)
osd: 1 osds: 1 up, 1 in

data:
pools: 1 pools, 64 pgs
objects: 0 objects, 0 B
usage: 225 MiB used, 475 GiB / 476 GiB avail
pgs: 100.000% pgs unknown
64 unknown

# ceph health detail

HEALTH_WARN Reduced data availability: 64 pgs inactive
PG_AVAILABILITY Reduced data availability: 64 pgs inactive
pg 1.0 is stuck inactive for 18225.245755, current state unknown, last acting []
pg 1.1 is stuck inactive for 18225.245755, current state unknown, last acting []

Test Activity
-------------
Developer Testing

Tags:

Mihnea Saracin (msaracin) on 2021-06-07

Changed in starlingx:
assignee:	nobody → Mihnea Saracin (msaracin)

Ghada Khalil (gkhalil) on 2021-06-15

tags:	added: stx.storage
Changed in starlingx:
importance:	Undecided → Low
status:	New → Triaged

Revision history for this message

Mihnea Saracin (msaracin) wrote on 2021-06-16:

This issue was fixed by:
https://github.com/starlingx-staging/stx-ceph/commit/b12ab7a268b5c1c5aaf2805b0c5f271954cb9232

Changed in starlingx:
status:	Triaged → Fix Released

Ghada Khalil (gkhalil) on 2021-06-16

tags:

added: stx.6.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.