Ceph Alarm 800.001 raised then cleared after lock/unlock on subcloud

Bug #1931103 reported by Mihnea Saracin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Mihnea Saracin

Bug Description

Brief Description
-----------------
Ceph Alarm 800.001 raised after lock/unlock on subcloud controller-0. Alarm cleared on it's own 37 minutes later.

Alarm ID Reason Text Entity ID Severity Time Stamp
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

800.001 Storage Alarm Condition: HEALTH_WARN [PGs are degraded/stuck or undersized]. Please check 'ceph -s' for cluster=259931e6-a1cc-47d6-895a- warning 2020-10-06T21:09:59
  more details. 2fc5f2745099 .229876

Severity
--------
Minor: System/Feature is usable but degraded

Steps to Reproduce
------------------
lock/unlock controller

Expected Behavior
------------------
unlock is successful without alarms

Actual Behavior
----------------
unlock completes but with ceph alarm

Reproducibility
---------------
Infrequent

System Configuration
--------------------
AIO-SX subcloud

Branch/Pull Time/Commit
-----------------------
stx 4.0

Timestamp/Logs
--------------

# bash.log

2020-10-06T20:55:05.000 controller-0 -sh: info HISTORY: PID=3985811 UID=42425 system host-unlock controller-0
2020-10-06T20:55:05.000 controller-0 -sh: info HISTORY: PID=3985811 UID=42425 system host-unlock controller-0
2020-10-06T21:02:45.000 controller-0 -sh: info HISTORY: PID=1508207 UID=42425 sudo reboot

# fm-manager.log

fm-manager.log:2765:2020-10-06T21:09:59.230 fmMsgServer.cpp(398): Raising Alarm/Log, (800.001) (cluster=259931e6-a1cc-47d6-895a-2fc5f2745099)
fm-manager.log:2766:2020-10-06T21:09:59.231 fmMsgServer.cpp(421): Alarm created/updated: (800.001) (cluster=259931e6-a1cc-47d6-895a-2fc5f2745099) (1) (54beb906-be7f-4fcc-9372-49849e02fa42)

# After 37 min alarm cleared

fm-manager.log:2931:2020-10-06T21:46:23.403 fmMsgServer.cpp(494): Deleted alarm: (800.001) (cluster=259931e6-a1cc-47d6-895a-2fc5f2745099)

# ceph status

[sysadmin@controller-0 ~(keystone_admin)]$ ceph -s
cluster:
id: 259931e6-a1cc-47d6-895a-2fc5f2745099
health: HEALTH_WARN
Reduced data availability: 64 pgs inactive

services:
mon: 1 daemons, quorum controller-0
mgr: controller-0(active)
osd: 1 osds: 1 up, 1 in

data:
pools: 1 pools, 64 pgs
objects: 0 objects, 0 B
usage: 225 MiB used, 475 GiB / 476 GiB avail
pgs: 100.000% pgs unknown
64 unknown

# ceph health detail

HEALTH_WARN Reduced data availability: 64 pgs inactive
PG_AVAILABILITY Reduced data availability: 64 pgs inactive
pg 1.0 is stuck inactive for 18225.245755, current state unknown, last acting []
pg 1.1 is stuck inactive for 18225.245755, current state unknown, last acting []

Test Activity
-------------
Developer Testing

Changed in starlingx:
assignee: nobody → Mihnea Saracin (msaracin)
Ghada Khalil (gkhalil)
tags: added: stx.storage
Changed in starlingx:
importance: Undecided → Low
status: New → Triaged
Revision history for this message
Mihnea Saracin (msaracin) wrote :
Changed in starlingx:
status: Triaged → Fix Released
Ghada Khalil (gkhalil)
tags: added: stx.6.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.