Ceph health shows HEALTH_OK, no alarms for OSD down

Bug #1841903 reported by Dan Voiculeasa
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Dan Voiculeasa

Bug Description

Brief Description
-----------------
User should have an alarm raised about OSDs that are down, even though ceph shows HEALTH_OK.

Severity
------------------
Minor

Steps to Reproduce
------------------
Make an OSD go down.
`ceph status` will show HEALTH_WARN.
After some time the OSD is marked out so health becomes HEALTH_OK.
Looking with `ceph osd tree` shows the OSD is still down.

When health becomes HEALTH_OK no alarm regarding down OSDs is raised (fm alarm-list).

Expected Behavior
------------------
Display the OSD down alarms.

Actual Behavior
----------------
No alarm regarding down OSDs.

Reproducibility
---------------
100% reproducible

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
starlingx/utilities master ed1ba1665059f19c91a3463cd0366f6c016fd51d

Last Pass
---------
New test

Timestamp/Logs
--------------
not required

Test Activity
-------------
Developer Testing

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Marking stx.3.0 gating to improve alarm reporting.

Changed in starlingx:
status: New → Triaged
importance: Undecided → Medium
assignee: nobody → Dan Voiculeasa (dvoicule)
tags: added: stx.3.0 stx.fault
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to utilities (master)

Fix proposed to branch: master
Review: https://review.opendev.org/681246

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on integ (master)

Change abandoned by Dan Voiculeasa (<email address hidden>) on branch: master
Review: https://review.opendev.org/679229
Reason: Project structure change, moved to
https://review.opendev.org/#/c/681246/1

description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to utilities (master)

Reviewed: https://review.opendev.org/681246
Committed: https://git.openstack.org/cgit/starlingx/utilities/commit/?id=e32b7906843aa6730bd7531c2a7f40efb474c000
Submitter: Zuul
Branch: master

commit e32b7906843aa6730bd7531c2a7f40efb474c000
Author: Dan Voiculeasa <email address hidden>
Date: Tue Sep 10 08:50:39 2019 -0400

    ceph-manager: raise alarms when OSD is down even if health OK

    If ceph status reports HEALTH_OK then OSD down alarm is not raised.
    Same for OSD out.

    Example scenario:
    The disk might fail which means the osd will be in down state.
    Ceph status shows a health warning, for example HEALTH_WARN.
    An alarm is raised by ceph-manager.
    After some time the disk will be marked out. Ceph health becomes
    HEALTH_OK and alarms are cleared.
    The user might never replace the disk thus the OSD state is still down
    yet no alarm is raised.

    Raise alarms even when status is HEALTH_OK to let the user know that
    the OSDs are still down or out.

    Closes-Bug: 1841903
    Change-Id: I4380183ce0cd2e41fbf12d0f9f20a4328293882c
    Signed-off-by: Dan Voiculeasa <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per Yang Liu, stx.retestneeded tag was added so that a regression test-case is added for this scenario

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

2019-11-02_08-39-54
verified on 2+2 system, where osd's down these alarms are set and clear
set | 800.001 | Storage Alarm Condition: HEALTH_WARN.
set | 800.011 | Loss of replication in replication group
 group-0: OSDs are down

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

verified the same alarm behaviour where osd marked down and out (then in)

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.