No way to skip the check for some NVME devices
Bug #2044398 reported by
Facundo Ciccioli
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
hw-health-charm |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
One of the things the check_nvme.py check does is to read the percentage_used metric and alert in case it's greater than a configured threshold. The issue is that often a single host has more than one NVME, and once one of them reaches the threshold, then the check will remain CRITICAL while any actions are taken to resolve the issue with that particular drive. During this period, the other NVMEs are effectively shadowed and them reaching the threshold will go unnoticed.
Analyze the feature of adding an "ignore-nvme" config property to skip some NVMEs from the check, or provide a workaround to remove an NVME device from a system while it is not being used.
To post a comment you must log in.
Since hw-health is deprecated, we've worked around this issue in a different way (went for alertmanager and prometheus alert rules, which allows us to silence very selectively).