NVME check remains green if any nvme command fails

Bug #2044387 reported by Facundo Ciccioli
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hw-health-charm
Won't Fix
Undecided
Unassigned

Bug Description

NVME's check code does this for every NVME found under /dev:

try:
    output = subprocess.check_output(
 ["sudo", "/usr/sbin/nvme", "smart-log", device]
    )
except subprocess.CalledProcessError as error:
    print("nvme check error: {}".format(error))
    return

So, if the nvme command fails for any reason, the check just returns and does not raise any particular nagios status, hence OK is assumed and the failure goes unnoticed.

This is particularly problematic for this check as it needs special permissions to run, which are granted by the Nvme._render_sudoers() method. Essentially, a sudoers rule is created to allow the execution of the nvme smart-log command for each of the detected NVME devices. The issue is that this detection is performed during install time only, hence any drives added at a later stage will cause the command to error.

Tags: bseng-1814

Related branches

Revision history for this message
Facundo Ciccioli (fandanbango) wrote :

One possible workaround for this issue is to clear the hw-health.installed flag:

juju run -u hw-health/97 -- charms.reactive -p clear_flag hw-health.installed

This will cause all the tools to be re-evaluated, including the re-rendering of the sudoers rule.

Eric Chen (eric-chen)
tags: added: bseng-1814
Revision history for this message
Facundo Ciccioli (fandanbango) wrote :

Since hw-health is deprecated, we no longer rely on this charm's NVME check (went for alertmanager and prometheus alert rules, which allows us to silence very selectively).

Changed in charm-hw-health:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.