Check for Megaraid (and other Physical disk/raid) problems

Bug #1796535 reported by Tejeev Patel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
NRPE Charm
Invalid
Undecided
Unassigned
hw-health-charm
Won't Fix
Wishlist
Unassigned

Bug Description

I'm copying in an Engineers comments on an internal case as he had good points.

We are currently not monitoring the error counters or logs of MegaRAID or other controllers

These issues are typically reported to us by customers noticing amber lights in their datacenter, but we'd like to become proactive at responding to these issues.

There are plugins for megaraid checks on the net that can be used. Perhaps check this as one of the better checks for megaraid.

https://exchange.nagios.org/directory/Plugins/Hardware/Storage-Systems/RAID-Controllers/check_megaraid_sas-v2/details

We should be reporting on media/other errors, plus non-optimal logical volumes.

if you're unfamiliar with megaraid commands, I suggest starting with some basics like:

megaraid -ldpdinfo -aALL
megaraid -pdinfo -PhysDrv[X:Y] -aALL
megaraid -AdpEventLog -file errors.log -aALL

You can either have an option to enable megaraid checks in the nrpe charm, or we could do auto-detects of whether we're on metal and if there's a MegaRaid/LSI raid card in lspci

I noticed that the latest version of smartmontools includes the following error when you run "smartctl /dev/sda":

Smartctl open device: /dev/sda failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'

If you see this note about megaraid, we could add the checks.

We might also want to bake in some S.M.A.R.T monitoring into the charm for non-megaraid disk errors.

apparently SMART should work with the raids, but the smartctl -d <raidtype>,N format require you know which number device you're addressing, as smartctl -d <raidtype>,0 /dev/sda is the same as /dev/sde if you keep using N==0. So, iterate through smartctl -d <raidtype>,0..30 to find all disks in smartctl if you want to scrape there for the alert.

Revision history for this message
Xav Paice (xavpaice) wrote :

Retargeted to the hw-health charm.

Changed in nrpe-charm:
status: New → Invalid
tags: added: canonical-bootstack
Revision history for this message
Drew Freiberger (afreiberger) wrote :

Megaraid checks were added into the charm, however, the megaraid software is not detecting all smart errors on the disks.

See the following private pastebin to see that smartctl shows errors on megaraid,3 (fourth logical disk, fifth physical) but no errors or smart alert has been logged in megacli output.

https://pastebin.canonical.com/p/jwfcQyK7d2/

Must add smartctl -l error -d megaraid,<X> /dev/sda to the charm to check the error log.

Have to iterate over all physical disk slot numbers from megacli ldpdinfo with smartctl.

Changed in hw-health-charm:
status: New → Confirmed
Revision history for this message
Xav Paice (xavpaice) wrote :

This requires smartctl to be fully integrated to the charm, and given a config option for the controller number.

Will mark as wishlist, pending smartctl addition.

Changed in hw-health-charm:
importance: Undecided → Wishlist
Eric Chen (eric-chen)
Changed in charm-hw-health:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.