Sector read error checks

Bug #1851389 reported by David O Neill on 2019-11-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
hw-health-charm
Wishlist
Unassigned

Bug Description

We need checks for sector read errors

Nov 05 13:30:40 dcs1-clp-nod9 smartd[9854]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 75 to 74
Nov 05 13:30:40 dcs1-clp-nod9 smartd[9854]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 25 to 26
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdb [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 76 to 75
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 24 to 25
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdc [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sde [SAT], 1 Currently unreadable (pending) sectors
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdf [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 73
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/sdf [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 1
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 80 to 82
Nov 05 13:30:41 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_00] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 1 to 2
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 81 to 83
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 1 to 2
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 75 to 74
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_02] [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 25 to 26
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_04] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 83 to 84
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_05] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 82 to 83
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_06] [SAT], 1 Currently unreadable (pending) sectors
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 100 to 73
Nov 05 13:30:42 dcs1-clp-nod9 smartd[9854]: Device: /dev/bus/0 [megaraid_disk_07] [SAT], SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 100 to 1

Andrea Ieri (aieri) on 2020-01-30
affects: nrpe-charm → hw-health-charm
Xiyue Wang (ziyiwang) on 2020-04-02
Changed in charm-hw-health:
importance: Undecided → High
Changed in charm-hw-health:
status: New → Confirmed
assignee: nobody → Peter Sabaini (peter-sabaini)
Peter Sabaini (peter-sabaini) wrote :

I've been looking on how to improve disk monitoring a bit. Imho monitoring individual read errors doesn't tell you too much about disk health, you'd need to monitor crossing some threshold of read errors or rate of read errors to make predictions. One option I've looked at is a nagios plugin from Thomas Krenn[0] which queries smartctl. Unfortunately driving smartctl with RAIDed disks is a bit vendor-specific, and this plugin would require you to keep a database of drive specifics updated.

Otoh, we do have support for some checking drive health already:

a) via vendor specific tools such as megacli which report drive state

b) via ipmi for drive faults and also for predictive failures at least for some systems

At this point I wonder what additional smartctl monitoring would buy us. I'm marking this as wishlist as it's a new feature

[0] https://github.com/thomas-krenn/check_smart_attributes

Changed in charm-hw-health:
importance: High → Wishlist
assignee: Peter Sabaini (peter-sabaini) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers