COMMISSION S.M.A.R.T Tests fail unnecessarily on code 64 (past log entries)

Bug #1783889 reported by Evan Sikorski
70
This bug affects 15 people
Affects Status Importance Assigned to Milestone
MAAS
Triaged
Medium
Unassigned

Bug Description

If a disk has *old* indications of failure in its SMART log, a return code of 64 is provided.

This causes MAAS to mark the node with hardware tests failed, which this is not a valid indicator if of if the disk is actively failing now.

Based off this article https://alexander.kirk.at/2013/02/07/munin-smart-plugin-ignore-error-in-the-past/

We have changed you code as follows

File changed: `/usr/lib/python3/dist-packages/metadataserver/builtin_scripts/smartctl.py`

Lines Changed:
159 if proc.returncode != 0 and proc.returncode != 4 and proc.returncode != 64:

and

172 return 0 if proc.returncode == 4 or proc.returncode == 64 else proc.returncode

Please review and suggest if this is satisfactory for addition to MAAS or if it needs further development/review first.

Related branches

Changed in maas:
milestone: none → 2.5.0alpha2
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Evan Sikorski (evan.sikorski) wrote :

We made a small improvement to this if you wish to incorporate it.

if (proc.returncode & 187) != 0:

and

return 0 if (proc.returncode & 187) == 0 else proc.returncode

"Those would be smarter ways to do what we were wanting based similarly to what the munin guy is doing Now *ALLL* of the return codes 4, 64, and 68 will *ALL* return 0!!!
Not sure if you can update the bug with those lines as suggestions vs what I had before :confused:
This will also now always mask those errors so that when people see the return codes they can ignore the `4` and the `64` cause it won't even show in the exit code"

Revision history for this message
Tyler Gray (tyler.gray) wrote :

So there's 2 thoughts of how to solve this:
(Still referencing lines 159 and 172)

1) Account for all permutations of the bits that will be flagged for the test failures we want to ignore (0, 4, 64, and 68)

In which case the code would need to be:
if (proc.returncode != 0 and proc.returncode != 4 and proc.returncode != 64 and proc.returncode != 68):

and

return (0 if proc.returncode == 4 or proc.returncode == 64 or proc.returncode == 68 else proc.returncode)

2) Completely ignore those flags occurring with a bitwise operation that can mask it completely from those bits being flagged

In which case the code could be:
if (proc.returncode & 187) != 0:

and

return 0 if (proc.returncode & 187) == 0 else proc.returncode

And you might even be able to shorten line 172 to just the below, and remove the if/else completely:
return (proc.returncode & 187)

Not sure how dev's find it best to handle it, but wanted to throw out a more completely solution.

Revision history for this message
Lee Trager (ltrager) wrote :

Thanks for the report. I'm trying to understand what exactly is failing. Could you please post output of the following:

* The output of the MAAS smartctl test
* sudo smartctl --xall <disk with error>; echo $?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Tyler Gray (tyler.gray) wrote :

A few things to note:
1) Scratch the last suggestion of "return (proc.returncode & 187)", that would mask the error even if more than just bits 2 and 6 were flagged from a failure.

2) According to the man page for smartctl, any error present in a log, even past ones, will continue to flag the 6th bit (with the 8 bits marked as 0-7). So even if there are errors that are in the log, but nothing that is actively wrong with the disk, the smartctl error will still flag the 6th bit, causing the exit code to be 64 when translated from binary.

I'll post a failed log for you when I get a chance.

Revision history for this message
Tyler Gray (tyler.gray) wrote :
Download full text (36.8 KiB)

INFO: Veriying SMART support for the following drive: /dev/sdm
INFO: Running command: sudo -n smartctl --all /dev/sdm

INFO: SMART support is available; continuing...
INFO: Verifying and/or validating SMART tests...
INFO: Running command: sudo -n smartctl --xall /dev/sdm

FAILURE: SMART tests have FAILED for: /dev/sdm
The test exited with return code 64! See the smarctl manpage for information on the return code meaning. For more information on the test failures, review the test output provided below.
---------------------------------------------------

smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-108-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model: SSDSC2BB120G7R
Serial Number: PHDV808402XH150MGN
LU WWN Device Id: 5 5cd2e4 14f1c6a0b
Add. Product Id: DELL(tm)
Firmware Version: N201DL43
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Thu Jul 26 18:55:37 2018 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
ATA Security is: Unavailable
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
     was completed without error.
     Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
     without error or no self-test has ever
     been run.
Total time to complete Offline
data collection: ( 18) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
     No Auto Offline data collection support.
     Suspend Offline collection upon new
     command.
     Offline surface scan supported.
     Self-test supported.
     Conveyance Self-test supported.
     Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
     power-saving mode.
     Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
     General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 60) minutes.
Conveyance self-test routine
recommended polling time: ( 60) minutes.
SCT capabilities: (0x003d) SCT Status supported.
     SCT Error Recovery Control supported.
     SCT Feature Control supported.
     SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate -OSR-- 130 130...

Revision history for this message
Lee Trager (ltrager) wrote :

Thanks for posting the log. It looks like your drive is experiencing read errors. While these errors are recoverable they may effect performance. You can still use the machine by using the 'override failed test' machine operation in the UI or over the API.

ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate -OSR-- 130 130 039 - 8597
 13 Read_Soft_Error_Rate -OSRC- 130 130 000 - 8597
201 Unknown_SSD_Attribute PO--CK 100 100 010 - 227633478306

Revision history for this message
Tyler Gray (tyler.gray) wrote :

Okay, maybe I need more help reading these logs, but I've been trying to study them. From what I can tell what you posted does not actually appear show read errors that one would need to be concerned with.

Here's some wikis I've used to try to help me understand errors:
https://lime-technology.com/wiki/Understanding_SMART_Reports
https://en.wikipedia.org/wiki/S.M.A.R.T.

According to these wikis, there are a few things to note:
1) The columns VALUE, WORST, and THRESH tend to start at 100 and count down. So if the current value was lower than 039 (currently at 130), then it would signify that there is a problem with the drive.

2) The column FAIL seems to indicate the last operational hour (from attribute 9 Power_On_Hours) that this attribute failed. Right now that column is blank ('-').

3) It mentions that the RAW_VALUE column should basically be ignored. Its meaning is entirely up to the drive manufacturer. These are Intel drives.

The overall result of that section of the test was:
SMART overall-health self-assessment test result: PASSED

So even with those values, smartctl isn't really declaring that the drive is having issues.

Here's an example from another server of ours were the smartctl results came back clean, with the only difference being that there were no entries in the devices error log:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate -OSR-- 130 130 039 - 8637
 13 Read_Soft_Error_Rate -OSRC- 130 130 000 - 8637
201 Unknown_SSD_Attribute PO--CK 100 100 010 - 103079492898

So this is why, from what I can tell, if something simply has an entry in its error log, which will flip bit 6 (and give return code 64 in decimal), that those errors will permanently be in that log and could thus be ignored if that's the only bit flagged in a smartctl return code.

Changed in maas:
milestone: 2.5.0alpha2 → 2.5.0beta1
Changed in maas:
milestone: 2.5.0beta1 → 2.5.0beta2
Changed in maas:
milestone: 2.5.0beta2 → 2.5.0rc1
Changed in maas:
milestone: 2.5.0rc1 → 2.5.x
Revision history for this message
KingJ (kj-kingj) wrote :

I am also experiencing this on a few drives. Their regular smartctl -a output passes with a return code of 0, however when run with --xall old errors present in the log cause the test to fail.

In my case, the errors are all related to failed WRITE FPDMA QUEUED commands. These were caused by a faulty backplane, rather than a faulty disk. However, as a result the disk is now persistently marked as failing smartctl tests by MAAS despite smartctl reporting that it has passed every single test after the WRITE FPDMA QUEUED errors as well as a badblocks test.

Personally, it feels wrong to mark the drive as failed in this instance since the fault was caused by other hardware in the past, and the drive has subsequently passed any checks performed against it*.

* checking the log again, I see that i've not performed an extended test at any point, I wonder if I were to perform an extended test and it passed, smartctl would disregard the old entries in the log and return an RC of 0? smartctl's man page does seem to imply that this is the case for bit 7 - "The device self-test log contains records of errors. [ATA only] Failed self-tests outdated by a newer successful extended self-test are ignored.". However, as RC=64 is bit 6 it may not work. I'll try it and report back here...

I've attached a log showing the output of;

1) smartctl -a
2) echo $?
3) smartctl --xall
4) echo $?

Revision history for this message
Jan Klare (j-klare) wrote :

Is the still work going on to get this merged? We are currently also encountering this issue with SSDs where the smartctl error log includes old errors from a power outage about 2 year ago and therefore the return code is 64 like mentioned in the bug description.

Revision history for this message
Jan Klare (j-klare) wrote :

bump

Revision history for this message
Paul Tobias (tobias.pal) wrote :

Maybe instead of --xall it would be better to use --health?

For me smartctl --xall exits with code 64:
# smartctl --xall /dev/sdh; echo $?
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-88-generic] (local build)
...snip...
SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 4
...snip...
64

But with --health it returns with success:
# smartctl --health /dev/sdh; echo $?
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-88-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

0

These errors in the logs are past errors. It does not indicate the drive is failing. For example an error log like this is generated if `smartctl --test=short --captive` is ran. Because the kernel detects that the drive didn't respond in a long time, then resets the drive with `ataX: hard resetting link` in dmesg. There seems to be no way of clearing these logs, so the drive will report as Failed in MaaS forever if --xall is used.

Revision history for this message
Paul Tobias (tobias.pal) wrote :

Until then I've fixed it for myself with this patch and restarted maas-regiond and now I don't have failing hardware tests any more:

--- /usr/lib/python3/dist-packages/metadataserver/builtin_scripts/smartctl.py.orig
+++ /usr/lib/python3/dist-packages/metadataserver/builtin_scripts/smartctl.py
@@ -245,7 +245,7 @@
     print('INFO: Verifying SMART data on %s' % device_name)
     try:
         output = run_smartctl(
- blockdevice, ['--xall'], device, output=True, stderr=STDOUT)
+ blockdevice, ['--health'], device, output=True, stderr=STDOUT)
     except TimeoutExpired:
         print('ERROR: Validating %s timed out!' % device_name)
         raise

Revision history for this message
Jan Klare (j-klare) wrote :

Just ran into that issue again while installing another MAAS server. It would be great if we could agree on a solution for this.
Cheers,
Jan

Changed in maas:
status: Incomplete → New
no longer affects: maas/2.4
Changed in maas:
milestone: 2.5.x → none
Revision history for this message
Björn Tillenius (bjornt) wrote :

I'm fine with either solution:

  1) run smartclt --health instead of --xall
  2) detect that the error was in the past, and either
     ignore it, or automatically override testing, so
     that the machine is usable, but with a warning.

Changed in maas:
status: New → Triaged
Revision history for this message
Brent Barr (brentbarr) wrote :

Still an issue in 3.0/stable.

INFO: Verifying SMART support for the following drive: /dev/sdb
INFO: Running command: sudo -n smartctl --all /dev/sdb
INFO: SMART support is available; continuing...
INFO: Verifying SMART data on /dev/sdb
INFO: Running command: sudo -n smartctl --xall /dev/sdb
FAILURE: SMART tests have FAILED for: /dev/sdb
The test exited with return code 64! See the smarctl manpage for information on the return code meaning. For more information on the test failures, review the test output provided below.

Revision history for this message
Bryan Seitz (seitz-a) wrote :

Confirmed an issue for me as well with 3/stable from SNAP.

INFO: Verifying SMART support for the following drive: /dev/sda
INFO: Running command: sudo -n smartctl --all /dev/sda
INFO: SMART support is available; continuing...
INFO: Verifying SMART data on /dev/sda
INFO: Running command: sudo -n smartctl --xall /dev/sda
FAILURE: SMART tests have FAILED for: /dev/sda
The test exited with return code 64! See the smarctl manpage for information on the return code meaning. For more information on the test failures, review the test output provided below.

Changed in maas:
milestone: none → 3.4.0
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.x
Revision history for this message
Jan Klare (j-klare) wrote :

Just saw that this was moved again. IMHO it is a pretty low hanging fruit since there were already a bunch of suggestions on how to fix it, including at least one patch set. I am happy to rebase this, but i would not want to invest this time if nobody cares on the maintainer side.

Changed in maas:
milestone: 3.4.x → 3.5.x
Revision history for this message
Jan Klare (j-klare) wrote :

Hi Anton,

I just saw that this was moved again from 3.4 to 3.5. What is blocking here and what needs to happen to get this implemented/merged?

Cheers,
Jan

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.