SMART test triggering false failures on some drives.

Bug #1612220 reported by Jerry Kao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Checkbox Provider - Base
Fix Released
High
Jeff Lane 

Bug Description

Smart test was fail once (either there is error or uncompleted in SMART Self-test log). Rerun SMART test several times and results are passed. But checkbox (disk_smart) still returns failed.

It seems checkbox look into historical logs and returns failed if there is error or uncompleted in the logs, even though the last tests are passed.

 u@u-Inspiron-14-3467:~$ sudo ./disk_smart -b /dev/sda -s 130 -t 530
INFO Starting SMART self-test on /dev/sda
ERROR FAIL: SMART Self-Test appears to have failed for some reason. Run 'sudo smartctl -l selftest /dev/sda' to see the SMART log
u@u-Inspiron-14-3467:~$ sudo smartctl -l selftest /dev/sda
[sudo] password for u:
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-33-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3 -
# 2 Short offline Completed without error 00% 3 -
# 3 Short offline Completed without error 00% 3 -
# 4 Short offline Completed without error 00% 3 -
# 5 Short offline Completed without error 00% 3 -
# 6 Short offline Completed without error 00% 3 -
# 7 Short offline Completed without error 00% 3 -
# 8 Short offline Completed without error 00% 3 -
# 9 Short offline Completed without error 00% 3 -
#10 Short offline Completed without error 00% 2 -
#11 Short offline Completed without error 00% 2 -
#12 Extended offline Completed without error 00% 0 -
#13 Extended offline Self-test routine in progress 90% 0 -
#14 Short offline Completed without error 00% 0 -
#15 Short offline Self-test routine in progress 90% 0 -

In above log, the last 12 tests are passed without error. #13 and #15 are not completed.

Related branches

Jerry Kao (jerry.kao)
Changed in plainbox-provider-checkbox:
status: New → Confirmed
importance: Undecided → High
tags: added: ce-qa-concern
Revision history for this message
Jerry Kao (jerry.kao) wrote :

run disk smart test with --debug

http://paste.ubuntu.com/23049074/

Revision history for this message
Pierre Equoy (pieq) wrote :

Jeff, could you have a look at the script and see if the modification required by Jerry is possible and if it doesn't overlap with the use for the server team?

Changed in plainbox-provider-checkbox:
assignee: nobody → Jeff Lane (bladernr)
Jeff Lane  (bladernr)
summary: - SMART test look into historical logs
+ SMART test triggering false failures on some drives.
Changed in plainbox-provider-checkbox:
status: Confirmed → Incomplete
Revision history for this message
Jeff Lane  (bladernr) wrote :
Download full text (3.2 KiB)

Jerry,

Can you run the attached version of disk_smart like this:

sudo ./disk_smart -d -b /dev/sdX

where sdX is the drive you're seeing this failure on?

The problem is not that it's looking too far into the history, I think the problem is the drive you're using. I've tested this on three drives here, one 2TB SATA II, one 320GB SATA and one 240GB SATAII SSD. All but the 320GB SATA disk pass easily. On the 320GB disk, I found that the following happens:

Test initiates a short test via smartctl (smartctl -t short /dev/sdX)
Test then begins polling for a change in the log file (smartctl -l selftest /dev/sdX)
Test fails because it encounters THIS in the #1 spot and smartctl returns a 128 error code, which indicates that the log shows a test failure:

# 1 Short offline Interrupted (host reset) 90% 17355 -

However, this is just what the drive is returning. The test is actually still running, despite what the firmware is saying, and the SMART test errors out.

To manually recreate this, do the following:

sudo smartctl -t short /dev/sdX
while true; do sudo smartctl -l selftest /dev/sdX; sleep 5; done

on the failing system.

You should see something like the following UNTIL the test is completed:
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-31-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Interrupted (host reset) 90% 17355 -
# 2 Short offline Completed without error 00% 17355 -
# 3 Short offline Completed without error 00% 17355 -
# 4 Short offline Completed without error 00% 17355 -
# 5 Short offline Completed without error 00% 17355 -
# 6 Short offline Completed without error 00% 17354 -
# 7 Short offline Completed without error 00% 17354 -
# 8 Short offline Completed without error 00% 17354 -
# 9 Short offline Completed without error 00% 17354 -
#10 Short offline Completed without error 00% 17353 -
#11 Short offline Completed without error 00% 17353 -
#12 Short offline Completed without error 00% 17352 -
#13 Short offline Completed without error 00% 17352 -
#14 Short offline Completed without error 00% 17352 -
#15 Short offline Completed without error 00% 17279 -
#16 Short offline Completed: read failure 90% 14832 2238
#17 Short offline Completed: read failure 90% 14832 2238
#18 Short offline Completed without error 00% 4658 -
#19 Short offline Aborted by host 10% 4658 -
#20 Short offline Completed without error 00% 724 -

and once the test IS completed, you should see that #1 h...

Read more...

Revision history for this message
Jeff Lane  (bladernr) wrote :

Please send me the output of that version of disk_smart, as well as output from

smartctl -l selftest /dev/sdX

run immediately after you manually start a short test via smartctl.

Revision history for this message
Jerry Kao (jerry.kao) wrote :

Jeff,

The test running with your attached version disk_smart in comment#3 is nonstop for 40min+. I force stop it by ctrl+C. The log is attached.

Revision history for this message
Jerry Kao (jerry.kao) wrote :

The nonstop issue happens with the original disk_smart as well

Revision history for this message
Jeff Lane  (bladernr) wrote : Re: [Bug 1612220] Re: SMART test triggering false failures on some drives.
Download full text (4.3 KiB)

Thanks... I see, I think, what is happening now. Is there any way you
can make that machine accessible via SSH (via Yantok perhaps) so I can
try tweaking the polling routine?

You're original assertion is correct, but it's not that it's looking
at history so much as it's seeing that "In Progress" message and
picking that up. :/ this wouldn't happen on a drive that didn't have
a stuck Extended Offline test run on it (as that one seems to have).

I may be able to recreate that scenario here, but getting access to
that system would be easier, unless you need it for other work and
can't spare it for a couple days.

On Tue, Aug 16, 2016 at 6:02 AM, Jerry Kao <email address hidden> wrote:
> The nonstop issue happens with the original disk_smart as well
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1612220
>
> Title:
> SMART test triggering false failures on some drives.
>
> Status in Checkbox Provider for PlainBox:
> Incomplete
>
> Bug description:
> Smart test was fail once (either there is error or uncompleted in
> SMART Self-test log). Rerun SMART test several times and results are
> passed. But checkbox (disk_smart) still returns failed.
>
> It seems checkbox look into historical logs and returns failed if
> there is error or uncompleted in the logs, even though the last tests
> are passed.
>
> u@u-Inspiron-14-3467:~$ sudo ./disk_smart -b /dev/sda -s 130 -t 530
> INFO Starting SMART self-test on /dev/sda
> ERROR FAIL: SMART Self-Test appears to have failed for some reason. Run 'sudo smartctl -l selftest /dev/sda' to see the SMART log
> u@u-Inspiron-14-3467:~$ sudo smartctl -l selftest /dev/sda
> [sudo] password for u:
> smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-33-generic] (local build)
> Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
>
> === START OF READ SMART DATA SECTION ===
> SMART Self-test log structure revision number 1
> Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
> # 1 Short offline Completed without error 00% 3 -
> # 2 Short offline Completed without error 00% 3 -
> # 3 Short offline Completed without error 00% 3 -
> # 4 Short offline Completed without error 00% 3 -
> # 5 Short offline Completed without error 00% 3 -
> # 6 Short offline Completed without error 00% 3 -
> # 7 Short offline Completed without error 00% 3 -
> # 8 Short offline Completed without error 00% 3 -
> # 9 Short offline Completed without error 00% 3 -
> #10 Short offline Completed without error 00% 2 -
> #11 Short offline Completed without error 00% 2 -
> #12 Extended offline Completed without error 00% 0 -
> #13 Extended offline Self-test routine in progress 90% 0 -
> #14 Short ...

Read more...

Revision history for this message
Jeff Lane  (bladernr) wrote :

OK, had it fixed before the IN Progress log lines disappeared from the failing system. Should not be an issue now. Also added a lot more debug verbosity and tweaked some things to make it work better.

Changed in plainbox-provider-checkbox:
status: Incomplete → In Progress
Jeff Lane  (bladernr)
Changed in plainbox-provider-checkbox:
importance: High → Medium
importance: Medium → High
tags: added: server-cert
tags: added: hwcert-server
removed: server-cert
Pierre Equoy (pieq)
Changed in plainbox-provider-checkbox:
status: In Progress → Fix Committed
Changed in plainbox-provider-checkbox:
milestone: none → 0.34.0
Changed in plainbox-provider-checkbox:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.