devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

Bug #445852 reported by theluketaylor
518
This bug affects 88 people
Affects Status Importance Assigned to Milestone
EasyPeasy Overview
Fix Released
High
Unassigned
Linux
Won't Fix
Medium
Nominated for 2.6.31 by Юрий Аполлов
libatasmart
Confirmed
Low
devicekit-disks (Fedora)
New
Undecided
Unassigned
devicekit-disks (Ubuntu)
Invalid
Undecided
Unassigned
Declined for Dapper by Martin Pitt
Declined for Hardy by Martin Pitt
Declined for Intrepid by Martin Pitt
Declined for Jaunty by Martin Pitt
Nominated for Maverick by rogmorri
Karmic
Fix Released
Critical
Martin Pitt
Lucid
Invalid
Undecided
Unassigned
libatasmart (Ubuntu)
Fix Released
High
Martin Pitt
Declined for Dapper by Martin Pitt
Declined for Hardy by Martin Pitt
Declined for Intrepid by Martin Pitt
Declined for Jaunty by Martin Pitt
Nominated for Maverick by rogmorri
Karmic
Won't Fix
Medium
Unassigned
Lucid
Fix Released
High
Martin Pitt

Bug Description

TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in karmic-proposed and needs testing feedback):

1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules

2. locate the following lines (about 1/3 the way into the file; search for "smart")

# ATA disks driven by libata
KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"

3. comment out the second line by adding a # in front, so you should have

# ATA disks driven by libata
#KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"

4. save the file and reboot

TECHNICAL ANALYSIS: https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
LUCID STATUS: https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
KARMIC SOLUTION: https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204

BUG DESCRIPTION FOLLOWS:

In the Karmic beta I experience ssd stalls during the boot process. It happens almost everytime before xsplash loads and happens again frequently between logging into gdm and the desktop loading. When it happens during login I think it is making gnome time out on loading panel items as I get errors related to lots of panel items failing to load. If I log out and back in again when the ssd isn't stalled the panel items load fine.

When it happens the following messages appear before xplash (or in dmesg when it happens after gdm):

ata2: lost interrupt (Status 0x58)
ata2: drained 16384 bytes to clear DRQ.
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: BMDMA stat 0x4
ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
ata2.00: status: { DRDY DRQ }
ata2: soft resetting link
ata2.00: configured for UDMA/66
ata2: EH complete

I did not have this issue in jaunty with this hardware and I don't think it has happened once the system is fully loaded. I am running karmic unr on an Acer Aspire One netbook.

ProblemType: Bug
AplayDevices:
 **** List of PLAYBACK Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: luke 1990 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
   Mixer name : 'Realtek ALC268'
   Components : 'HDA:10ec0268,1025015b,00100101'
   Controls : 9
   Simple ctrls : 6
CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
CheckboxSystem: c69722ecac764861be52925fa50b4dcc
Date: Wed Oct 7 17:54:56 2009
DistroRelease: Ubuntu 9.10
HibernationDevice: RESUME=UUID=8d44b89b-2edb-4c02-a4be-94bd25b65081
MachineType: Acer AOA110
Package: linux-image-2.6.31-12-generic 2.6.31-12.40
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.31-12-generic root=UUID=039a096e-3486-4898-9eeb-44a705f8b7fd ro quiet splash elevator=noop usbcore.autosuspend=1
ProcEnviron:
 LANG=en_CA.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-12.40-generic
RelatedPackageVersions: linux-firmware 1.21
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
Tags: ubuntu-unr
Uname: Linux 2.6.31-12-generic i686
XsessionErrors:
 (gnome-settings-daemon:2006): GLib-CRITICAL **: g_propagate_error: assertion `src != NULL' failed
 (gnome-settings-daemon:2006): GLib-CRITICAL **: g_propagate_error: assertion `src != NULL' failed
 (nautilus:2092): Eel-CRITICAL **: eel_preferences_get_boolean: assertion `preferences_is_initialized ()' failed
 (polkit-gnome-authentication-agent-1:2118): GLib-CRITICAL **: g_once_init_leave: assertion `initialization_value != 0' failed
 (gnome-panel:2048): Gdk-WARNING **: /build/buildd/gtk+2.0-2.18.2/gdk/x11/gdkdrawable-x11.c:952 drawable is not a pixmap or window
dmi.bios.date: 10/06/2008
dmi.bios.vendor: Acer
dmi.bios.version: v0.3309
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.vendor: Acer
dmi.board.version: Base Board Version
dmi.chassis.type: 1
dmi.chassis.vendor: Chassis Manufacturer
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAcer:bvrv0.3309:bd10/06/2008:svnAcer:pnAOA110:pvr1:rvnAcer:rn:rvrBaseBoardVersion:cvnChassisManufacturer:ct1:cvrChassisVersion:
dmi.product.name: AOA110
dmi.product.version: 1
dmi.sys.vendor: Acer

Revision history for this message
theluketaylor (ekul-taylor) wrote :
Revision history for this message
theluketaylor (ekul-taylor) wrote :
Revision history for this message
theluketaylor (ekul-taylor) wrote :

I updated to kernel 2.6.31-12 today and the problem seems to have gotten worse. Under the older karmic kernels it would almost always happen before xsplash and very rarely after gdm. with -12 it seems to be happening before xsplash and after gdm every boot.

Revision history for this message
av8r (av8r) wrote :

It doesn't change with 2.6.31-13-generic #44 - on a EeePC 900A with upgrade RAM/SSD disk.
For me, It's usualy freeze between fsck and setting up the resolver. And once again while launching the first session - doesn't matter if it's UNR or not (I've both setup).
I had to the grub boot line: "clocksource=hpet notsc", It remove me the warning message abount tsc clock unstable but didn't change anything with stall SSD. I also remove(edit) from the grub boot line: "quiet splash"

$ dmesg | grep ata2
[ 1.253633] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15
[ 1.427461] ata2.00: CFA: Patriot Memory 64GB PATA Storage Drive, Ver2.M0G, max UDMA/66
[ 1.427461] ata2.00: 126090720 sectors, multi 1: LBA
[ 1.440008] ata2.00: configured for UDMA/66
[ 40.809047] ata2: lost interrupt (Status 0x58)
[ 40.809047] ata2: drained 2048 bytes to clear DRQ.
[ 40.811862] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 40.815862] ata2.00: BMDMA stat 0x24
[ 40.818548] ata2.00: cmd c8/00:08:32:51:22/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 40.822958] ata2.00: status: { DRDY DRQ }
[ 40.826958] ata2: soft resetting link
[ 41.000008] ata2.00: configured for UDMA/66
[ 41.000008] ata2: EH complete
[ 232.820015] ata2: lost interrupt (Status 0x58)
[ 232.820081] ata2: drained 8192 bytes to clear DRQ.
[ 232.848018] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 232.848018] ata2.00: BMDMA stat 0x24
[ 232.848018] ata2.00: cmd c8/00:20:ba:72:27/00:00:00:00:00/e0 tag 0 dma 16384 in
[ 232.848018] ata2.00: status: { DRDY DRQ }
[ 232.848018] ata2: soft resetting link
[ 233.020008] ata2.00: configured for UDMA/66
[ 233.020008] ata2: EH complete
[ 273.820016] ata2: lost interrupt (Status 0x58)
[ 273.820089] ata2: drained 6144 bytes to clear DRQ.
[ 273.837834] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 273.843269] ata2.00: BMDMA stat 0x24
[ 273.849571] ata2.00: cmd c8/00:18:da:72:27/00:00:00:00:00/e0 tag 0 dma 12288 in
[ 273.860739] ata2.00: status: { DRDY DRQ }
[ 273.866515] ata2: soft resetting link
[ 274.041007] ata2.00: configured for UDMA/66
[ 274.041007] ata2: EH complete
[ 407.436313] ata2.00: ACPI cmd ef/03:44:00:00:00:a0 filtered out
[ 407.436313] ata2.00: ACPI cmd ef/03:0c:00:00:00:a0 filtered out
[ 407.436313] ata2.00: ACPI cmd c6/00:01:00:00:00:a0 succeeded
[ 407.452005] ata2.00: configured for UDMA/66
[ 407.468005] ata2.00: configured for UDMA/66
[ 407.468005] ata2: EH complete

av8r (av8r)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
redDEADresolve (reddeadresolve) wrote :

I am also getting the same error on My Dell Mini 9.

[38.825065] ata1.00 exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[38.825227] BMDMA Stat 0x24
[38.825318] ata1.00:cmd c8/00:18:8f:89:20/00:00:00:00:/e0 tag 0 dma 1228in
[38.825321] res 58/00:18:8f:89:20/00:00:00:00:/e0 Emask 0x2 (HSM violation)
[38.825598] ata1.00: status {DRDY DRQ}

Occasionally I get sent to the root prompt to manually run an fsck.

Revision history for this message
redDEADresolve (reddeadresolve) wrote :
Revision history for this message
Gav Mack (gavinmac) wrote :

I have identical issues with my Aspire One A110 with a SuperTalent SSD 32Gb upgrade as the OP of this bug - it makes Karmic take almost 3 minutes to start the first time and at least 3 restarts later (with ever decreasing time) I get a relatively stable desktop.

Revision history for this message
Gav Mack (gavinmac) wrote :
Revision history for this message
Gav Mack (gavinmac) wrote :
Revision history for this message
Johan Van den Neste (jvdneste) wrote :

I have the same configuration as Gav Mack (Aspire One A110 with a SuperTalent SSD 32Gb upgrade). Same problem here.

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

It is easy to reproduce simply by starting gparted. The error is then produced twice just as before while 'searching /dev/sda partitions'. As a result, the 'searching /dev/sda partitions' activity in gparted takes a long time.

Revision history for this message
professordes (d-a-johnston-hw) wrote :

A "me too" on an eeePC 901 with an upgraded crucial SSD and an upgrade install of karmic RC

The relevant bit in dmesg is:

[ 35.816124] ata2: lost interrupt (Status 0x58)
[ 35.820096] ata2: drained 2048 bytes to clear DRQ.
[ 35.823180] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 35.826268] ata2.01: BMDMA stat 0x64
[ 35.829292] ata2.01: cmd c8/00:08:f7:41:0d/00:00:00:00:00/f2 tag 0 dma 4096 i
n
[ 35.829295] res 58/00:08:f7:41:0d/00:00:00:00:00/f2 Emask 0x2 (HSM v
iolation)
[ 35.835971] ata2.01: status: { DRDY DRQ }

The machine is also (mostly) failing to pick up an SDHC card in the reader, which wasn't the case in 9.04

Revision history for this message
Adam Gianola (adam-gianola) wrote :

Same story here. Dell Mini 9 upgraded with a Super Talent FEM16GFDL 16 GB SSD. I experience this both after the upgrade from 9.04 to 9.10 as well as on a clean install of 9.10.

I can also confirm starting gparted reproduces the dmesg output normally seen after (long) boot up.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

Another 'me too'.

Just upgraded an Acer Aspire One A110 (ZG5) from existing (factory installed) 8 GB SSD to Super Talent 16 GB (FEM16GF13M).

Running the LiveCD (on USB stick) with 9.10 RC, then opening gParted shows the essentially the same messages in dmesg as other reports (and it takes a long time).

Everything else seems fine.

Revision history for this message
danq989 (danq989) wrote :

Me too!

I have the same configuration as Gav Mack (including upgraded 32GB SSD) with the same results.

Verified on both a 9.04 to 9.10 upgrade and a fresh 9.10 install on a freshly erased and partitioned drive. Same problem on boot and in gparted. Seems to only occur during mounting of the drive (possibly during initial mount and then remount as -rw)

---danq989

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

I have linked this bug report to (what looks to be) the same problem at the kernel bug tracker. Not sure I've done the linking correctly ;-)

http://bugzilla.kernel.org/show_bug.cgi?id=14515

Changed in linux:
status: Unknown → Confirmed
Revision history for this message
Kory (postmako) wrote :

Me too! AAO ZG5 running stock 8GB drive. Jaunty was booting in about 20 seconds and Karmic is taking about 90 seconds. I am attaching parts from dmesg and the most recent copy of bootchart.

...
[ 7.242403] input: SynPS/2 Synaptics TouchPad as /devices/platform/i8042/serio2/input/input8
[ 36.820096] ata2: lost interrupt (Status 0x58)
[ 36.824029] ata2: drained 2048 bytes to clear DRQ.
[ 36.827217] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 36.827227] ata2.00: BMDMA stat 0x4
[ 36.827248] ata2.00: cmd c8/00:08:ef:cc:ce/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 36.827251] res 58/00:08:ef:cc:ce/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 36.827258] ata2.00: status: { DRDY DRQ }
[ 36.827302] ata2: soft resetting link
[ 36.996463] ata2.00: configured for UDMA/66
[ 36.996497] ata2: EH complete
[ 37.001030] Clocksource tsc unstable (delta = -133907975 ns)
[ 37.042527] ath5k 0000:03:00.0: PCI INT A -> GSI 18 (level, low) -> IRQ 18
[ 37.042586] ath5k 0000:03:00.0: setting latency timer to 64
[ 37.042686] ath5k 0000:03:00.0: registered as 'phy0'
...
[ 56.778794] groups: 1 0
[ 335.989086] ata2: lost interrupt (Status 0x58)
[ 335.993064] ata2: drained 2048 bytes to clear DRQ.
[ 335.996576] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 335.996588] ata2.00: BMDMA stat 0x4
[ 335.996609] ata2.00: cmd c8/00:08:4f:1c:0c/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 335.996613] res 58/00:08:4f:1c:0c/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 335.996623] ata2.00: status: { DRDY DRQ }
[ 335.996675] ata2: soft resetting link
[ 336.168420] ata2.00: configured for UDMA/66
[ 336.168453] ata2: EH complete
[ 372.004111] ata2: lost interrupt (Status 0x58)
[ 372.008033] ata2: drained 2048 bytes to clear DRQ.
[ 372.011577] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 372.011588] ata2.00: BMDMA stat 0x4
[ 372.011609] ata2.00: cmd c8/00:08:bf:a6:49/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 372.011613] res 58/00:08:bf:a6:49/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 372.011623] ata2.00: status: { DRDY DRQ }
[ 372.011674] ata2: soft resetting link
[ 372.184426] ata2.00: configured for UDMA/66
[ 372.184458] ata2: EH complete
[ 372.586276] gdu-notificatio[1693]: segfault at c ip 00efd50e sp bfbf6d50 error 4 in libgdu.so.0.0.0[ef4000+1c000]
[ 377.073516] wlan0: authenticate with AP 00:0f:66:b9:59:0f
[ 377.079461] wlan0: authenticated
[ 377.079473] wlan0: associate with AP 00:0f:66:b9:59:0f
[ 377.085704] wlan0: RX AssocResp from 00:0f:66:b9:59:0f (capab=0x11 status=0 aid=5)
[ 377.085716] wlan0: associated
...
And it happens from time to time after login...

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

I'd like to point out that even though the delays in the boot process are annoying, what is worse is the series of applets crashing when logging in to gnome. Usually I cannot log out again because that applet has crashed. So I switch to tty1 and do a 'sudo service gdm restart'. The next and subsequent logins are fine until the next reboot.

Revision history for this message
Kory (postmako) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (5.5 KiB)

Yeah I started to notice that kind of stuff as well. That is why I'm
leaving Jaunty on my wife's machine.

On Sat, Oct 31, 2009 at 8:51 AM, Johan Van den Neste <email address hidden>wrote:

> I'd like to point out that even though the delays in the boot process
> are annoying, what is worse is the series of applets crashing when
> logging in to gnome. Usually I cannot log out again because that applet
> has crashed. So I switch to tty1 and do a 'sudo service gdm restart'.
> The next and subsequent logins are fine until the next reboot.
>
> --
> SSD stall durin g boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrls : 6
> CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
> CheckboxSystem: c69722ecac764861be52925fa50b4dcc
> Date: Wed Oct 7 17:54:56 2009
> DistroRelease: Ubuntu 9.10
> HibernationDevice: RESUME=UUID=8d44b89b-2edb-4c02-a4be-94bd25b65081
> MachineType: Acer AOA110
> Package: linux-image-2.6.31-12-generic 2.6.31-12.40
> ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.31-12-generic
> root=UUID=039a096e-3486-4898-9eeb-44a705f8b7fd ro quiet spl...

Read more...

tags: added: ubuntu
Revision history for this message
ownyourown (ownyourown) wrote : Re: SSD stall during boot

My Asus eee pc 900 also affected by this bug. (Fresh install of final Ubuntu 9.10)

Revision history for this message
Saif Ahmed (saif) wrote :

A me too here

eeepc 900 fresh install of final 9.1.

Moreover if I have any kind of usb flash drives attached, machine doesn't complete boot at all.

Revision history for this message
andrey i. mavlyanov (andrey-mavlyanov) wrote :

Moreover. I got this error on non-SSD drive. Check https://bugs.launchpad.net/ubuntu/+source/linux/+bug/473765 for details

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@andrey i. mavlyanov

Andrey,

I don't think that this is the same bug.

On this line:

Nov 4 08:18:45 aim-laptop kernel: [35132.010175] res 40/00:00:00:4f:c2/00:00:00:00:00/40 Emask 0x4 (timeout)

You are getting a 'timeout', whereas this bug is causing 'HSM Violations'.

Revision history for this message
Adam Gianola (adam-gianola) wrote :

Interestingly, if I use the Dell Mini 9 Factory SSD with 9.10 this problem goes away.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

Playing with LiveCD (on a USB stick) with an Aspire One with Super Talent 16Gb SSD:

- Normal LiveCD boot shows the problem in dmesg.

- Booting with 'libata.dma=0' in kernel line fixes the problem (by disabling DMA) in dmesg.

- Booting with 'libata.ignore_hpa=0' had no affect.

Since the problem looked to be DMA related, I tried slowing down the transfer with 'libata.force=udma/33'. No affect, though plenty of logs about UDMA being forced to 33.

The same machine is working fine with Jaunty. Another Aspire One with the standard (factory) 8 Gb SSD is running Karmic without any problem.

Revision history for this message
mdyn (tamerlaha-gmail) wrote :

acer aoa-110 some problem....

Revision history for this message
danq989 (danq989) wrote :

I just verified that this bug still present for me in the just-released kernel 2.6.31-15.

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

Linux kernel bug 14515 has nothing to do with this

Changed in linux:
importance: Unknown → Undecided
status: Confirmed → New
Revision history for this message
Dave V (mindkeep) wrote :

Affects my asus eeepc 900. Please raise to critical before I have to learn to hassle with Gentoo again.

Changed in linux:
importance: Undecided → Unknown
status: New → Unknown
Changed in linux:
status: Unknown → Confirmed
Revision history for this message
Horácio (horacioh) wrote :

I had exactly the same problem (boot stall) on an Asus eee 900. After 2 weeks of use, i got a grub -error: "error: biosdisk read error" and the system become completely useless, pending a disk wipe and full reinstall.
Similar situations are reported on: https://bugs.launchpad.net/bugs/387272
considered a duplicate of this bug.
But I do not see the grub-error problem reported here. May this be a different bug?

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

Ok, so after running 9.10 and discovering this issue. I now have booted off a 9.04 USB stick and dd'ed /dev/zero over both sda and sdb. I then installed 9.04 and have no issues.

So the hardware is not faulty.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

A possible work around from the upstream bug report is to boot with 'irqpoll' in the kernel boot parameters. It's not a good fix, the logs are still full of error messages, but at least the 'stall' is reduced.

Regrettably, it's probably best to avoid using Karmic on SSD equipped netbooks. Use Jaunty instead, since this bug probably won't be fixed in the near future.

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

I have the same issue as reported above with an eeepc 701 (with 16GB SuperTalent SSD, and also with the original 8GB SSD).

An alternate method of fixing a Karmic-corrupted SSD - at least on the 701 - is to boot with ASUS's rescue DVD and allow it to reinstall the default Xandros installation.

With Karmic installed, I can confirm that kernel option "irqpoll" stops the stall during a Karmic boot, but does anyone know if that stops Karmic from messing up the SSD?

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

I'm a bit disappointed that the previously mentioned kernel bug is discarded so quickly. Could it not still be related? There are indeed no *optical* drives to be polled, however, on the acer one there are 2 card readers (= pollable removeable media drives). Since the kernel bug report claims that the problem is caused by a removeable media drive choking on the polling commands, could one of the card readers not be the cause?

So I tried 'hal-disable-polling' on the card readers...

One reader marked as 'storage extension' is /dev/mmcblk0, and is apparently not seen as a removeable device (message by hal-disable-polling). The other... doesn't seem to work. I get no response whatsoever when inserting or removing and sd card (which only raises my suspicion). Hence, I don't know its /dev/ name, and dont know what to give hal-disable-polling as --device argument. Maybe the device is not even detected at boot?

Anyone else care to investigate on his/her laptop? (I'd really hate to switch back to 9.04)

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

I should add that boot-time, gnome login and gparted startup are typically moments where I'd expect polling for media to take place.

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

I notice changed behaviour when there are sd cards inserted: The stall is longer with 2 cards inserted than with 1 card, and 1 card is worse than no cards, though it does not disappear. How would I disable the card readers entirely? (I see sdhci-pci mentioned, and when there are cards inserted, the drives are detected as mmc0 and mmc1)

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

The eeepc 701 has one SD reader that can be disabled in the BIOS. Disabling it doesn't seem to affect the SSD issues at all on my system.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@Johan
Interesting comment. I have private doubts that this bug is totally due to hardware 'quality' problems (see the current kernel bug report).

If the hardware was at fault then: firstly, the bug would not be spread over such a range of differing hardware, and secondly, Ubuntu 9.04 should also be failing in a similar manner?

Revision history for this message
Kory (postmako) wrote : Re: [Bug 445852] Re: SSD stall during boot

Well after having this issue for a couple of weeks now my netbook will no
longer boot. I ran a live USB stick and gparted can't even read the
partition. As soon as I have time, I'm going back to 9.04 and I suggest
everyone else do the same before their drive craps out like mine did. It
seems to write out bad sectors or destroy your data as well because my
wireless stopped working and I rebooted hoping it would fix the problem. So
all of my netbooks will wait until this issue is resolved before upgrading.
Enjoy!

Revision history for this message
danq989 (danq989) wrote : Re: SSD stall during boot

@Andrew
Couldn't agree more that this is a code regression and not simply a hardware quality issue.

I haven't checked the various changelogs, but I wouldn't be surprised if something in the IDE IRQ handler or the hardware initialization was subjected to optimization (in libdma?). Possibly the SuperTalent drives do react in a non-standard way that was never exposed before.

Hopefully Mr. Heo will take the time to look through the code and see. I'm sure it would help if someone could supply the developers hardware that reliably produces the error. I need my netbook too much to send it away, but maybe someone has an extra SuperTalent drive?

---danq989

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

@Andrew
Don't get me wrong, i'm also convinced it's a software issue. I know little about the linux kernel, but I know a lot about concurrent programming, and I know certain bugs only manifest themselves under very specific conditions usually related to very subtle differences in timing (which may also only happen on different processors, different number of cores and so on).

Not that I'm saying it's a concurrency bug.

All I'm saying is that different devices influence the timing of all sorts of events and that specific combinations of hardware may trigger specific bugs, and I was guessing that maybe - just maybe - the combination of the ssd with the card readers triggered this one. (since the timing characteristics of ssd's are after all quite different from regular hard drives)

So I would still like to try and disable the card readers entirely (which I cannot do in the bios). But yes, it's a long shot.

Revision history for this message
Gav Mack (gavinmac) wrote :

I have noticed if I power up the AAO without mains power more often than not it only freezes on the first part of the boot process and not during GDM making all the applets fail, which gives me the most stable I've got Karmic working so far. Fsck seems to want to run every boot now just before the freeze. It's enough for me to stick with Karmic unless the SSD gets trashed like Kory but I'm running EXT4 with journaling still enabled, perhaps that's why I've not suffered the corruption problem,

I've posted this issue on the Supertalent forum to see if they can possibly help with maybe a firmware update but I'm not holding my breath! http://www.supertalent.com/home/forum/viewtopic.php?f=17&t=2083 I agree with you both Johan and Dan - this does seem to be a Linux bug - one they seemingly can't be arsed to fix at the moment which is very frustrating. Maybe when they get enough pissed off users like us they'll do something about it :o(

I use my card readers a lot, the left hand in particular for data because I'm dual booting the SSD with Windows 7 so that rules out that option in my case.

Revision history for this message
David Staples (dcstaples) wrote :

I've been having this problem too on my Acer aoa110 (using UNR 9.10)
After a few days of using karmic and getting frustrated with the ~1:45 boot times (from grub to desktop) along with the panel configuration problems, I tried to use a PPA kernel (comment #20 in http://bugs.launchpad.net/ubuntu/karmic/+source/sreadahead/+bug/432089), which bricked my netbook. After trying unsuccessfully several times to reinstall karmic by using sysresccd for gparted (wasn't trusting karmic at this point), and doing a testdisk I noticed there was a read error at cylinder 257, along with an intermittent read-error at sector 1.
I gave up and did a wipe of the drive (not using dd. wipe. took about six hours).
Read-error went away, partitioning worked fine, karmic finally installed, but no change to boot times.
Gah!
I'm going back to jaunty until 10.04 comes out. Which is a shame, because I like the new UI in karmic, but I've decided not to risk my ssd for eye candy.

Revision history for this message
David Staples (dcstaples) wrote :

Oh yeah, here's the output of one of the logs.
Btw, sorry about the long meandering story above ;)

ata2: lost interrupt (Status 0x58)
ata2: drained 2048 bytes to clear DRQ.
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
ata2.00: BMDMA stat 0x4
ata2.00: cmd c8/00:08:87:31:01/00:00:00:00:00/e0 tag 0 dma 4096 in
          res 58/00:08:87:31:01/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
ata2.00: status: { DRDY DRQ }
ata2: soft resetting link
ata2.00: configured for UDMA/66
ata2: EH complete

Revision history for this message
Jim (wilsja) wrote :

Reproduced on a stock eeepc 900 linux version, installing karmic to the 4G ssd. (I did a dist-upgrade from jaunty, which works fine) Also, I have gotten many cases where I needed to overwrite the disk with zeros to make it usable again. (this happens after one or two boots with karmic, I then reinstall to try to fix it)

kernel is 2.6.31-14-generic

dmesg |grep ata2 gives me

[ 1.103371] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0xffa8 irq 15
[ 1.344348] ata2.00: ATA-0: ASUS-PHISON OB SSD, TST2.04L, max UDMA/66
[ 1.344356] ata2.00: 7880544 sectors, multi 0: LBA
[ 1.344436] ata2.01: ATA-0: ASUS-PHISON SSD, TST2.04L, max UDMA/66
[ 1.344442] ata2.01: 31522176 sectors, multi 0: LBA
[ 1.356270] ata2.00: configured for UDMA/66
[ 1.368273] ata2.01: configured for UDMA/66
[ 1.401892] ata2.00: configured for UDMA/66
[ 1.408271] ata2.01: configured for UDMA/66
[ 1.408279] ata2: EH complete
[ 36.816048] ata2: lost interrupt (Status 0x58)
[ 36.820016] ata2: drained 8192 bytes to clear DRQ.
[ 36.834372] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 36.834380] ata2.00: BMDMA stat 0x64
[ 36.834397] ata2.00: cmd c8/00:20:87:1a:35/00:00:00:00:00/e0 tag 0 dma 16384 in
[ 36.834408] ata2.00: status: { DRDY DRQ }
[ 36.834443] ata2: soft resetting link
[ 37.028315] ata2.00: configured for UDMA/66
[ 37.036314] ata2.01: configured for UDMA/66
[ 37.060315] ata2.00: configured for UDMA/66
[ 37.068312] ata2.01: configured for UDMA/66
[ 37.068331] ata2: EH complete
[ 67.816056] ata2: lost interrupt (Status 0x58)
[ 67.820016] ata2: drained 6144 bytes to clear DRQ.
[ 67.829818] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 67.829945] ata2.00: BMDMA stat 0x64
[ 67.830019] ata2.00: cmd c8/00:18:cf:4b:3c/00:00:00:00:00/e0 tag 0 dma 12288 in
[ 67.830275] ata2.00: status: { DRDY DRQ }
[ 67.830379] ata2: soft resetting link
[ 68.024317] ata2.00: configured for UDMA/66
[ 68.032313] ata2.01: configured for UDMA/66
[ 68.056315] ata2.00: configured for UDMA/66
[ 68.064310] ata2.01: configured for UDMA/66
[ 68.064329] ata2: EH complete

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@Ubuntu Bugs

Can you please have a look at the status of this bug (currently 'undecided')? This bug could do with some input from the Ubuntu devs. Here's why:

1. The bug is occurring on a wide range of net books with SSD units. This is a growing target audience for Ubuntu.

2. In the simplest case the bug makes the machine unresponsive and impractical to use.

3. If the user continues with the above state, the machine often gets 'bricked'. Total data loss occurs, and the SSD can only be recovered with low level formatting (Normal rescue tools don't work).

Total data loss with bricked machines confirmed several times over and no apparent workaround, has to be a more than 'undecided' bug?

I have opened a bug report on the kernel bug list which is getting some high level attention. It would be good if Ubuntu was able to give some support on this.

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

I contacted Leann on the kernel team and asked for some advice, and this is her response. I'm not at home, but at UDS so can't try this right now.. be good if someone else could:-

"Thanks for the heads up Alan. I'll get this bug on our list for review.
It seems this has been forwarded upstream as well:

http://bugzilla.kernel.org/show_bug.cgi?id=14583

Care to give the latest upstream mainline kernel builds a test:

https://wiki.ubuntu.com/KernelTeam/MainlineBuilds

2.6.31.6 is the latest upstream stable kernel (which we're going to
release as an SRU for karmic):
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.31.6/

The Karmic kernel for SRU with the 2.6.31.6 patches is currently baking
in Stefan's PPA (2.6.31-16-51~pre2):
https://edge.launchpad.net/~stefan-bader-canonical/+archive/karmic/+packages

2.6.32-rc6 is the latest 2.6.32 release candidate
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.32-rc6/

Might not hurt to give both the 2.6.31.6 and 2.6.32-rc6 kernels a test
and confim the issue remains. Then relay the info to both the upstream
bug and lp bug."

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

Here's one test of 2.6.32-rc6 on an ASUS eeepc 701 with 16GB SuperTalent SSD: The original Karmic kernel also shows thus bug with the stock ASUS 8GB SSD.

$ uname -a
Linux eeepc 2.6.32-020632rc6-generic #020632rc6 SMP Wed Nov 4 10:54:30 UTC 2009 i686 GNU/Linux

Looks like the same thing is happening in 2.6.32-rc6.

[ 8.420522] ata2: drained 2048 bytes to clear DRQ.
[ 8.421826] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 8.421839] ata2.00: BMDMA stat 0x24
[ 8.421850] ata2.00: failed command: READ DMA
[ 8.421872] ata2.00: cmd c8/00:08:86:39:04/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 8.421876] res 58/00:08:86:39:04/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 8.421887] ata2.00: status: { DRDY DRQ }
[ 8.421948] ata2: soft resetting link

Revision history for this message
Trey (trey333) wrote :

I want people, and especially the dev team, to know how serious this problem is. it killed my SSDs, as I reported in an duplicate bug of "unknown" importance. I mean this is hardware issue. Nothing could get either of my chips working get - not zeroing out, not a gparted live USB, not Windows / Partition magic on a USB. Not even taking out the 16gb chip and putting it in a different computer. This is physical. This is real and expensive to replace. Get Karmic off your SSD drives and dev team, please change this damn status! Having to find a new solid state drive because of your release is not of "unknown" importance.

Revision history for this message
danq989 (danq989) wrote :

@Trey
Just a point of info for you. After I performed an in-place install of Karmic over my existing Jaunty, I found my AAO-110's SuperTalent-upgrade SSD bricked.

I managed to recover it using HDDERASE version 3.3.

This DOS program uses the "secure erase" function of the drive to completely erase the drive and set all blocks to unused. Unlike a dd of zeros, it only takes a minute or so to complete. Apparently this is a trick that folks have been using on Intel gen1 SSDs to recover performance after block fragmentation has caused a performance drop. Note that the more widely available HDDERASE versions 3.1 and 4.0 did NOT work for me (both threw different exceptions and bombed), so be sure to find and use v3.3. Guide and link here: http://www.iishacks.com/index.php/2009/06/30/how-to-secure-erase-reset-an-intel-solid-state-drive-ssd/

Now this won't fix things if there's enough real damage to the flash due to too many write/erase cycles, but it's worth a shot.

Good luck Trey!
---danq989

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

One more piece of information that may or may not be useful to the develpoers:

On my other laptop, I have Jaunty installed with kernel 2.6.30.9 (from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30.9/ ) and the other Intel-related patches from the Jaunty/Intel graphics performance howto at http://ubuntuforums.org/showthread.php?t=1130582 .

If I install that kernel on my eeepc 701 running Karmic, I get the same stall / HSM violation errors that I do under 2.6.31 or 2.6.32-rc6. (There are other problems with that kernel under karmic, but I mainly wanted to see if it would stop the SSD problems, and it doesn't...)

Revision history for this message
Jim (wilsja) wrote :

Trey I'm not sure about your specific situation, but for me a regular dd didn't work, though a dd with of=/dev/sda bs=1M did work.

Just to back up what Rick is saying, I installed 2.6.28-12-netbook-eeepc from array.org, and I still have the errors. Furthermore, I used the jaunty kernel that was still on there: 2.6.28-11-generic, and I still have the errors. If I install jaunty, that same kernel does not give me errors.

I'm not sure what the effects of using an old kernel on a new distribution are (for instance, the touchpad driver doesn't work), so I don't know how to isolate the problem

Revision history for this message
Skylord (me-skylord) wrote :

Just want to confirm the whole bug on my EeePC 901 4G+16G. Installation of fresh Karmic not working at all with mentioned errors in logs and freezing of partitioning setup page. And surely all is good on Jaunty.

Revision history for this message
Raf (4283534-noduck) wrote :

I also see these log entries on my Acer Aspire One with SUPER TALENT FEM32GF13M. However, I have not yet seen any corruption as a result of this. In fact, sometimes this does not result in a stall during boot. If it does stall the boot, it hangs for about 12 seconds.

Revision history for this message
Jim (wilsja) wrote :

It actually seems that it gives a slightly different error message with the old kernel, but has a similar effect. This error with the old kernel only shows up after a dist-upgrade to karmic -- essentially a karmic install with the old 2.6.28 kernel

I am attaching three dmesg outputs.

dmesg.jaunty is before upgrading to karmic

dmesg.oldkern is directly after upgrading to karmic, but using the same kernel as before

dmesg.karmic is using the karmic kernel

Revision history for this message
Jim (wilsja) wrote :

This has a timeout error rather than an HSM violation

Revision history for this message
Jim (wilsja) wrote :
Revision history for this message
lotus49 (lotus-49) wrote :

I have the same computer and SSD (Acer Aspire One and SUPER TALENT FEM32GF13M) as well as the same symptoms as Raf above.

Revision history for this message
sles (slesru) wrote :

this bug is not SSD specific.
here is output from my colleague's dmesg notebook dell 120L, this is hdd:

[ 6702.000206] ata1: lost interrupt (Status 0x58)
[ 6702.004018] ata1: drained 32768 bytes to clear DRQ.
[ 6702.093725] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 6702.093739] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[ 6702.093740] cdb 1e 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 6702.093742] res 58/00:01:00:00:00/00:00:00:00:00/b0 Emask 0x2 (HSM violation)
[ 6702.093746] ata1.01: status: { DRDY DRQ }
[ 6702.093859] ata1: soft resetting link
[ 6702.340796] ata1.00: configured for UDMA/100
[ 6702.372409] ata1.01: configured for UDMA/33
[ 6702.391202] ata1: EH complete

Revision history for this message
professordes (d-a-johnston-hw) wrote :

This bug is still present with the 31-15 kernel and all other updates applied on my eeePC 901 (1 Dec). There doesn't seem to be a lot going on at linux-kernel-bugs 14583?

Revision history for this message
Sal Mazzola (salmaz) wrote :

I am having a similar problem with an Acer Aspire 1410. Ubuntu LiveCD booted perfectly off of the USB drive, but booting off the hard drive give me either HSM Violations or timeouts.

Setting libata.force=noncq allowed the machine to finally boot up without any errors.

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

The kernel parameter mentioned in #61 has no effect on the stall/HSM violation errors on my ASUS EeePC 701 (booted with the 9.10 Live CD and added the parameter to the kernel options, as I have had to reinstall Jaunty to use the machine daily).

Back to the drawing board, I guess.

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

Ok, there's no way this is a hardware issue. I have on my desk two idenical Eee 900's. Both have exhibited this issue under Jaunty and both exhibit the issue under Karmic. If I roll them back to karmic and wipe the SSD along the way, the error goes away.

I have now installed 9.10 UNR on one, and have added the 2.6.32 kernel from the links in comment #47.

[ 113.816054] ata2: lost interrupt (Status 0x58)
[ 113.820008] ata2: drained 8192 bytes to clear DRQ.
[ 113.835302] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 113.835310] ata2.00: BMDMA stat 0x64
[ 113.835317] ata2.00: failed command: READ DMA
[ 113.835332] ata2.00: cmd c8/00:20:a7:8f:09/00:00:00:00:00/e0 tag 0 dma 16384 in
[ 113.835335] res 58/00:20:a7:8f:09/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 113.835343] ata2.00: status: { DRDY DRQ }
[ 113.835393] ata2: soft resetting link

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

The upstream bug report has asked whether anyone has tested kernels 2.6.29 or 2.6.30 as this would help narrow down when the bug was (re)introduced to the kernel.

For my own interest, I have noted that the bug generally occurs with newer/faster SSD units. For instance the Intel SSD originally fitted to the early Acer Aspire One is, well, rather slow, but doesn't show the bug (I have one). However another same machine (they were brought as a pair) upgraded with the Super Talent unit does have the problem. Anyone notice any similarity?

Alan Cox has suggested that the SSD is responding to a command so fast that the kernel misses seeing the interrupt.

Revision history for this message
lotus49 (lotus-49) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (5.1 KiB)

I can confirm that upgrading from the stock SSD with Karmic already
installed to a Super Talent SSD produced this error which had not
previously been present.

Simon

Sent from My iPhone

On 7 Dec 2009, at 06:31, Andrew Simpson
<email address hidden> wrote:

> The upstream bug report has asked whether anyone has tested kernels
> 2.6.29 or 2.6.30 as this would help narrow down when the bug was
> (re)introduced to the kernel.
>
> For my own interest, I have noted that the bug generally occurs with
> newer/faster SSD units. For instance the Intel SSD originally
> fitted to
> the early Acer Aspire One is, well, rather slow, but doesn't show the
> bug (I have one). However another same machine (they were brought
> as a
> pair) upgraded with the Super Talent unit does have the problem.
> Anyone
> notice any similarity?
>
> Alan Cox has suggested that the SSD is responding to a command so fast
> that the kernel misses seeing the interrupt.
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process.
> It happens almost everytime before xsplash loads and happens again
> frequently between logging into gdm and the desktop loading. When
> it happens during login I think it is making gnome time out on
> loading panel items as I get errors related to lots of panel items
> failing to load. If I log out and back in again when the ssd isn't
> stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in
> dmesg when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't
> think it has happened once the system is fully loaded. I am running
> karmic unr on an Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrl...

Read more...

Revision history for this message
theluketaylor (ekul-taylor) wrote : Re: SSD stall during boot

I can confirm kernel 2.6.30 works fine with my netbook (Acer Aspire One with 8 GB SSD) using a Jaunty userland. I am running 2.6.30-02063009-generic.
I am somewhat reluctant to try a karmic kernel since once I started getting the errors after installing karmic to get rid of them in any distro/kernel combination I had to write all zeros to the SSD. If it's truly necessary I will try.

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (5.1 KiB)

Telling kernel libata.force=noncq during boot-time from liveUSB had no
effect for me - still got hangups on the same place. It is interesting, that
this happens only after X and gdm is starting. Don't know why really.
Now I got this: booting into single-user mode gave an interesting
result:after getting onto root shell, in about a minute I got a message
about exception Emask, DRQ etc. But after it - another thing, that got my
interest - a message, telling that "Starting init crypto disks... [OK]"
So, my guess that it's really there - in crypto disks.

2009/12/7 theluketaylor <email address hidden>

> I can confirm kernel 2.6.30 works fine with my netbook (Acer Aspire One
> with 8 GB SSD) using a Jaunty userland. I am running
> 2.6.30-02063009-generic.
> I am somewhat reluctant to try a karmic kernel since once I started getting
> the errors after installing karmic to get rid of them in any distro/kernel
> combination I had to write all zeros to the SSD. If it's truly necessary I
> will try.
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls ...

Read more...

Revision history for this message
Andrew Squire (andrewsquire) wrote : Re: SSD stall during boot

I had this issue on my EeePC 901. I took the following actions and have not *yet* seen the issue reappear:

1. Boot from 9.10 NBR Live CD
2. dd if=/dev/zero of=/dev/sda bs=1M
3. dd if=/dev/zero of=/dev/sdb bs=1M
4. install 9.10 NBR
5. Add kernal command "libata.dma=0" in GRUB2 config and rebuild GRUB2 menu

NOTE: When I first got the issue I first tried steps 4. & 5. without doing 1. 2. 3. and it did not fix the issue.

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (5.5 KiB)

And yes- it's in libata!!
SSD is eating brains only when libata.dma>=4

http://www.kernel.org/doc/Documentation/kernel-parameters.txt

> libata.dma= [LIBATA] DMA control
> libata.dma=0 Disable all PATA and SATA DMA
> libata.dma=1 PATA and SATA Disk DMA only
> libata.dma=2 ATAPI (CDROM) DMA only
> libata.dma=4 Compact Flash DMA only
> Combinations also work, so libata.dma=3 enables DMA
> for disks and CDROMs, but not CFs.
>
> So, SSD is going to CFs. No surprise, of course. But if no dma is on -
haha, the speed of devise suffers well!
During bootup kernel says (in case of libata.dma={<=3}), that my SSD
(AAO-110L, 8Gb SSD-PAMM, Samsung) is in (!) PIO4 transfer data mode. It is
REALLY slow.

hdparm -tT /dev/sda
Cached reads: 370 MB/sec
Buffered reads: 2.3 MB/sec

And it is reading - do I have to talk 'bout writing?
Where to dig now, when the reason of problem seems 2 be localized?

2009/12/8 Andrew Squire <email address hidden>

> I had this issue on my EeePC 901. I took the following actions and have
> not *yet* seen the issue reappear:
>
> 1. Boot from 9.10 NBR Live CD
> 2. dd if=/dev/zero of=/dev/sda bs=1M
> 3. dd if=/dev/zero of=/dev/sdb bs=1M
> 4. install 9.10 NBR
> 5. Add kernal command "libata.dma=0" in GRUB2 config and rebuild GRUB2 menu
>
> NOTE: When I first got the issue I first tried steps 4. & 5. without
> doing 1. 2. 3. and it did not fix the issue.
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268...

Read more...

Revision history for this message
Gav Mack (gavinmac) wrote : Re: SSD stall during boot

Setting libata.dma=3 to grub seems to have stopped the hang on my AAO so far, the entries in dmesg are gone. Still taking over a minute to boot though but that's something I'll live with for now.

It's early days but premature thanks go out to Andrew Squire in advance!

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (4.6 KiB)

Btw libata.dma=3 do not make SSD work in DMA mode though!
There's something to be told to developers, cause I stay on Jaunty so far
this stall goes away!

2009/12/8 Gav Mack <email address hidden>

> Setting libata.dma=3 to grub seems to have stopped the hang on my AAO so
> far, the entries in dmesg are gone. Still taking over a minute to boot
> though but that's something I'll live with for now.
>
> It's early days but premature thanks go out to Andrew Squire in advance!
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrls : 6
> CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
> CheckboxSystem: c69722ecac764861be52925fa50b4dcc
> Date: Wed Oct 7 17:54:56 2009
> DistroRelease: Ubuntu 9.10
> HibernationDevice: RESUME=UUID=8d44b89b-2edb-4c02-a4be-94bd25b65081
> MachineType: Acer AOA110
> Package: linux-image-2.6.31-12-generic 2.6.31-12.40
> ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.31-12-generic
> root=UUID=039a096e-3486-4898-9eeb-44a705f8b7fd ro quiet splash elevator=noop
> usbcore.autosuspend=1
> ProcEnviron:
> LANG...

Read more...

Revision history for this message
Gav Mack (gavinmac) wrote : Re: SSD stall during boot

You're correct - it was early days and premature! Though I could put up with the slow boot time with anything other than basic web browsing I couldn't handle the drop the disk read performance from 67mb/sec to 2Mb/sec never mind write. It was worse than my old stock AAO SSD was with Jaunty playing back video.

Back to the situation in post 67 methinks - devs stop dragging your heels and sort this issue out!

Revision history for this message
Andrew Squire (andrewsquire) wrote :

Agree completely, Gav Mack. Seems likely to me to be a regression in the kernal that should be fixed. Unfortunately, it looks like the associated kernal bug (#14583) is being put down to dodgy hardware.

The libata.dma=<4 is just a workaround to keep us up and running in the interim... if you can put up with the negative impact on performance :)

Revision history for this message
Gav Mack (gavinmac) wrote :

Couldn't agree more Andrew. It's the easy way out to blame this on hardware, they may have had a valid point but clearly none of us ever had a problem with this until Karmic and the newer kernels. To paraphrase that old saying "If it looks like a Kernel Bug, acts like a Kerrnel Bug then it's a Kernel bug" and I can't help but think "waffle" and "b*llsh*t" by laying the blame on the storage devices.

Unless there's more of us or if we get lucky with and a netbook OEM runs into major trouble with the same problem or some IT hacks get involved to force them to do something about it I'll be booting into Windows 7 far more than I want to. A shame :(

Now where's my logins for theregister.co.uk and theinquirer.net?

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (5.7 KiB)

A small thing to keep in mind:
1. I load Karmic - and got this bug. Kernel 2.6.31 as to my mind
2. I load Karmic and 2.6.32 - still got this bug
3
. I load Jaunty - and got not this bug. Kernel 2.6.28
4. I load Jaunty and 2.6.32 - and got NOT this bug
Though it's NOT a kernel bug I guess. Otherwise anyone could reproduce it on
Jaunty easy.
So, I repeat, it's NOT a direct kernel bug - there's something that we miss.
Something makes an SSD hangup when in DMA. What could it be?? If it's not
kernel - than what? A userspace process, that stalls SSD? Specifically what?
Really - maybe if that one is killed - there would be no hangups? Or anyway
we could write a bug report in correct place - neither on a Launchpad, nor
in kernel bugs section.
That was my thoughts. IMHO.

2009/12/9 Gav Mack <email address hidden>

> Couldn't agree more Andrew. It's the easy way out to blame this on
> hardware, they may have had a valid point but clearly none of us ever
> had a problem with this until Karmic and the newer kernels. To
> paraphrase that old saying "If it looks like a Kernel Bug, acts like a
> Kerrnel Bug then it's a Kernel bug" and I can't help but think "waffle"
> and "b*llsh*t" by laying the blame on the storage devices.
>
> Unless there's more of us or if we get lucky with and a netbook OEM runs
> into major trouble with the same problem or some IT hacks get involved
> to force them to do something about it I'll be booting into Windows 7
> far more than I want to. A shame :(
>
> Now where's my logins for theregister.co.uk and theinquirer.net?
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice...

Read more...

Revision history for this message
Raf (4283534-noduck) wrote : Re: SSD stall during boot

#75 is correct. My karmic install always gives the error (even with 2.6.29-rc3*). But my jaunty never gives the error, even with 2.6.31-16. However, there is one more difference between my karmic and jaunty installs: karmic uses ext4, while jaunty uses ext2 (and for the record all of those are inside LVM).

Could it be something related to ext4? I notice that the bug triggers during or right after filesystem check/mount (mountall). But I have not been able to reproduce this bug by doing filesystem checks/mounting of the karmic-ext4 partition under jaunty.

I have not been able to debug it further. Debugging upstart to see exactly when the bug triggers generates way too much output.

I tried disabling ureadahead, but that didn't make any difference.

*Note that the error message changed between 2.6.29-6 and 2.6.30-rc1, but it still triggers an error message and the same delay.

Revision history for this message
Gav Mack (gavinmac) wrote :

@Raf:

Don't think it's an ext4 issue - I'm very sure that when I first ran Karmic Beta on the USB boot stick it hung on the detection of the SSD before I even got to the partition menu. I tried ext2 first and got corruption pretty quickly and then setup in ext4 before I spotted this bug report on launchpad.

Don't forget also that Andrew Simpson said in post 15 of the Kernel bug list that Mandriva 2010.0 with the same kernel doesn't show this error but Fedora 12 Beta does.

Revision history for this message
theluketaylor (ekul-taylor) wrote :

I can confirm this isn't an ext4 issue; I am using ext4 for my / partition in Jaunty with a 2.6.30 kernel without any trouble

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

I've got to agree with #77 - It isn't an ext4 issue. I installed Karmic on my Eee with ext4 and again with ext2 and got similar errors both times.

I can also confirm that even using kernels that cause no problems under Jaunty cause problems under Karmic. So what's different about the boot process of karmic that's trashing our SSDs? Maybe it's something that can be fixed outside the kernel?

Revision history for this message
Stephen O (soglesby1) wrote :

It's not just Karmic, as Gav pointed out. I installed Fedora 12 two days ago in part because I was curious if this problem would follow me from Karmic. Sure enough, a panel failed to load and my dmesg output had the same ata errors as it did in karmic. Even though my partitioning scheme was different since Fedora defaults to LVM. I can also join the chorus in stating that I used ext4 with this drive on Jaunty and had no issues.

Revision history for this message
Raf (4283534-noduck) wrote :

What about the increased parallelism in the new upstart? Could this be the cause of the problems: increased simultaneous disk access.

Doesn't Fedora also use upstart?

Unfortunately it looks like upstart cannot be serialized. (I am talking about the jobs in /etc/init, not /etc/init.d)

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (4.6 KiB)

I'm about to please somebody to look distinctly at upstart - can it be
removed (I have no info now on this - I'm still on 9.04). Can somebody try
removing?

2009/12/9 Raf <email address hidden>

> What about the increased parallelism in the new upstart? Could this be
> the cause of the problems: increased simultaneous disk access.
>
> Doesn't Fedora also use upstart?
>
> Unfortunately it looks like upstart cannot be serialized. (I am talking
> about the jobs in /etc/init, not /etc/init.d)
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrls : 6
> CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
> CheckboxSystem: c69722ecac764861be52925fa50b4dcc
> Date: Wed Oct 7 17:54:56 2009
> DistroRelease: Ubuntu 9.10
> HibernationDevice: RESUME=UUID=8d44b89b-2edb-4c02-a4be-94bd25b65081
> MachineType: Acer AOA110
> Package: linux-image-2.6.31-12-generic 2.6.31-12.40
> ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.31-12-generic
> root=UUID=039a096e-3486-4898-9eeb-44a705f8b7fd ro quiet splash elevator=noop
> usbcore.autosuspend=1
> ProcEnv...

Read more...

Gav Mack (gavinmac)
Changed in linux (Ubuntu):
assignee: nobody → Gav Mack (gavinmac)
assignee: Gav Mack (gavinmac) → nobody
assignee: nobody → Upstart Developers (upstart-devel)
Revision history for this message
Gav Mack (gavinmac) wrote : Re: SSD stall during boot

Notified the upstart devs - do any other distros other than Fedora 12 and Karmic use upstart?

Changed in linux (Ubuntu):
assignee: Upstart Developers (upstart-devel) → nobody
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Please do not subscribe that team, the team's own description explicitly asks you not to.

This bug clearly is nothing to do with userspace, it should not be possible for userspace to cause problems in this way (that's what the kernel is there for).

It smells like a kernel driver bug to me, especially given the reports of fiddling with DMA. The high number of similar SSDs mean it could be a hardware company being creative with the spec, but that's still a kernel driver bug for failing to quirk them properly.

I'm not a kernel developer, but I would recommend the following debugging technique:

 - those affected should supply detailed information about not only their SSD, but the I/O controller in their laptop (dmesg, lcpci -vvnn, etc.)

 - if one release of Ubuntu is affected more than the other, that suggests a regression

 - first try a mainline kernel build from http://kernel.ubuntu.com/~kernel-ppa/mainline/ of the equivalent release; if that fixes the problem (unlikely, but still possible), then it is with an Ubuntu patch

 - if that does not fix the problem, start working backwards through the kernel releases until you find one that does fix the problem

 - if the first kernel is still affected, try kernels from previous Ubuntu releases

    (one assumes that the kernel from the release where things work fine, installed on karmic, will work)

 - Given a loose idea, narrow it down using the kernel packages you can download from https://launchpad.net/ubuntu/+source/linux/+publishinghistory

Basically what would be ideal would be to find one kernel that works, and then the *immediate next kernel* that doesn't work. This would give a limited number of changes that broke it, and start to reveal what the bug might be

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

Thanks for the debugging advice Scott.

The only issue for me is that when I install Karmic and it goes _very_ bad I end up with a bricked machine (IO errors on SSD causing me to drop to initramfs promt). The only way to test other kernels is to dd zeros over the SSD and reinstall the OS again, then add in whatever kernel to test. It's a maddeningly time consuming task.

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (5.0 KiB)

2 Scott:
How could it be: same program (here I mean kernel version), same hardware -
and different behaviour on different systems of Ubuntu. Don't you think that
it would be rather strange, if 2 things work here and not there - than it's
not because that things gone bad, and the reason is somewhere in
environment?
Nevertheless I think your idea is interesting - can anyone here that is
sitting on 9.10 with SSD test this workaround?

2009/12/10 Alan Pope <email address hidden>

> Thanks for the debugging advice Scott.
>
> The only issue for me is that when I install Karmic and it goes _very_
> bad I end up with a bricked machine (IO errors on SSD causing me to drop
> to initramfs promt). The only way to test other kernels is to dd zeros
> over the SSD and reinstall the OS again, then add in whatever kernel to
> test. It's a maddeningly time consuming task.
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrls : 6
> CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
> CheckboxSystem: c69722ecac764861be52925fa50b4dcc
> Date: Wed Oc...

Read more...

Revision history for this message
Raf (4283534-noduck) wrote : Re: SSD stall during boot

I don't think that there is a bug in upstart. More likely the increased parallelism of upstart triggers a bug in the kernel, firmware, or hardware.

If it is a kernel bug, it has been in the kernel for several releases. I have tested with 2.6.29-rc3 (and most versions in between), that is the oldest kernel on http://kernel.ubuntu.com/~kernel-ppa/mainline/ that can still boot my ext4-karmic partition.

I am wondering if #61 is not on to something. However, my SDD is ATA, not SATA (so no NCQ anyway). I will research if the SDD can do TCQ, and if yes how to disable it. That would suggest a firmware/hardware bug.

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

@Raf #87

I've tested Karmic (ext2 filesystem) with the latest 2.6.28 kernel for Jaunty, with similar results as for the Karmic kernels. (Of course, the same kernel works just fine with Jaunty.)

Another issue with upstart being the cause of the problem - as #11 notes, this bug can be triggered in Karmic by running gparted at any time - even when booting up via the LiveCD instead of the SSD.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

O.K., Reviewing what we do know:

Scott has suggested providing system information, however this already seems to be listed in this bug report and in the upstream bug report.

He has also suggested trying different kernels to isolate when the problem started. This is what we have been doing, HOWEVER comments #75, #76 & #87 have now all shown that something in 'Karmic' is the problem - and not directly the kernel version. Have we been chasing the wrong problem?

We can rule out ext4 and NCQ from comments #79, #87 and #88.

While I would agree with Scott that 'userspace applications' shouldn't affect the kernel, it does appear a 'userspace application' is affecting the kernel.

I can confirm that Mandriva 2010.0 (after several weeks of use & lots of checking) does not have this problem.

My own experience and comment #80 confirms that Fedora 12 does have the problem. I have searched the Fedora Bugzilla and can't see a bug report there. It would be good to file a bug there. One of the Fedora kernel devs has been commenting on the upstream bug report.

What is different about Mandriva 2010.0 compared to Karmic and Fedora 12? Anything of note other than upstart?

Revision history for this message
theluketaylor (ekul-taylor) wrote :

if a userspace application is able to trigger this issue I would say this is a kernel bug since that is exactly the sort of thing the kernel is supposed to be managing. No userspace application regardless of how poorly written should be able to trigger a drop from UDMA to PIO.

I have tried jaunty with a 2.6.32 kernel and it works fine. I was going to try installing karmic and upgrading it to 2.6.32 but I wasn't even able to complete the installer. It failed to mount the ext4 / partition due to numerous HSM faults

Revision history for this message
Andrew Squire (andrewsquire) wrote :

Just a consideration it might be worth making when we're testing the different distros / kernals. Once I had experienced the problem, I could not reliably get rid of it without doing a dd -if /dev/zero to both disks. In #89 (Andrew Simpson) and #80 (Stephen O), when the problem was exhibited in Fedora 12, had the disks been zeroed in between?

Revision history for this message
theluketaylor (ekul-taylor) wrote :

I tried bootstrapping karmic from a jaunty livecd on a zeroed ssd. I installed the 2.6.32 kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/ that I have run from jaunty without any trouble.
I wanted to see if I could trigger the bug when karmic wasn't involved in the creation or population of the filesystem. Since once the bug happens the disk has to be zeroed I started small, just booting into single user mode. Even this triggered the bug. The only processes that had been spawned were init, upstart, dhcp and the shell I was using

Revision history for this message
Johan Van den Neste (jvdneste) wrote :

I can't try this right now, but would it be possible to run gparted with lots of debugging output to see exactly what it is doing when the bug is triggered? If this is not possible, maybe we could ask someone who knows the gparted codebase to help us out?

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (4.9 KiB)

Such an interesting thing: if only I launch parted from liveusb - I
immediately got our errors. Got same using cfdisk. And - it may be
interesting and important - it happens when I just run parted, when I exit
parted - and when I run/exit cfdisk too.
BUT!
Once I've removed libparted and all packets that depend upon it - and the
noise has gone away! No more blinking of SSD-indicator. WOW!I still got DMA
though. Interesting, I guess. Can anyone approve my find?

2009/12/10 Johan Van den Neste <email address hidden>

> I can't try this right now, but would it be possible to run gparted with
> lots of debugging output to see exactly what it is doing when the bug is
> triggered? If this is not possible, maybe we could ask someone who knows
> the gparted codebase to help us out?
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrls : 6
> CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
> CheckboxSystem: c69722ecac764861be52925fa50b4dcc
> Date: Wed Oct 7 17:54:56 2009
> DistroRelease: Ubuntu 9.10
> HibernationDevice: RESUME=UUI...

Read more...

Revision history for this message
rogmorri (frontporsche) wrote : Re: SSD stall during boot

It seems that if I do this as root....

  int k = open("/dev/sda", O_WRONLY|O_LARGEFILE);
  ioctl(k,BLKFLSBUF,0);
  close(k);

...then 15~30 seconds later I see this on the console ...

[33458.988232] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[33458.994551] ata2.00: BMDMA stat 0x4
[33459.001646] ata2.00: cmd ca/00:08:ef:00:40/00:00:00:00:00/e0 tag 0 dma 4096 out
[33459.001657] res 58/00:08:ef:00:40/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[33459.014157] ata2.00: status: { DRDY DRQ }

but dmesg shows a bit more....

[33458.988145] ata2: lost interrupt (Status 0x58)
[33458.988232] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[33458.994551] ata2.00: BMDMA stat 0x4
[33459.001646] ata2.00: cmd ca/00:08:ef:00:40/00:00:00:00:00/e0 tag 0 dma 4096 out
[33459.001657] res 58/00:08:ef:00:40/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[33459.014157] ata2.00: status: { DRDY DRQ }
[33459.021056] ata2: soft resetting link
[33459.228490] ata2.00: configured for UDMA/66
[33459.228528] ata2: EH complete

I gleaned those 3 lines of code from doing "strace parted"

Revision history for this message
rogmorri (frontporsche) wrote :

I should have mentioned that this could be related to O_LARGEFILE, which I had to define manually...

  #define O_LARGEFILE 0100000

Revision history for this message
rogmorri (frontporsche) wrote :

Oops, ignore that.
It seem that this is all I need to cause the error is just this...

  int k = open("/dev/sda", O_WRONLY);
  close(k);

BTW, my root file system is mounted on sda1 while I'm doing this.

(If I open with O_RDONLY instead, then there is no error)

Revision history for this message
Raf (4283534-noduck) wrote :

#97 also repeatably triggers the HSM violation for me. But only under Karmic (not Jaunty).

Revision history for this message
Raf (4283534-noduck) wrote :

But be warned, this has lead to disk corruption for me! I never had disk corruptions before...

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 445852] Re: SSD stall during boot

On Fri, 2009-12-11 at 01:07 +0000, rogmorri wrote:

> Oops, ignore that.
> It seem that this is all I need to cause the error is just this...
>
> int k = open("/dev/sda", O_WRONLY);
> close(k);
>
> BTW, my root file system is mounted on sda1 while I'm doing this.
>
That actually has quite a few side-effects that you might not
realise ;-)

Try this command (as root)

  echo change > /sys/block/sda/uevent

Does that cause the same errors?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Stephen O (soglesby1) wrote : Re: SSD stall during boot

I had not zeroed my drive prior to installing Fedora 12 so I attempted to follow Andrew's suggestion (#91). Unfortunately every time I try dd if=/dev/zero of=/dev/sda (which is definitely my 32GB SSD) I get the following error:
dd: writing to '/dev/sda': Input/output error
676065+0 records in
676064+0 records out
346144768 byte (346 MB) copied, 82.0451 s, 4.2 MB/s

It's always 346MB, never more nor less. Has my drive fallen victim to this bug and been damaged in some way?

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: SSD stall during boot
Download full text (4.8 KiB)

Stephen, try badblocks -wvs /dev/sda to make check your drive in read-write
mode

2009/12/11 Stephen O <email address hidden>

> I had not zeroed my drive prior to installing Fedora 12 so I attempted to
> follow Andrew's suggestion (#91). Unfortunately every time I try dd
> if=/dev/zero of=/dev/sda (which is definitely my 32GB SSD) I get the
> following error:
> dd: writing to '/dev/sda': Input/output error
> 676065+0 records in
> 676064+0 records out
> 346144768 byte (346 MB) copied, 82.0451 s, 4.2 MB/s
>
> It's always 346MB, never more nor less. Has my drive fallen victim to
> this bug and been damaged in some way?
>
> --
> SSD stall during boot
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “linux” package in Ubuntu: Confirmed
>
> Bug description:
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't think it
> has happened once the system is fully loaded. I am running karmic unr on an
> Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> Architecture: i386
> ArecordDevices:
> **** List of CAPTURE Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
> Subdevices: 1/1
> Subdevice #0: subdevice #0
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC0: luke 1990 F.... pulseaudio
> CRDA: Error: [Errno 2] No such file or directory
> Card0.Amixer.info:
> Card hw:0 'Intel'/'HDA Intel at 0x58540000 irq 16'
> Mixer name : 'Realtek ALC268'
> Components : 'HDA:10ec0268,1025015b,00100101'
> Controls : 9
> Simple ctrls : 6
> CheckboxSubmission: 12ef539f3788bfbc46bc56b5c28128a6
> CheckboxSystem: c69722ecac764861be52925fa50b4dcc
> Date: Wed Oct 7 17:54:56 2009
> DistroRelease: Ubuntu 9.10
> HibernationDevice: RESUME=UUID=8d44b89b-2edb-4c02-a4be-94bd25b65081
> MachineType: Acer AOA110
> Package: linux-image-2.6.31-12-generic 2.6.31-12.40
> ProcCmdLine: BOOT_IMAGE=/boot/vmlinu...

Read more...

Revision history for this message
Andrew Squire (andrewsquire) wrote : Re: SSD stall during boot

Stephen O - I also had this error unless I set the block size manually. Try bs=1M.

Revision history for this message
rogmorri (frontporsche) wrote :

@Scott,

>> int k = open("/dev/sda", O_WRONLY);
>> close(k);
>That actually has quite a few side-effects that you might not
>realise ;-)

It might be good then to warn people not to run "parted --list" to view partition tables. That seems to open all your devices with O_WRONLY.

Revision history for this message
Raf (4283534-noduck) wrote :

Scott,

Yes, "echo change > /sys/block/sda/uevent" results in the same error. After entering that command, it takes about 20 to 30 seconds until the error shows up.

Any idea why does works fine under Jaunty and not Karmic?

Raf.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 445852] Re: SSD stall during boot

On Fri, 2009-12-11 at 17:10 +0000, Raf wrote:

> Yes, "echo change > /sys/block/sda/uevent" results in the same error.
> After entering that command, it takes about 20 to 30 seconds until the
> error shows up.
>
Ok, excellent.

So what's happening is that one of the commands being run to probe the
disk is causing the error. Let's figure out which one!

Run the following command:

sudo udevadm test /block/sda 2>&1 | grep "^util_run_program:.*started"

This will output a bunch of program names. First wait the 20-30s, to
see whether you get the error. It's possible that you will not with
this (which is interesting in of itself, so please let me know if that
happens).

If you do get the error, note down the commands and then we'll want to
run each one in turn to see which one gives the error. (You'll need to
run them all with sudo or as root).

There's probably about 6-8 of them.

Let me know which one(s) cause the error (if any).

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Raf (4283534-noduck) wrote : Re: SSD stall during boot

Scott,

# udevadm test /block/sda 2>&1 | grep "^util_run_program:.*started"
util_run_program: 'ata_id --export /dev/sda' started
util_run_program: 'scsi_id --whitelisted --replace-whitespace -p0x80 -d/dev/sda' started
util_run_program: 'path_id /devices/pci0000:00/0000:00:1f.2/host1/target1:0:0/1:0:0:0/block/sda' started
util_run_program: '/sbin/blkid -o udev -p /dev/sda' started
util_run_program: 'edd_id --export /dev/sda' started
util_run_program: 'devkit-disks-part-id /dev/sda' started
util_run_program: 'devkit-disks-probe-ata-smart /dev/sda' started

This did trigger the HSM violation.

Testing more, I think that devkit-disks-probe-ata-smart is the one triggering the violation (devkit is new in Karmic?). However, it doesn't always show.

I think it might be related to the delay between the different commands. If the delay is too big it doesn't always show . But if the delay is too small, we cannot be sure which one triggered the command.

The delay between running devkit-disks-probe-ata-smart and the actual violation being logged is not constant. Maybe devkit-disks-probe-ata-smart sets things up, but the actual violation only triggers after some other activity occurs (e.g. simple disk access).

I will try again to isolate it.

Raf.

Revision history for this message
Raf (4283534-noduck) wrote :

I am doing this (on an otherwise idle UNR):

# sleep 120; logger devkit-disks-probe-ata-smart; /lib/udev/devkit-disks-probe-ata-smart /dev/sda; sleep 120; logger done

And I get this in syslog (repeatably):

Dec 14 11:12:01 unus logger: devkit-disks-probe-ata-smart
Dec 14 11:12:35 unus kernel: [ 7734.000130] ata2: lost interrupt (Status 0x58)
Dec 14 11:12:35 unus kernel: [ 7734.000217] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 14 11:12:35 unus kernel: [ 7734.000232] ata2.00: BMDMA stat 0x4
Dec 14 11:12:35 unus kernel: [ 7734.000264] ata2.00: cmd ca/00:08:08:8b:54/00:00:00:00:00/e3 tag 0 dma 4096 out
Dec 14 11:12:35 unus kernel: [ 7734.000270] res 58/00:08:08:8b:54/00:00:00:00:00/e3 Emask 0x2 (HSM violation)
Dec 14 11:12:35 unus kernel: [ 7734.000284] ata2.00: status: { DRDY DRQ }
Dec 14 11:12:35 unus kernel: [ 7734.000343] ata2: soft resetting link
Dec 14 11:12:35 unus kernel: [ 7734.208576] ata2.00: configured for UDMA/66
Dec 14 11:12:35 unus kernel: [ 7734.208618] ata2: EH complete
Dec 14 11:14:01 unus logger: done

Also note that I stay in UDMA (not PIO like some other posters). Except for the delays, the disk is quite usable.

Revision history for this message
Raf (4283534-noduck) wrote :

I disabled /lib/udev/rules.d/95-devkit-disks.rules (using dpkg-divert). And now I can boot without HSM violations. But I believe this is only a workaround.

Note that the devkit-disks-probe-ata-smart tests did again result in filesystem corruption.

Raf.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 445852] Re: SSD stall during boot

On Mon, 2009-12-14 at 16:27 +0000, Raf wrote:

> I disabled /lib/udev/rules.d/95-devkit-disks.rules (using dpkg-divert).
> And now I can boot without HSM violations. But I believe this is only a
> workaround.
>
> Note that the devkit-disks-probe-ata-smart tests did again result in
> filesystem corruption.
>
From information received here, and information on the kernel bug, I
really think that the cause *is* the SMART commands.

Scott
--
Scott James Remnant
<email address hidden>

affects: linux (Ubuntu) → libatasmart (Ubuntu)
summary: - SSD stall during boot
+ devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential
+ hardware death
Changed in devicekit-disks (Ubuntu):
status: New → Triaged
Changed in libatasmart (Ubuntu):
status: Confirmed → Triaged
importance: Undecided → High
importance: High → Critical
Changed in devicekit-disks (Ubuntu):
importance: Undecided → High
importance: High → Critical
Revision history for this message
Raf (4283534-noduck) wrote :

I don't know if it is helpful to anybody, but I have attached the strace for /lib/udev/devkit-disks-probe-ata-smart /dev/sda. It does two SG_IO ioctls against the device. smartctl -a /dev/sda does not trigger the HSM violation.

Revision history for this message
Raf (4283534-noduck) wrote :
Revision history for this message
Gav Mack (gavinmac) wrote :

@Raf: Could you explain how to disable /lib/udev/rules.d/95-devkit-disks.rules (using dpkg-divert) so I can apply the workaround - the corruption is finally starting to hit my setup over the past couple of days.

One good thing has came out of a good Windows tech but linux n00b fumbling about not really knowing what he was doing - I got Scott involved :-)

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@Gav Mack
This is probably a fairly crude workaround, but it works for me. I just disabled the ata-smart disk probe in the udev rules:

In the file /lib/udev/rules.d/95-devkit-disks.rules look for these lines (Lines 73 & 74):

 # ATA disks driven by libata
KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"

Add a '#' in front to make the rule line a comment, like this:

# ATA disks driven by libata
#KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"

Save the file.

To make sure it's reloaded do these commands:

sudo service udev stop
sudo service udev start

Test with gparted... and notice the difference.

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

On Mon, 2009-12-14 at 20:26 +0000, Raf wrote:

> I don't know if it is helpful to anybody, but I have attached the strace
> for /lib/udev/devkit-disks-probe-ata-smart /dev/sda. It does two SG_IO
> ioctls against the device. smartctl -a /dev/sda does not trigger the HSM
> violation.
>
Could you provide the equivalent strace for smartctl as well, for
comparison?

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
rogmorri (frontporsche) wrote :

> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"

Nice workaround. This brought my poweron-to-desktop time from 2:35 down to 0:52. :)

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Confirming this behavior on an ASUS EeePC 900 with an upgraded SSD: Patriot Lite SSD, 32GB model PL32GPEPCSSDR

I just ran a command like @Raf described in comment #108. When /dev/sda is umounted, no response except "DKD_ATA_SMART_IS_AVAILABLE=1" in the console, but next mount takes a long time to complete and any activity on /dev/sda generates the same syslog errors as described there. The drive seems to function normally until the next /lib/udev/devkit-disks-probe-ata-smart command when it hangs and generates syslog events again.

I have been commenting on my reported Bug 430333 but I think I will declare it to be a duplicate of this one, and send the other Patriot Lite SSD user(s) over here.

Revision history for this message
shadowblast101 (shadowblast101) wrote :

I followed Tommy over, and can confirm his confirmation. I have pretty much the same setup, EEE900, PL32GPEPCSSDR, but with Arch instead of Ubuntu, and the behavior is still there. Mine had a few other quirks too, such as hiding my mouse until I switch to a tty and back.

After applying the workaround, everything seems to be good.

Revision history for this message
Raf (4283534-noduck) wrote :

I did some more tests with devkit-disks-probe-ata-smart. If I boot from USB flash, devkit-disks-probe-ata-smart on the SSD does not trigger any log entry until I try to write to the disk. If I then try to write (to the swap partition, as not the corrupt any fs) to the disk with dd, dd hangs until the error is generated, which occurs 30 seconds later (probably a timeout in the kernel).

Revision history for this message
Raf (4283534-noduck) wrote :

Output from smartctl -a.

Revision history for this message
Raf (4283534-noduck) wrote :

strace from smartctl -a.

Revision history for this message
In , Scott James Remnant (Canonical) (canonical-scott) wrote :

We have many reports of the libatasmart code causing stalls, HSM Violations and even death of SSDs. Particularly SuperTalent ones, but also those found in my netbooks.

# sleep 120; logger devkit-disks-probe-ata-smart; /lib/udev/devkit-disks-probe-ata-smart /dev/sda; sleep 120; logger done

And I get this in syslog (repeatably):

Dec 14 11:12:01 unus logger: devkit-disks-probe-ata-smart
Dec 14 11:12:35 unus kernel: [ 7734.000130] ata2: lost interrupt (Status 0x58)
Dec 14 11:12:35 unus kernel: [ 7734.000217] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 14 11:12:35 unus kernel: [ 7734.000232] ata2.00: BMDMA stat 0x4
Dec 14 11:12:35 unus kernel: [ 7734.000264] ata2.00: cmd ca/00:08:08:8b:54/00:00:00:00:00/e3 tag 0 dma 4096 out
Dec 14 11:12:35 unus kernel: [ 7734.000270] res 58/00:08:08:8b:54/00:00:00:00:00/e3 Emask 0x2 (HSM violation)
Dec 14 11:12:35 unus kernel: [ 7734.000284] ata2.00: status: { DRDY DRQ }
Dec 14 11:12:35 unus kernel: [ 7734.000343] ata2: soft resetting link
Dec 14 11:12:35 unus kernel: [ 7734.208576] ata2.00: configured for UDMA/66
Dec 14 11:12:35 unus kernel: [ 7734.208618] ata2: EH complete
Dec 14 11:14:01 unus logger: done

The problem has also been confirmed in Fedora 12.

Revision history for this message
In , Scott James Remnant (Canonical) (canonical-scott) wrote :

Kernel bugzilla bug for the same issue (URL above is the Launchpad bug)

http://bugzilla.kernel.org/show_bug.cgi?id=14583

Revision history for this message
In , Lennart-poettering (lennart-poettering) wrote :

Hmm, lacking access to the hw in question I am not sure what I can do about this.

What surprises me a bit is that this only appeared so very recently. Is this triggered by some interplay with some specific kernel version?

description: updated
Revision history for this message
In , Scott James Remnant (Canonical) (canonical-scott) wrote :

Most distros only switched to using your code recently; previously we've all been using smartmontools and the like which don't cause this problem.

The LP bug has the differencing straces between the two if that's helpful?

Revision history for this message
Gav Mack (gavinmac) wrote :

@Andrew Simpson: Many thanks for the instructions. The workaround has dropped my boot time from always over 2 minutes to 30 seconds, what I was expecting back in late September when I installed the Beta with the Super Talent SSD! Almost 3 months of woe now at an end thank goodness. Recreated my user account from scratch because after further investigation my other half thought it was a good idea to delete the timed out applets including window-picker so I couldn't put them back again!

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

We have seen the corruption survive a basic filesystem initialization, so once your drive has been corrupted you may need to write zeroes to it to eliminate the bad blocks before you can create a clean filesystem again. You can verify the drive using badblocks when it is not mounted. This procedure works for the flash SSDs such as the Patriot Lite.

For example, to write zeroes to /dev/sda:

 # dd if=/dev/zero of=/dev/sda bs=1M

To do a read-only test of /dev/sda:

# badblocks -s /dev/sda

Revision history for this message
lotus49 (lotus-49) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
Download full text (5.8 KiB)

The workaround did the trick for me too. I hadn't suffered any
corruption but I was comsidering going back to Jaunty and I am pleased
not to have to.

Simon

Sent from My iPhone

On 16 Dec 2009, at 14:28, Gav Mack <email address hidden> wrote:

> @Andrew Simpson: Many thanks for the instructions. The workaround has
> dropped my boot time from always over 2 minutes to 30 seconds, what I
> was expecting back in late September when I installed the Beta with
> the
> Super Talent SSD! Almost 3 months of woe now at an end thank
> goodness.
> Recreated my user account from scratch because after further
> investigation my other half thought it was a good idea to delete the
> timed out applets including window-picker so I couldn't put them back
> again!
>
> --
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and
> potential hardware death
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in The Linux Kernel: Confirmed
> Status in “devicekit-disks” package in Ubuntu: Triaged
> Status in “libatasmart” package in Ubuntu: Triaged
>
> Bug description:
> TEMPORARY WORK AROUND FOR THIS PROBLEM:
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file;
> search for "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV
> {DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should
> have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV
> {DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 4. save the file and reboot
>
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process.
> It happens almost everytime before xsplash loads and happens again
> frequently between logging into gdm and the desktop loading. When
> it happens during login I think it is making gnome time out on
> loading panel items as I get errors related to lots of panel items
> failing to load. If I log out and back in again when the ssd isn't
> stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in
> dmesg when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not have this issue in jaunty with this hardware and I don't
> think it has happened once the system is fully loaded. I am running
> karmic unr on an Acer Aspire One netbook.
>
> ProblemType: Bug
> AplayDevices:
> **** List of PLAYBACK Hardware Devices ****
> card 0: Intel [HDA Intel], device 0: ALC268 Ana...

Read more...

Revision history for this message
In , Lennart-poettering (lennart-poettering) wrote :

(In reply to comment #3)
> Most distros only switched to using your code recently; previously we've all
> been using smartmontools and the like which don't cause this problem.

Is it actually verified that this doesn't happen with smartmontools? I mean, smartmontools in contrast to libatasmart does not issue commands that early after initialization/hotplug? So, is it verified that the problem is with the way libatasmart issues its commands and not simply due to the context those commands are executed in?

> The LP bug has the differencing straces between the two if that's helpful?

I only see a lot of noise in that bug report, could you point me tto the two straces?

Revision history for this message
In , Lennart-poettering (lennart-poettering) wrote :

(In reply to comment #3)
> Most distros only switched to using your code recently;

The simple fact is that rawhide (and the ubuntu betas) had this code for months already, and we got quite a few bug reports, but never something about this issue. This issue only appeared a couple of weeks back, and hence I am wondering if something else changed in that time, because libatasmart didn't.

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

UNFORTUNATELY the workaround is not a good idea for a NEW installation on my netbook. I was able to edit /lib/udev/rules.d/95-devkit-disks.rules before the installer rebooted into the new system, but as soon as the system installed the first set of critical and recommended updates, the filesystem was thoroughly trashed before update-manager had even finished.

Is there a single package I can pin in apt, or can I just remove or somehow deactivate libatasmart itself instead of editing the udev rule?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

On Thu, 2009-12-17 at 15:44 +0000, Tommy Trussell wrote:

> UNFORTUNATELY the workaround is not a good idea for a NEW installation
> on my netbook. I was able to edit /lib/udev/rules.d/95-devkit-
> disks.rules before the installer rebooted into the new system, but as
> soon as the system installed the first set of critical and recommended
> updates, the filesystem was thoroughly trashed before update-manager had
> even finished.
>
Then re-apply the change before rebooting after those updates.

You can use dpkg-divert as described above in the bug comments to ensure
that updates do not affect this file - but then you won't get the proper
fix later and may indeed cause yourself future bugs down the line.

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

@scott Yes, I think dpkg-divert or some other technique is essential because as I said, I didn't even reboot -- the filesystem was already trashed BEFORE update-manager had FINISHED -- apparently it had already reverted the change and reloaded udev in one of its updates.

Revision history for this message
Cris (cristiano.p) wrote :

While upgrading my eeepc 900 to Karmic (before I had to recover the ssd and fall-back to Jaunty),
I've noticed a very marked slow down of the disk operations: the upgrade process took quite 5 hours.

Which makes me believe that the disk corruption happens already during the upgrade/install process.

Revision history for this message
In , Andrewnz-simpson (andrewnz-simpson) wrote :

(In reply to comment #5)
> The simple fact is that rawhide (and the ubuntu betas) had this code for months
> already, and we got quite a few bug reports, but never something about this
> issue. This issue only appeared a couple of weeks back, and hence I am
> wondering if something else changed in that time, because libatasmart didn't.
>

The Ubuntu bug report goes back to early October, however the link with libatasmart was only made very recently.

There is no interplay with any specific kernel version: The bug has been confirmed on 2.6.28, 2.6.29, 2.6.30, 2.6.31 and 2.6.32. Both Ubuntu patched versions and mainline kernels are affected.

The bug has been confirmed as libatasmart only; testing has shown that smartmontools does not give the same problem. Early initialisation is a possible issue, though the problem can be readily reproduced at any time.

Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems to be isolating the bug.

I will attach the straces from the Ubuntu Bug Report to this report

Revision history for this message
In , Andrewnz-simpson (andrewnz-simpson) wrote :

Created an attachment (id=32167)
Trace from libatasmart

Revision history for this message
In , Andrewnz-simpson (andrewnz-simpson) wrote :

Created an attachment (id=32168)
Trace from smartctl

Revision history for this message
Jean-Louis (jean-louis) wrote :

HI, sorry for my bad english.

I don't have sdd hard disk, but I've watched libatasmart code for other bug and I think that I can help a little for this

In the attachment smartctl-output I can see: "ATA Version is: 5"

In this pdf (2.7MB) http://www.t10.org/t13/project/d1321r3-ATA-ATAPI-5.pdf there are differences if hdd implements "PACKET Command feature set" or no.

In particular if it is implemented, all the smart commands used in libatasmart are prohibited.

The IDENTIFY DEVICE command, if is implemented "PACKET Command feature set", shall return command aborted, but in libatasmart the return value is lost and the "d->identify_valid = FALSE;" is never setted.

Try to add in function disk_identify_device(),
after (line 741) "if ((ret = disk_command(d, SK_ATA_COMMAND_IDENTIFY_DEVICE, SK_DIRECTION_IN, cmd, d->identify, &len)) < 0)"
and before (line 742)"return ret;"
"d->identify_valid = FALSE;"

like this

        if ((ret = disk_command(d, SK_ATA_COMMAND_IDENTIFY_DEVICE, SK_DIRECTION_IN, cmd, d->identify, &len)) < 0) {
                d->identify_valid = FALSE;
                return ret;
        }

Revision history for this message
In , Lennart-poettering (lennart-poettering) wrote :

(In reply to comment #6)

>
> Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> to be isolating the bug.

I don't think so. That proposed patch is bogus, identify_valid is FALSE unless set to TRUE anyway.

Also, supposedly SMART does work with smartmontools, just not with libatasmart, right? That comment suggests that SMART would not work at all with those SSDs.

Andrew, do you have one of the SSDs affected? Could you step through the code and figure out exactly which command triggers the problem?

Revision history for this message
In , Andrewnz-simpson (andrewnz-simpson) wrote :

(In reply to comment #9)

>
> Andrew, do you have one of the SSDs affected? Could you step through the code
> and figure out exactly which command triggers the problem?
>

You would have to give me very detailed instructions as to how to do it. Programming in C is not one of my skills and I don't have a programming background.

More realistically, there are at least a couple of people on the Ubuntu Bug list that would know how to do this. Is it worth putting out a query?

Revision history for this message
In , Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

In terms of how long ago this bug has existed, I originally filed a bug on this issue back in June '09. So it's existed long before 9.10 (October) released.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/387272

Revision history for this message
In , Jelot-freedesktop (jelot-freedesktop) wrote :

(In reply to comment #9)
> (In reply to comment #6)
>
> >
> > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > to be isolating the bug.
>
> I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> set to TRUE anyway.

I'm the author of comment #129 in launchpad <https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/445852/comments/129>

I'm quite a beginner with c, but I know that if a variable is not initialized its value is garbage.

I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I could be missed it.

>
> Also, supposedly SMART does work with smartmontools, just not with libatasmart,
> right? That comment suggests that SMART would not work at all with those SSDs.

I don't know internals of smartmoontools... I have read only some pages of this pdf (2.7MB) http://www.t10.org/t13/project/d1321r3-ATA-ATAPI-5.pdf

on pag 52 is reported:

[quote]
Devices that implement the PACKET Command feature set shall not implement the SMART feature set as described in this subclause.
Devices that implement the PACKET Command feature set and SMART shall implement SMART as defined by the command packet set implemented by the device.
[/quote]

and on page 196 and subsequent is reported:

[quote]
SMART feature set.
  − Mandatory when the SMART feature set is implemented.
  − Use prohibited when the PACKET Command feature set is implemented.
[/quote]

I don't know how is implemented SMART for command packet set and I don't know if this sdd implements command packet set, but this *could be* the problem (IMHO)

Revision history for this message
mint-one (d-zschokke) wrote :

Hi folks

Applying the patch works... You should apply it first to your install-usb-stick and you will notice how fast gparted detects your drives. Then apply the patch after installing karmic. Never reboot the system without applying the patch! Otherwise you will write lots zeroes to your ssd again. And now to the best part: Every kernel or grub update (undecided) will reset this patch! Apply it after each major update or your ssd will be lost in space again.

I'm still up after having updated the system successfully. Let's see how long this is going to last. This bug is hell. Priority should be set to "hell".

good luck, dominic (on a eee pc 900)

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Would it be useful to create a "dummy" libatasmart4 package that responds to its calls with something innocuous but doesn't actually probe the SMART status? I see it's not easy to just yank it out because of other packages' dependencies upon it. I would prefer to disable the package in a way that survives ordinary software updates.

I'm not sure how @mint-one was able to avoid filesystem breakage... I wasn't able to reapply the patch in time, though maybe I was just especially un-lucky or un-careful.

P.S.: The Patriot Lite 32GB SSD upgrade on my ASUS 900 seems most susceptible to damage when the root partition or the root + swap partitions completely fill the drive. I don't know what that might mean, except that it's a "bigger target" for breakage. The beta 9.10 NBR installers (prior to October) could not even finish the job without completely trashing the filesystem before grub was installed.

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

@jean-louis: do you have your patched libatasmart4 code in a PPA? I would be pleased to test it.

Revision history for this message
mint-one (d-zschokke) wrote :

For Tommy Trussel and to whom it may concern:

1. Create a bootable usb medium.
2. Apply the patch on the usb medium! (Uncomment this line, see top of this bug)
3. Install (boot into live installation, don't install directly)
4. Install karmic
5. Don't reboot
6. Apply the patch on the the boot partition (navigate there with nautilus and copy the path (quite a strange one), paste it into terminal, sudo gedit $path, save)
7. Reboot.
8. Update your system (language support, new kernel, grub, misc updates)
9. Don't reboot.
10. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
10.1 and uncomment again... it was reset to the faulty default!
11. Reboot

here you go. This worked for me. Still working after several reboots.. and its fast on this old and small eee pc.

Good Luck!

Revision history for this message
Jean-Louis (jean-louis) wrote :

> @jean-louis: do you have your patched libatasmart4 code in a PPA?
> I would be pleased to test it.

No, I don't have a ppa.
I'm new on launchpad and I have yet to understand its features.

My comment is reported to upstream by Andrew Simpson, but Lennart Poettering says that a proposed patch is bogus (https://bugs.freedesktop.org/show_bug.cgi?id=25673#c9).

I'm quite a beginner with c, but I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I could be missed it.

Now I would create account on freedesktop for ask

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

@mint-one -- when I did that, my system was corrupted BEFORE step #8 was finished.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

O.K., I think I have a better workaround for this bug.

The problem is that udev reads the udev rule files into memory and then uses inotify to watch for changes in the file. As soon as the rule file changes, udev is informed and re-reads the file. That means that when apt-get updates the rule file, damage can be done before you get a chance to patch it again.

What I have done is put a dummy file in for devkit-disks-probe-ata-smart and used dpkg-divert to make the system accept the dummy file.

Run the following command:

$ sudo dpkg-divert --divert --add --rename --divert /lib/udev/devkit-disks-probe-ata-smart.bak /lib/udev/devkit-disks-probe-ata-smart

This renames the existing file to devkit-disks-probe-ata-smart.bak and tells dpkg / apt-get to install any new updates to the _changed_ file name.

To see your divert (and others in the system):

$ sudo dpkg-divert --list

Now we create a dummy file:

$ sudo /lib/udev/nano devkit-disks-probe-ata-smart (or some other editor of your choice)

#!/bin/bash
#
exit 0

Save the file.

This dummy file does precisely nothing, but it allows udev to run it...

Make the new dummy file executable:

$ sudo chmod 755 /lib/udev/devkit-disks-probe-ata-smart

That's it.

When the bug gets really fixed, we need to remove the dummy file and divert:

$ sudo rm /lib/udev/devkit-disks-probe-ata-smart
$ dpkg-divert --rename --remove /lib/udev/devkit-disks-probe-ata-smart

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

Carrying on from above:

Here's how to patch and install a system safely from the LiveCD (or live USB) of Ubuntu 9.10.

I booted up the LiveCD and patched the live system as above. That made the live system safe to use. I then installed from the LiveCD (no errors - good).

However instead of immediately rebooting, I patched the SSD from the live system:

$ sudo mkdir /target

(In my case it already existed from the install)

$ sudo mount /dev/sda1 /target

$ sudo chroot /target

You are now in the new (SSD) system as root, but safely running on the patched LiveCD. Follow the steps above, but leave out 'sudo', because you are root. When finished you can leave chroot by:

# exit

------------------

Edit on previous comment:

$ sudo /lib/udev/nano devkit-disks-probe-ata-smart (or some other editor of your choice)

-- should read:

 $ sudo nano /lib/udev/devkit-disks-probe-ata-smart (or some other editor of your choice)

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

@Andrew Simpson: Thank you! but be sure to substitute the path to an editor that works ;-) I'm testing this now.

sudo gedit /lib/udev/devkit-disks-probe-ata-smart
or
sudo nano /lib/udev/devkit-disks-probe-ata-smart

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Finished testing, and the new workaround procedure works fine on my ASUS. (I did have some trouble on the first reboot after update-manager finishes, but it looks like a grub issue, probably not related to this bug.)

Revision history for this message
In , Lennart-poettering (lennart-poettering) wrote :

(In reply to comment #12)
> (In reply to comment #9)
> > (In reply to comment #6)
> >
> > >
> > > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > > to be isolating the bug.
> >
> > I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> > set to TRUE anyway.
>
> I'm the author of comment #129 in launchpad
> <https://bugs.launchpad.net/ubuntu/+source/libatasmart/+bug/445852/comments/129>
>
> I'm quite a beginner with c, but I know that if a variable is not initialized
> its value is garbage.
>
> I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I
> could be missed it.

The initial calloc() call for allocating the SkDisk structure does the zero initialization.

Revision history for this message
In , Jelot-freedesktop (jelot-freedesktop) wrote :

(In reply to comment #13)
> (In reply to comment #12)
> > I don't find how d->identify_valid is zeroed or setted FALSE. Obviously, I
> > could be missed it.
>
> The initial calloc() call for allocating the SkDisk structure does the zero
> initialization.
>

Uh... thanks for clarification and sorry for wasting your time.

Revision history for this message
In , 4280829-noduck (4280829-noduck) wrote :

(In reply to comment #9)
> (In reply to comment #6)
>
> >
> > Comment #129 of the Ubuntu Bug Report is well worth reading, because it seems
> > to be isolating the bug.
>
> I don't think so. That proposed patch is bogus, identify_valid is FALSE unless
> set to TRUE anyway.
>
> Also, supposedly SMART does work with smartmontools, just not with libatasmart,
> right? That comment suggests that SMART would not work at all with those SSDs.
>
> Andrew, do you have one of the SSDs affected? Could you step through the code
> and figure out exactly which command triggers the problem?
>

I ran gdb on devkit-disks-probe-ata-smart/libatasmart. I found that it is the ioctl call from disk_smart_read_thresholds call that triggers the HSM violation in the kernel log (disk_smart_read_thresdholds calls disk_command with the argument SK_SMART_COMMAND_READ_THRESHOLDS), which calls disk_passthrough_16_command, which calls sg_io, which does an ioctl.

There is one earlier call to disk_command (with SK_ATA_COMMAND_IDENTIFY_DEVICE), but that does not the trigger the HSM violation in the kernel log. So I believe it is the way that the SSD reacts to the READ_THRESHOLDS command that throws off the kernel.

Raf.

Revision history for this message
MFV (mfv) wrote :

I can confirm this occurs on a stock Asus EEE 900 (original celeron linux model with 4gb+16gb SSD's - it occurs on both).

Changed in libatasmart:
status: Unknown → Confirmed
Revision history for this message
Vishal Rao (vishalrao) wrote :

I've filed https://bugs.launchpad.net/bugs/502219 not sure whether its the same or just related and whether the same workaround applies?

Revision history for this message
jslater (jslater) wrote :

Does this bug affect _everyone_ with an eeePC [900]? A "Tier 1" supported netbook platform.

It certainly destroyed the contents of my SSD.

The workaround works, but there is no mention on https://wiki.ubuntu.com/HardwareSupport/Machines/Netbooks or elsewhere of this very serious bug. At the very least warning potential users of this bug might save someone from losing all of their data.

This bug has existed for months and has seriously dented my impression of Ubuntu.

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

@jslater: from earlier tests it seemed it did not affect everyone, or at least not equally. For example, I have an ASUS EeePC 900 (4GB SSD, no built-in webcam, purchased from Target) and I see the problem very clearly on my Patriot Lite upgraded SSD but not on the stock ASUS 4GB SSD. I haven't swapped the 4GB SSD back in since we have discovered the trigger -- it's possible the stock SSD was somewhat affected but didn't trash data as thoroughly or something. When I get back to my office later I might get out the screwdriver set and try it.

I agree that this bug should be noted somewhere on that netbooks wiki page. Please feel free to add it! (Though it would be hard to know exactly which models might be affected. Maybe ALL of them, depending on which SSD is installed.)

I believe some SSDs have been reported that are not installed in netbooks.

If you see a good place to include the warning, please add it.

Revision history for this message
rogmorri (frontporsche) wrote :

The factory-installed SSD on my Aspire Acer One 110-1722, a tier-2 netbook, died a few months ago. In retrospect, I think the issue was this very bug.

(I've since replaced the SSD with a 16G after-market drive, which also suffered from this problem.)

Wouldn't a problem like this, where there's no easy workaround for avoiding the problem at install time, call for releasing a new 9.10.1 Ubuntu iso?

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

Just for reference I have an EEE 900 which has been running Jaunty 9.04 for a while just fine. I upgraded to 9.10, and before rebooting implemented the change to /lib/udev/rules.d/95-devkit-disks.rules as indicated in the description. This seemed to work well.

Revision history for this message
In , Jelot-freedesktop (jelot-freedesktop) wrote :

In Karmic there is a new stable kernel 2.6.31-17.54 <https://bugs.launchpad.net/ubuntu/+source/linux/+bug/480144> that refer to upstream kernel 2.6.31.6; in the changelog <http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.31.6> is mentioned this commit 9982364654c186acd48c3070dcf6a76c69e540cc with this description:

[quote]
commit 9982364654c186acd48c3070dcf6a76c69e540cc
Author: Tejun Heo <email address hidden>
Date: Fri Oct 16 13:00:51 2009 +0900

libata: fix internal command failure handling

commit f4b31db92d163df8a639f5a8c8633bdeb6e8432d upstream.

When an internal command fails, it should be failed directly without invoking EH. In the original implemetation, this was accomplished by letting internal command bypass failure handling in ata_qc_complete(). However, later changes added post-successful-completion handling to that code path and the success path is no longer adequate as internal command failure path. One of the visible problems is that internal command failure due to timeout or other freeze conditions would spuriously trigger WARN_ON_ONCE() in the success path.

This patch updates failure path such that internal command failure handling is contained there.
[/quote]

Could be related to this bug?

Revision history for this message
Jean-Louis (jean-louis) wrote :

In Karmic there is a new stable kernel 2.6.31-17.54
<https://bugs.launchpad.net/ubuntu/+source/linux/+bug/480144> that refer to
upstream kernel 2.6.31.6; in the changelog
<http://kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.31.6> is mentioned this
commit 9982364654c186acd48c3070dcf6a76c69e540cc with this description:

[quote]
commit 9982364654c186acd48c3070dcf6a76c69e540cc
Author: Tejun Heo <email address hidden>
Date: Fri Oct 16 13:00:51 2009 +0900

libata: fix internal command failure handling

commit f4b31db92d163df8a639f5a8c8633bdeb6e8432d upstream.

When an internal command fails, it should be failed directly without invoking
EH. In the original implemetation, this was accomplished by letting internal
command bypass failure handling in ata_qc_complete(). However, later changes
added post-successful-completion handling to that code path and the success
path is no longer adequate as internal command failure path. One of the
visible problems is that internal command failure due to timeout or other
freeze conditions would spuriously trigger WARN_ON_ONCE() in the success path.

This patch updates failure path such that internal command failure handling is
contained there.
[/quote]

Could be related to this bug?

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Here is a tested procedure for getting an uncorrupted Karmic system installed onto a netbook (tested on my ASUS Eee PC 900).

Installing Ubuntu Karmic 9.10 UNR on a system with an affected SSD

1) Boot from a live Ubuntu "Karmic" 9.10 USB stick or SD card.

2) If the SSD has been trashed by previous encounters with the bug, it may need to be wiped to eliminate bad blocks. Open a terminal and issue the following command (this assumes the SSD mounts to "/dev/sda" -- you must be certain of the device name on your system because everything on it will be erased):

$ sudo dd if=/dev/zero of=/dev/sda bs=1M

3) After step 2 finishes (it can take awhile), launch the "Install Ubuntu-Netbook-Remix 9.10" application (ubiquity) and install Ubuntu to your SSD. (If you have sufficient RAM, choose a custom partition and install a single "/" (root) partition without any swap space. I recommend using ext3 or the default ext4. Some recommend usig ext2, however, in my experience it does not recover from crash problems gracefully.)

4) When the installer finishes, a dialog will come up suggesting you can restart now. Don't restart yet! While that dialog is open, the install partition should still be mounted at /target ... HOWEVER if you already closed the dialog, open a Terminal and mount the partition:

$ sudo mount /dev/sda1 /target

5) Now chroot into the target system

$ sudo chroot /target

6) The terminal is chrooted into the target system as root (no need for sudo). You can now divert the problematic file on the target system:

# dpkg-divert --divert --add --rename --divert /lib/udev/devkit-disks-probe-ata-smart.bak /lib/udev/devkit-disks-probe-ata-smart

7) Now create a file. You will type three lines directly into the file, finishing with a control-D. (If you make a mistake that you can't fix using backspace, close the file with control-D and use nano or vim to edit the file.)

# cat > /lib/udev/devkit-disks-probe-ata-smart
#!/bin/bash
#
exit 0
[type control-D here]

8) Make the new file executable:

# chmod 755 /lib/udev/devkit-disks-probe-ata-smart

9) exit the chroot and terminal

# exit
$ exit

10) Shutdown, remove the USB stick or SD card, and boot into the new system. Install all software updates as needed.

-------------------------

After you know this bug has been fixed AND after the correct updated devicekit-disks package has been installed on your system, you can re-enable it using these commands:

$ sudo rm /lib/udev/devkit-disks-probe-ata-smart
$ sudo dpkg-divert --rename --remove /lib/udev/devkit-disks-probe-ata-smart

--------------------------

Revision history for this message
MFV (mfv) wrote :

Whats the update on this? The workaround seems good, and this issues exists in every current linux distro? Is a fix actually on its way?

Revision history for this message
Jarige (jarikvh) wrote :

I've noticed that I had similar symptoms of this bug when adding "elevator=noop" to /etc/default/grub on to this line: GRUB_CMDLINE_LINUX_DEFAULT="quiet splash elevator=noop".
After removing it again from command line it worked 'normally' again. When I say normally I mean that all programs seem to stop responding when the SSD is in use (either write or read). Installing a program will make every other program to 'crash' and the GUI (even the mouse) stops responding. This is not happening all the time though. As I'm typing on my AAO (8GB SSD) I see the SSD LED blinking every now and then but it affects other progams pretty often.

Revision history for this message
Vishal Rao (vishalrao) wrote :

FYI, my (solved) problem is/was NCQ not SMART as you can see in this comment in another bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/502219/comments/7

If you see "failed command READ FPDMA QUEUED" kind of logs in dmesg then that might be for you...

Basically you need to pass " libata.force=noncq " to the linux kernel boot param which I am doing but not sure what to do if people have multiple drives some properly supporting NCQ...

Revision history for this message
Юрий Аполлов (apollovy) wrote :

It's told here in launchpad, that this bug has a patch. But I cannot see it in usual way - there are instructions, but no patch. Can anyone write a patch-script to work this problem around??

Changed in linux:
status: Confirmed → Invalid
Revision history for this message
ectropionized (ectropionized-deactivatedaccount) wrote :

After upgrading the kernel to 2.6.31-19, devicekit-disks (007-2ubuntu4), and enabling DMA again, I am receiving no errors after testing extensively. I was one of those plagued with this bug, causing havoc on my netbook SSD. Can anyone else confirm a resolution on their end? It's entirely possible the time period for testing became anomalous, although I would figure unlikely given the consistency of errors previous. Since I am not yet receiving errors I thought it was worth opening continued discussion.

Revision history for this message
ectropionized (ectropionized-deactivatedaccount) wrote :

Scratch that on the resolution. Although I'm no longer receiving data corruption with DMA enabled, I just received this:

[ 2962.988208] ata2: lost interrupt (Status 0x58)
[ 2962.988297] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 2962.988306] ata2.00: BMDMA stat 0x4
[ 2962.988323] ata2.00: cmd ca/00:08:40:e9:7d/00:00:00:00:00/e0 tag 0 dma 4096 out
[ 2962.988326] res 58/00:08:40:e9:7d/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 2962.988333] ata2.00: status: { DRDY DRQ }
[ 2962.988376] ata2: soft resetting link
[ 2963.196543] ata2.00: configured for UDMA/66
[ 2963.196581] ata2: EH complete

Revision history for this message
Will (will-berriss) wrote :

I have just put a Super Talent 32GB SSD into my AA1 netbook and have installed Ubuntu 9.10 and I have this bug.

I don't want to damage my SSD as it was expensive.

What are my options to avoid Ubuntu damaging my SSD? Do I need to stop using 9.10 and wait for 10.04 or will the workaround above treat the SSD nicely?

Revision history for this message
Gav Mack (gavinmac) wrote :

@Will - Follow the instructions on post 147

Revision history for this message
Will (will-berriss) wrote :

@Gav Mack - Thanks! That's looks like quite a change, but I like the idea of not having swap so I may give it a go and reinstall.

Currently all I have done is the stuff in post 1, i.e. this:

# ATA disks driven by libata
#KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"

Is this enough to avoid damage to the SSD or is it only a small step towards reducing SSD wear and tear?

Thanks.

Revision history for this message
Martin Pitt (pitti) wrote :

but is in libatasmart, closing devicekit-disks task.

Changed in devicekit-disks (Ubuntu):
status: Triaged → Invalid
importance: Critical → Undecided
Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

@will -- in my experience with my Patriot Lite SSD, the patch in post 1 works fine but your first software update will undo it. And it's not good enough to apply the patch after the software update is finished, because the buggy code starts running immediately and it trashes the drive before the update even finishes.

The more elaborate workaround in comment 147 (which Andrew Simpson developed and I reiterated) tells the package manager to move the buggy code to a different location and apply all future updates to it in the new location (step 6) and creates a dummy executable file that runs but does nothing (step 7 & 8) in the location of the old software.

Revision history for this message
Will (will-berriss) wrote :

@Tommy Trussell - Thank you very much!

I reinstalled with no swap space and will apply #147 next. I just did #1 for the time being and i noticed an update overwrote it, so I reapplied it. Luckily my SSD survived that at least.

Next time I boot it up, I'll apply #147 right away. What a nightmare!

Thanks again! :)

Revision history for this message
Will (will-berriss) wrote :

I had to read #147 a couple of times, as the wording of step 7) confused me. Anyway, in short I did this to my working system:

6) You can now divert the problematic file on the target system:

# dpkg-divert --divert --add --rename --divert /lib/udev/devkit-disks-probe-ata-smart.bak /lib/udev/devkit-disks-probe-ata-smart

7) Now create a file.

vi /lib/udev/devkit-disks-probe-ata-smart

and put the following 3 lines in it:

#!/bin/bash
#
exit 0

8) Make the new file executable:

# chmod 755 /lib/udev/devkit-disks-probe-ata-smart

Revision history for this message
In , 4280829-noduck (4280829-noduck) wrote :

(In reply to comment #15)
> I ran gdb on devkit-disks-probe-ata-smart/libatasmart.

Was this information helpful? If not, can you let me know how I can assist in fixing this bug?

Revision history for this message
Raf (4283534-noduck) wrote :

This bug also affects Lucid via the call to udisks-probe-ata-smart in /lib/udev/rules.d/80-udisks.rules (from the package udisks). Uncommenting the line

KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="udisks-probe-ata-smart $tempnode"

in /lib/udev/rules.d/80-udisks.rules works around the problem on my Acer.

Is there any plan to include a real fix for this problem in Lucid? The upstream kernel bug was closed, the upstream libatasmart bug hasn't received much attention. How can I help to make sure that this bug is fixed in Lucid?

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@Raf
I closed the kernel bug report (it was my bug report) since it's not relevant to the kernel.

I've also nominated this bug for Lucid release - whatever that does.

More importantly the upstream maintainer seems to have lost interest in fixing this bug. How does one go about nominating packages for removal from Ubuntu due to lack of response from upstream maintainer?

Revision history for this message
Guy Taylor (thebiggerguy) wrote :

@Andrew Simpson
I have had good response from upstream and think this is a good package to keep within Ubuntu.

@all
Has the particular hardware (SSD or Controller) being identified yet? libatasmart has a "quirk" table to black list incompatible hardware.

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@Guy Taylor
The hardware has been generically identified - most Super Talent and Patriot devices and less commonly a few others. The problem seems to be at the SSD rather than the bridge. There are enough people following this bug to enable compilation of a reasonably complete list if asked.

What device information is required for a quirks table? Output from lspci -vv? Or something else?

Revision history for this message
Skylord (me-skylord) wrote :

BTW, this problem refers not only to specific hardware. For example I encountered it after updating my standard EeePC 901 SSD firmware to newer version - with better speed performance (in exchange of reducing disk space). The same is for Acer AspireOne....

Revision history for this message
adamski (adam-hasselbalch) wrote :

So.

I have /dev/zeroed both the 4G and the 16G SSDs in my Eee PC after running into this bug. When I discovered what was going on, I had used Karmic for about one hour.

The 4G drive seems to be OK.'badblocks -s' finish with no reports.

The 16G drive is dead. Stuffed to the brim with bad sectors, and dmesg shows I/O errors galore. This is both with 9.04 and 8.04 (which was my last known good installation) kernels. 8.04 allowed me to actually make a partition table after 9.04 failed with I/O errors all over the place. 8.04 also allowed me to create a file system. badblocks(1), however, show that there are still tons of errors on the drive.

I am going to try another /dev/zeroing of the 16G drive with a 8.04 kernel, for good measure, but I am not optimistic, since the HSM violations are gone, and what I see is, as mentioned, what looks like hardware I/O errors.

Mind you, both these drives worked fine prior to installing 9.10, but now, one disk is dead.

I am NOT really happy with Ubuntu right now. Spare Asus SSDs (which of course don't use regular SATA connections) are not a common commodity here, so for all intents and purposes, Karmic has bricked my Eee.

Revision history for this message
Юрий Аполлов (apollovy) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
Download full text (6.7 KiB)

adamski, if possible - try to renew (actually change) firmware. For a couple
of times. I got such troubles with Karmik on my Acer Aspire One 8Gb SSD.
Then many times zeroed it, ten (or more) times one-by-one firmwared - and
now it's running fine. Jaunty.

2010/2/24 adamski <email address hidden>

> So.
>
> I have /dev/zeroed both the 4G and the 16G SSDs in my Eee PC after
> running into this bug. When I discovered what was going on, I had used
> Karmic for about one hour.
>
> The 4G drive seems to be OK.'badblocks -s' finish with no reports.
>
> The 16G drive is dead. Stuffed to the brim with bad sectors, and dmesg
> shows I/O errors galore. This is both with 9.04 and 8.04 (which was my
> last known good installation) kernels. 8.04 allowed me to actually make
> a partition table after 9.04 failed with I/O errors all over the place.
> 8.04 also allowed me to create a file system. badblocks(1), however,
> show that there are still tons of errors on the drive.
>
> I am going to try another /dev/zeroing of the 16G drive with a 8.04
> kernel, for good measure, but I am not optimistic, since the HSM
> violations are gone, and what I see is, as mentioned, what looks like
> hardware I/O errors.
>
> Mind you, both these drives worked fine prior to installing 9.10, but
> now, one disk is dead.
>
> I am NOT really happy with Ubuntu right now. Spare Asus SSDs (which of
> course don't use regular SATA connections) are not a common commodity
> here, so for all intents and purposes, Karmic has bricked my Eee.
>
> --
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential
> hardware death
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in ATA S.M.A.R.T. Disk Health Monitoring Library: Confirmed
> Status in The Linux Kernel: Invalid
> Status in “devicekit-disks” package in Ubuntu: Invalid
> Status in “libatasmart” package in Ubuntu: Triaged
> Status in “devicekit-disks” source package in Lucid: Invalid
> Status in “libatasmart” source package in Lucid: Triaged
> Status in “devicekit-disks” source package in Karmic: New
> Status in “libatasmart” source package in Karmic: New
>
> Bug description:
> TEMPORARY WORK AROUND FOR THIS PROBLEM:
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file; search for
> "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 4. save the file and reboot
>
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on load...

Read more...

Revision history for this message
shadowblast101 (shadowblast101) wrote :

Adamski, just thought I'd clarify that this applies to a lot of Linux distributions, not just Ubuntu. I have this bug on my Arch install, and I think someone with Debian has confirmed the bug as well. Pretty much anything that calls lib-ata-smart will suffer from this, so there's not really any reason to blame Ubuntu explicitly.

Revision history for this message
In , 4280829-noduck (4280829-noduck) wrote :

(In reply to comment #17)
> Was this information helpful? If not, can you let me know how I can assist in
> fixing this bug?

For those still following: disabling smart support on the device prevents the second (dangerous) ioctl, and as a result no more HSM violations.

smartctl --smart off /dev/sda

Revision history for this message
MFV (mfv) wrote :

I'd disagree, theres a perfectly good reason to blame Ubuntu, their NBR doesn't work on Netbooks. In my book thats an epic FAIL and probably spells the end of Linux's chance on netbooks given it is the highest profile distro and they just don't appear interested in fixing it.

To those having problems recovering devices, I recommend repartioning the drive or writing to the raw devices. Milax does this well. Format, select device, analyze and purge.

Revision history for this message
GenericAnimeBoy (souletech) wrote :

I'd rather not join in with hysterical finger-pointing, but it's been a month and a half, people. The workaround in #147 (modified in #160 for working systems) could easily have been implemented as a script, packaged up, and pushed out via the software updater by now, and the fix could have been rolled into the Karmic .iso's. You have to realize just how critical this is: the only reason I even became aware of this problem was that when the SSD hung up during boot (as a result of this bug) my gnome-applets failed to load. For every affected user who is on launchpad following this bug, I would guess there have been at least 3 others who have just ignored the occasional glitches, thinking that it's nothing major.

How many drives have been destroyed in the month and a half since the workaround was published? Replacing an SSD in a netbook is an expensive, time consuming, and (if you do it yourself) warranty voiding operation.

Revision history for this message
GenericAnimeBoy (souletech) wrote :

Just a rough implementation of #160 as a script. Probably needs to be prettied up for public consumption.

Revision history for this message
Raf (4283534-noduck) wrote :

I found a possibly easier work around: after I disabled SMART support on the device, I can safely run devkit-disks-probe-ata-smart (or udisks-probe-ata-smart in lucid):

sudo smartctl --smart off /dev/sda

You can check if smart is disabled, with 'sudo smartctl -i /dev/sda', the output should include (note the last line):

SMART support is: Available - device has SMART capability.
SMART support is: Disabled

Note the following comment in the smartctl manual: "In principle the SMART feature settings are preserved over power-cycling, but it doesn´t hurt to be sure." I have not yet rebooted.

Looking at the strace of devkit-disks/udisks-probe-ata-smart, I see that the second (dangerous) ioctl is not executed when smart is disabled.

Revision history for this message
Raf (4283534-noduck) wrote :

I just rebooted, and SMART was enabled again. So this doesn't work through a reboot. Sorry.

Revision history for this message
Rick @ rickandpatty.com (rick-rickandpatty) wrote :

@Raf

As you say, that won't survive a reboot - but your workaround in #172 might be a good thing to add between steps 2 (zeroing the drive) and 3 (starting the installer) of the Karmic installation procedure in #147. Simply starting up the partitioner during installation will trigger the bug and trash some SSDs - like the original factory SSD in my Asus Eee 701 8G

Revision history for this message
Guy Taylor (thebiggerguy) wrote :

Hi all
My research has found this only affects:

Company Model Name Model Number Firmware Version
--------------------------------------------------------------------------------------------------------------------------------------------------
Intel Z-P230 SSDPAMM0008G1 Unknown affected firmware. Inclusive of "Ver2.J0H" and "Ver2.I0K"
Seagate STEC PATA 8GB Unknown affected firmware. Inclusive of "D5221-10"
Unknown Flash Module Unknown affected firmware. Inclusive of "Ver3.P0B"

Could people with the problem please run "sudo hdparm -i /dev/sda" (replacing the sda with the problematic drive) to confirm this or flag any other drives and or identify 'fixed' firmware.

Thank you

Revision history for this message
Guy Taylor (thebiggerguy) wrote :

Sorry all for the formatting. I have attached a text file instead so you can actually read it.

Revision history for this message
GenericAnimeBoy (souletech) wrote :

My hdparm is attached. The SSD in question is the aftermarket 16GB Supertalent SSD everyone's talking about, and it looks like the firmware version already appears on your list.

FWIW, I've only had minor issues with this one: it would occasionally hang up during boot and cause several gnome applets to fail to start. I had an Intel SSDPAMM0008G, which was the original SSD in this netbook [Acer Aspire One ZG5] which cratered the first time booted 9.10 from it. I guess I know now why that happened.

Revision history for this message
shadowblast101 (shadowblast101) wrote :

Here's the patriot 32gb SSD that a couple of us have. I have had this one completely corrupt on both Ubuntu and Arch before applying the patch.

Revision history for this message
MFV (mfv) wrote :

Asus EEE 900 (orig Celeron model)

Phison devices.

Revision history for this message
LarryGrover (lgrover) wrote :

Acer Aspire One with replacement Super Talent 32 GB SSD. I experienced stalls during boot up and HSM error messages in logs, but no data corruption. Drive info attached.

Revision history for this message
Raf (4283534-noduck) wrote :

I have the same Super Talent 32 GB SSD. But it identifies it self simply as 'Flash' (see attached output). I think the easiest way to implement a blacklist will be in the udev rules, so I included the output of 'udevadm info --query=all --path /block/sda'.

@LarryGrover: I have the same device. While debugging this problem (repeated runs of devkit-disks-probe-ata-smart) I did get (non-permanent) disk corruption. fsck placed several files in /lost+found!

Revision history for this message
Samizdata (samizdata) wrote :

I have the SSDPAMM0008G1, FwRev=Ver2.I0K SSD in my Acer Aspire One, but I am including the data for the sake of completeness. I have not had it brick, but I did have problems with timing out and the "lost+found" data corruption. I am currently running with the "quick", non-redirect workaround successfully. I have also attached both sets of data mentioned.

Revision history for this message
ipig (infopiggy) wrote :

Stock Asus EEE 900 (Celeron) / White 12G?

SSDs: Phison / 4GB (Primary) - 8GB (2ndary)

- ~Same hardware as post #179 -

On remix 9.10/32 & regular 9.10/32 - HSM Violation

(Used) Post #147 (upon 3rd install!) fixed issues.

- Had disk utility crash on 3rd/fix install during format. Hoping it's not bad blocks?

Revision history for this message
MFV (mfv) wrote :

Same fix in lucid requires editting 80-udisks.rules. Search for 'smart' and look for the entry.

Revision history for this message
basily (basily) wrote :

A me too... Acer Aspire One with replacement Super Talent 32 GB SSD, UNR 9.10. I had very slow boot times, but not data corruption. I have successfully implemented the work around in posting #147. Now boot times are around 27 seconds to full desktop and wifi connection.

Revision history for this message
Alain SAURAT (maisondouf) wrote :

I have a similar problem with an EeePC 701 ugraded with two SSD.

Originaly equiped with 4Gb SSD onto the mobo, Jaunty, Karmic and Lucid initialize it in UDMA66 mode without any problems and the read speed is around 30Mb/s.

I upgrade this notebook by adding a 32Gb PCIe PATA SSD in the extension connector.
BIOS well reconize them as Secondary Master for internal SSD and as Secondary Slave for PCIe SSD.

Now the kernel spend 3 timouts of 30 seconds to downgrade ata protocol from UDMA66 to PIO4 for the internal 4Gb SSD.
The PCIe SSD is directly use in UDMA66 mode.

After booting the read speed is 3Mb/s on internal SSD and 40Mb/s on PCIe SSD

I try, as I read here, to add "libata.dma=0" to the grub.cfg file, in this case timouts disapear but read speeds are very low on all disks

To avoid timouts and have a good read speed on /dev/sdb, can I deactivate UDMA mode only on /dev/sda and how ?

ps: I try an USB WinXP, there is no timout during boot and speeds are the sames ( Internal 4Mb/s, PCIe 40Mb/s )

Revision history for this message
Alain SAURAT (maisondouf) wrote :

Whouauuuhhh !

with "libata.force=2.00:pio4" option, kernel initialize internal SSD directly in PIO mode, so no timouts.

Option found in [url]http://www.kernel.org/doc/Documentation/kernel-parameters.txt[/url]
Syntax found in [url]http://docs.blackfin.uclinux.org/kernel/generated/libata/[/url]

Read speed stay very low on internal SSD but it doesn't matter for me.

Revision history for this message
Richard Ayotte (rich-ayotte) wrote :

Setting libata.force=noncq fixed it for me. Disabling smart as describe in the workaround or forcing pio4 had no effect.

Hardware: Acer Aspire One

Here's what I did.

sudo gedit /etc/default/grub

Change the line that says:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
to
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"

sudo update-grub

reboot.

Revision history for this message
Vishal Rao (vishalrao) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

aha! so i found another user needing the same workaround!

i had mentioned this in this bug and another and also sent a patch to
LKML which wasn't
safe enough to go in and also blogged about it: http://lahsiv.net/blog/?p=47

On 15 March 2010 02:10, Richard Ayotte <email address hidden> wrote:
> Setting libata.force=noncq fixed it for me. Disabling smart as describe
> in the workaround or forcing pio4 had no effect.
>
> Hardware: Acer Aspire One
>
> Here's what I did.
>
> sudo gedit /etc/default/grub
>
> Change the line that says:
> GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
> to
> GRUB_CMDLINE_LINUX_DEFAULT="quiet splash libata.force=noncq"
>
> sudo update-grub
>
> reboot.
>
> --
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of the bug.
>

--
"Thou shalt not follow the null pointer for at its end madness and chaos lie."

Revision history for this message
Vishal Rao (vishalrao) wrote :
Revision history for this message
Steve (sjc-carpanet) wrote :

I do not have an SD disk, but I do have the exact same errors:

[ 3252.000066] ata1: lost interrupt (Status 0x58)
[ 3252.004027] ata1: drained 32768 bytes to clear DRQ.
[ 3252.091644] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 3252.091669] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[ 3252.091672] cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 3252.091675] res 58/00:01:00:12:00/00:00:00:00:00/b0 Emask 0x2 (HSM v
[ 3252.091684] ata1.01: status: { DRDY DRQ }
[ 3252.091726] ata1: soft resetting link
[ 3252.332346] ata1.00: configured for UDMA/100
[ 3252.348516] ata1.01: configured for MWDMA2
[ 3252.348839] ata1: EH complete

It looks like the problem might be with the cdrom?

Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA Model: ST96812A Rev: 3.05
  Type: Direct-Access ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: MATSHITA Model: UJDA775 DVD/CDRW Rev: 1.00
  Type: CD-ROM ANSI SCSI revision: 05

In any case, I applied the workaround in question, and it definitely happens less often, but still happens to me pretty frequently.

Revision history for this message
Paede (patrick-steiner-gmx) wrote :

@Steve

I also have the same problem on a normal IDE disk. What type of Notebook do you have? And also the same type of cdrom:

Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA Model: WDC WD2500BEVE-0 Rev: 01.0
  Type: Direct-Access ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 01 Lun: 00
  Vendor: MATSHITA Model: UJ-822Da Rev: 1.02
  Type: CD-ROM ANSI SCSI revision: 05

this error happend to me when i move my laptop. i don't know if it cause by the HP Mobile Data Protection System

Revision history for this message
Raf (4263004-noduck) wrote :

The first beta of Lucid was released and we have not yet found a solution for the HSM violations and corruption. I would really hope that we can find a way to fix this before the final release.

A proper solution would be a patch for libatasmart, but I have not seen any progress.

A first workaround would be to disable the use of udisks-probe-ata-smart in 80-udisks.rules. I haven't found anything using the result of the SMART test (ID_ATA_FEATURE_SET_SMART, ID_ATA_FEATURE_SET_SMART_ENABLED, UDISKS_ATA_SMART_IS_AVAILABLE). And I don't think they are documented.

An alternative workaround would create a blacklist in 80-udisks.rules, so that SMART test is not run on the devices identified above (and possible others).

I would like to know if the developers are willing to accept either of these workarounds.

Revision history for this message
Raf (4263004-noduck) wrote :

I should have written:

I would like to know if the *maintainers* are willing to accept either of these workarounds.

Revision history for this message
ipig (infopiggy) wrote :

(Ref Post #183)

This bug gave me hell this weekend.

I had 9.10 running fine all month with remix.

I decided to nuke my netbook & later put 9.10 back on.

Upon installing 9.10 I forgot the timing of post #147. I finished the installation & was confronted with HSM Violations (again)

After the install i attempted the fix in post #188. That fix did not work. Attempting to undo 188 i was confronted with the inability of writing to the disk (save) - i could not undo the changes.

I decided to do another install this time not forgetting to apply 147 prior to re-boot. I was then unable to finish any installation normally. (aka never got to that point)

Installations would hang @ 38% (copying files iirc) & i noticed the HD/LED light
would begin flashing in a timed (1/2 sec per) fashion. I'd tried 3 installs partitioning (even slightly diff sizes) & formatting - Nothing made a difference.

I'd also tried the dd command (post 147) in between - but i don't think i did it correctly as it only took about 7-10 minutes (on 8GB). - I believe i'd ran it on a partition instead of disk.

Later on i noticed hitting the power button (@ the 38% hang) dropped me to a screen that displayed **The machine was in an infinite loop of HSM violation errors** (over & over & over)- In sync with the flashing HD light.

It seemed regardless of post #147 - the bug effects the machine earlier than that. That or the bug had still been dragging along all this time.

I'd tried using a separate gparted livecd to format & partition & it made no difference on installations failing.

Literally a day later i ran the dd command again (post 147) this time correctly & it took about 1.5 hours. (i ran it from 8.04 live/cd)

I had a feeling things would then go differently & they did. I managed to get 9.10 installed - HOWEVER i did still receive a disk utility crash (i believe during formatting - it's difficult to tell when it occurs because all it does is put a small red icon in the task bar) - I believe i put in #147 correctly (hell i'd done it before)

So i am thinking 'finally' this has been dealt with.

Wrong.

After installation i realized for some reason i'm unable to install any packages or updates. Seemingly anything. It seemed i could write to the disk OK but reading/installing packages/updates resulted in input/output errors displayed in the terminal/details.

I battled with this for a while but then i just gave up. F'it.

I'm willing to put 4-6 hours in but once it starts pushing beyond that people just can't be expected to deal with this. I was quite angry by the time i gave up & i am still a bit disgusted with this. No doubt that i've spent 8 or more hours in some way dealing with this problem.

I am not pleased upon hearing there's no fix in Lucid.

If the difference between getting a patch worked on & not is me packing up my netbook and shipping it off then i might be willing. I am a fairly loyal eee pc fan/user - when i think of netbook i think 'eee pc'.

What's bothersome is the apparent netbook remix edition. What part of netbook didn't include EEE PCs?

Revision history for this message
ipig (infopiggy) wrote :

Here's a surprise. I installed beta 10.04 LTS yesterday & it seems this problem doesn't exist. Things have generally been pretty good. Hopefully the full release keeps it up!

Revision history for this message
ipig (infopiggy) wrote :

Spoke too soon. On a re-boot of the machine it got caught in a bunch of HSM violations. Ugh i'd just started enjoying 10.04. Same issue :( - Not sure how well it's going to start up at the moment.

Revision history for this message
ipig (infopiggy) wrote :

Here's what i decided to do w/beta 10.04

- Deleted partitions via gparted/10.04 livecd / applied & did not re-create, quit

- Ran: sudo dd if=/dev/zero of=/dev/sda bs=1M (post #147)

- Started & finished 10.04 install within live cd / left re-boot prompt open /

- Applied steps #5 -> * (in post #147)

- Did NOT re-boot / Kept terminal open

- While still in /target -> deleted smart section out of 80-udisks.rules (re: #161) *

- Finished / will check in later

* While in 80-udisks.rules i don't think the line specified by 161 was commented out.

Revision history for this message
ideathproof (glenn-immortal) wrote :

I have just installed 10.4 and applied ipig's suggestion #198, working fine so far.

Martin Pitt (pitti)
Changed in libatasmart (Ubuntu Lucid):
assignee: nobody → Martin Pitt (pitti)
status: Triaged → In Progress
Martin Pitt (pitti)
Changed in devicekit-disks (Ubuntu Karmic):
status: New → Invalid
Revision history for this message
Martin Pitt (pitti) wrote :

I'll disable the probing in karmic for now; it's not really critical for the system to work, it will just disable the warnings that you'll get for potential disk failures from SMART. But that's much better than the current situation.

I will check the smartmontools code what they do differently. The problem does not happen on my two computers (one HDD, one SSD), so I'd really appreciate if someone affected could give me ssh access to such a system? (second key on https://launchpad.net/~pitti/+sshkeys).

With that we at least have some more time to figure out a proper fix in libatasmart. I'll start with pursuing the IDENTIFY PACKET DEVICE path as suggested by Jean-Louis, thanks for that!

Changed in devicekit-disks (Ubuntu Karmic):
assignee: nobody → Martin Pitt (pitti)
importance: Undecided → Critical
status: Invalid → In Progress
Martin Pitt (pitti)
Changed in libatasmart (Ubuntu Karmic):
status: New → Triaged
Steve Langasek (vorlon)
Changed in devicekit-disks (Ubuntu Karmic):
status: In Progress → Fix Committed
tags: added: verification-needed
Revision history for this message
Martin Pitt (pitti) wrote :

Accepted devicekit-disks into karmic-proposed, the package will build now and be available in a few hours.

Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Revision history for this message
In , Martin Pitt (pitti) wrote :

I got ssh access to an affected machine and finally tracked this down. I also compared it to the ioctls that smartmontools do.

My raw notes with all the ioctl stracing, bisecting, etc. is in https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202 .

Summary: It seems READ_THRESHOLDS without READ_DATA (or a different ioctl like RETURN_STATUS) causes this problem, the drive "wants" to send more data which is never flushed. Possible explanation: https://bugzilla.kernel.org/show_bug.cgi?id=14583#c25

http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=a223a4f6277a9f006b722b13671d5292dc6339bb fixed this more or less inadvertetly, which explains why we don't see that problem on our development releases.

Quite obviously from the commit, sk_disk_open() called sk_disk_smart_read_thresholds(), but not sk_disk_smart_read_data().
udisks-probe-ata-smart and skdump --can-smart just call sk_disk_open() and sk_disk_smart_is_available() (the latter does not do any I/O itself, just tests a flag).

So while a223a4 fixes this for the "common" use cases, there might still be situations where thresholds are read, but not the values. Let's look where init_smart() (the only place reading thresholds) is called:

 * sk_disk_smart_read_data(): OK, does READ_DATA

 * sk_disk_smart_status(): Does SK_SMART_COMMAND_RETURN_STATUS after init_smart(), confirmed to work

 * sk_disk_smart_self_test(): OK, calls sk_disk_smart_read_data()

So right now, all code paths work.

However, a potential robust solution might be to make init_smart() call sk_disk_smart_read_data() right after sk_disk_smart_read_thresholds(). This would cause data_is_valid to already be TRUE for self_test() (and thus not change behaviour). sk_disk_smart_read_data() could test the flag to avoid reading it twice. For sk_disk_smart_status() this would mean to have an additional unused READ_DATA call, though, but that might not hurt too much.

Revision history for this message
Martin Pitt (pitti) wrote :
Download full text (10.4 KiB)

Alan Pope kindly provided ssh access to his affected machine, and I analyzed this in detail.

I put my raw notes here for having a permanent record. I'll follow up with a more human-readable status in the next comment, so unless you are interested in the technical details, you can safely ignore this long post.

Jean-Louis' theory: check PACKET Command feature
------------------------------------------------

 * Both of my computers can do SMART just fine, but both also succeed with IDENTIFY_PACKET_DEVICE and deliver real data
 * An affected machine responds to SMART commands just fine with current Ubuntu 10.04 beta-1 (and deliver sensible results), so they can do SMART

libatasmart 0.17+git20100219-1git2, udisks 1.0.0 (Ubuntu 10.04)
------------------------------------------------

WORKS: # strace -e ioctl /lib/udev/udisks-probe-ata-smart /dev/sda
ioctl(3, BLKGETSIZE64, 0x9525014) = 0
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[16]=[85, 08, 2e, 00, 00, 00, 01, 00, 00, 00, 00, 00, 00, 00, ec, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=2000, flags=0, data[512]=["J\4\212\36\0\0\20\0\0~\0\2?\0x\0`?\0\0 "...], status=02, masked_status=01, sb[22]=[72, 00, 00, 00, 00, 00, 00, 0e, 09, 0c, 00, 00, 00, ff, 00, 00, 00, 00, 00, 00, 00, 50], host_status=0, driver_status=0x8, resid=0, duration=0, info=0x1}) = 0
UDISKS_ATA_SMART_IS_AVAILABLE=1

WORKS: # skdump /dev/sda

WORKS: # udisks --ata-smart-wakeup --ata-smart-refresh /dev/sda

libatasmart 0.17+git20100219-1git2, dk-disks 007
------------------------------------------------

WORKS: strace -e ioctl ./devkit-disks-probe-ata-smart /dev/sda
ioctl(3, BLKGETSIZE64, 0x9686014) = 0
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[16]=[85, 08, 2e, 00, 00, 00, 01, 00, 00, 00, 00, 00, 00, 00, ec, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=2000, flags=0, data[512]=["J\4\212\36\0\0\20\0\0~\0\2?\0x\0`?\0\0 "...], status=02, masked_status=01, sb[22]=[72, 00, 00, 00, 00, 00, 00, 0e, 09, 0c, 00, 00, 00, ff, 00, 00, 00, 00, 00, 00, 00, 50], host_status=0, driver_status=0x8, resid=0, duration=4, info=0x1}) = 0
DKD_ATA_SMART_IS_AVAILABLE=1

libatasmart 0.16, udisks 1.0.0
------------------------------

FAILS: LD_LIBRARY_PATH=/home/pitti/libatasmart-karmic/lib/ strace -e ioctl /lib/udev/udisks-probe-ata-smart /dev/sda
ioctl(3, BLKGETSIZE64, 0x9fef014) = 0
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[16]=[85, 08, 2e, 00, 00, 00, 01, 00, 00, 00, 00, 00, 00, 00, ec, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=2000, flags=0, data[512]=["J\4\212\36\0\0\20\0\0~\0\2?\0x\0`?\0\0 "...], status=02, masked_status=01, sb[22]=[72, 00, 00, 00, 00, 00, 00, 0e, 09, 0c, 00, 00, 00, ff, 00, 00, 00, 00, 00, 00, 00, 50], host_status=0, driver_status=0x8, resid=0, duration=25768, info=0x1}) = 0
ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[16]=[85, 08, 2e, 00, d1, 00, 01, 00, 00, 00, 4f, 00, c2, 00, b0, 00], mx_sb_len=32, iovec_count=0, dxfer_len=512, timeout=2000, flags=0, data[512]=["\20\0\350\2\0\0\0\0\0\0\0\0\0\0\351\0\0\0\0\0\0\0\0\0\0\0\352\0\0\0\0\0"...], status=02, masked_status=01, sb[22]=[72, 00, 00, 00, 00, 00, 00, 0e, 09, 0c, 00, 00, 00, 00, ...

Revision history for this message
Martin Pitt (pitti) wrote :

So in summary, the problem is fixed in the lucid version of libatasmart. While the code could be a little more robust for future extensions (which I'll discuss in the upstream bug), there are currently no code paths which can lead to the situation that triggers HSM violations.

Changed in libatasmart (Ubuntu Lucid):
status: In Progress → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote :

For Karmic we can backport http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=a223a4f6277a9f006b722b13671d5292dc6339bb to fix this properly. If we do this, we should also apply http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=54f846c2115e7addf5468a9c10ecf9ba844b946e on top, to avoid exporting this as a new symbol.

It just moves some initialization code into a new function and calls this lazily. It does not change any API/ABI. It has been tested a long time in lucid and should be fairly safe.

However, I'd like to keep the current workaround in devicekit-disks in karmic-proposed for now (please test that this properly disables SMART probing). I'd like to hear some more confirmations from affected people here that things indeed work fine with Lucid beta-1 on a variety of hardware platforms before re-enabling smart probing and this patch in karmic again.

Thank you, and sorry for the trouble that this caused!

Changed in libatasmart (Ubuntu Karmic):
assignee: nobody → Martin Pitt (pitti)
importance: Undecided → Medium
description: updated
Revision history for this message
Trey (trey333) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
Download full text (12.1 KiB)

I have two bricked EEE 900 chips with this problem. I've been booting
her off an SD card. Can the onboard SSD's rise from the dead now?

On 3/26/10, Martin Pitt <email address hidden> wrote:
> For Karmic we can backport
> http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=a223a4f6277a9f006b722b13671d5292dc6339bb
> to fix this properly. If we do this, we should also apply
> http://git.0pointer.de/?p=libatasmart.git;a=commitdiff;h=54f846c2115e7addf5468a9c10ecf9ba844b946e
> on top, to avoid exporting this as a new symbol.
>
> It just moves some initialization code into a new function and calls
> this lazily. It does not change any API/ABI. It has been tested a long
> time in lucid and should be fairly safe.
>
> However, I'd like to keep the current workaround in devicekit-disks in
> karmic-proposed for now (please test that this properly disables SMART
> probing). I'd like to hear some more confirmations from affected people
> here that things indeed work fine with Lucid beta-1 on a variety of
> hardware platforms before re-enabling smart probing and this patch in
> karmic again.
>
> Thank you, and sorry for the trouble that this caused!
>
> ** Changed in: libatasmart (Ubuntu Karmic)
> Importance: Undecided => Medium
>
> ** Changed in: libatasmart (Ubuntu Karmic)
> Assignee: (unassigned) => Martin Pitt (pitti)
>
> ** Description changed:
>
> - TEMPORARY WORK AROUND FOR THIS PROBLEM:
> + TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in
> + karmic-proposed and needs testing feedback):
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file; search
> for "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should
> have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 4. save the file and reboot
>
> + TECHNICAL ANALYSIS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
> + LUCID STATUS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
> + KARMIC SOLUTION:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again
> frequently between logging into gdm and the desktop loading. When it
> happens during login I think it is making gnome time out on loading
> panel items as I get errors related to lots of panel items failing to
> load. If I log out and back in again when the ssd isn't stalled the
> panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to cle...

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

@Trey, technically they're not 'bricked'. You can revive them fairly easily. I revived two (including the one Martin logged into) by using dd to copy zeroes over the entire SSD. Once done I did a Jaunty install (this was a few months ago) and upgraded to karmic, but before rebooting to the new upgraded karmic install I did what's recommended in the description (as per comment #145).

It's been running karmic fine for ages. According to Martin Lucid does not suffer from this problem so you could dd zeroes then install the Lucid beta and be safe.

Revision history for this message
adamski (adam-hasselbalch) wrote :

Alan: My Eee 900 is bricked. I have dd'ed zeroes over the SSD several times, and while I am no longer getting HSM-violations (since I am using a 8.04 rescue image), I now get Buffer I/O errors galore on the device. A full dd takes roughly 12 hours (that's the 16 gig drive) due to all these errors, and have no effect on them. I have dd'ed it probably three or four times now, and there's no apparent improvement.

I have given up and purchased a 1008HA with a regular disk on it instead. That works like a charm, though.

In other news: Eee 900 with 2G RAM for sale. :)

Revision history for this message
Alan Pope 🍺🐧🐱 🦄 (popey) wrote :

@adamski - What did you boot from to do the dd?

Revision history for this message
adamski (adam-hasselbalch) wrote :

@Alan: both 8.04 (which was what was on the thing until I foolishly reinstalled it) and 9.04. Also tried a mini-recue-dist of some sort, although I don' remember which one.

Also tried a Solaris-thing, as someone mentioned above, but I didn't have the patience at that time (was late, and I'd been at it for a couple hours) to make the USB boot, which took non-trivial effort (i.e. it didn't "just work").

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

Alan Pope [2010-03-26 10:17 -0000]:
> to copy zeroes over the entire SSD

For those less accustomed with the command line:

 * Boot a Jaunty or Lucid Beta-1 desktop CD.
 * Start gparted to find the right drive. It should usually be
   /dev/sda, but it could also be /dev/sdb if you have more than one
   hard disk
 * Open a Terminal, and do

    sudo dd if=/dev/zero of=/dev/sda

   (Replace "sda" with the actual drive, if you have several).

Please note that this IRREVOCABLY ERASES ALL DATA. So please make
double and triple sure that you are not overwriting that other hard
disk, or the USB disk you just attached. To be on the safe side,
disconnect all USB storage before you do this.

Martin

--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)

Revision history for this message
Guy Taylor (thebiggerguy) wrote :

@Trey
Have you tried using the ATA 'Secure Erase' command (see: https://ata.wiki.kernel.org/index.php/ATA_Secure_Erase). This tells the disk drive's controller to do the resetting, allowing data that dd cannot reach to be reset. Also as a by product it will reset the SSD's speed back to the performance you had on day one (see the link above for more info).
hope this works.

Revision history for this message
Steve Beattie (sbeattie) wrote :

Martin, I can confirm that the version of devicekit-disks in karmic-proposed, 007-2ubuntu6, has the commented out line in /lib/udev/rules.d/95-devkit-disks.rules. After installing the package from proposed, my system continues to boot, mount disks properly, and usb sticks continue to automount onto the desktop. I don't have an SSD drive so I can't confirm that the HSM violations no longer occur (though if someone wants to send me one, I'll happily test :-) ).

Martin Pitt (pitti)
tags: added: verification-done
removed: verification-needed
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package devicekit-disks - 007-2ubuntu6

---------------
devicekit-disks (007-2ubuntu6) karmic-proposed; urgency=low

  * Add 11-disable-smart-probing.patch: Disable ATA SMART probing on ATA
    disks. It causes hardware damage to a lot of SSD disks. This is a
    workaround, until a real fix in libatasmart is found. (LP: #445852)
 -- Martin Pitt <email address hidden> Thu, 25 Mar 2010 18:47:35 +0100

Changed in devicekit-disks (Ubuntu Karmic):
status: Fix Committed → Fix Released
Revision history for this message
Jarige (jarikvh) wrote :

I received a workaround (not a fix!) for this bug today through update-manager, although I wasn't experiencing this bug that badly. It definitely improved boottime :D

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

Confirming the fix in Karmic.

New file arrived through update-manager today.

I removed my existing dpkg-divert, rebooted and tested. No sign of error messages in dmesg. Previously with this machine I would have had error messages. That's good :-)

Revision history for this message
Samizdata (samizdata) wrote :

Confirming fix in Karmic UNR. Received via Update Manager. No errors seen and performance seems good. Manually confirmed presence of the workaround.

Revision history for this message
Samizdata (samizdata) wrote :

Oh, Acer Aspire One with the SSDPAM device.

Revision history for this message
ideathproof (glenn-immortal) wrote :

Comfirming UNR 10.04 (lucid) after running update manager work around is sill in place (no need to edit 80-udisks.rules) on Asus EEE 900 xp 12g version. was this work around released for lucid?

And boot time are the fastest i've seen cold boot from power led coming on to desktop 35 seconds, shutdown in 11 seconds. How can it shut down so fast?

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Also confirming the software update to 9.10 Karmic NBR that came through today. No trouble; no foolin! ;-)

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

OH and to be clear -- I DID "undo" the workaround as described in the last two lines of comment 147 https://bugs.launchpad.net/ubuntu/+bug/445852/comments/147

Revision history for this message
ectropionized (ectropionized-deactivatedaccount) wrote :

Applied fix (via Update-Manager), confirmed - no errors. (Intel 4GB SSD, UNR 9.10). All is well in the land of milk and hardware.

Revision history for this message
Trey (trey333) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
Download full text (6.8 KiB)

I've got my old bricked Asus EEE 900 at my house for the weekend. Both
SSD's are ostensibly dead. I've loaded Ubuntu 10.04 Beta1 on a USB and
can't install it without an Input/Output Error during the format. So
much for the new kernel fixing the problem. Zeroing out and
"unlocking" is the same. I've even taken the 16GB out and tried to
have it read in a Windows machine. I've tried everything. You guys
keep saying it's not really dead. I assure you that it behaves like it
is every time I poke it with a digital stick. It came from trying to
install the early Karmic release on both SSD's after the first one
failed. It's not physically damaged, it came directly from my
persistence trying to make this damn thing work installing and
reinstalling.

I've got the 900 for another day or two. I sold it to a friend with a
Ubuntu running off an SD card. I'd like to get it working for him
before I give it back to him tomorrow.

My original post:
http://georgia.ubuntuforums.org/showthread.php?p=8262777

On Sat, Apr 3, 2010 at 12:35 PM, sun2ecliptic <email address hidden> wrote:
> Applied fix (via Update-Manager), confirmed - no errors. (Intel 4GB SSD,
> UNR 9.10).  All is well in the land of milk and hardware.
>
> --
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in ATA S.M.A.R.T. Disk Health Monitoring Library: Confirmed
> Status in The Linux Kernel: Invalid
> Status in “devicekit-disks” package in Ubuntu: Invalid
> Status in “libatasmart” package in Ubuntu: Fix Released
> Status in “devicekit-disks” source package in Lucid: Invalid
> Status in “libatasmart” source package in Lucid: Fix Released
> Status in “devicekit-disks” source package in Karmic: Fix Released
> Status in “libatasmart” source package in Karmic: Triaged
>
> Bug description:
> TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in karmic-proposed and needs testing feedback):
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file; search for "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart $tempnode"
>
> 4. save the file and reboot
>
> TECHNICAL ANALYSIS: https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
> LUCID STATUS: https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
> KARMIC SOLUTION: https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process.  It happens almost everytime before xsplash loads and happens again frequently between logging into g...

Read more...

Revision history for this message
Dan Halbert (dhalbert) wrote :

Lucid Lynx Beta1 does not include this fix, because it was assembled before the fix was released. I successfully installed a recent daily netbook build of Lucid (http://cdimage.ubuntu.com/ubuntu-netbook/daily-live/) on a Dell Mini 9 with a stock 4GB STEC SSD. The build I used was created after the fix was released. Though I did not have the reported problem, I did have peculiar, similar symptoms with karmic and after.

The final test for those of you with problem SSD's, it seems to me, is to install from of the daily-live builds (or wait for Beta2). That will require no patching during the install, and should just work.

Before I installed, I booted from a USB stick and did a secure erase of the SSD (see #211 above), which only took a short time.

Revision history for this message
Trey (trey333) wrote :
Download full text (6.9 KiB)

Couldn't get the Daily Build downloaded (5kb/s in China), but I've got the
latest stable .32 kernel running in Karmic. Still doesn't work. HDPARM gives
me input/output errors doing anything, like setting a password. Gparted at
least sees one of the drives and its partitions, but ultimately does the
same when I tried to format.

Any suggestions?

  n Sun,

Apr 4, 2010 at 2:04 AM, Dan Halbert <email address hidden> wrote:

> Lucid Lynx Beta1 does not include this fix, because it was assembled
> before the fix was released. I successfully installed a recent daily
> netbook build of Lucid (http://cdimage.ubuntu.com/ubuntu-netbook/daily-
> live/ <http://cdimage.ubuntu.com/ubuntu-netbook/daily-live/>) on a Dell
> Mini 9 with a stock 4GB STEC SSD. The build I used was
> created after the fix was released. Though I did not have the reported
> problem, I did have peculiar, similar symptoms with karmic and after.
>
> The final test for those of you with problem SSD's, it seems to me, is
> to install from of the daily-live builds (or wait for Beta2). That will
> require no patching during the install, and should just work.
>
> Before I installed, I booted from a USB stick and did a secure erase of
> the SSD (see #211 above), which only took a short time.
>
> --
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential
> hardware death
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in ATA S.M.A.R.T. Disk Health Monitoring Library: Confirmed
> Status in The Linux Kernel: Invalid
> Status in “devicekit-disks” package in Ubuntu: Invalid
> Status in “libatasmart” package in Ubuntu: Fix Released
> Status in “devicekit-disks” source package in Lucid: Invalid
> Status in “libatasmart” source package in Lucid: Fix Released
> Status in “devicekit-disks” source package in Karmic: Fix Released
> Status in “libatasmart” source package in Karmic: Triaged
>
> Bug description:
> TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in
> karmic-proposed and needs testing feedback):
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file; search for
> "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 4. save the file and reboot
>
> TECHNICAL ANALYSIS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
> LUCID STATUS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
> KARMIC SOLUTION:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime befor...

Read more...

Revision history for this message
J-Pierre Rouits (jprouits) wrote :

Surprisingly, the fix from a few days ago also fixed the random freezing at boot time that I experienced from the time I installed Karmic. I have no SSD but an ATA disk. However, the CDROM drive continues to be probed randomly giving an HSM violation. See the following kernel log:
===============
Apr 5 11:31:03 jpport kernel: [ 1940.181158] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 5 11:31:03 jpport kernel: [ 1940.181187] ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Apr 5 11:31:03 jpport kernel: [ 1940.181191] cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 5 11:31:03 jpport kernel: [ 1940.181194] res 00/01:01:01:14:eb/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 5 11:31:03 jpport kernel: [ 1940.181243] ata2: soft resetting link
Apr 5 11:31:03 jpport kernel: [ 1940.360647] ata2.00: configured for MWDMA2
Apr 5 11:31:03 jpport kernel: [ 1940.473234] ata2: EH complete
Apr 5 11:37:48 jpport kernel: [ 2345.180532] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Apr 5 11:37:48 jpport kernel: [ 2345.180563] ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Apr 5 11:37:48 jpport kernel: [ 2345.180567] cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 5 11:37:48 jpport kernel: [ 2345.180571] res 00/01:01:01:14:eb/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 5 11:37:48 jpport kernel: [ 2345.180646] ata2: soft resetting link
Apr 5 11:37:48 jpport kernel: [ 2345.360649] ata2.00: configured for MWDMA2
Apr 5 11:37:48 jpport kernel: [ 2345.366253] ata2: EH complete
Apr 5 11:39:31 jpport kernel: [ 2447.968101] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Apr 5 11:39:31 jpport kernel: [ 2447.968112] ata2.00: ST_FIRST: !(DRQ|ERR|DF)
Apr 5 11:39:31 jpport kernel: [ 2447.968139] ata2.00: cmd a0/00:00:00:00:00/00:00:00:00:00/a0 tag 0
Apr 5 11:39:31 jpport kernel: [ 2447.968142] cdb 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 5 11:39:31 jpport kernel: [ 2447.968146] res 00/01:01:01:14:eb/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Apr 5 11:39:31 jpport kernel: [ 2447.968191] ata2: soft resetting link
Apr 5 11:39:31 jpport kernel: [ 2448.148629] ata2.00: configured for MWDMA2
Apr 5 11:39:31 jpport kernel: [ 2448.262396] ata2: EH complete

================

Configuration : HP Compaq nx6125, Ubuntu 9.10, last updated on April 3

Revision history for this message
J-Pierre Rouits (jprouits) wrote :

Unfortunately, this morning, the random boot lock up reappeared. and this can only be recovered by hitting the power button which erases any log ! So my previous comment was too optimistic. May be this is not the proper place for discussion about the random boot lock up. But the random CDROM probing is still there with HSM violation.

Revision history for this message
Raf (4263004-noduck) wrote :

I am using current Lucid (udisks 1.0.0+git20100319-0git1 and libatasmart4 0.17+git20100219-1git2) on my Acer Aspire One with Super Talent replacement SSD (FEM32GF13M) and I am still getting HSM violations. I previously tested and was able to confirm that it (sometimes) works without error.

I am hesitant to test more since the corruption was quite bad: broken grub config (easily fixed) and "mount point /dev/shm does not exist" errors when booting (haven't found how to fix this yet).

Revision history for this message
ipig (infopiggy) wrote :

I have not had any problems with 10.04 b1 since post #198.

Call me a wimp but if it aint broke i lack the energy to fix it again.

Not sure if i'll do b2 yet but i'll definitely upgrade to the full version when it comes out.

Revision history for this message
MFV (mfv) wrote :

It does *appear* its fixed in the latest Lucid.

Can anyone confirm how effective any Karmic/Jaunty fixes were, as Lucid is a bag of i915 coredump at the moment?

Revision history for this message
Gav Mack (gavinmac) wrote :

I installed the Lucid daily build a fortnight ago, it seemed to work fine for a week but last weekend I started getting the HSM violiations which have finally trashed my partition - fsck was running for a whole day stuck on 70%. Stuck the daily boot usb back in and am currently zeroing the ext4 partition. This time I think I'll be setting Andrew Simpsons dpdg-divert from the very start and leaving it in place until I'm sure it's not going to come back!

Revision history for this message
Raf (4263004-noduck) wrote :

With the early versions of udisks/libatasmart4 on Lucid I would always get HSM violations. Now with the newer versions (udisks 1.0.0+git20100319-0git1 and libatasmart4 0.17+git20100219-1git2) I only sometimes get HSM violations.

Revision history for this message
Gav Mack (gavinmac) wrote :

@Raf:

Despite months of HSM violations with Karmic I never had to zero the drive whatsoever, I installed Lucid just by reformatting the partition. This time I only had the occasional HSM, not during startup but was noticeable during apt get update in terminal but it was enough to trash the supertalent SSD this time only after 2 days tops. I would rather set the divert up and leave it set until I know this bug is gone forever!

My reinstall of Lucid after the erase and divert set as per post 147 is now running nicely - now I just have to load the apps/repositories back in to get back to where I was!

Revision history for this message
Raf (4263004-noduck) wrote :

@Gav: how are we going to find a fix for this problem if none of us once in a while is willing to try the new version?

Revision history for this message
Jarige (jarikvh) wrote :

Still having the bug with a totally updated Lucid today, but it didn't appear at boottime. It became less and less over time, but it is not fully gone. I got it twice now during this session. Both of them quite at the beginning (85 and 114 seconds in dmesg, if those are seconds)

udisk version: 1.0.1-1
libatasmart version: 0.17+git20100219-1git2

Revision history for this message
MFV (mfv) wrote :

Spoke to soon, my system got hosed just now. see 561079 as i didn't have this reference to hand. Managed to get dumps etc off.

udev rules was edited to not run the libata stuff and it still errored.

I really need to use this netbook for casual work on the move and can't mess around any longer, so going to have to try installing something else, Ubuntu is not working out on the EEEpc. Bye for now.

Revision history for this message
shadowblast101 (shadowblast101) wrote :

MFV, I just thought I should point out that this bug is not just within Ubuntu, but affects any Linux distribution that utilizes the Libatasmart package.

Revision history for this message
MFV (mfv) wrote :

I know this re #148. There are other OS's out there ;)

Revision history for this message
Martin Pitt (pitti) wrote :

Reopening for lucid then, since some machines still seem to be affected. For those who get it on Lucid beta-2, can you please confirm that applying the workaround in /lib/udev/rules.d/80-udisks.rules works? Also, please do

  sudo strace -vvfs1024 -o /tmp/probe-smart.txt /lib/udev/udisks-probe-ata-smart /dev/sda

and attach /tmp/probe-smart.txt here. It'd be best if someone could give me ssh access to an affected machine, since it works on those I can put my hands on now.

Changed in libatasmart (Ubuntu Lucid):
importance: Critical → High
status: Fix Released → Confirmed
Martin Pitt (pitti)
Changed in libatasmart (Ubuntu Lucid):
status: Confirmed → Incomplete
Revision history for this message
ubuntu-crypto (davexthc) wrote :

just to be sure it is safe to remove this patch if you *don't* use SDDs correct?

Revision history for this message
Raf (4263004-noduck) wrote :

I have not been able to reproduce the HSM violations. I rebooted 20 times, cold booted, booted with battery, replaced battery and booted, tried 2.6.32-19 and 2.6.32-20, all of these seem to work without problem.

Previously (after the fix went in) I sometimes got the HSM violation, but only on boot, I was not able to trigger it by manually running udisks-probe-ata-smart.

I have replaced /lib/udev/udisks-probe-ata-smart with a script that generates a trace, so if it should cause an HSM violation again, it should generate a trace output.

My understanding of this bug is that it is only related to running of udisks-probe-ata-smart, which should only run at boot (unless repartitioning the drive). But some reports seem to indicate failures after boot (e.g. #234).

The workaround did seem to work for me.

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

ubuntu-crypto [2010-04-12 21:52 -0000]:
> just to be sure it is safe to remove this patch if you *don't* use
> SDDs correct?

The SMART probing is not inherently tied to SSDs. It just seems that
many of today's SSDs use a kind of controller which acts up when its
asked for SMART status.

So, nobody can guarantee that this problem does not affect normal HDDs
as well, but it seems we haven't heard about those yet.

If you re-enable the SMART probing and it works for you (easy to
notice if your startup speed suddenly increases by 30 seconds or so),
then it's safe, yes.

Martin

--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)

Revision history for this message
Gav Mack (gavinmac) wrote :

Even though I have the divert set I've just had a HSM Violation shortly after Lucid has fully started - portion of the log attached from boot with the error logged at the end.

Revision history for this message
Martin Pitt (pitti) wrote :

Gav Mack [2010-04-13 23:18 -0000]:
> Even though I have the divert set I've just had a HSM Violation shortly
> after Lucid has fully started - portion of the log attached from boot
> with the error logged at the end.
>
> ** Attachment added: "dmesg"
> http://launchpadlibrarian.net/44091487/dmesg

This looks rather different, though (no HSM violation). If you have
the divert set, then it's not due to devkit-disks-probe-ata-smart.
Another potential culprit could be hdparm, please see bug 515023.

Revision history for this message
Raf (4263004-noduck) wrote :

I have not had anymore HSM violations. And nobody else has reported any problems. It looks like this problem is fixed.

Revision history for this message
Trey (trey333) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

I still have two bricked SSD's on an EEE 900

On Apr 17, 2010 10:01 AM, "Raf" <email address hidden> wrote:

I have not had anymore HSM violations. And nobody else has reported any
problems. It looks like this problem is fixed.

-- devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential
hardware death https:/...

Revision history for this message
ectropionized (ectropionized-deactivatedaccount) wrote :

So, the problem has reappeared with Lucid. Last night I did an upgrade via upgrade-manager to Lucid, and after updating all packages and using the system, I received this:

==
[ 89.816125] ata2: lost interrupt (Status 0x58)
[ 89.820090] ata2: drained 2048 bytes to clear DRQ.
[ 89.823689] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 89.823708] ata2.00: BMDMA stat 0x4
[ 89.823726] ata2.00: failed command: READ DMA
[ 89.823765] ata2.00: cmd c8/00:08:00:55:e1/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 89.823774] res 58/00:08:00:55:e1/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 89.823794] ata2.00: status: { DRDY DRQ }
[ 89.823870] ata2: soft resetting link
[ 89.992540] ata2.00: configured for UDMA/66
[ 89.992580] ata2: EH complete
[ 120.816124] ata2: lost interrupt (Status 0x58)
[ 120.820091] ata2: drained 32768 bytes to clear DRQ.
[ 120.934970] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 120.938005] ata2.00: BMDMA stat 0x4
[ 120.940833] ata2.00: failed command: READ DMA
[ 120.943786] ata2.00: cmd c8/00:f8:a8:da:15/00:00:00:00:00/e0 tag 0 dma 126976 in
[ 120.943795] res 58/00:f8:a8:da:15/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 120.950725] ata2.00: status: { DRDY DRQ }
[ 120.954028] ata2: soft resetting link
[ 121.124583] ata2.00: configured for UDMA/66
[ 121.124625] ata2: EH complete
[ 209.787794] __ratelimit: 9 callbacks suppressed
[ 209.787814] apt-get[1574]: segfault at 0 ip 00327d10 sp bfa0e0ec error 4 in libc-2.11.1.so[247000+153000]
==

Kernel: 2.6.32-21-generic (i686)

Revision history for this message
Horácio (horacioh) wrote :

I confirm the bug in lucid beta2. I had this problem in a asus eee 900 originally, was apparently solved after patch, but after upgrade from karmic to lucid beta2, I detected again HSM violations. The difference is that it does not appear during boot but randomly during normal use of computer (my case web browsing).

   82.816046] ata2: lost interrupt (Status 0x58)
[ 82.816103] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 82.816111] ata2.00: BMDMA stat 0x64
[ 82.816119] ata2.00: failed command: WRITE DMA
[ 82.816133] ata2.00: cmd ca/00:10:1f:16:00/00:00:00:00:00/e0 tag 0 dma 8192 out
[ 82.816137] res 58/00:10:1f:16:00/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 82.816144] ata2.00: status: { DRDY DRQ }
[ 82.816180] ata2: soft resetting link
[ 83.014771] ata2.00: configured for UDMA/66
[ 83.020331] ata2.01: configured for UDMA/66
[ 83.044327] ata2.00: configured for UDMA/66
[ 83.052277] ata2.01: configured for UDMA/66
[ 83.052295] ata2: EH complete
[ 113.816056] ata2: lost interrupt (Status 0x58)
[ 113.816113] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 113.816121] ata2.01: BMDMA stat 0x64
[ 113.816129] ata2.01: failed command: WRITE DMA
[ 113.816144] ata2.01: cmd ca/00:88:39:08:e2/00:00:00:00:00/f0 tag 0 dma 69632 out
[ 113.816147] res 58/00:88:39:08:e2/00:00:00:00:00/f0 Emask 0x2 (HSM violation)
[ 113.816154] ata2.01: status: { DRDY DRQ }
[ 113.816191] ata2: soft resetting link
[ 114.056305] ata2.00: configured for UDMA/66
[ 114.064298] ata2.01: configured for UDMA/66
[ 114.088283] ata2.00: configured for UDMA/66
[ 114.096279] ata2.01: configured for UDMA/66
[ 114.096296] ata2: EH complete
[ 147.816044] ata2: lost interrupt (Status 0x58)
[ 147.816102] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 147.816109] ata2.01: BMDMA stat 0x64
[ 147.816117] ata2.01: failed command: WRITE DMA
[ 147.816131] ata2.01: cmd ca/00:08:d1:ed:61/00:00:00:00:00/f0 tag 0 dma 4096 out
[ 147.816135] res 58/00:08:d1:ed:61/00:00:00:00:00/f0 Emask 0x2 (HSM violation)
[ 147.816142] ata2.01: status: { DRDY DRQ }
[ 147.816178] ata2: soft resetting link
[ 148.056297] ata2.00: configured for UDMA/66
[ 148.064307] ata2.01: configured for UDMA/66
[ 148.088278] ata2.00: configured for UDMA/66
[ 148.096288] ata2.01: configured for UDMA/66
[ 148.096303] ata2: EH complete
[ 180.816048] ata2: lost interrupt (Status 0x58)
[ 180.816105] ata2.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 180.816233] ata2.01: BMDMA stat 0x64
[ 180.816297] ata2.01: failed command: WRITE DMA
[ 180.816380] ata2.01: cmd ca/00:08:b9:01:6a/00:00:00:00:00/f0 tag 0 dma 4096 out
[ 180.816384] res 58/00:08:b9:01:6a/00:00:00:00:00/f0 Emask 0x2 (HSM violation)
[ 180.816624] ata2.01: status: { DRDY DRQ }
[ 180.816727] ata2: soft resetting link
[ 181.056279] ata2.00: configured for UDMA/66
[ 181.064276] ata2.01: configured for UDMA/66
[ 181.088275] ata2.00: configured for UDMA/66
[ 181.096274] ata2.01: configured for UDMA/66
[ 181.096289] ata2: EH complete
[ 244.738599] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id

Revision history for this message
Jarige (jarikvh) wrote :

I can confirm that I still have this bug having all updates installed in Lucid.

I was told (by Martin Pitt) to execute the following command:
sudo strace -vvfs1024 -o /tmp/probe-smart.txt /lib/udev/udisks-probe-ata-smart /dev/sda

And add /tmp/probe-smart.txt as an attachment. So I did that, hoping it would help... I did this without applying any workaround (except for the one that was auto-released with Karmic, but the problems reappeared in Lucid)

Basically, I don't understand anything of this bug. I don't know how to apply the workaround on an already installed UNR, since the explanation only states booting from a LiveCD, and I don't know whether the workaround has any bad side effects.

How do I apply the workaround on my machine, on an already installed UNR?

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

I'm not convinced this recent problem is related to devkit-disks-probe-ata-smart (the original bug). I have experienced the recent problem once - during an 'apt-get update'.

From what I see it's typified by the kernel giving a READ DMA or WRITE DMA command, to which the drive responds in an unexpected manner (HSM Violation). After a suitable timeout the drive is reset and things continue.

Also, the bug does not occur at boot, but randomly during use. And using the probing with 'Disk Utility' has no affect for me.

Revision history for this message
Martin Pitt (pitti) wrote :

Given how much trouble this still causes on Lucid, I won't reenable the SMART prober for karmic very soon.

Changed in libatasmart (Ubuntu Karmic):
assignee: Martin Pitt (pitti) → nobody
Revision history for this message
Jarige (jarikvh) wrote :

If there's anything I can do to produce more data for debugging, contact me.
I didn't apply any workaround, since I do not know how to do that.

Jon Ramvi (ramvi)
Changed in easypeasy-project:
status: New → Confirmed
assignee: nobody → Jon Ramvi (ramvi)
importance: Undecided → High
Revision history for this message
ipig (infopiggy) wrote :

My (Post #198) install had a lockup & then would only boot to a grub read error - not sure what happened.

I decided to give 10.04 Beta2/RC a whirl - repeating the steps in #198 - i couldn't dd the drive w/out it resulting in a loop of HSM violations.

I then tried dd'ing the drive in (live) 9.04, 8.10 & then 8.04.

In 8.04 was i able to dd the drive w/out a loop of HSM violations.

Side Note: I've noticed when dd'ing the drive under normal circumstances the HD light stays on solid, when it starts pulsing in a timed manor that = an hsm loop going on in the bg

Side Note: seems to effect 8.10/9.04/9.10/10.04 beta1 & beta2/rc) - (but not 8.04)

The bug surely seems to be in effect earlier then i thought in re: versions & installation(s)

Anyways, i know the full release is tomorrow so hopefully everything is g2g (re: #250)

Revision history for this message
Martin Pitt (pitti) wrote : Re: [Bug 445852] Re: devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential hardware death

ipig [2010-04-28 19:53 -0000]:
> Side Note: seems to effect 8.10/9.04/9.10/10.04 beta1 & beta2/rc) - (but
> not 8.04)

This is definitively unrelated then, since libatasmart was only
introduced in 9.10. Perhaps your problem is more like bug 515023 or
bug 548513?

Jon Ramvi (ramvi)
Changed in easypeasy-project:
status: Confirmed → Triaged
status: Triaged → Fix Released
Revision history for this message
ipig (infopiggy) wrote :

When i boot up a live (nightly) build from Tuesday/27th i get an error saying a hard disk has health problems (ATA ASUS-PHISON SSD/TST2.0L4) (Port 2 of PATA Host Adapter) (8.1GB) (/dev/sdb)

SMART Status: Disk Failure is Imminent

ID: 235 / Good Block Rate (Number of available reserved blocks as a percentage of the total number of reserved blocks) Assessment: Failing / Normalized: 1 / Worst: 1 / Threshold: 3 / Value: N/A

I don't really know what the deal is. Maybe the disk really is failing. It's been failing for a while+ then.

Maybe libatasmart just brings out the worst of it. I'm a little split still.

Revision history for this message
theluketaylor (ekul-taylor) wrote :

I am the original reporter of this issue. I installed Lucid today and I can confirm this issue has NOT been corrected. I had to comment out the SMART portions of /lib/udev/rules.d/80-udisks.rules to avoid 5-10 second I/O stalls and numerous HSM errors (exactly the same symptoms as originally reported)

Revision history for this message
Jim Connor (jim-canada) wrote :

6 months, 22 days ago?

Revision history for this message
theluketaylor (ekul-taylor) wrote :

@Jim Connor

Yes, I reported this 6 months ago. I'm more than a little frustrated it hasn't been fixed yet, especially since in comment 203 I read this:

"So in summary, the problem is fixed in the lucid version of libatasmart. While the code could be a little more robust for future extensions (which I'll discuss in the upstream bug), there are currently no code paths which can lead to the situation that triggers HSM violations."

I assumed based on that when I upgraded from 9.10 to 10.04 I would have no issues. I did a fresh install and it went fine until I rebooted into the new system. Then I got the same errors I reported oh so long ago. The work around from 9.10 worked, though the SMART udev rules are now located in a different file (80-udisks.rules)

Revision history for this message
Martin Pitt (pitti) wrote :

theluketaylor [2010-04-29 21:24 -0000]:
> Yes, I reported this 6 months ago. I'm more than a little frustrated it
> hasn't been fixed yet, especially since in comment 203 I read this:
>
> "So in summary, the problem is fixed in the lucid version of
> libatasmart. While the code could be a little more robust for future
> extensions (which I'll discuss in the upstream bug), there are currently
> no code paths which can lead to the situation that triggers HSM
> violations."

So far I just got access to one machine where this happened. I could
reproduce the bug and found the cause (see upstream report). As of
today, nobody offered me SSH access to a machine which is still
affected, so I'm afraid there's nothing else I can do..

Martin
--
Martin Pitt | http://www.piware.de
Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)

Revision history for this message
Jarige (jarikvh) wrote :

@Martin Pitt
I'm willing to give you SSH access to my netbook, but you have got to tell me how to do so. I do not have any experience with that. I must tell you that I did apply the workaround yesterday. I can uncomment the lines if necessarily. I'm probably going to be online 5 hours from now, and maybe even longer. I'll receive an e-mail notification if you reply here.

And, of course, if you've got SSH access (which I guess is some kind of terminal access) don't screw it :P This is a production machine, not a test machine. I work on this netbook every day. My important files are backed up with Dropbox though, so no need to worry about that.

So just tell me what to do :)

Revision history for this message
Martin Pitt (pitti) wrote :

These are the steps for allowing me SSH access:

 * Install openssh-server
 * Create a new user for me (e. g. "pitti"), with admin privileges
 * Log in as that user, write the password in a file "password.txt" in the home directory (so that you do not need to pass it around by mail, but I can get access to it once I'm logged in and need sudo)
 * mkdir ~/.ssh
 * wget -O ~/.ssh/authorized_keys https://launchpad.net/~pitti/+sshkeys
 * Configure your router to allow access to your machine's Port "22" (for ssh from outside)
 * Tell me (via private mail, IRC, or bug followup) your IP address (visible in the router configuration web page usually).

Revision history for this message
Martin Pitt (pitti) wrote :

Oh, for the record: I will track my changes and revert them, but I'll need to install a few additional packages (thus I need a few MB of download quota), build some test code, and run it as root. Thus I _will_ access your hard drive with those SMART probing commands to reproduce the problem a few times.

I will not need to see anything in other home directories, and the like.

Revision history for this message
Martin Pitt (pitti) wrote :

I debugged this issue on Jarige's machine, and it has a pretty different root cause. Due to that, and because this bug has become way too long, and because it fixes the issue for most people here, we opened a new report in bug 574462. If you still have the problem, please subscribe to that one instead.

Thank you!

Changed in libatasmart (Ubuntu Lucid):
status: Incomplete → Fix Released
Changed in libatasmart (Ubuntu):
status: Incomplete → Fix Released
Changed in libatasmart (Ubuntu Karmic):
status: Triaged → Won't Fix
Changed in libatasmart:
importance: Unknown → Critical
Revision history for this message
loewe_78 (bergloewe) wrote :
Download full text (3.6 KiB)

I've got a hp 510 notebook pc. Back when it was new, it was shipped with open-dos. So, it was running with Linux since Gutsy Gibbon and, up to now, had only some difficulties to be solved with the southbridge that were working out of the box in Hardy or Jackalope.
It has got an Intel Celeron M 360 1.4 Mhz Processor with 400 Mhz frontside-bus and Intel 910 GML Chipset with Intel-ICH-6m SB.
The HD is a IBM/Hitachi 40GB 4200RPM 2MB Cache Travelstar HTS421240H9AT00.

When I upgraded to Karmic about a year ago, I had the problem described above. So, I reinstalled Jackalope where the hardware worked without problems.
Now I want to pass on my laptop as I bought a new one. A test with the desktop-CD made no obvious problems (it's a HD-failure that might occur more often when the program is installed on the HD and not on CD, ha-ha). This and the fact that it's easier to do so is why I tried to install Lucid.

Despite I implemented the workaround for Lucid given above, the device produces the following output when running dmesg:

[ 0.271537] ata_piix 0000:00:1f.1: version 2.13
[ 0.271555] alloc irq_desc for 16 on node -1
[ 0.271559] alloc kstat_irqs on node -1
[ 0.271567] ata_piix 0000:00:1f.1: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 0.271628] ata_piix 0000:00:1f.1: setting latency timer to 64
[ 0.277201] isapnp: Scanning for PnP cards...
[ 0.282766] scsi0 : ata_piix
[ 0.282911] scsi1 : ata_piix
[ 0.283654] ata1: PATA max UDMA/100 cmd 0x1f0 ctl 0x3f6 bmdma 0x3580 irq 14
[ 0.283659] ata2: PATA max UDMA/100 cmd 0x170 ctl 0x376 bmdma 0x3588 irq 15
[ 0.284177] Fixed MDIO Bus: probed

-
.
.
.
-

[ 0.489196] ata1.00: ATA-7: HTS421240H9AT00, HACOA70S, max UDMA/100
[ 0.489204] ata1.00: 78140160 sectors, multi 16: LBA48
[ 0.489256] ata1.01: ATAPI: TSSTcorpCDW/DVD TS-L462D, HS02, max MWDMA2
[ 0.548617] ata1.00: configured for UDMA/100
[ 0.556926] ACPI: Battery Slot [C15E] (battery present)
[ 0.580432] ata1.01: configured for MWDMA2

-
.
.
.
-
[ 158.816041] ata1: lost interrupt (Status 0x58)
[ 158.820017] ata1: drained 32768 bytes to clear DRQ.
[ 158.909721] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 158.909728] sr 0:0:1:0: CDB: Test Unit Ready: 00 00 00 00 00 00
[ 158.909744] ata1.01: cmd a0/00:00:00:00:00/00:00:00:00:00/b0 tag 0
[ 158.909746] res 58/00:01:00:00:00/00:00:00:00:00/b0 Emask 0x2 (HSM violation)
[ 158.909750] ata1.01: status: { DRDY DRQ }
[ 158.909787] ata1: soft resetting link
[ 159.128657] ata1.00: configured for UDMA/100
[ 159.160312] ata1.01: configured for MWDMA2
[ 159.179141] ata1: EH complete
[ 1113.785054] atkbd.c: Unknown key released (translated set 2, code 0xe0 on isa0060/serio0).
[ 1113.785060] atkbd.c: Use 'setkeycodes e060 <keycode>' to make it known.
[ 2920.000474] ata1: lost interrupt (Status 0x58)
[ 2920.004015] ata1: drained 32768 bytes to clear DRQ.
[ 2920.093417] ata1.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 2920.093423] ata1.01: ATAPI check failed (ireason=0x1 bytes=8)
[ 2920.093429] sr 0:0:1:0: CDB: Get event status notification: 4a 01 00 00 10 00 00 00 08 00
[ 2920.093448] ata1.01: cmd a0/00:00:00:08:...

Read more...

Revision history for this message
loewe_78 (bergloewe) wrote :
Revision history for this message
rogmorri (frontporsche) wrote :

On the same acer aspire one laptop where I saw this issue last year, I am perhaps seeing it again with ubuntu-10.10-rc-desktop-i386...

Oct 4 00:59:43 ubuntu kernel: [ 602.639266] res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)
Oct 4 00:59:44 ubuntu kernel: [ 604.149075] ata2: soft resetting link
Oct 4 00:59:45 ubuntu kernel: [ 604.321486] ata2.00: configured for UDMA/100
Oct 4 00:59:45 ubuntu kernel: [ 604.321529] ata2: EH complete
Oct 4 00:59:49 ubuntu kernel: [ 609.173903] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
Oct 4 00:59:49 ubuntu kernel: [ 609.173915] ata2.00: BMDMA stat 0x5
Oct 4 00:59:49 ubuntu kernel: [ 609.173926] ata2.00: failed command: WRITE DMA
Oct 4 00:59:49 ubuntu kernel: [ 609.173945] ata2.00: cmd ca/00:00:40:69:c9/00:00:00:00:00/e0 tag 0 dma 131072 out
Oct 4 00:59:49 ubuntu kernel: [ 609.173949] res 00/00:00:00:00:00/00:00:00:00:00/00 Emask 0x2 (HSM violation)

Revision history for this message
Andrew Simpson (andrew-simpson) wrote :

@rogmorri
I don't think this is the same bug. You are getting HSM Violations with WRITE DMA, whereas this bug occurred with READ DMA.

(This is being written on an AA1 with 10.10 also!)

Revision history for this message
rogmorri (frontporsche) wrote :

Thank, Andrew. Maybe then I just have bad hardware.

Revision history for this message
Trey (trey333) wrote :
Download full text (6.0 KiB)

just to throw it out there - this bug was never "fixed." I had to sell a
bricked EEE 900 for scrap because it killed both SSD's.

On Mon, Oct 4, 2010 at 2:08 PM, rogmorri <email address hidden> wrote:

> Thank, Andrew. Maybe then I just have bad hardware.
>
> --
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential
> hardware death
> https://bugs.launchpad.net/bugs/445852
> You received this bug notification because you are a direct subscriber
> of a duplicate bug (387272).
>
> Status in EasyPeasy Overview: Fix Released
> Status in ATA S.M.A.R.T. Disk Health Monitoring Library: Confirmed
> Status in The Linux Kernel: Invalid
> Status in “devicekit-disks” package in Ubuntu: Invalid
> Status in “libatasmart” package in Ubuntu: Fix Released
> Status in “devicekit-disks” source package in Lucid: Invalid
> Status in “libatasmart” source package in Lucid: Fix Released
> Status in “devicekit-disks” source package in Karmic: Fix Released
> Status in “libatasmart” source package in Karmic: Won't Fix
> Status in “devicekit-disks” package in Fedora: New
>
> Bug description:
> TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in
> karmic-proposed and needs testing feedback):
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file; search for
> "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 4. save the file and reboot
>
> TECHNICAL ANALYSIS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
> LUCID STATUS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
> KARMIC SOLUTION:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process. It
> happens almost everytime before xsplash loads and happens again frequently
> between logging into gdm and the desktop loading. When it happens during
> login I think it is making gnome time out on loading panel items as I get
> errors related to lots of panel items failing to load. If I log out and
> back in again when the ssd isn't stalled the panel items load fine.
>
> When it happens the following messages appear before xplash (or in dmesg
> when it happens after gdm):
>
> ata2: lost interrupt (Status 0x58)
> ata2: drained 16384 bytes to clear DRQ.
> ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
> ata2.00: BMDMA stat 0x4
> ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
> res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
> ata2.00: status: { DRDY DRQ }
> ata2: soft resetting link
> ata2.00: configured for UDMA/66
> ata2: EH complete
>
> I did not hav...

Read more...

Revision history for this message
lotus49 (lotus-49) wrote :
Download full text (12.9 KiB)

Trey's right, it never was fixed.

Although I have fortunately not suffered from any permanent hardware
problems, the bug resurfaces every now and then. I have worked around it by
editing /lib/udev/rules.d/80-udisks.rules and commenting out this line:

# ATA disks driven by libata
#KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
ENV{DEVTYPE}=="disk", IMPORT{program}="udisks-probe-ata-smart $tempnode"

This workaround has done the trick but unfortunately, this is overwritten
every now and then (as it warns it will be at the beginning of the file).
At least it is easy to spot when this has happened as my boot times go from
about 15 secs to a couple of minutes.

Simon

On 4 October 2010 07:41, Trey <email address hidden> wrote:

> just to throw it out there - this bug was never "fixed." I had to sell a
> bricked EEE 900 for scrap because it killed both SSD's.
>
> On Mon, Oct 4, 2010 at 2:08 PM, rogmorri <email address hidden>
> wrote:
>
> > Thank, Andrew. Maybe then I just have bad hardware.
> >
> > --
> > devkit-disks-probe-ata-smart causes HSM Violations on SSD, and potential
> > hardware death
> > https://bugs.launchpad.net/bugs/445852
> > You received this bug notification because you are a direct subscriber
> > of a duplicate bug (387272).
> >
> > Status in EasyPeasy Overview: Fix Released
> > Status in ATA S.M.A.R.T. Disk Health Monitoring Library: Confirmed
> > Status in The Linux Kernel: Invalid
> > Status in “devicekit-disks” package in Ubuntu: Invalid
> > Status in “libatasmart” package in Ubuntu: Fix Released
> > Status in “devicekit-disks” source package in Lucid: Invalid
> > Status in “libatasmart” source package in Lucid: Fix Released
> > Status in “devicekit-disks” source package in Karmic: Fix Released
> > Status in “libatasmart” source package in Karmic: Won't Fix
> > Status in “devicekit-disks” package in Fedora: New
> >
> > Bug description:
> > TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in
> > karmic-proposed and needs testing feedback):
> >
> > 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
> >
> > 2. locate the following lines (about 1/3 the way into the file; search
> for
> > "smart")
> >
> > # ATA disks driven by libata
> > KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> > ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> > $tempnode"
> >
> > 3. comment out the second line by adding a # in front, so you should have
> >
> > # ATA disks driven by libata
> > #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> > ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> > $tempnode"
> >
> > 4. save the file and reboot
> >
> > TECHNICAL ANALYSIS:
> >
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
> > LUCID STATUS:
> >
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
> > KARMIC SOLUTION:
> >
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204
> >
> > BUG DESCRIPTION FOLLOWS:
> >
> > In the Karmic beta I experience ssd stalls during the boot process. It
> > happens almost everytime before xsplash loads a...

Changed in linux:
status: Invalid → Won't Fix
Changed in libatasmart:
importance: Critical → Unknown
Revision history for this message
Neil Hooey (nhooey) wrote :

Is anyone going to fix this?

I've disabled s.m.a.r.t. on all of my drives in /lib/udev/rules.d/80-udisks.rules, and I've disabled hdparm for them in /lib/udev/rules.d/85-hdparm.rules, but I still get IDENTIFY DEVICE errors on boot, shutdown and reboot.

Does anyone even know what software package is actually responsible for the bug?

Revision history for this message
Thomas Wagnwer@thowabu.de (thomas-thowabu) wrote :

I get the same Problem with a NVIDIA MCP61 Chipset, OCZ Vertex2 60GB and ext4 Filesystem.
It´s real Pain -- sometimes the whole system is freezing.
The SSD had dataLoss.

The Problem occurs less frequently with ext2...

And not even once with Windows XP, 7 ...

SATA hdd and optical works fine.

To jail hdparm and smart doesn´t work for me running the latest rc kernels.
I think it´s between the SSD firmware and the libata kernel stack.
(see differenc to Windows, but other drives work fine with libata)

Changed in linux:
importance: Unknown → Medium
Changed in libatasmart:
importance: Unknown → Critical
Revision history for this message
Neil Hooey (nhooey) wrote :

I just installed Fedora Core 14 which uses kernel 2.6.35.6-45.fc14.i686, and the "failed command: IDENTIFY DEVICE" and "failed command: FLUSH CACHE" problems went away.

Here's the Fedora Bug:
https://bugzilla.redhat.com/show_bug.cgi?id=549981

More details at my StackExchange question:
http://askubuntu.com/questions/16608/how-do-you-fix-failed-command-identify-device-showing-up-in-dmesg

Jon Ramvi (ramvi)
Changed in easypeasy-project:
assignee: Jon Ramvi (ramvi) → nobody
Revision history for this message
In , Lennart-poettering (lennart-poettering) wrote :

MArtin, can you prep a patch for your requested changes?

Revision history for this message
In , Bugs-freedesktop-org-n (bugs-freedesktop-org-n) wrote :

you there, martin?

Revision history for this message
In , Greg Unger (mr-ory) wrote : Re: [Bug 445852]

I'm sorry, you have the wrong email. My name is Greg.
On Oct 31, 2013 9:28 AM, "Bugs-freedesktop-org-n" <email address hidden>
wrote:

> you there, martin?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/445852
>
> Title:
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and
> potential hardware death
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/easypeasy-project/+bug/445852/+subscriptions
>

Revision history for this message
In , mdyn (tamerlaha-gmail) wrote :
Download full text (6.7 KiB)

Something goes wrong, guys :)

2013/10/31 Greg Unger <email address hidden>

> I'm sorry, you have the wrong email. My name is Greg.
> On Oct 31, 2013 9:28 AM, "Bugs-freedesktop-org-n" <
> <email address hidden>>
> wrote:
>
> > you there, martin?
> >
> > --
> > You received this bug notification because you are subscribed to the bug
> > report.
> > https://bugs.launchpad.net/bugs/445852
> >
> > Title:
> > devkit-disks-probe-ata-smart causes HSM Violations on SSD, and
> > potential hardware death
> >
> > To manage notifications about this bug go to:
> > https://bugs.launchpad.net/easypeasy-project/+bug/445852/+subscriptions
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/445852
>
> Title:
> devkit-disks-probe-ata-smart causes HSM Violations on SSD, and
> potential hardware death
>
> Status in EasyPeasy Overview:
> Fix Released
> Status in ATA S.M.A.R.T. Disk Health Monitoring Library:
> Confirmed
> Status in The Linux Kernel:
> Won't Fix
> Status in “devicekit-disks” package in Ubuntu:
> Invalid
> Status in “libatasmart” package in Ubuntu:
> Fix Released
> Status in “devicekit-disks” source package in Lucid:
> Invalid
> Status in “libatasmart” source package in Lucid:
> Fix Released
> Status in “devicekit-disks” source package in Karmic:
> Fix Released
> Status in “libatasmart” source package in Karmic:
> Won't Fix
> Status in “devicekit-disks” package in Fedora:
> New
>
> Bug description:
> TEMPORARY WORK AROUND FOR THIS PROBLEM IN KARMIC: (This is now also in
> karmic-proposed and needs testing feedback):
>
> 1. sudo gedit /lib/udev/rules.d/95-devkit-disks.rules
>
> 2. locate the following lines (about 1/3 the way into the file; search
> for "smart")
>
> # ATA disks driven by libata
> KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 3. comment out the second line by adding a # in front, so you should
> have
>
> # ATA disks driven by libata
> #KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata",
> ENV{DEVTYPE}=="disk", IMPORT{program}="devkit-disks-probe-ata-smart
> $tempnode"
>
> 4. save the file and reboot
>
> TECHNICAL ANALYSIS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202
> LUCID STATUS:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/203
> KARMIC SOLUTION:
> https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/204
>
> BUG DESCRIPTION FOLLOWS:
>
> In the Karmic beta I experience ssd stalls during the boot process.
> It happens almost everytime before xsplash loads and happens again
> frequently between logging into gdm and the desktop loading. When it
> happens during login I think it is making gnome time out on loading
> panel items as I get errors related to lots of panel items failing to
> load. If I log out and back in again when the ssd isn't stalled the
> panel items load fine.
>
> When it happens the following messages appear before xplash (or in
> dmesg ...

Read more...

Revision history for this message
In , Martin Pitt (pitti) wrote :

I didn't actually see Lennart's comment 20 two years ago, sorry. Downgrading priority as the actual bug has been fixed two years ago. What's left is some robustification which I outlined in the last paragraph of comment 19.

Changed in libatasmart:
importance: Critical → Low
Revision history for this message
Stephan Müller (megandy) wrote :

Hi,

since the update to 14.04.01 I encounter this error on my Thinkpad with a SSD. The systems freezes, and after aprox. 30 sec - 1 min the following messages can be found by using dmesg:

ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x50000 action 0x6 frozen
ata1: SError: { PHYRdyChg CommWake }
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:00:b0:06:53/00:00:04:00:00/40 tag 0 ncq 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
ata1.00: status: { DRDY }

So, the bug seems to have a revival. Any ideas?

Revision history for this message
Matt (matthewj-coke) wrote :

I too have experienced this revival on Ubuntu 16.04.1 LTS

Revision history for this message
Josir Cardoso Gomes (josircg) wrote :

I too have experienced this error. SDD is working fine but I'm worried with this odd message.

Dmesg:

ata5: drained 512 bytes to clear DRQ
[ 7.955932] ata5.01: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 7.955998] ata5.01: failed command: SMART
[ 7.956059] ata5.01: cmd b0/d5:01:06:4f:c2/00:00:00:00:00/10 tag 0 pio 512 in
                        res 58/00:ff:08:4f:c2/00:00:00:00:00/10 Emask 0x2 (HSM violation)
[ 7.956162] ata5.01: status: { DRDY DRQ }
[ 7.956240] ata5: soft resetting link

Kernel: 4.4.0-58-generic #79-Ubuntu SMP Tue Dec 20 12:12:35 UTC 2016 x86_64

$ dpkg -l | grep libatasmart
ii libatasmart4:amd64 0.19-3

If you need other info, I will be glad to help.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.