udisks-probe-ata-smart causes HSM violations

Bug #574462 reported by Jarige on 2010-05-03
116
This bug affects 17 people
Affects Status Importance Assigned to Milestone
Linux
Invalid
Undecided
Unassigned
linux (Ubuntu)
High
Unassigned
Lucid
High
Unassigned

Bug Description

This is related to bug 445852, it causes the same effects, but under different circumstances.

During boot on an SSD system and when logging in and starting something rather IO intense, like firefox, the system freezes for 30 seconds, and afterwards dmesg shows an error like

  ata2: lost interrupt (Status 0x58)
  ata2: drained 16384 bytes to clear DRQ.
  ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
  ata2.00: BMDMA stat 0x4
  ata2.00: cmd c8/00:40:cb:60:32/00:00:00:00:00/e0 tag 0 dma 32768 in
  res 58/00:40:cb:60:32/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
  ata2.00: status: { DRDY DRQ }
  ata2: soft resetting link
  ata2.00: configured for UDMA/66
  ata2: EH complete

The main cause of bug 445828 was fixed now, but there are still some users who get those HSM violations/30 second hangs during boot.

Martin Pitt got ssh access to Jarige's machine which is still affected (he's willing to provide access to other people for debugging).

A lot of different commands were tried to reproduce this at runtime, like

  # for i in `seq 50`; do skdump --can-smart /dev/sda; hdparm -B254 /dev/sda; sleep 0.2; done
  # udevadm trigger --action=change --sysname-match=sda # (also in a loop)
  # (/lib/udev/udisks-probe-ata-smart /dev/sda &); /lib/udev/udisks-probe-ata-smart /dev/sda

and so on, but it seems impossible to reproduce at runtime unfortunately. I also tried those commands while a "grep -r . /usr" was running in the background to induce I/O and disk reading activity.

The interesting thing is that the bug goes away if you either disable /lib/udev/rules.d/85-hdparm.rules, or udisks-probe-ata-smart in /lib/udev/rules.d/80-udisks.rules. So this seems to happen in situations where there is something reading a lot of files from the disk, and hdparm or libatasmart send their ioctls to the drive.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-21-generic 2.6.32-21.32
Regression: No
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.32-21.32-generic 2.6.32.11+drm33.2
Uname: Linux 2.6.32-21-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices:
 **** List of PLAYBACK Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC268 Analog [ALC268 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: jarik 1395 F.... pulseaudio
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0x78540000 irq 16'
   Mixer name : 'Realtek ALC268'
   Components : 'HDA:10ec0268,1025015b,00100101'
   Controls : 8
   Simple ctrls : 5
Date: Mon May 3 15:21:38 2010
InstallationMedia: Ubuntu-Netbook-Remix 9.10 "Karmic Koala" - Release i386 (20091028.4)
MachineType: Acer AOA110
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-21-generic root=UUID=994b1074-44a7-4871-9553-fe61b94182cf ro quiet splash pciehp.pciehp_force=1 elevator=noop
ProcEnviron:
 LANG=en_US.utf8
 SHELL=/bin/bash
RelatedPackageVersions: linux-firmware 1.34
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
dmi.bios.date: 05/09/2008
dmi.bios.vendor: INSYDE
dmi.bios.version: v0.3109
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: Base Board Product Name
dmi.board.vendor: Intel Corp.
dmi.board.version: Base Board Version
dmi.chassis.type: 1
dmi.chassis.vendor: Chassis Manufacturer
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnINSYDE:bvrv0.3109:bd05/09/2008:svnAcer:pnAOA110:pvr1:rvnIntelCorp.:rnBaseBoardProductName:rvrBaseBoardVersion:cvnChassisManufacturer:ct1:cvrChassisVersion:
dmi.product.name: AOA110
dmi.product.version: 1
dmi.sys.vendor: Acer

Jarige (jarikvh) wrote :
Martin Pitt (pitti) wrote :

At this point I'd appreciate some input from the kernel team what this message actually means, and what the likely cause could be. It does not really seem specific to either hdparm nor libatasmart, all they do is things like https://bugs.launchpad.net/ubuntu/karmic/+source/libatasmart/+bug/445852/comments/202 (i. e. using the normal SCSI ioctls).

One possible workaround might be to not run hdparm on SSD devices. This only might fix the symptom, but it might be an appropriate SRU for lucid.

description: updated
Changed in linux (Ubuntu):
status: New → Confirmed
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
Martin Pitt (pitti) on 2010-05-03
description: updated
Martin Pitt (pitti) wrote :

Hm, so my original workaround idea was to change /lib/udev/rules.d/85-hdparm.rules to

ACTION=="add", SUBSYSTEM=="block", KERNEL=="[sh]d[a-z]", \
        ATTR{queue/rotational}=="1", \
        RUN+="/lib/udev/hdparm"

to suppress the rule on SSDs.

But on Jarige's laptop this doesn't actually work, the attribute is "1" on the SSD (/dev/sda). It does seem to be correct on my Dell Mini, though.

Martin Pitt (pitti) wrote :

Adding a hdparm task for now, for possible workarounds.

Changed in hdparm (Ubuntu):
assignee: nobody → Martin Pitt (pitti)
Tim Gardner (timg-tpi) wrote :

Martin - a first pass debug effort is to try an upstream kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.34-rc6-lucid. I'll look into what an HSM violation means.

Tim Gardner (timg-tpi) on 2010-05-03
Changed in linux (Ubuntu Lucid):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Dave V (mindkeep) wrote :

Attached a fresh dmesg dump from my eee900. This laptop has been useless for the last 6 months. ssh access can be made available. At this point, I can even send the hardware if need be.

Chris Howson (cdh) wrote :

I confirm this bug on an Aspire One A110 netbook with 16G Super Talent SSD. It is running Lucid Lynx Netbook Remix with the 2.6.32-22-generic kernel. The system doesn't freeze on every boot though it can often be provoked by launching Opera as soon as the desktop appears. The workaround of disabling udisks-probe-ata-smart in /lib/udev/rules.d/80-udisks.rules removes the symptoms.

Andrew Simpson (andrew-simpson) wrote :

Referring to comment #7:
I also have an Aspire One A110 netbook with 16G Super Talent SSD. Running standard Ubuntu (Lucid). Exactly the same hardware, but I haven't had noticeable 'freezing' problems. This machine previously suffered from Bug #445828.

I have had (two, maybe three times in a month of use) 'freezing' during periods of intense disk activity. I have saved the dmesg of one event and will attach it below. I'm not totally sure it's related to this bug, but there is a bit more information in the log and it might help.

Chris Howson (cdh) wrote :

In response to comment #8: it looks like the same bug to me. I hereby attach my dmesg.

Jarige (jarikvh) wrote :

As I understand it, these are the same symptoms, but a different cause. The first cause was already fixed and I still have the same symptoms. So there must be a second cause, which is why this report was opened.

elmimmo (chocolate-camera) wrote :

I have the same symptoms, running standard Lucid on an Acer Aspire One ZG5 (same as A110L I think) with a Super Talent 32GB SSD.

Happening pretty much everytime I boot, after the desktop appears and I launch Google Chrome.

I have an ASUS eeePC 900 (Target) with RAM upgrade and (formerly) a Patriot SSD upgrade as described in Bug 445852. I believe THIS bug bit the SSD hard on the lucid upgrade, and unfortunately I may have finally worn out the SSD with all my experiments the last few months.

Despite several "zeros" (dd if=/dev/zero etc.) I can no longer install lucid OR karmic successfully on the 32gig Patriot SSD. Before it started failing completely I was getting the 30+second hangs with the drive light on, and I didn't think to look for the HSM violations until too late. Once I saw some in the VT terminal while I was copying /dev/zero to the drive, and I took a photo in case the exact messages are helpful .

I put the 4gig stock SSD back in, and installed Lucid just fine. (But the stock SSD on this system never showed the serious corruption problems like in Bug 445852.) I haven't put lucid through the paces on the 4gig system, so it may show trouble I haven't seen yet.

Is there a place people are collecting wisdom on whether / how to test or recover failed SSDs if they don't respond to the /dev/zero treatment? Also when I go to order a replacement SSD, is there a place that lists models that are not affected by these bugs?

Wes Harper (wes-h) wrote :

Yeah I have this problem too. I have a Eee PC 900 / Linux with the 4G onboard and 16G socketed drives. I'll skip the dmesg since it's identical to others posted with the Eee PC. I would like to note that I first disabled hdparm and it worked as a workaround until I ran update-grub (grub2 on 10.04 UNE). Now in addition to disabling hdparm I have disabled the smart probe and now there are no more 30 second hangs. I'm not sure if they would return if I just enabled hdparm again.. but my guess would be yes. On an earlier installation I reduced the hangs by making sure I set the bios to "start" until I finished the installation then set it to "finished", but I'm not sure if that is related. Before that when I installed with the bios set to finish, it hanged frequently even after boot, but after when I changed the bios the errors would only occur right after getting to the desktop and starting an application. After the initail group of 3 or 4 hangs it would never do it again until reboot. I wonder if grub2 is somehow related to this. Or the padding at the beginning of the partitions. This is all on Ubuntu 10.04 UNE.

MFV (mfv) wrote :

@#13, you need to trash all recognisable traces of filesystems and partitions to recover. At present, only older linux distros can recover, but i didn't have much success with them booting on the EeePC (YMMV of course)

OpenSolaris revived my setup every time, but for those fearing continued write damage this is going to make you wince. The process is:-

1. get an img from genunix.org (milax is ideal as its very small)
2. boot it
3. at a shell, pfexec format
4. choose the disk, it should be fairly clear which is which
5. [A]nalyze
6. [P]urge

Sit back and watch the speed of an SSD!

@MFV not sure what you are saying -- I did completely remove all traces of the previous OS by writing zeroes to the SSD, but after several cycles of this and through several versions of Ubuntu at some point the SSD may have worn out. I will try it again in a few weeks once I contact Patriot and see if they have any recommendations. And of course, the "speed of an SSD" part is almost funny to consider -- these are flash SSDs on the EeePC 900 and they are not very fast even going full tilt.

tags: added: kernel-core kernel-reviewed
Will (will-berriss) wrote :

Got this on an original Acer Aspire One with a new 32GB SSD from SuperTalent.
LOTS of disk activity at boot up - gradually killing my SSD I think, it just did a disk check. :(

Martin Pitt (pitti) on 2010-06-16
Changed in hdparm (Ubuntu):
assignee: Martin Pitt (pitti) → nobody
JP Vossen (jp-jpsdomain) wrote :

Dell Mini9 running clean Lucid install. Fully up-to-date but no work-arounds attempted. Upgraded RAM and 16G SSD I see messages like this several times a day:

Jun 17 10:37:02 mini9 kernel: [541557.000145] ata1: lost interrupt (Status 0x58)
Jun 17 10:37:02 mini9 kernel: [541557.004104] ata1: drained 2048 bytes to clear DRQ.
Jun 17 10:37:02 mini9 kernel: [541557.007610] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 17 10:37:02 mini9 kernel: [541557.007626] ata1.00: BMDMA stat 0x24
Jun 17 10:37:02 mini9 kernel: [541557.007640] ata1.00: failed command: READ DMA
Jun 17 10:37:02 mini9 kernel: [541557.007669] ata1.00: cmd c8/00:08:10:72:2b/00:00:00:00:00/e1 tag 0 dma 4096 in
Jun 17 10:37:02 mini9 kernel: [541557.007675] res 58/00:08:10:72:2b/00:00:00:00:00/e1 Emask 0x2 (HSM violation)
Jun 17 10:37:02 mini9 kernel: [541557.007689] ata1.00: status: { DRDY DRQ }
Jun 17 10:37:02 mini9 kernel: [541557.007750] ata1: soft resetting link
Jun 17 10:37:02 mini9 kernel: [541557.176536] ata1.00: configured for UDMA/66

I'm worried about hardware death...

The SSD I'm running is a SUPER TALENT model FEM16GHDL: http://www.newegg.com/Product/Product.aspx?Item=N82E16820609413&Tpk=FEM16GHDL

$ cat /proc/scsi/scsi
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
  Vendor: ATA Model: Flash Module Rev: Ver2
  Type: Direct-Access ANSI SCSI revision: 05

JP Vossen (jp-jpsdomain) wrote :

I should have noted in my comment 18 that I noticed this issue via logcheck, and that I was previously running Ubuntu 9.04 LPIA, with logcheck and did NOT see this problem. I made 2 attempts to upgrade to 9.10 and both failed utterly, presumably due to the terrible LPIA ports. So as noted this is a clean install of 10.04 i386.

This machine sits idle most of the time, except when I'm using to read mail and surf, and it has a bunch of FF tabs open. So I find the following really interesting. Note the minutes-after-the-hour part. What runs then?

$ fgrep 'ata1: lost interrupt (Status 0x58)' /var/log/syslog /var/log/syslog.1
/var/log/syslog:Jun 18 08:37:02 mini9 kernel: [620757.000141] ata1: lost interrupt (Status 0x58)
/var/log/syslog:Jun 18 08:37:33 mini9 kernel: [620788.000122] ata1: lost interrupt (Status 0x58)
/var/log/syslog:Jun 18 10:37:02 mini9 kernel: [627956.988194] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 02:37:06 mini9 kernel: [512761.000144] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 02:37:37 mini9 kernel: [512792.000142] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 04:37:02 mini9 kernel: [519956.988157] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 06:37:02 mini9 kernel: [527156.989171] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 06:37:33 mini9 kernel: [527188.000141] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 10:37:02 mini9 kernel: [541557.000145] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 14:37:02 mini9 kernel: [555956.988181] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 18:37:02 mini9 kernel: [570356.988163] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 20:37:02 mini9 kernel: [577556.988207] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 17 22:37:02 mini9 kernel: [584756.988199] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 18 00:37:02 mini9 kernel: [591956.989193] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 18 02:37:02 mini9 kernel: [599157.000122] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 18 04:37:02 mini9 kernel: [606357.000144] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 18 04:37:33 mini9 kernel: [606388.000138] ata1: lost interrupt (Status 0x58)
/var/log/syslog.1:Jun 18 06:37:02 mini9 kernel: [613556.989153] ata1: lost interrupt (Status 0x58)

# grep -R '3[567]' /etc/cron* /var/spool/cron/crontabs/
#

@JP Vossen: I just looked through my system and noticed auth.log contains an hourly entry by pam_unix, (though mine are at :17 after) so it may be that's when pam renews its authentication, which probably causes the disk buffers to get flushed.

Jarige (jarikvh) wrote :

Is there still anyone busy on fixing this bug? It has taken long enough now, this bug should have higher priority as it could wear out the SSD. At least the workaround should be distributed through an update.
I recently received an update which disabled the workaround, so I enabled it again.

JP Vossen (jp-jpsdomain) wrote :

@Jarige in comment 21, what work-around, the one in comment 3?

@Tommy Trussell in comment 20, interesting point. I will "grep '^... .. ..:3[567]:' /var/log/*" and take a good look at the results, when I have some time to spare.

Jarige (jarikvh) wrote :

No, apparently the one in #3 doesn't work as Martin tried it on my machine using SSH.
I'm talking about the workaround in #7:
"The workaround of disabling udisks-probe-ata-smart in /lib/udev/rules.d/80-udisks.rules"

That one I recently reapplied.

Jarige (jarikvh) wrote :

Hmm, after the workaround was suddenly turned off, I reapplied it, and found that /etc/apt/preferences.d had corrupted. Instead of a directory it turned into a file which crashed apt-get. Opening the file showed nothing of interest (an empty file) while it was about 13KB in size.
I had the same problem a few weeks ago (or maybe some months) but instead of preferences.d it was apt.conf.d that turned into a file, and sources.list got corrupted. I remember opening sources.list, and at the end of the file it just stopped without closing. This has been a while ago, but it could well have been at the same time the workaround wasn't active since I did remove the workaround in that time.

Also, #19 indicates that it appears every hour, which could prove that it's apt checking for updates. I'm just speculating though, it could be coincidence. I don't know how many times an hour apt checks for updates, but if it's an hourly schedule, it's worth checking which command triggers the bug. I suppose apt itself is not the problem as it probably just calls some command in a library, right?

I hope to be of assistance this way...

JP Vossen (jp-jpsdomain) wrote :

I tried the one in #3 and it does not seem to have worked either, though I didn't restart anything or reboot so I'm not 100% sure it took effect.

Different idea: I'm hazy on just what hdparm is needed for. What if I just rename it, and create a symlink to true in its place. Will that cause any Bad Things? The machine this is on has its 16G SSD and that's it, though I do rarely insert USB sticks and even more rarely SD-cards, if that matters.

Chris Howson (cdh) wrote :

@Jarige - Do you know which update disabled the workaround ? I have a bunch of updates waiting but am afraid to activate them in case I get the problems you have had.

Jarige (jarikvh) wrote :

@Chris
No, sorry. I don't know which one disabled the workaround.
But if you'd just install the updates and check afterwards to see whether the workaround is still in place or not, you could just enable the workaround again and no harm will be done. I think that file will only be loading during boot, unless you say to load it again, so just don't reboot until the workaround is applied.

I don't have a lot of time this week (not until Friday at the least) so after that I hope to do some tests. I want to disable the workaround and check whether apt is causing the symptoms by manually checking for updates using the command line. I don't suppose it'd work, but it's worth trying...

JP Vossen (jp-jpsdomain) wrote :
Download full text (5.1 KiB)

UPDATE: I added the work-around in comment 3, and it seemed to have no effect. Then I had to reboot for other reasons, and since that reboot I've had the messages only once, as follows, which is a great improvement.

I did *not* mess with /sbin/hdparm as I mentioned I might in comment 25.

Jun 23 22:17:13 mini9 kernel: [167484.988135] ata1: lost interrupt (Status 0x58)
Jun 23 22:17:13 mini9 kernel: [167484.988223] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 23 22:17:13 mini9 kernel: [167484.988239] ata1.00: BMDMA stat 0x24
Jun 23 22:17:13 mini9 kernel: [167484.988253] ata1.00: failed command: WRITE DMA
Jun 23 22:17:13 mini9 kernel: [167484.988282] ata1.00: cmd ca/00:08:50:b2:d6/00:00:00:00:00/e0 tag 0 dma 4096 out
Jun 23 22:17:13 mini9 kernel: [167484.988288] res 58/00:08:50:b2:d6/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
Jun 23 22:17:13 mini9 kernel: [167484.988303] ata1.00: status: { DRDY DRQ }
Jun 23 22:17:13 mini9 kernel: [167484.988363] ata1: soft resetting link
Jun 23 22:17:13 mini9 kernel: [167485.196524] ata1.00: configured for UDMA/66
Jun 23 22:17:44 mini9 kernel: [167516.000138] ata1: lost interrupt (Status 0x58)
Jun 23 22:17:44 mini9 kernel: [167516.000240] ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Jun 23 22:17:44 mini9 kernel: [167516.000257] ata1.00: BMDMA stat 0x24
Jun 23 22:17:44 mini9 kernel: [167516.000271] ata1.00: failed command: WRITE DMA
Jun 23 22:17:44 mini9 kernel: [167516.000300] ata1.00: cmd ca/00:10:98:0f:8b/00:00:00:00:00/e0 tag 0 dma 8192 out
Jun 23 22:17:44 mini9 kernel: [167516.000306] res 58/00:10:98:0f:8b/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
Jun 23 22:17:44 mini9 kernel: [167516.000320] ata1.00: status: { DRDY DRQ }
Jun 23 22:17:44 mini9 kernel: [167516.000381] ata1: soft resetting link
Jun 23 22:17:44 mini9 kernel: [167516.208571] ata1.00: configured for UDMA/66

## Work-around
[jp@mini9:T4:L1:C525:J0:2010-06-24_15:29:18_EDT]
/home/jp$ ll /lib/udev/rules.d/85-hdparm.rules
-rw-r--r-- 1 root root 116 2010-06-20 15:44 /lib/udev/rules.d/85-hdparm.rules

[jp@mini9:T4:L1:C516:J0:2010-06-24_15:25:26_EDT]
/home/jp$ cat /lib/udev/rules.d/85-hdparm.rules
ACTION=="add", SUBSYSTEM=="block", KERNEL=="[sh]d[a-z]", \
 ATTR{queue/rotational}=="1", \
 RUN+="/lib/udev/hdparm"

[jp@mini9:T4:L1:C517:J0:2010-06-24_15:25:45_EDT]
/home/jp$ last reboot
reboot system boot 2.6.32-22-generi Mon Jun 21 23:46 - 15:26 (2+15:39)
reboot system boot 2.6.32-22-generi Mon Jun 21 19:16 - 23:38 (04:21)
reboot system boot 2.6.32-22-generi Fri Jun 11 04:12 - 19:15 (10+15:02)
reboot system boot 2.6.32-21-generi Thu Jun 10 19:33 - 04:08 (08:35)

[jp@mini9:T4:L1:C521:J0:2010-06-24_15:27:19_EDT]
/home/jp$ zfgrep -c 'ata1: lost interrupt (Status 0x58)' /var/log/syslog*
/var/log/syslog:0
/var/log/syslog.1:2
/var/log/syslog.2.gz:0
/var/log/syslog.3.gz:6
/var/log/syslog.4.gz:14
/var/log/syslog.5.gz:12
/var/log/syslog.6.gz:9
/var/log/syslog.7.gz:15

### Note time changed from :37: to :17:, not sure of the significance, if any
[jp@mini9:T4:L1:C522:J0:2010-06-24_15:27:31_EDT]
/home/jp$ zfgrep 'ata1: lost interrupt (Status 0x58)' /var/log/syslog* | head -2...

Read more...

Jarige (jarikvh) wrote :

@JP Vossen: If you want the symptoms to stop, you should aply the workaround in #7

I think the time at which the symptoms occur might be related to the time you turned on the computer. After you turned on the computer some program produces the bug, and then produces it every hour.

Chris Howson (cdh) wrote :

@Jarige, re post #27: I applied all the updates, except one for udisks and the workaround is still in place.

JP Vossen (jp-jpsdomain) wrote :

@Jarige, I thought the work-around I applied from comment 3 *was* the "disabling udisks-probe-ata-smart in /lib/udev/rules.d/80-udisks.rules" one in comment 7. I didn't read the file-names carefully enough.

Having said that, '/lib/udev/rules.d/80-udisks.rules' says right at the top "# Do not edit this file, it will be overwritten on updates" while "/lib/udev/rules.d/85-hdparm.rules" does not. And it is not clear to me how, exactly, to disable it per comment 7 anyway.

I'll keep an eye on it, but the simple, presumably non-clobberable comment 3 fix is more-or-less working for me right now.

Chris Howson (cdh) wrote :

@JP Vossen: to apply the workaround you can simply search in the file for "ATA disks driven by libata" and comment out the following line which begins with "KERNEL":

# ATA disks driven by libata
# KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="udisks-probe-ata-smart $tempnode"

General Note:

The workaround in Comment #3 won't work for many machines. This is because the kernel tries to detect whether the device is an SSD and is meant to set /sys/block/sda/queue/rotational to '0' for SSD or '1' for HDD. However it often gets it wrong (or wrong info from the drive). Many of the SSD's we have are telling the kernel that they are rotational!

Use the workaround in Comment #7 or Comment #32 (both same).

If you are getting symptoms at the same time every hour, have you checked what is in /etc/cron.hourly?

There is nothing in cron.hourly on my machine (and I don't get freezes every hour).

AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
AplayDevices:
 **** List of PLAYBACK Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC662 rev1 Analog [ALC662 rev1 Analog]
   Subdevices: 0/1
   Subdevice #0: subdevice #0
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC662 rev1 Analog [ALC662 rev1 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ptaylor 1245 F.... pulseaudio
 /dev/snd/pcmC0D0p: ptaylor 1245 F...m pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf7eb8000 irq 16'
   Mixer name : 'Realtek ALC662 rev1'
   Components : 'HDA:10ec0662,10438337,00100101'
   Controls : 17
   Simple ctrls : 10
DistroRelease: Ubuntu 10.04
Frequency: Once a day.
HibernationDevice: RESUME=UUID=b937331d-b512-48c0-b65b-e52db4f469d3
InstallationMedia: Ubuntu 10.04 LTS "Lucid Lynx" - Release i386 (20100429)
MachineType: ASUSTeK Computer INC. 900
Package: linux (not installed)
PackageArchitecture: i386
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-23-generic root=UUID=acd9d6d2-46b8-4286-93b8-62573cb294bf ro quiet splash
ProcEnviron:
 LANG=en_GB.utf8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-23.37-generic 2.6.32.15+drm33.5
Regression: No
RelatedPackageVersions: linux-firmware 1.34.1
Reproducible: No
Tags: lucid lucid needs-upstream-testing lucid lucid needs-upstream-testing
Uname: Linux 2.6.32-23-generic i686
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare
dmi.bios.date: 06/10/2008
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 0704
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: 900
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: x.xx
dmi.chassis.asset.tag: 0x00000000
dmi.chassis.type: 10
dmi.chassis.vendor: ASUSTek Computer INC.
dmi.chassis.version: x.x
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr0704:bd06/10/2008:svnASUSTeKComputerINC.:pn900:pvr0704:rvnASUSTeKComputerINC.:rn900:rvrx.xx:cvnASUSTekComputerINC.:ct10:cvrx.x:
dmi.product.name: 900
dmi.product.version: 0704
dmi.sys.vendor: ASUSTeK Computer INC.

tags: added: apport-collected

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Philip Taylor (scraliontis) wrote :

sorry for all that info apport stuck there,
i did a fresh 10.04 install this morning, on an eeepc 900 ssd.
And experienced the above error (hsm violation).

I have attached my kernel log, with said error.
Alan Pope from ubuntu-uk also advised me to open a new bug, so i will be doing that tmrw.

Andy Whitcroft (apw) on 2010-07-15
Changed in linux (Ubuntu Lucid):
importance: Undecided → High
Tim Gardner (timg-tpi) on 2010-08-27
Changed in linux (Ubuntu Lucid):
assignee: Tim Gardner (timg-tpi) → nobody
Michael Hughes (semi-logic) wrote :

From comment 1 - "The interesting thing is that the bug goes away if you either disable /lib/udev/rules.d/85-hdparm.rules, or udisks-probe-ata-smart in /lib/udev/rules.d/80-udisks.rules."

I have tried both of these work-arounds and they do not fix this issue. I cannot get so far as to even boot into the 10.04 LiveCD desktop (unless it plans to time out after 20-30 minutes--haven't tried).

My system is a Mac Mini (mid 2010) with a Corsair F120 SSD, and I can provide more information if it would be helpful. However, it is identical to most of the information posted here, except that the work-arounds do not help. I also have tried the "Mac Mini Spin": (https://help.ubuntu.com/community/Macmini4-1/Lucid), which works with the regular hard drive, not the SSD I'd like to use.

@Michael Hughes: If you are having such extreme issues I suspect you may not be seeing this exact bug; in fact I have not noticed this problem when booting from a live filesystem (though I suppose it could be trying to use your SSD for swap and hanging up there).

Maybe you should open a separate bug and post the exact messages you are getting...

Fortunately for me, Patriot replaced my flash SSD for my EeePC 900 under warranty (the data-destructive Bug 445852 seems to have worn out the first 32gig drive) but I still see THIS bug, which makes the machine somewhat slow. Fortunately (so far) it does not apparently destroy data, and the workaround described here works fine.

mdyn (tamerlaha-gmail) wrote :

Acer Aspire one 110
same problem
Ubuntu 10.04 but in 9.04 all was fine
Clean install
i can make a ssh access if necessary

Mikael Hjelm (j-m-hjelm) wrote :

I seem to still see this bug even thou i have followed the advice in #3 and #7.

[ 40.816105] ata2: lost interrupt (Status 0x58)
[ 40.820035] ata2: drained 2048 bytes to clear DRQ.
[ 40.823235] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[ 40.823245] ata2.00: BMDMA stat 0x4
[ 40.823255] ata2.00: failed command: READ DMA
[ 40.823273] ata2.00: cmd c8/00:08:b7:2e:41/00:00:00:00:00/e0 tag 0 dma 4096 in
[ 40.823277] res 58/00:08:b7:2e:41/00:00:00:00:00/e0 Emask 0x2 (HSM violation)
[ 40.823286] ata2.00: status: { DRDY DRQ }
[ 40.823337] ata2: soft resetting link
[ 40.992479] ata2.00: configured for UDMA/66
[ 40.992499] ata2: EH complete

/lib/udev/rules.d/80-udisks.rules

# ATA disks driven by libata
#KERNEL=="sd*[!0-9]", ATTR{removable}=="0", ENV{ID_BUS}=="ata", ENV{DEVTYPE}=="disk", IMPORT{program}="udisks-probe-ata-smart $tempnode"

/lib/udev/rules.d/85-hdparm.rules

ACTION=="add", SUBSYSTEM=="block", KERNEL=="[sh]d[a-z]", \
        ATTR{queue/rotational}=="1", \
        RUN+="/lib/udev/hdparm"

Any ideas?

@Mikael,
Have you seen https://bugs.launchpad.net/bugs/445852 and in particular entry 68?
I was seeing similar errors even after applying the patches on my Asus 900. After I discovered that I had a lot of bad blocks on sdb I did the following-
Boot from USB or SD into an distro which doesn't exhibit the fault (I used crunchbang) then
1 Copied all data (including hidden files to USB sticks)
2 zeroed the drive
3 create & format a new partition
4 copied data back
5 modified /etc/fstab for the new UUID of /home
You may have to chown, /home, I didn't as there is the same username with uid on both.

The errors stopped, touch wood no lasting damage has been done.

Mikael Hjelm (j-m-hjelm) wrote :

@Martin
Thanks, the https://bugs.launchpad.net/bugs/445852 seems relevant, if i disable dma on the drive it seems to be ok.
So there is probably some broken blocks somewhere. Will follow your advice and zero the drive but i will probably just reinstall instead of trying to save the old installation.
Question shouldn't erroneous blocks be visible with fsck from a live usb stick?

Mikael Hjelm (j-m-hjelm) wrote :

zeroed the drive and reinstalled now it runs a lot better. Thanks.
Was my problem caused by internal fragmentation of the drive?

@Mikael,
I believe, (but may be wrong) that the problem was blocks being incorrectly marked as bad.
Glad to hear that it's better.

Tim Russell (fargle) wrote :

Just FYI, I just installed a new Western Digital WD20EVDS 2 TB rotational drive and got this exact same behavior - lost interrupts etc. I had created a full-disk EXT4 partition and the corruption was so bad that whenever the machine tried to mount the FS, it caused a kernel oops which killed the whole system. I finally had to dd if=/dev/zero over the start of the drive to kill the partition table so it would stop trying to mount it on boot.

Anyway, disabling udisks-probe-ata-smart via dpkg-divert fixed things perfectly and it works fine now.

This code needs to go! It should never have been added, smartmontools has been around forever with no problems but this code is killing drives left and right. It's not worth the small benefit of notifying desktop users of drive problems.

Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Neil Hooey (nhooey) wrote :

Who is fixing this bug?

Even after disabling s.m.a.r.t. and hdparm, and even zeroing the disk and clean installing ubuntu, I still get the "command failed: IDENTIFY DEVICE" errors on my "ATA-8: Mushkin MKNSSDCL60GB-DX, 340A13F0, max UDMA/133" SSD.

What software package is responsible for these errors?

Neil Hooey (nhooey) wrote :

I just installed Fedora Core 14 which uses kernel 2.6.35.6-45.fc14.i686, and the "failed command: IDENTIFY DEVICE" and "failed command: FLUSH CACHE" problems went away.

Here's the Fedora Bug:
https://bugzilla.redhat.com/show_bug.cgi?id=549981

More details at my StackExchange question:
http://askubuntu.com/questions/16608/how-do-you-fix-failed-command-identify-device-showing-up-in-dmesg

Johan Van den Neste (jvdneste) wrote :

I have an Acer Aspire One with a SuperTalent 32GB SSD upgrade (FEM32GF13M). I was affected by bug 445852, but have since been running 10.10 successfully. A week ago, the errors started popping up again resulting in the inability to boot (also described elsewhere, "BUG: kernel paging error"). The easiest method of zeroing the drive with dd was using the latest TinyCore, which seems unaffected (kernel 2.6.33, i think). After that, 11.04 installs fine (before zeroing it was even almost impossible to boot the installer). However, when the installed system has been running a few hours, the drive locks up, the system freezes and I'm back to square one.

It's deeply disappointing.

I'll give it another go and try to disable udisks-probe-ata-smart immediately after installation. Which is better: editing config files or using dpkg-divert?

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in hdparm (Ubuntu Lucid):
status: New → Confirmed
Changed in hdparm (Ubuntu):
status: New → Confirmed

Jarige, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Jarige (jarikvh) wrote :

I haven't been bothered by this bug for some time now. I don't remember exactly when the symptoms stopped happening, but I think it was after I upgraded my ssd over a year ago. It could also have been a new release that fixed the problem. I still have the old 8GB ssd laying somewhere, but I don't use it anymore.

I'm sorry for not being that active in this bug report. I haven't had the error and symptoms for months now, the last few releases were alright. And so I kinda forgot about this report.

Anyway, if there's no other people affected by this bug, I guess this report could be marked as solved.

Jarige, this bug report is being closed due to your last comment regarding this being fixed with an update. For future reference you can manage the status of your own bugs by clicking on the current status in the yellow line and then choosing a new status in the revealed drop down box. You can learn more about bug statuses at https://wiki.ubuntu.com/Bugs/Status. Thank you again for taking the time to report this bug and helping to make Ubuntu better. Please submit any future bugs you may find.

no longer affects: hdparm (Ubuntu)
no longer affects: hdparm (Ubuntu Lucid)
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in linux (Ubuntu Lucid):
status: In Progress → Invalid
Changed in linux:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.