Hardy: ata errors stop the boot process for 10 minutes

Bug #244363 reported by Marius Gedminas
12
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Fix Released
High
Tim Gardner
Hardy
Fix Released
High
Tim Gardner

Bug Description

Binary package hint: linux-image-2.6.24-19-generic

Yesterday my desktop running Ubuntu Hardy (x86_64) started showing boot problems: showing a blank screen (with blinking cursor) or showing the boot splash going forward & backward, with the disk silent, not progressing until I get bored and reboot.

When I boot in rescue mode, I see the following errors (just after it correctly detects my hard drive on ata1):

[ 28.820922] ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[ 28.824248] ata1.00: ATA-7: SAMSUNG SP2004C, VM100-50, max UDMA7
[ 28.824290] ata1.00: 390721968 sectors, multi 0: LBA48 NCQ (depth 31/32)
[ 28.853832] ata1.00: configured for UDMA/133
[ 31.171096] ata2: classification failed
[ 31.171136] ata2: reset failed (errno=-22), retrying in 8 secs
[ 39.482420] ata2: classification failed
[ 39.482460] ata2: reset failed (errno=-22), retrying in 10 secs
[ 49.472609] ata2: classification failed
[ 49.473157] ata2: reset failed (errno=-22), retrying in 35 secs
[ 83.799319] ata2: limiting SATA link speed to 1.5 Gbps
[ 84.438170] ata2: classfication failed, assuming ATA
[ 84.438218] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
[ 114.409170] ata2.00: qc timeout (cmd 0xec)
[ 114.409219] ata2.00: failed to IDENTIFY (I/O error, err_mask=0x4)
[ 114.409260] ata2: failed to recover some devices, retrying in 5 secs

and then it starts repeating, with longer timeouts. This goes on for 10 minutes, in the middle of which I get an (initramfs) prompt because /dev/disk/by-id/$UUID doesn't exist.

When the 10 minutes pass, I see

[ 356.102791] ata2: classfication failed, assuming ATA
[ 356.102837] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 310)
[ 356.102997] scsi 0:0:0:0: Direct-Access ATA SAMSUNG SP2004C VM10 PQ: 0 ANSI: 5

then my hard disk becomes visible in /dev, and I can press Ctrl+D to continue booting.

This does not happen if I boot the older 2.6.22-14-generic kernel (version 2.6.22-14.47). (What happens is my X become unhappy because I didn't have nvidia's binary blob installed for that kernel).

/var/log/dpkg.log tells me I got 2.6.24-19-generic (version 2.6.24-19.34) on June 26. /var/log/kern.log tells me that I successfully booted this kernel on June 26 (several times), 27 and 28. The first boot problems appeared only yesterday. Since then I've been having boot problems every day. Reinstalling the kernel didn't help.

As far as I can tell, ata1 has my SATA hard disk, while ata2 is supposed to have my DVD drive (which also has problems working in Windows). They're both attached to a JMicron RAID controller (that had problems loading GRUB until I reflashed it with a newer firmware).

$ lspci -nn -s 02:
02:00.0 SATA controller [0106]: JMicron Technologies, Inc. JMB361 AHCI/IDE [197b:2361] (rev 02)
02:00.1 IDE interface [0101]: JMicron Technologies, Inc. JMB361 AHCI/IDE [197b:2361] (rev 02)

$ ls -l /dev/disk/by-path/
pci-0000:00:1a.7-usb-0:1:1.0-scsi-0:0:0:0 -> ../../sda
pci-0000:00:1a.7-usb-0:1:1.0-scsi-0:0:0:1 -> ../../sdb
pci-0000:00:1a.7-usb-0:1:1.0-scsi-0:0:0:2 -> ../../sdc
pci-0000:00:1a.7-usb-0:1:1.0-scsi-0:0:0:3 -> ../../sdd
pci-0000:02:00.0-scsi-0:0:0:0 -> ../../sde
pci-0000:02:00.0-scsi-0:0:0:0-part1 -> ../../sde1
pci-0000:02:00.0-scsi-0:0:0:0-part2 -> ../../sde2
pci-0000:02:00.0-scsi-0:0:0:0-part3 -> ../../sde3
pci-0000:02:00.0-scsi-0:0:0:0-part4 -> ../../sde4
pci-0000:02:00.0-scsi-0:0:0:0-part5 -> ../../sde5
pci-0000:02:00.1-scsi-0:0:0:0 -> ../../scd0

(sda through sdd represent the multi-format card reader attached through USB)

The DVD drive seems to work (i.e. "eject" opens the tray) despite the long pause.

I searched on launchpad and while I found some similar-looking bugs, none of them mentioned ata classification errors. Not being sure, I'm filing a new bug.

Revision history for this message
Dimitrios Symeonidis (azimout) wrote :
Revision history for this message
Connor Imes (ckimes) wrote :

Have you tried booting with kernel options "acpi=off noapic"?
That did the trick for me.

Revision history for this message
Marius Gedminas (mgedmin) wrote :

Connor: no, I haven't tried kernel options yet.

Dimitrios: I think you're right! I built a kernel with that single patch applied, but unfortunately it wasn't enough:

[ 30.118089] ahci 0000:02:00.0: JMB361 has only one port, port_map 0x3 -> 0x1
[ 30.118094] ahci 0000:02:00.0: nr_ports (2) and implemented port map (0x1) don't match, using nr_ports
[ 30.118096] ahci 0000:02:00.0: forcing PORTS_IMPL to 0x3

and then again

[ 34.106600] ata2: classification failed
[ 34.106604] ata2: reset failed (errno=-22), retrying in 8 secs

Looking at the code I think I see the problem with the fix: the port_map change is dropped a few lines down because it doesn't match the number of ports advertised in cap.nr_ports. Resetting the latter ought to fix it...

Revision history for this message
Marius Gedminas (mgedmin) wrote :

Here's a corrected patch that fixes the problem for me.

Revision history for this message
Marius Gedminas (mgedmin) wrote :

The upstream fix works fine with the upstream kernel because it has a different port_map consistency check.

Revision history for this message
Marius Gedminas (mgedmin) wrote :

As far as I can tell, fixing this would involve backporting 837f5f8fb98d4357d49e9631c9ee2815f3c328ca as well as d799e083a80b220f3681d7790f11e77d1704022b

Revision history for this message
Connor Imes (ckimes) wrote :

So is this a bug for JMicron support in the linux kernel?

Revision history for this message
Marius Gedminas (mgedmin) wrote :

More like a missing workaround for buggy hardware, but yes.

Revision history for this message
Connor Imes (ckimes) wrote :

Marius, thank you for taking the time to report this and follow through. I am going to mark this bug with "Fix Released" since you attached a fix and it is available upstream, but is not in the -proposed repository. If you find that this is made available in the Ubuntu repositories at a later time, please let me know and I can mark it with "Fix Committed". I am also marking the importance as High since it has a severe impact on a small number of users.
You clearly know more about dealing with kernel bugs than I do, so please do not hesitate to update me or correct me at any time. Thanks again.

Changed in linux:
importance: Undecided → High
status: New → Fix Released
Revision history for this message
Marius Gedminas (mgedmin) wrote :

I've only one question: what's the proper procedure for contacting the Ubuntu kernel maintainers and asking them whether this fix will appear in the next Hardy kernel update?

Also, I'm not following the Ubuntu release process very closely, and I don't know at what point the Intrepid kernel stops tracking upstream. Will this fix make it into Intrepid?

Revision history for this message
Connor Imes (ckimes) wrote :

I think the best way is to register with and send an email to the Ubuntu Kernel Team mailing list. Here is their wiki homepage - https://wiki.ubuntu.com/KernelTeam
Their mailing list is under the "Getting Involved" link. You can also check out their KB and the other links on their page. I'm not sure how closely they work with the core linux kernel developers, that is probably where the patch needs to be implemented, but the Ubuntu Kernel team is the next rung up the ladder from here, so you should contact them first.
Here is their LP page as well - https://launchpad.net/~ubuntu-kernel-team

Once you get ahold of them and get an answer, please post back here with that you learned. Then we can all know if/when the patch will be available in the repositories.
Thanks.

Changed in linux:
status: Unknown → Fix Released
Revision history for this message
Tim Gardner (timg-tpi) wrote :

SRU Justification

Impact - Some JMicron controllers stop the boot process.

Patch Description - JMB361 has only one port but reports it has two causing longish probe failure on the second one. Quirk it.

Patch: http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-hardy.git;a=commit;h=74e0ac48290489bceb1a73751c004caf8f8a5671

Test Case: See bug description

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux:
assignee: nobody → timg-tpi
milestone: none → ubuntu-8.04.2
status: Fix Released → Fix Committed
Steve Langasek (vorlon)
Changed in linux:
assignee: nobody → timg-tpi
importance: Undecided → High
milestone: none → ubuntu-8.04.2
status: New → In Progress
milestone: ubuntu-8.04.2 → ubuntu-8.10-beta
Revision history for this message
Steve Langasek (vorlon) wrote :

Accepted into -proposed, please test and give feedback here. Please see https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in linux:
milestone: ubuntu-8.04.2 → none
status: In Progress → Fix Committed
Revision history for this message
Marius Gedminas (mgedmin) wrote :

Note to self: apt-get install linux-image-2.6.24-20-generic instead of waiting for a new linux-image-2.6.24-19-generic update to be suggested by update-manager.

(FWIW the 2.6.24-19.36 version from hardy-security doesn't have the fix.)

Revision history for this message
Marius Gedminas (mgedmin) wrote :

The kernel in hardy-proposed (2.6.24-20.37) does not have the 10 minute pause during boot. Yay, fixed!

(Unfortunately there's no linux-restricted-modules-2.6.24-20-generic yet, so no evil NVidia compiz bling for me.)

Revision history for this message
Marius Gedminas (mgedmin) wrote :

No linux-ubuntu-modules either, so sound card is gone too. Fun!

Revision history for this message
Martin Pitt (pitti) wrote :

Marius, lum and lrm are on the way. We won't update linux-meta until everything is in place. Thanks for testing!

Revision history for this message
helpdeskdan (helpdeskdan-gmail) wrote :

Not solved for me with linux-image-2.6.24-20-generic.

[ 773.444685] usbcore: registered new interface driver hub
[ 773.472482] usbcore: registered new device driver usb
[ 773.500472] USB Universal Host Controller Interface driver v3.0
[ 773.689754] ata1.00: ATA-5: IBM-DJSA-210, JS2OAB0A, max UDMA/66
[ 773.689770] ata1.00: 19640880 sectors, multi 8: LBA
[ 773.732499] ata1.00: configured for UDMA/33
[ 773.817593] FDC 0 is a post-1991 82077
[ 774.000473] ata2.01: failed to IDENTIFY (I/O error, err_mask=0x1)
[ 774.000488] ata2: failed to recover some devices, retrying in 5 secs

(Long, long pause occurs here)

[ 775.842993] Clocksource tsc unstable (delta = -1144855154 ns)
[ 775.846974] Time: acpi_pm clocksource has been installed.
[ 776.065298] ata2.00: ATAPI: SAMSUNG CD-ROM SN-124, q008, max UDMA/33, CDB intr

Perhaps my problem is different?

Revision history for this message
Marius Gedminas (mgedmin) wrote :

helpdeskdan: if your SATA controller is not JMicron JMB361, then you have a different problem. Check with lspci:

  $ lspci -nn
  ...
  02:00.0 SATA controller [0106]: JMicron Technologies, Inc. JMB361 AHCI/IDE [197b:2361] (rev 02)
  02:00.1 IDE interface [0101]: JMicron Technologies, Inc. JMB361 AHCI/IDE [197b:2361] (rev 02)
  ...

Revision history for this message
helpdeskdan (helpdeskdan-gmail) wrote :

My apologies, I thought they might be related. I will search again.

Revision history for this message
euthymos (euthymos) wrote :

Have got a similar problem too: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/260543

Did you try to boot with an old kernel?

Revision history for this message
Martin Pitt (pitti) wrote :

linux 2.6.24-21 copied to hardy-updates.

Changed in linux:
status: Fix Committed → Fix Released
Revision history for this message
Colin Watson (cjwatson) wrote :

I've verified that the fix that was backported to Hardy is also present in the Intrepid kernel.

Changed in linux:
status: Fix Committed → Fix Released
Changed in linux:
importance: Unknown → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.