Disk Read Errors during boot-time caused by probe of invalid partitions
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Linux |
Fix Released
|
Undecided
|
Unassigned | ||
linux (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
linux-source-2.6.17 (Debian) |
Fix Released
|
Undecided
|
Unassigned | ||
linux-source-2.6.20 (Ubuntu) |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
I appear to have stumbled upon a bug in the kernel that can, in certain circumstances, both cause the kernel-boot to get stuck in an endless loop, and possibly damage the IDE drives over time (based on experience).
Using Edgy Eft Desktop Live CD, preparing to install to an existing Windows system. This probably occurs during an installed system-boot too, but I've not got that far as yet.
Scenario:
PC with a Promise FastTrak TX2000 SoftRAID controller and 4x 60GB IDE parallel ATA drives configured as RAID 1+0 (Mirror + Stripe) to provide one logical 120GB drive.
The PC already has Windows 2003 Server installed and booting from the RAID 1+0, with 2 NTFS partitions.
I wanted to shrink the 2nd partition to make room to install Ubuntu 6.10 from the Live CD.
See my Ubuntu forums article for a detailed explanation of my experience:
http://
Bug:
When booting Edgy from the CD the kernel loads the Promise fasttrak controller module "pdc202xx" and then probes each of the connected IDE hard drives (for a partition table?) dmraid not being loaded so its not dealing with the logical drive.
The RAID 1+0 120GB logical drive consists of hde+hdf mirrored to hdg+hdh, with the partiton table on hde and hdg.
Large drives use LBA addressing to overcome the CHS limitations of partition tables.
If the probe finds a partition table on any drive, it then tries to seek to the starting sector of each partition (presumably to read its boot-sector system-id byte?), and also tries to seek into the last few sectors of the partition (looking for a superblock?).
On a RAID 0 array where the striping causes the partition table to represent a larger logical drive, the starting and ending sector numbers of some partitions are beyond the end of the physical drive the partition table is written on.
This causes the Disk Read Errors reported here.
The fix would be for the probe to compare the physical number of cylinders reported by the drive (as seen by e.g. fdisk /dev/hde or fdisk /dev/hdg) to the starting/ending sector numbers for the LBA device.
If the entries in the partition are beyond the end of the physical disk the probe should handle the situation gracefully (This could potentially be used as a cue to auto-loading dmraid).
Once dmraid is loaded "fdisk /dev/mapper/
-------- Short extract of repetitive disk errors - usually there are hundred or thousands ------
PDC202XX: Primary channel reset.
ide2: reset: success
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
end_request: I/O error, dev hde, sector 238276076
printk: 8 messages suppressed.
Buffer I/O error on device hde2, logical block 47279294
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
description: | updated |
Changed in linux-source-2.6.20: | |
assignee: | intuitivenipple → nobody |
status: | Fix Released → Confirmed |
Changed in linux: | |
importance: | Undecided → Unknown |
status: | Fix Released → Unknown |
Changed in linux: | |
status: | Unknown → Confirmed |
Changed in linux (Ubuntu): | |
assignee: | TJ (intuitivenipple) → nobody |
Changed in linux: | |
importance: | Unknown → High |
Changed in linux: | |
importance: | High → Undecided |
status: | Confirmed → New |
status: | New → Fix Released |
I've been slowly working to isolate the source-code at the root of this error. The biggest problem I faced was the sheer number of errors reported at boot-time swamped the kernel's log buffer (128KB by default) and therefore I had no information about what was happening in the lead-up to this.
Yesterday I built a new kernel with the kernel log buffer size increased from 128KB to 1MB.
This is with Edgy Eft versions 2.6.17-10-generic and 2.6.17.14.
Using make menuconfig I changed the Kernel Hacking Kernel Log Buffer size. I altered the log-buffer-shift parameter from 17 to 20. This is a bit-shift value, so the buffer size is 2^X.
The entry in the kernel configuration file .config is CONFIG_ LOG_BUF_ SHIFT.
Now finally I have the kernel messages leading up to bug:
[17179576.612000] AMD7441: IDE controller at PCI slot 0000:00:07.1 0xd407, 0xd002 on irq 169
[17179576.612000] AMD7441: chipset revision 4
[17179576.612000] AMD7441: not 100% native mode: will probe irqs later
[17179576.612000] AMD7441: 0000:00:07.1 (rev 04) UDMA100 controller
[17179576.612000] ide0: BM-DMA at 0xd800-0xd807, BIOS settings: hda:DMA, hdb:DMA
[17179576.612000] ide1: BM-DMA at 0xd808-0xd80f, BIOS settings: hdc:DMA, hdd:DMA
[17179576.612000] Probing IDE interface ide0...
[17179576.900000] hda: Maxtor 6Y120L0, ATA DISK drive
[17179577.180000] hdb: Maxtor 6Y060L0, ATA DISK drive
[17179577.236000] ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
[17179577.236000] Probing IDE interface ide1...
[17179577.972000] hdc: PIONEER DVD-RW DVR-109, ATAPI CD/DVD-ROM drive
[17179578.756000] hdd: PIONEER DVD-RW DVR-103, ATAPI CD/DVD-ROM drive
[17179578.812000] ide1 at 0x170-0x177,0x376 on irq 15
[17179578.824000] hda: max request size: 128KiB
[17179578.832000] hda: 240121728 sectors (122942 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
[17179578.836000] hda: cache flushes supported
[17179578.836000] hda: hda1 hda2 hda3 < hda5 hda6 hda7 hda8 >
[17179578.884000] hdb: max request size: 128KiB
[17179578.884000] hdb: 120103200 sectors (61492 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(100)
[17179578.884000] hdb: cache flushes supported
[17179578.884000] hdb: hdb1
[17179578.888000] hdc: ATAPI 40X DVD-ROM DVD-R CD-R/RW drive, 2000kB Cache, UDMA(66)
[17179578.888000] Uniform CD-ROM driver Revision: 3.20
[17179578.912000] hdd: ATAPI 24X DVD-ROM DVD-R CD-R/RW drive, 2000kB Cache, DMA
[17179579.280000] PDC20271: IDE controller at PCI slot 0000:00:08.0
[17179579.280000] ACPI: PCI Interrupt 0000:00:08.0[A] -> GSI 20 (level, low) -> IRQ 169
[17179579.280000] PDC20271: chipset revision 2
[17179579.280000] PDC20271: ROM enabled at 0x88000000
[17179579.280000] PDC20271: 100% native mode on irq 169
[17179579.280000] ide2: BM-DMA at 0xb000-0xb007, BIOS settings: hde:pio, hdf:pio
[17179579.280000] ide3: BM-DMA at 0xb008-0xb00f, BIOS settings: hdg:pio, hdh:pio
[17179579.280000] Probing IDE interface ide2...
[17179579.572000] hde: Maxtor 6Y060L0, ATA DISK drive
[17179579.852000] hdf: Maxtor 6Y060L0, ATA DISK drive
[17179579.908000] ide2 at 0xd400-
[17179579.908000] hde: max request size: 128KiB
[17179579.924000] hde: 120103200 sectors (61492 MB) w/2048KiB Cache, CHS=65535/16/63, UDMA(133)
[17179579.924...