2007-01-02 21:37:47 |
TJ |
bug |
|
|
added bug |
2007-01-03 20:00:56 |
TJ |
description |
I appear to have stumbled upon a bug in the kernel that can, in certain circumstances, both cause the kernel-boot to get stuck in an endless loop, and possibly damage the IDE drives over time (based on experience).
Using Edgy Eft Desktop Live CD, preparing to install to an existing Windows system. This probably occurs during an installed system-boot too, but I've not got that far as yet.
Scenario:
PC with a Promise FastTrak TX2000 SoftRAID controller and 4x 60GB IDE parallel ATA drives configured as RAID 10 (Mirror + Stripe) to provide one logical 120GB drive.
The PC already has Windows 2003 Server installed and booting from the RAID 10, with 2 NTFS partitions.
I wanted to shrink the 2nd partition to make room to install Ubuntu 6.10 from the Live CD.
See my Ubuntu forums article for a detailed explanation of my experience:
http://www.ubuntuforums.org/showthread.php?p=1958918
Bug:
When booting Edgy from the CD the kernel loads the Promise fasttrak controller module "pdc202xx" and then probes each of the connected IDE hard drives (for a partition table?) dmraid not being loaded so its not dealing with the logical drive.
Large drives use LBA addressing to overcome the CHS limitations of partition tables.
If the probe finds a partition table on any drive, it then tries to seek to the starting sector of each partition (presumably to read its boot-sector system-id byte?), and also tries to seek into the last few sectors of the partition (looking for a superblock?).
On a RAID 0 array where the striping causes the partition table to represent a larger logical drive, the starting and ending sector numbers of some partitions are beyond the end of the physical drive the partition table is written on.
This causes the Disk Read Errors reported here.
The fix would be for the probe to compare the physical number of cylinders reported by the drive (as seen by e.g. fdisk /dev/hde or fdisk /dev/hdg) to the starting/ending sector numbers for the LBA device.
If the entries in the partition are beyond the end of the physical disk the probe should handle the situation gracefully (This could potentially be used as a cue to auto-loading dmraid).
Once dmraid is loaded "fdisk /dev/mapper/raidarrayname" shows the correct total number of logical sectors.
-------- Short extract of repetitive disk errors - usually there are hundred or thousands ------
PDC202XX: Primary channel reset.
ide2: reset: success
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
end_request: I/O error, dev hde, sector 238276076
printk: 8 messages suppressed.
Buffer I/O error on device hde2, logical block 47279294
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } |
I appear to have stumbled upon a bug in the kernel that can, in certain circumstances, both cause the kernel-boot to get stuck in an endless loop, and possibly damage the IDE drives over time (based on experience).
Using Edgy Eft Desktop Live CD, preparing to install to an existing Windows system. This probably occurs during an installed system-boot too, but I've not got that far as yet.
Scenario:
PC with a Promise FastTrak TX2000 SoftRAID controller and 4x 60GB IDE parallel ATA drives configured as RAID 1+0 (Mirror + Stripe) to provide one logical 120GB drive.
The PC already has Windows 2003 Server installed and booting from the RAID 1+0, with 2 NTFS partitions.
I wanted to shrink the 2nd partition to make room to install Ubuntu 6.10 from the Live CD.
See my Ubuntu forums article for a detailed explanation of my experience:
http://www.ubuntuforums.org/showthread.php?p=1958918
Bug:
When booting Edgy from the CD the kernel loads the Promise fasttrak controller module "pdc202xx" and then probes each of the connected IDE hard drives (for a partition table?) dmraid not being loaded so its not dealing with the logical drive.
The RAID 1+0 120GB logical drive consists of hde+hdf mirrored to hdg+hdh, with the partiton table on hde and hdg.
Large drives use LBA addressing to overcome the CHS limitations of partition tables.
If the probe finds a partition table on any drive, it then tries to seek to the starting sector of each partition (presumably to read its boot-sector system-id byte?), and also tries to seek into the last few sectors of the partition (looking for a superblock?).
On a RAID 0 array where the striping causes the partition table to represent a larger logical drive, the starting and ending sector numbers of some partitions are beyond the end of the physical drive the partition table is written on.
This causes the Disk Read Errors reported here.
The fix would be for the probe to compare the physical number of cylinders reported by the drive (as seen by e.g. fdisk /dev/hde or fdisk /dev/hdg) to the starting/ending sector numbers for the LBA device.
If the entries in the partition are beyond the end of the physical disk the probe should handle the situation gracefully (This could potentially be used as a cue to auto-loading dmraid).
Once dmraid is loaded "fdisk /dev/mapper/raidarrayname" shows the correct total number of logical sectors.
-------- Short extract of repetitive disk errors - usually there are hundred or thousands ------
PDC202XX: Primary channel reset.
ide2: reset: success
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
end_request: I/O error, dev hde, sector 238276076
printk: 8 messages suppressed.
Buffer I/O error on device hde2, logical block 47279294
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }
hde: task_in_intr: error=0x04 { DriveStatusError }
ide: failed opcode was: unknown
hde: task_in_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } |
|
2007-01-25 17:41:43 |
TJ |
None: statusexplanation |
|
Assigned to more appropriate package |
|
2007-01-31 17:47:40 |
TJ |
bug |
|
|
assigned to linux (upstream) |
2007-01-31 17:52:46 |
TJ |
title |
Disk Read Errors during boot-time probe of physical softRAID drives |
Disk Read Errors during boot-time caused by probe of invalid partitions |
|
2007-01-31 20:24:58 |
TJ |
bug |
|
|
added attachment 'msdos.c.tj.patch' (Patch for fs/partitions/msdos.c) |
2007-01-31 20:29:29 |
TJ |
bug |
|
|
assigned to linux-source-2.6.17 (Debian) |
2007-01-31 21:36:37 |
TJ |
bug |
|
|
added attachment 'msdos.c.tj.2.patch' (Updated patch for fs/partitions/msdos.c) |
2007-01-31 21:38:37 |
TJ |
linux-source-2.6.17: status |
Unconfirmed |
In Progress |
|
2007-01-31 21:38:37 |
TJ |
linux-source-2.6.17: assignee |
|
intuitive-nipple |
|
2007-01-31 21:38:37 |
TJ |
linux-source-2.6.17: statusexplanation |
Assigned to more appropriate package |
Updated status to "In Progress" to reflect the availability of a universal patch for testing. Needs to be tested in systems that don't have this issue to ensure it doesn't cause any regressions. |
|
2007-01-31 23:04:13 |
TJ |
linux: status |
Unconfirmed |
In Progress |
|
2007-02-01 02:00:13 |
TJ |
bug |
|
|
added attachment 'msdos.c.tj.7.patch' (Patch revision 3) |
2007-03-26 16:21:10 |
Tormod Volden |
linux-source-2.6.17: statusexplanation |
Updated status to "In Progress" to reflect the availability of a universal patch for testing. Needs to be tested in systems that don't have this issue to ensure it doesn't cause any regressions. |
|
|
2007-07-25 20:05:21 |
TJ |
linux-source-2.6.20: status |
In Progress |
Fix Released |
|
2007-07-25 20:06:02 |
TJ |
linux: status |
In Progress |
Fix Released |
|
2009-02-15 23:25:40 |
TJ |
linux-source-2.6.20: status |
Fix Released |
Confirmed |
|
2009-02-15 23:25:40 |
TJ |
linux-source-2.6.20: assignee |
intuitivenipple |
|
|
2009-02-18 19:35:01 |
TJ |
linux: status |
Fix Released |
Unknown |
|
2009-02-18 19:35:01 |
TJ |
linux: importance |
Undecided |
Unknown |
|
2009-02-18 19:35:01 |
TJ |
linux: statusexplanation |
Fix applied to Andrew Morton's -mm tree in January 2007 |
|
|
2009-02-18 19:36:10 |
Bug Watch Updater |
linux: status |
Unknown |
Confirmed |
|
2009-02-18 19:37:45 |
TJ |
bug |
|
|
assigned to linux (Ubuntu) |
2009-02-18 19:48:10 |
TJ |
linux: status |
New |
Confirmed |
|
2009-02-18 19:48:10 |
TJ |
linux: assignee |
|
intuitivenipple |
|
2009-02-18 19:48:10 |
TJ |
linux: statusexplanation |
|
Confirmed as still affecting Jaunty by report in bug #329880.
It appears Linus Torvalds rejected my patch when it was pushed from Andrew Morton's -mm tree to mainline in May 2007:
-----------------------------
From: akpm@linux-foundation.org
To: linux@tjworld.net, mm-commits@vger.kernel.org
Subject: - filesystem-disk-errors-at-boot-time-caused-by-probe.patch removed from -mm tree
Date: Tue, 08 May 2007 19:34:23 -0700 (Wed, 03:34 BST)
The patch titled
filesystem: Disk Errors at boot-time caused by probe of partitions
has been removed from the -mm tree. Its filename was
filesystem-disk-errors-at-boot-time-caused-by-probe.patch
This patch was dropped because it was nacked
-----------------------------
From: Linus Torvalds <torvalds@linux-foundation.org>
To: akpm@linux-foundation.org
Cc: linux@tjworld.net, bunk@stusta.de, Jens Axboe <jens.axboe@oracle.com>
Subject: Re: [patch 012/455] filesystem: Disk Errors at boot-time caused by probe of partitions
Date: Tue, 8 May 2007 09:19:32 -0700 (PDT) (17:19 BST)
On Tue, 8 May 2007, akpm@linux-foundation.org wrote:
>
> From: TJ <linux@tjworld.net>
I don't really like these kinds of addresses. Who is TJ? When I google for
that name, I find a lot of hits, but all the links to tjworld.net are
down.
I also think the patch is wrong.
IIRC, we cannot trust the "capacity" data, because not all disks report it
correctly. If we did, we'd just do the check in read_dev_sector() instead.
So I'm dropping this. I might be wrong about the capacity thing, we may
have fixed it (Jens cc'd). But if the capacity is trustworthy, why not
just do the trivial check in read_dev_sector to protect against invalid
extended ones? And in add_partitions()?
Linus
-----------------------------
|
|
2009-03-28 10:11:50 |
Chucky Ellison |
bug |
|
|
added attachment 'dmesg.txt' (dmesg.txt) |
2009-03-29 00:40:35 |
Chucky Ellison |
bug |
|
|
added attachment 'dmesg.2.6.29.txt' (dmesg.2.6.29.txt) |
2009-03-29 21:35:48 |
Chucky Ellison |
bug |
|
|
added attachment 'proc.partitions.2.6.29.txt' (proc.partitions.2.6.29.txt) |
2009-03-29 21:37:24 |
Chucky Ellison |
bug |
|
|
added attachment 'fdisk-l.2.6.29.txt' (fdisk-l.2.6.29.txt) |
2009-04-28 23:54:26 |
Leann Ogasawara |
linux-source-2.6.20 (Ubuntu): status |
Confirmed |
Won't Fix |
|
2009-07-10 19:38:18 |
kernel-janitor |
tags |
dmraid |
dmraid kj-comment |
|
2011-01-12 21:29:02 |
Jeremy Foshee |
linux (Ubuntu): assignee |
TJ (intuitivenipple) |
|
|
2011-01-19 10:32:17 |
Andy Whitcroft |
linux-source-2.6.17 (Debian): status |
New |
Fix Released |
|
2011-01-19 10:33:05 |
Andy Whitcroft |
linux (Ubuntu): status |
Confirmed |
Fix Released |
|
2011-02-03 17:20:39 |
Bug Watch Updater |
linux: importance |
Unknown |
High |
|
2011-09-16 16:40:48 |
Steve Conklin |
linux: importance |
High |
Undecided |
|
2011-09-16 16:40:48 |
Steve Conklin |
linux: status |
Confirmed |
New |
|
2011-09-16 16:40:48 |
Steve Conklin |
linux: remote watch |
Linux Kernel Bug Tracker #7912 |
|
|
2011-09-16 16:40:59 |
Steve Conklin |
linux: status |
New |
Fix Released |
|