Comment 566 for bug 620074

Revision history for this message
In , vesok (vesok-linux-kernel-bugs) wrote :

OK, the fun continues.

Installed the offending hard disk in another system, booted Fedora 14 live and the drive worked OK:
[root@localhost ~]# dd if=/dev/zero of=/dev/sd_ bs=1M count=4000 conv=fdatasync
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 50.0265 s, 83.8 MB/s

(Replaced /dev/sda with /dev/sd_ in case someone decides to copy/paste the command).

Then I booted Knoppix 5.1.1 (from 2007) and saw the fault. CPU usage was 49.7%wa (dual cpu) and had to interrupt dd because it was taking way too long. Then I tried again with a smaller file:

root@Knoppix:~# uname -a
Linux Knoppix 2.6.19 #7 SMP PREEMPT Sun Dec 17 22:01:07 CET 2006 i686 GNU/Linux
root@Knoppix:~# dd if=/dev/zero of=/dev/sd_ bs=1M count=40 conv=fdatasync
40+0 records in
40+0 records out
41943040 bytes (42 MB) copied, 20.8245 seconds, 2.0 MB/s

Then I booted Fedora again and saw the fault again:
[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.35.6-45.fc14.i686 #1 SMP Mon Oct 18 23:56:17 UTC 2010 i686 i686 i386 GNU/Linux
[root@localhost ~]# dd if=/dev/zero of=/dev/sd_ bs=1M count=40 conv=fdatasync
40+0 records in
40+0 records out
41943040 bytes (42 MB) copied, 20.3055 s, 2.1 MB/s

@ #548 From Zenith88:
Ignoring the possibility of a hardware fault when the evidence points that way surely brings those who practice that great deal of fruitless debugging and frustration.

@ #550 From D.M.
I don't think it is the "partition starts at the wrong sector" issue. In the dd commands listed above I was writing to the drive as a whole, without messing with partitions at all.
For the sake of it I decided to create a new partition and see what will happen:
[root@localhost ~]# fdisk -H 224 -S 56 /dev/sd_
Device contains neither a valid DOS partition table, nor Sun, SGI or OSF disklabel
Building a new DOS disklabel with disk identifier 0x9b81ad16.
Changes will remain in memory only, until you decide to write them.
After that, of course, the previous content won't be recoverable.

Warning: invalid flag 0x0000 of partition table 4 will be corrected by w(rite)

Command (m for help): n
Command action
   e extended
   p primary partition (1-4)
p
Partition number (1-4, default 1): 1
First sector (2048-2930275054, default 2048):
Using default value 2048
Last sector, +sectors or +size{K,M,G} (2048-2930275054, default 2930275054): +10G

Command (m for help): p

Disk /dev/sda: 1500.3 GB, 1500300828160 bytes
224 heads, 56 sectors/track, 233599 cylinders, total 2930275055 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x9b81ad16

   Device Boot Start End Blocks Id System
/dev/sda1 2048 20973567 10485760 83 Linux

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
[root@localhost ~]# mkfs.ext2 -q /dev/sda_
[root@localhost ~]# mount /dev/sda1 /mnt
[root@localhost ~]# dd if=/dev/zero of=/mnt/bigfile bs=1M count=100 conv=fdatasync
100+0 records in
100+0 records out
104857600 bytes (105 MB) copied, 77.3839 s, 1.4 MB/s

I guess the performance drop can be attributed to the filesystem overhead.

The issue you describe with writing a large bunch of dirty pages is a real one but is different to the high iowait times.

I have seen high iowait times when the only active application I had was rtorrent running in seeding mode - so no disk writes but lots of disk reads from all over the place, with total system memory less than the size of the torrent.

Basically when the performance of the drive drops from 80 MB/s to 2 MB/s the only thing the kernel does is waiting for I/O operations to complete. I am not sure if there is a solution for this problem at all.

The disk is still available so I can run more tests if anyone is interested.