Ubuntu
linux package

Bug #620074
Comment #434

Comment 434 for bug 620074

Revision history for this message

In Linux Kernel Bug Tracker #12309, frankrq2009 (frankrq2009-linux-kernel-bugs) wrote on 2010-03-31:

#434

Gentlemen,
I have suffered the high iowait problem for almost 4 years, and I have been watching the bug report(Bug 12309) on bugzilla.kernel.org for 1 year,
and yesterday I finally managed to get out of this trouble by switching from CentOS 5.4(with kernel 2.6.18) to zenwalk 6.2(with a snapshot kernel 2.6.32.2).
The computer is used to collect signal data from 4 gas turbines in a power plant. The project started from 2004,and we used mandrake 9 and zenwalk, both are 2.4.x kernel,and there was no high iowait problems. Since 2006 we switched to fedore 6(kernel 2.6.18) and then centos 5, and the iowait began to make trouble, the system's response of mouse and keyboard became very slow, new applications needed a long time to be launched. During these years, I always thought the main reason of this was because the computer's hardware was not good enough. But early this month, the plant has upgraded the computer to a new Lenovo server with two Xeon E5504 CPUs(total 8 cores), and 4GB memory,but the iowait is still very very high, the following is the output of "top" command on that machine:

Tasks: 215 total, 1 running, 213 sleeping, 0 stopped, 1 zombie
Cpu0 : 1.0%us, 0.3%sy, 0.0%ni, 65.9%id, 32.8%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 1.0%us, 3.6%sy, 0.0%ni, 45.0%id, 50.3%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 1.0%us, 4.0%sy, 0.0%ni, 94.7%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Cpu3 : 1.3%us, 3.3%sy, 0.0%ni, 56.3%id, 38.3%wa, 0.0%hi, 0.7%si, 0.0%st
Cpu4 : 1.3%us, 6.7%sy, 0.0%ni, 0.0%id, 89.7%wa, 0.7%hi, 1.7%si, 0.0%st
Cpu5 : 0.3%us, 3.3%sy, 0.0%ni, 91.7%id, 0.0%wa, 0.7%hi, 4.0%si, 0.0%st
Cpu6 : 10.3%us, 30.2%sy, 0.0%ni, 50.2%id, 2.3%wa, 1.0%hi, 6.0%si, 0.0%st
Cpu7 : 1.3%us, 8.6%sy, 0.0%ni, 83.1%id, 4.0%wa, 1.0%hi, 2.0%si, 0.0%st
Mem: 4078540k total, 3872720k used, 205820k free, 182344k buffers
Swap: 4192956k total, 0k used, 4192956k free, 2815596k cached

PID USER 3841 markv 8573 markv 4067 markv 3548 mysql 27042 markv 15 3810 root 1 root 2 root 3 root 4 root 5 root 6 root 7 root 8 root 9 root 10 root 11 root 12 root 13 root 14 root 15 root 16 root 17 root 18 root [markv@markgt ~]$ PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15 0 72172 12m 8380 S 42.2 0.3 1984:24 lvinf
15 0 60232 12m 8876 S 11.6 0.3 0:17.22 mark
15 0 19056 3224 2336 S 10.6 0.1 759:52.00 dms
21 0 656m 617m 9292 S 9.0 15.5 764:42.05 mysqld
0 69404 12m 8756 S 4.3 0.3 290:36.14 walin
15 0 39772 15m 8224 S 1.3 0.4 3:59.76 Xorg
15 0 2068 620 532 S 0.0 0.0 0:01.19 init
RT -5 0 0 0 S 0.0 0.0 0:00.04 migration/0
34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0
RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
RT -5 0 0 0 S 0.0 0.0 0:00.02 migration/1
34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
RT -5 0 0 0 S 0.0 0.0 0:00.01 migration/2
34 19 0 0 0 S 0.0 0.0 0:00.01 ksoftirqd/2
RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/3
34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3
RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/3
RT -5 0 0 0 S 0.0 0.0 0:00.09 migration/4
34 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/4
RT -5 0 0 0 S 0.0 0.0 0:00.00 watchdog/4
RT -5 0 0 0 S 0.0 0.0 0:00.03 migration/5
36 19 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5

All our application's job is to insert 16 records per second(every record is a fixed 12 bytes in 3 fields) into a mysql database, the storage is a LVM consisted of two 750GB seagate SATA 7200RPM disks. I am sure this high iowait was not caused by other things like network cards or video card, because I have experimented to comment out only the mysql inserting lines from our source code, and the system iowait would drop to 0, GUI would become very responsive.
It also has nothing to do with the io scheduler,because I had tested deadline and noop on CentOS 5.4 and iowait could not be reduced. I also tried to enlarge the /sys/block/sda/queue/nr_requests, and it does not work.
I got information from this bugzilla report that kernel 2.6.32 has fixed this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of zenwalk on my notebook, and found the high iowait is gone, so yesterday I installed the zenwalk 6.2 with the 2.6.32.2 kernel on that server, although the kernel only detected/used one Xeon CPU and 2GB memory, the iowait is very low and the whole system became very fast, only several seconds iowait would reach to 30%-40%, and then dropped back to 0 very soon.
By the way, the io scheduler is cfq. The following is "top" output of it:

Tasks: 157 total, 2 running, 155 sleeping, 0 stopped, 0 zombie
Cpu0 : 12.3%us, 7.8%sy, 0.0%ni, 77.3%id, 0.0%wa, 0.0%hi, 2.6%si, 0.0%st
Cpu1 : 11.3%us, 8.4%sy, 0.0%ni, 76.1%id, 0.0%wa, 0.0%hi, 4.2%si, 0.0%st
Cpu2 : 5.2%us, 7.2%sy, 0.0%ni, 84.0%id, 0.0%wa, 0.3%hi, 3.3%si, 0.0%st
Cpu3 : 8.1%us, 7.7%sy, 0.0%ni, 81.3%id, 0.0%wa, 0.0%hi, 2.9%si, 0.0%st
Mem: 2272368k total, 1153508k used, 1118860k free, 79384k buffers
Swap: 4192956k total, 0k used, 4192956k free, 797568k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3085 mysql 40 0 453m 68m 4864 S 35 3.1 12:25.12 mysqld
3203 markv 40 0 77852 17m 11m S 24 0.8 7:53.75 mark
2684 root 40 0 16440 2896 2144 S 9 0.1 11:03.99 dms
3879 markv 40 0 42256 12m 9336 S 4 0.6 1:43.26 walin
1520 root 40 0 4156 1232 972 S 0 0.1 0:00.06 ntpd
3235 root 40 0 64164 29m 9m S 0 1.3 0:50.78 X
3885 markv 40 0 2452 1180 892 R 0 0.1 0:02.08 top
   1 root 40 0 804 332 292 S 0 0.0 0:00.90 init
   2 root 40 0 0 0 0 S 0 0.0 0:00.00 kthreadd
   3 root RT 0 0 0 0 S 0 0.0 0:00.00 migration/0
   4 root 20 0 0 0 0 S 0 0.0 0:00.05 ksoftirqd/0
        We didn't change the storage, it is still that two seagate disks LVM with 4 years turbine data in them.
I found in kernel 2.6.32.8 the high iowait is back. How do I know
that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb
mobile disk, I found the reading side disk LED start to falsh quickly and immediately, but the writing side disk LED will keep still for a long time(like 25-30 seconds), and then start to flash slowly,and the course is abnormally long and low responsive. The kernel 2.6.32.2 is the only 2.6 kernel(since 2.6.18) on which I found both of the reading and writing side disk LED will start to falsh quickly and immediately.There must be somthing wrong with the write cache behavior which will cause the high iowait, and it has been fixed in 2.6.32.2 and brought back in 2.6.32.8.
So by copying a big file to an usb disk and watching the disk LED, it can be used as a method for the kernel developers to reproduce and observe this bug. I hope this may be helpful.
        I noticed Mr. Morton said this is not the best place to discuss, but the <email address hidden> rejected two of my emails from two different email account. So I mailed and cc-ed the email, and also post it here, to make more people can share my experience.

Best regards,
Frank Ren

Gentlemen,
	I have suffered the high iowait problem for almost 4 years, and I have been watching the bug report(Bug 12309) on bugzilla.kernel.org for 1 year,
and yesterday I finally managed to get out of this trouble by switching from CentOS 5.4(with kernel 2.6.18) to zenwalk 6.2(with a snapshot kernel 2.6.32.2). 
	The computer is used to collect signal data from 4 gas turbines in a power plant. The project started from 2004,and we used mandrake 9 and zenwalk, both are 2.4.x kernel,and there was no high iowait problems. Since 2006 we switched to fedore 6(kernel 2.6.18) and then centos 5, and the iowait began to make trouble, the system's response of mouse and keyboard became very slow, new applications needed a long time to be launched. During these years, I always thought the main reason of this was because the computer's hardware was not good enough. But early this month, the plant has upgraded the computer to a new Lenovo server with two Xeon E5504 CPUs(total 8 cores), and 4GB memory,but the iowait is still very very high, the following is the output of "top" command on that machine:

Tasks: 215 total,   1 running, 213 sleeping,   0 stopped,   1 zombie
Cpu0  :  1.0%us,  0.3%sy,  0.0%ni, 65.9%id, 32.8%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  1.0%us,  3.6%sy,  0.0%ni, 45.0%id, 50.3%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  1.0%us,  4.0%sy,  0.0%ni, 94.7%id,  0.0%wa,  0.0%hi,  0.3%si,  0.0%st
Cpu3  :  1.3%us,  3.3%sy,  0.0%ni, 56.3%id, 38.3%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu4  :  1.3%us,  6.7%sy,  0.0%ni,  0.0%id, 89.7%wa,  0.7%hi,  1.7%si,  0.0%st
Cpu5  :  0.3%us,  3.3%sy,  0.0%ni, 91.7%id,  0.0%wa,  0.7%hi,  4.0%si,  0.0%st
Cpu6  : 10.3%us, 30.2%sy,  0.0%ni, 50.2%id,  2.3%wa,  1.0%hi,  6.0%si,  0.0%st
Cpu7  :  1.3%us,  8.6%sy,  0.0%ni, 83.1%id,  4.0%wa,  1.0%hi,  2.0%si,  0.0%st
Mem:   4078540k total,  3872720k used,   205820k free,   182344k buffers
Swap:  4192956k total,        0k used,  4192956k free,  2815596k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                          
 3841 markv     15   0 72172  12m 8380 S 42.2  0.3   1984:24 lvinf                            
 8573 markv     15   0 60232  12m 8876 S 11.6  0.3   0:17.22 mark                             
 4067 markv     15   0 19056 3224 2336 S 10.6  0.1 759:52.00 dms                              
 3548 mysql     21   0  656m 617m 9292 S  9.0 15.5 764:42.05 mysqld                           
27042 markv     15   0 69404  12m 8756 S  4.3  0.3 290:36.14 walin                            
 3810 root      15   0 39772  15m 8224 S  1.3  0.4   3:59.76 Xorg                             
    1 root      15   0  2068  620  532 S  0.0  0.0   0:01.19 init                             
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:00.04 migration/0                      
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/0                      
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0                       
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:00.02 migration/1                      
    6 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/1                      
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1                       
    8 root      RT  -5     0    0    0 S  0.0  0.0   0:00.01 migration/2                      
    9 root      34  19     0    0    0 S  0.0  0.0   0:00.01 ksoftirqd/2                      
   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2                       
   11 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 migration/3                      
   12 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/3                      
   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3                       
   14 root      RT  -5     0    0    0 S  0.0  0.0   0:00.09 migration/4                      
   15 root      34  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/4                      
   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4                       
   17 root      RT  -5     0    0    0 S  0.0  0.0   0:00.03 migration/5                      
   18 root      36  19     0    0    0 S  0.0  0.0   0:00.00 ksoftirqd/5                      
[markv@markgt ~]$

All our application's job is to insert 16 records per second(every record is a fixed 12 bytes in 3 fields) into a mysql database, the storage is a LVM consisted of two 750GB seagate SATA 7200RPM disks. I am sure this high iowait was not caused by other things like network cards or video card, because I have experimented to comment out only the mysql inserting lines from our source code, and the system iowait would drop to 0, GUI would become very responsive.
    It also has nothing to do with the io scheduler,because I had tested deadline and noop on CentOS 5.4 and iowait could not be reduced. I also tried to enlarge the /sys/block/sda/queue/nr_requests, and it does not work.      
	I got information from this bugzilla report that kernel 2.6.32 has fixed this high iowait problem, and I tested the snapshot kernel 2.6.32.2 of zenwalk on my notebook, and found the high iowait is gone, so yesterday I installed the zenwalk 6.2 with the 2.6.32.2 kernel on that server, although the kernel only detected/used one Xeon CPU and 2GB memory, the iowait is very low and the whole system became very fast, only several seconds iowait would reach to 30%-40%, and then dropped back to 0 very soon. 
By the way, the io scheduler is cfq. The following is "top" output of it:

Tasks: 157 total,   2 running, 155 sleeping,   0 stopped,   0 zombie
Cpu0  : 12.3%us,  7.8%sy,  0.0%ni, 77.3%id,  0.0%wa,  0.0%hi,  2.6%si,  0.0%st
Cpu1  : 11.3%us,  8.4%sy,  0.0%ni, 76.1%id,  0.0%wa,  0.0%hi,  4.2%si,  0.0%st
Cpu2  :  5.2%us,  7.2%sy,  0.0%ni, 84.0%id,  0.0%wa,  0.3%hi,  3.3%si,  0.0%st
Cpu3  :  8.1%us,  7.7%sy,  0.0%ni, 81.3%id,  0.0%wa,  0.0%hi,  2.9%si,  0.0%st
Mem:   2272368k total,  1153508k used,  1118860k free,    79384k buffers
Swap:  4192956k total,        0k used,  4192956k free,   797568k cached

PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                        
3085 mysql     40   0  453m  68m 4864 S   35  3.1  12:25.12 mysqld                         
3203 markv     40   0 77852  17m  11m S   24  0.8   7:53.75 mark                           
2684 root      40   0 16440 2896 2144 S    9  0.1  11:03.99 dms                            
3879 markv     40   0 42256  12m 9336 S    4  0.6   1:43.26 walin                          
1520 root      40   0  4156 1232  972 S    0  0.1   0:00.06 ntpd                           
3235 root      40   0 64164  29m   9m S    0  1.3   0:50.78 X                              
3885 markv     40   0  2452 1180  892 R    0  0.1   0:02.08 top                            
   1 root      40   0   804  332  292 S    0  0.0   0:00.90 init                           
   2 root      40   0     0    0    0 S    0  0.0   0:00.00 kthreadd                       
   3 root      RT   0     0    0    0 S    0  0.0   0:00.00 migration/0                    
   4 root      20   0     0    0    0 S    0  0.0   0:00.05 ksoftirqd/0                    
        We didn't change the storage, it is still that two seagate disks LVM with 4 years turbine data in them. 
	I found in kernel 2.6.32.8 the high iowait is back. How do I know
that? When I copy a 700MB avi file from my notebook disk to a 3.5" usb
mobile disk, I found the reading side disk LED start to falsh quickly and immediately, but the writing side disk LED will keep still for a long time(like 25-30 seconds), and then start to flash slowly,and the course is abnormally long and low responsive. The kernel 2.6.32.2 is the only 2.6 kernel(since 2.6.18) on which I found both of the reading and writing side disk LED will start to falsh quickly and immediately.There must be somthing wrong with the write cache behavior which will cause the high iowait, and it has been fixed in 2.6.32.2 and brought back in 2.6.32.8.
	So by copying a big file to an usb disk and watching the disk LED, it can be used as a method for the kernel developers to reproduce and observe this bug. I hope this may be helpful.
        I noticed Mr. Morton said this is not the best place to discuss, but the linux-kernel@vger.kernel.org rejected two of my emails from two different email account. So I mailed and cc-ed the email, and also post it here, to make more people can share my experience.
	
Best regards,
Frank Ren

Ubuntulinux package

Comment 434 for bug 620074

Ubuntu
linux package