Very poor desktop response (high latency) during I/O-load with SATA+NCQ

Bug #343371 reported by Øyvind Stegard on 2009-03-15
80
This bug affects 12 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: linux-image-2.6.27-13-generic

I run Ubuntu on a moderately powerful quad-core x86-64 system and the desktop response is basically crippled whenever something is reading or writing large files as fast as it can (at normal priority).

As an example scenario:
1) Do: $ cat /path/to/LARGE_FILE > /dev/null
This will start a process reading a large file as fast as it can.

2) Everything else gets completely unusable because of the I/O latency. Opening a new Firefox tab takes seconds,
loading a web page takes ages, reading email with Evolution is extremely slow, etc. Basically anything that needs
to read or write something to disk, other than the cat process, is slowed down to a crawl (I/O-starved). Forget about starting up new programs, they will only appear after the process started in 1) has finished completely with dumping the entire file to /dev/null. As an example here, I launched 'gedit' as soon as I had started 1), and it appeared only after cat exited.

3) As soon as the cat-process is finished, everything gets back to normal and system is fine and snappy again.

4) The throughput itself is good, but the problem is that it completely starves the rest of the system I/O-wise, using only Ubuntu default settings.

It shouldn't be like this for a deskop-oriented distro on relatively powerful hardware (quad-core CPU, 64bit, 8GM RAM).

Disabling NCQ, by doing:
$ echo 1 > /sys/block/sda/device/queue_depth #(as per http://linux-ata.org/faq.html)

improves the situation quite a bit, but it is still painful, and not as fair as one would like when using "Completely Fair Queuing" I/O-scheduler under read-load. Write load is also painful, i.e. when copying large files from a different drive to the main SATA drive, or copy from drive to itself. Except for the issue at hand, there are no problems with hard drive performance or system stability in general.

System summary:
Ubuntu Intrepid x86-64 (using kernel 2.6.27-13)
Disk mount options at defaults, I/O-scheduler is default CFQ.
The main hard drive holds both root partition and /home-partition. All relevant
file systems are EXT3 with Ubuntu default mount options (ordered, relatime).
NCQ is enabled for this particular drive (libata reports depth 31/32 at boot).

Hardware:
ASUS P5E WS Pro, Intel X38 Express / Intel ICH9R
Intel(R) Core(TM)2 Quad CPU Q9300 @ 2.50GHz
8 GB RAM
Main SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller (rev 02)
Main hard drive: Western Digial WD5000AACS-00ZUB0, SATA NCQ enabled.

I will attach:
1) Output of command 'lspci -vv', run as root
2) Kernel log from boot

Øyvind Stegard (oyvinst) wrote :
Øyvind Stegard (oyvinst) wrote :

Thank you for reporting this problem. It helps us make Ubuntu better. Would you please follow this link (http://www.howtoforge.com/measuring-linux-latency-with-latencytop-on-ubuntu-8.10-and-debian-lenny) and give us the results? It could be very useful info to help us help you.

Changed in linux:
status: New → Confirmed
Øyvind Stegard (oyvinst) wrote :

Attaching some latencytop output while doing a large read ..

Excellent, thank you.

Øyvind Stegard wrote:
> Attaching some latencytop output while doing a large read ..
>
> ** Attachment added: "latencytop.log"
> http://launchpadlibrarian.net/23950515/latencytop.log
>
>

qwerty (escalantea) wrote :

I had a problem with a sata disk under heavy I/O ... response time degraded when loading/starting new applications (... until the disk finally freezed).

I solved my problem by fine tunning the "pdflush" parameters...

Monitor the "Dirty" values in " cat /proc/meminfo" ... if the values remain high (... are not cleaned frequently enough) then the problem might be the "pdflush" configuration.

If the problem is the "pdflush" then check this link :
https://bugs.launchpad.net/ubuntu/+bug/270794/comments/12

... I hope it helps.

I'll check it out and if that's the problem, I'll mark this bug a
duplicate of the other.

qwerty wrote:
> I had a problem with a sata disk under heavy I/O ... response time
> degraded when loading/starting new applications (... until the disk
> finally freezed).
>
> I solved my problem by fine tunning the "pdflush" parameters...
>
> Monitor the "Dirty" values in " cat /proc/meminfo" ... if the values
> remain high (... are not cleaned frequently enough) then the problem
> might be the "pdflush" configuration.
>
> If the problem is the "pdflush" then check this link :
> https://bugs.launchpad.net/ubuntu/+bug/270794/comments/12
>
> ... I hope it helps.
>
>

Øyvind Stegard (oyvinst) wrote :

@qwerty: Thanks for the hint, but I think it is not related. I don't have problems with the disk or controller under high load, only with the general latency of the system.

So, this is not a duplicate of bug 270794. The kernel version there is different (2.6.24), and like I stated, I've never had any stability problems with the SATA controller, nor the hard-drive. It never soft-resets itself and generally works very reliably. This is purely a latency problem, not anything going wrong, controller reset, kernel log errors, etc.

Øyvind Stegard (oyvinst) wrote :

Still a problem in Ubuntu Jaunty with all filesystems converted to EXT4. Copying 6-7 GB from one place on a partition to another place on the same partition (i.e. duplicating the data on the same harddrive) results in extreme I/O latencies, and things like starting up Firefox while copy operation is going is totally impossible. It seems that latency on Linux suffers badly whenever there is heavy writing, the in I/O scheduler is not up for the job of providing an OK desktop experience.

take a look at bug 381300!
please boot with elevator=noop and try if the problem persist...

"echo 1 > queue_depth" was not enought in my case.

What you see in queue_type when you do that?
"simple", "none" or something else?

I resolved it adding my hd (WDC WD3200BEVT-22ZCT0) in the NCQ blacklist
in drivers/ata/libata-core.c and now I can read very big files from the disk
without any problem...

After some test I think 381300 is something
different and you can ignore it...

I've same issue on my laptop.

Some information :

$ uname --all
Linux stephane-macbook 2.6.28-13-generic #44-Ubuntu SMP Tue Jun 2 07:57:31 UTC 2009 i686 GNU/Linux

Do you need other information ?

Is it an ubuntu bug or kernel bug ? If the last, may be we can found this bug in kernel bugzilla ?

Regards,
Stephane

Jim Lieb (lieb) wrote :

This bug may be a duplicate of #131094 even though I have not marked it as such (yet). See the comment at https://bugs.launchpad.net/ubuntu/+source/linux/+bug/131094/comments/235 and test your workload against it. Note that this has been a workload sensitive problem. You may have another problem further into the I/O subsystem but eliminating this issue should be the first step, i.e. if we eliminate the I/O from VM thrashing, NCQ etc. become less of an issue. If you have further problems as indicated in earlier comments, we can then pursue potential problems in the SATA layer. Thank you.

Changed in linux (Ubuntu):
assignee: nobody → Jim Lieb (lieb)
status: Confirmed → Incomplete
Øyvind Stegard (oyvinst) wrote :

Hi,

I've been running Ubuntu 9.10 for a while now (currently on kernel 2.6.31-17.54). The problem with high I/O latencies during write-load persists, even after migrating all partitions to the EXT4 file system. The hardware is still the same as when I reported this bug originally.

Again, to reiterate for Karmic, the issue typically appears when I run some process in the background that writes a lot out to disk (typical for audio/video work, etc), and then try to use my desktop for other things while this is going on. Firefox grinds to a halt immediately the first time it tries to do som disk I/O. And I have to wait for too long before I gets responsive again. This is a 64bit quad-core system with 8GB RAM, so there's plenty of resources. The symptoms are exactly the same as they were for Jaunty: all in all the disks and system is configured and running OK with good throughput performance, but the latency is just horrible at times.

Øyvind Stegard (oyvinst) wrote :

This is probably what I'm seeing: http://bugzilla.kernel.org/show_bug.cgi?id=12309
However, it looks like the most promising patch is only available from 2.6.32+.

Arrg! Same issue here coupled with the slow SATA performance bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/119730

Started happening for me with Hardy Heron about 1.5 years ago. Now running up-to-date Karmic with same issue.

Lenovo T61
1GB DDR2 667
500GB Seagate SATA2 ST9500420AS
2.6.31-16-generic SMP x86_64

As soon as my system swaps, the UI becomes unusable.
Trying other IO schedulers.

Laurent Duchesne (l-urent) wrote :

I can confirm I had the same problem with 2.6.31-18 (ICH7 chipset). After following these instructions to upgrade the kernel to 2.6.32, everything works fine:
http://www.ramoonus.nl/2009/12/03/linux-kernel-2-6-32-installation-guide-for-ubuntu-linux/

The mouse doesn't freeze anymore and apps are still responsive even when under heavy IO load.

Øyvind Stegard (oyvinst) wrote :

Thank you for the tip, I will try kernel 2.6.32.3 from the Ubuntu kernel team's mainline-PPA and report back.

Øyvind Stegard (oyvinst) wrote :

Laurent Duchesne <email address hidden> writes:

> I can confirm I had the same problem with 2.6.31-18 (ICH7 chipset).
> After following these instructions to upgrade the kernel to 2.6.32,
> everything works fine:
> http://www.ramoonus.nl/2009/12/03/linux-kernel-2-6-32-installation-guide-for-ubuntu-linux/
>
> The mouse doesn't freeze anymore and apps are still responsive even
> when under heavy IO load.

Kernel 2.6.32 helped the situation, definitely. But unfortunately I ran
into som other compatibility problems with both PPA mainline-versions
and custom-compiled versions from upstream (missing App-armor, somewhat
strange/altered fs-mounting-behaviour at boot, readahead not working,
extremely slow hibernation). In the end I decided to stick to the
2.6.31-based kernels distributed with Ubuntu Karmic. I guess things will
get better when Lucid comes around..

Brian Murray (brian-murray) wrote :

Looking at the attachments in this bug report, I noticed that "lspci -vv" was flagged as a patch. A patch contains changes to an Ubuntu package that will resolve a bug, since this was not one I've unchecked the patch flag for it. In the future keep in mind the definition of a patch. You can learn more about what qualifies as a patch at https://wiki.ubuntu.com/Bugs/Patches. Thanks!

Øyvind Stegard (oyvinst) wrote :

I'll be adding some test results for Ubuntu Lucid here, as soon as I can find some time.

Daniel Johansson (danjo133) wrote :

This bug affects me (10.10, 64bit) and has done so for a long time (at least 2 releases).

When having many tabs opened in google chrome (swap), compiling large projects or updating git repositories which requires disk-io, system becomes unusable (may take 15-200 seconds for mouse pointer to move from one end of screen to the next), system clock at top corner (gnome) slows down to a stand-still, all windows are greyed out etc.

This happens on both my laptops:

laptop 1) centrino duo, 4gb ram, ssd with bad controller (low iops, performance) 2.6.32 kernel
laptop 2) core i5, 3gb ram, normal harddrive. 2.6.37 kernel

Both laptops use luks to encrypt whole root-partition + swap.

Is there anything I can do to help here? (I'm a developer)

Cheers,
Daniel

Brad Figg (brad-figg) on 2012-02-10
Changed in linux (Ubuntu):
assignee: Jim Lieb (lieb) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.