lucid lynx going into disk activity frenzy, stalls for minutes

Bug #582264 reported by Romano Giannetti
46
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

Sometime, more or less every 3-4 minutes or when I start a new program, with a relatively light load (chromium browser, thunderbird, one VirtualBox session) the computers went into frenzy disk activity and grind to a practical halt for the next minute or so. Very unresponsive, triggers the "this page do not respond" warning from chrome or OOffice... then it resume.
I tried to track the culprit without any luck. Reducing vm.swappiness from 60 to 10 did not help. Cumulative atop for disk usage says:

    NPROCS SYSCPU USRCPU VSIZE RSIZE RDDSK WRDSK RNET SNET MEM CMD 1/4
     1 1.06s 0.08s 1.0G 836.0M 280 480 0 0 42% VirtualBox
     8 0.07s 0.25s 1.2G 174.7M 232 0 0 0 9% chrome
     1 0.03s 0.13s 246.6M 112.0M 0 0 0 0 6% Xorg
     1 0.00s 0.02s 340.5M 74156K 15072 8 0 0 4% thunderbird-bi
     1 0.00s 0.00s 93692K 9256K 0 0 0 0 0% nautilus
     1 0.01s 0.01s 48844K 5616K 1184 24 0 0 0% gnome-terminal
     1 0.00s 0.00s 41928K 5136K 0 0 0 0 0% wnck-applet

and free:

(0)pern:~% free
             total used free shared buffers cached
Mem: 2027004 1977144 49860 0 1184 50012
-/+ buffers/cache: 1925948 101056
Swap: 3903752 342976 3560776

This is a Dell n-series, core2 cpu, 2G ram, running a 32 bit kernel, ATI card with fglrx module.

Never happened such a thing with Karmic, same load (I know, virtualbox use half of the memory, but that was the same in Karmic, with no problem at all).

The system is pretty unusable for work.

Will try to reboot in an older kernel, using firefox, and downgrade virtualbox to try to find why, but if anyone has a suggestion, it's very welcome.
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: AD198x Analog [AD198x Analog]
   Subdevices: 2/2
   Subdevice #0: subdevice #0
   Subdevice #1: subdevice #1
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: romano 1559 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xfebdc000 irq 16'
   Mixer name : 'Analog Devices AD1984'
   Components : 'HDA:11d41984,10280211,00100400'
   Controls : 30
   Simple ctrls : 18
DistroRelease: Ubuntu 10.04
Frequency: Once a day.
HibernationDevice: RESUME=UUID=75c5b06d-24c3-4a8e-a0f3-cf3b519c35ad
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.

 vboxnet0 no wireless extensions.
MachineType: Dell Inc. OptiPlex 755
NonfreeKernelModules: fglrx
Package: linux (not installed)
ProcCmdLine: root=UUID=fe469c02-94bf-4a0f-a430-32767942a34d ro xforcevesa quiet splash
ProcEnviron:
 LANGUAGE=en_GB:en
 PATH=(custom, user)
 LANG=en_GB.utf8
 SHELL=/usr/bin/zsh
ProcVersionSignature: Ubuntu 2.6.32-22.36-generic 2.6.32.11+drm33.2
Regression: Yes
RelatedPackageVersions: linux-firmware 1.34
Reproducible: No
RfKill:

Tags: lucid regression-release needs-upstream-testing
Uname: Linux 2.6.32-22-generic i686
UserGroups: adm admin audio cdrom dialout dip fax floppy fuse lpadmin netdev plugdev sambashare scanner tape vboxusers video
WifiSyslog: Jun 8 10:24:26 pern kernel: [54110.681761] warning: `VirtualBox' uses 32-bit capabilities (legacy support in use)
WpaSupplicantLog:

dmi.bios.date: 08/04/2008
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A11
dmi.board.name: 0DR845
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 3
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA11:bd08/04/2008:svnDellInc.:pnOptiPlex755:pvr:rvnDellInc.:rn0DR845:rvr:cvnDellInc.:ct3:cvr:
dmi.product.name: OptiPlex 755
dmi.sys.vendor: Dell Inc.

Revision history for this message
Romano Giannetti (romano-giannetti) wrote :

hmmm.... very suspicion culprit: Google Chrome.

Testing more...

Revision history for this message
Romano Giannetti (romano-giannetti) wrote :

No. Happened again without it.

What seems is that when there is disk activity going on, like writing a big file, the system practically halt.

Probably kernel-related. Testing more.

Revision history for this message
Romano Giannetti (romano-giannetti) wrote :

Trying to assign to kernel package. No luck. Searching for packages say "too much package to list" or suggest a linux-2.6.32 that then fails.

ooze (zoe-gauthier)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Romano Giannetti (romano-giannetti) wrote :

Ok, cannot downgrade the kernel because I do not have 2.6.31 headers in the repositories... grr. I tried to manually install linux-headers-2.6.31-20 but it fails with missing dependencies.

Please advise.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Romano,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/daily-live/current/ . If the issue remains, please run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 582264

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Abdó Roig-Maranges (aroig) wrote :

Same happens to me.

When there is intense disk activity (copying files, installing packages, etc ...) the whole system becomes extremely slow and unresponsive. This is quite annoying as it renders the system unusable while copying files or doing an apt-get upgrade. This definitely did not happen in Karmic.

I have tested a mainline kernel build (linux-image-2.6.32-0206321405-generic_2.6.32-0206321405_amd64.deb) and the one from the Maverick liveCD, as suggested above. In both cases things went smooth, without stalls. I started to copy a large directory, and simultaneously I could open firefox and navigate without any noticeable lag. The problem seems to be in the ubuntu lucid kernel then ...

Revision history for this message
Abdó Roig-Maranges (aroig) wrote :

Ok, I have to correct myself. I've been using the mainline kernel for a while now and the problem is definitely still there.

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : AlsaDevices.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Romano Giannetti (romano-giannetti) wrote : AplayDevices.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : BootDmesg.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : Card0.Amixer.values.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : Lspci.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : Lsusb.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : PciMultimedia.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : ProcModules.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : UdevDb.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote : UdevLog.txt

apport information

Revision history for this message
Romano Giannetti (romano-giannetti) wrote :

It happened again, and I could capture a ps augx while in the "frenzy" phase, that will attach. After that the system is still sluggish, so that I will have to reboot (or maybe close application and doing a swapoff/swapon, will try that). I managed too to send apport information (that was a three-minute task, the disk is almost active all the time).

Load average peaked at 20 or so.

Revision history for this message
Romano Giannetti (romano-giannetti) wrote :

I think that the problem is a very bad page release or something like that. Once entered in the swap frenzy, the system is very difficult to reconvert. Swapoff do not work (it say no free memory) and I have to stop VirtualBox, which used half of the memory, to regain control in the system.

Again, this did not happen at all with Karmic.

If I fill the memory and push the thing to swap it happens again. I use a little program to fill the mem to test it, attached below.

Exiting VirtualBox resume normal working of the system, until you fill memory again, when swap starts load average jump high ans system became really sloppy.

#include <stdio.h>
#include <stdlib.h>
int main(int argc, char** argv) {

        char * mem;
        long nblocks;
        int i;

        if (argc!=2) {
                fprintf(stderr, "Usage: %s <number_of_MB>\n", argv[0]);
                exit(1);
        }

        nblocks = atoi(argv[1])*1024;
        for(i=0;i<nblocks;i++) {
                mem = (char*)malloc(1024);
                mem[1]='A';
                if ((i%1024)==0) {
                        printf("%s:%4d Mbyte allocated and touched\n",
                               argv[0], (int)(i/1024));
                }
        }

       /* uncomment this to sit here while(1) {sleep(10);}*/

        return 0; /* and free */
}

Revision history for this message
Josef Grahn (josef-grahn) wrote :

I can confirm this phenomenon on a Dell Dimension with a Pentium D CPU, 2G RAM, running a 64 bit kernel and nVidia card with official nVidia driver. Swappiness is set to 10.

It appears the system is swapping intensely when it becomes unresponsive, and sometimes freezes for several minutes with only occasional mouse pointer updates. The %iowait of the CPU is typically above 80 during these episodes. It usually happens when running some memory intensive application(s) (e.g. Chrome with many tabs, Eclipse or VirtualBox), so it could very well be caused by an actual shortage of RAM (disappointingly meaning 2 GB is only sufficient for casual desktop use nowadays).

I would however generally have expected the kernel to handle the resource shortage better, and not effectively suspend every running process, as well as mouse and keyboard input, for up to ten minutes. Especially at times when only one of the memory hungry applications is being actively used, as the others should be able to be swapped out completely leaving enough free RAM.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu development release http://cdimage.ubuntu.com/daily-live/current/ . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
David Oftedal (rounin) wrote :

This issue stil affects me in Maverick Meerkat. uname -a returns the following:
Linux big-iron 2.6.35-23-generic #41-Ubuntu SMP Wed Nov 24 11:55:36 UTC 2010 x86_64 GNU/Linux

The bug can be triggered by periods of relatively high disk activity, but it also seems to generate its own disk activity, and I've no idea what it's using the disk for. dmesg seems to indicate that there was an out of memory error at some point, but I'm not using swap, so it shouldn't involve the disk.

This blog entry - http://billauer.co.il/blog/2010/10/disk-io-scheduler-load-dd-freeze-stall-hang/ - indicates that it could be a kernel bug, and so I'm going to put off investigating it further until I've been able to try that kernel (or have resolved it in some other way).

Revision history for this message
David Oftedal (rounin) wrote :

Also, there was an earlier problem with these exact same symptoms which was said to be due to a race condition triggered when using a swap file stored on a LUKS-encrypted partition (this is why I no longer use swap). I still encrypt my /home partition with LUKS, so that could be related to the problem.

Changed in linux (Ubuntu):
status: Expired → Incomplete
Revision history for this message
David Oftedal (rounin) wrote :

There's also a thread on the forums pertaining to what seems to be this and several unrelated problems:
http://ubuntuforums.org/showthread.php?t=1478787

Revision history for this message
David Oftedal (rounin) wrote :

I've upgraded to kernel 2.6.37-12-generic, and the problem seems to be much harder to trigger, but it was still possible. It seems to be triggered by low-memory conditions:

1) I used the KDE file manager and navigated to a directory containing a lot of images, so that it started to create thumbnails
2) I started Python and had it create an endless list of junk:

a = list()
while(True):
 a.append(1)

As memory ran out, the entire system started grinding to a halt, ignoring keyboard and mouse input, dropping network traffic and using the disk for some mysterious purpose as before (as mentioned, I don't use swap, so it's not that).

I was able to get to the console eventually and run "top", and top showed two things:
1) A high "wa" value, which seems to mean that the CPU is waiting for I/O? But that value is at around 13% now, and everything's running smoothly.
2) Python was taking up a great deal of memory.

Killing Python resolved the problem immediately, and nothing else crashed, unlike with the previous kernel. But that still means that an unprivileged user running Python can stall the entire system by creating a list.

Revision history for this message
David Oftedal (rounin) wrote :

Definitely still a problem in kernel 2.6.37-12. Today the machine stalled twice as I was rsyncing some files from one partition to another.

The first time it lasted for about an hour, during which it was possible to type in commands, but it literally took an hour to log in and kill a few processes. The console was littered with messages about kworker-this-and-that being blocked for more than 120 seconds

The second time it lasted for more than two hours before I pressed the reset button.

It's wonderful to have some incentives to spend time away from the computer, of course, but more than three consecutive hours on a busy night is a bit much. As this seems to be a kernel bug, if not several, I'm going to assume it won't get fixed.

Revision history for this message
David Oftedal (rounin) wrote :

The kernel bug theory seems to hold true, as the problem is much less severe with kernel 2.6.38. The system is now down to stalling for seconds at a time instead of hours at a time. An improvement by a factor of 3600x from one version to the next!

Since the problem appears to be in the process of being solved, I guess we can say crisis averted.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
David Oftedal (rounin) wrote :

The problem's still present in 2.6.38, though the stalls are much shorter. The longest one I've had until the OOM-killer did its job was about 12 minutes.

Stephane Carrez suggests changing the I/O scheduler to the deadline scheduler:

http://blog.vacs.fr/index.php?post/2010/08/28/Solving-Linux-system-lockup-when-intensive-disk-I/O-are-performed

This can be done on a per-disk basis like so:
echo deadline > /sys/block/sda/queue/scheduler

Or the default can be changed by setting the kernel option elevator to deadline at boot time. So presumably, in grub.cfg, instead of:

linux /vmlinuz-somethingsomething root=somethingsomething

It should say something like:

linux /vmlinuz-somethingsomething root=somethingsomething elevator=deadline

The kernel developers have obviously thought long and hard about what scheduler to choose, but if the deadline scheduler can prevent the system from stalling for up to 12 minutes each time it runs out of RAM, with or without swap, then it's possible that the deadline scheduler would be a better default choice.

Revision history for this message
David Oftedal (rounin) wrote :

The bug's still there

Changed in linux (Ubuntu):
status: Expired → Incomplete
Revision history for this message
David Oftedal (rounin) wrote :

The scheduler change helps a lot (just as the kernel upgrade helped), but the underlying problem is obviously still there – When the system runs out of memory, it starts furiously using the disk, even when it isn't using swap.

Since the original poster submitted the requested information, I'm changing the status back to New.

Changed in linux (Ubuntu):
status: Incomplete → New
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Romano Giannetti, thank you for reporting this and helping make Ubuntu better. This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

tags: added: regression-release
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.