Intel Core i7 - Timer interrupt freezes, high CPU usage, system becomes sluggish

Bug #665796 reported by Juliano Ravasi on 2010-10-24
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Since upgrading to Maverick, I'm frequently observing instances of the timer interrupts suddenly stopping being processed (or even produced?) for a few minutes. The system becomes sluggish as a whole, all CPU cores go high on Sys% usage. This continues for 3-5 minutes, then suddenly goes back to normal, just like how it started.

It is certainly related to the timer interrupt: If I continuously 'cat /proc/interrupts' (or just run it under 'watch') during the time the system is affected, interrupt 0 (IR-IO-APIC-edge timer) stays frozen until the problem goes away. No messages appear on dmesg or in /var/log/kern.log.

The problem always starts at random, with no particular cause, but it seems more frequent when making heavy use of both sound and graphics at the same time. It is very common to happen when I start playing some game on Wine. It seems to also vanish at random after 3-5 minutes have passed, but zapping the X server seems to a way to restore the normal behaviour (it happened twice until now, I need more tests to confirm).

The system is:
  Processor: Intel Core i7 930 (2.8 GHz, 4 cores × 2 threads)
  Mainboard: Intel DX58SO Extreme
  Memory: 6 GiB
  Storage: 3 × 1 TB Samsung HardDisks on a RAID 5 array, LVM on top.
  Video: nVidia GTX 460 (with proprietary drivers)

I don't know what more I can say about this issue, but feel free to ask; I'm willing to provide as much information as possible to have this fixed.

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: linux-image-2.6.35-22-generic 2.6.35-22.35
Regression: Yes
Reproducible: No
ProcVersionSignature: Ubuntu 2.6.35-22.35-generic 2.6.35.4
Uname: Linux 2.6.35-22-generic x86_64
NonfreeKernelModules: nvidia
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: juliano 2781 F.... kmix
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf0000000 irq 22'
   Mixer name : 'Realtek ALC889'
   Components : 'HDA:10ec0889,80860022,00100004'
   Controls : 42
   Simple ctrls : 24
Date: Sun Oct 24 04:25:13 2010
Frequency: I don't know.
HibernationDevice: RESUME=UUID=4f433bf6-748e-432d-b93f-de4d0e693b91
InstallationMedia: Kubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101008)
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.35-22-generic root=/dev/mapper/Nagisa1-Root ro rootdelay=5
ProcEnviron:
 LANGUAGE=
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/zsh
RelatedPackageVersions: linux-firmware 1.38
RfKill:

SourcePackage: linux
UserAsoundrc:
 #pcm.!default plug:hw:0
 #ctl.!default hw:0
dmi.bios.date: 08/30/2010
dmi.bios.vendor: Intel Corp.
dmi.bios.version: SOX5810J.86A.5456.2010.0830.0013
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: DX58SO
dmi.board.vendor: Intel Corporation
dmi.board.version: AAE29331-703
dmi.chassis.type: 2
dmi.modalias: dmi:bvnIntelCorp.:bvrSOX5810J.86A.5456.2010.0830.0013:bd08/30/2010:svn:pn:pvr:rvnIntelCorporation:rnDX58SO:rvrAAE29331-703:cvn:ct2:cvr:

Juliano Ravasi (jravasi) wrote :
Juliano Ravasi (jravasi) on 2010-10-29
description: updated
Juliano Ravasi (jravasi) wrote :

Two more users reported a very similar problem at Server Fault, with the same CPU (Intel Core i7 930) and the same Ubuntu version (10.10):
http://serverfault.com/questions/194706/ubuntu-10-10-maverick-server-makes-system-locks-up-at-random-intervals-i7-930-1

Juliano Ravasi (jravasi) on 2010-10-29
description: updated
Adam Ziegler (mrbond) wrote :

Redirected here from ServerFault and bug 658649 (thanks JRavasi).

Same CPU/video card, although for me the timing is closer to 1-2 minutes of lockup. Unlike the original report, though, this seems to happen without _any_ apparent cause - even if the system is completely idle, there will be occasional bouts. Swapping to full terminal via Ctrl+Alt+Fx will take some 10-15 seconds, but once in terminal, responsiveness is okay. top/htop updates at this point mimic stuttering seen in the GUI.

Again, started happening after a fresh install of 10.10. Various other issues raised in relation include nVidia drivers and desktop effects, neither of which have solved it.

Juliano Ravasi (jravasi) wrote :

It happens without any apparent cause to me too, even with the system is idle.

It just appears that under normal operation, it happens 2-4 times a day (completely random, no fixed interval between events); but if I start playing Portal, for example, it happens in less than 30 mins (again, at random, no defined interval).

Switching terminals also takes 10-15 seconds for me, and htop from the console also shows the same unresponsiveness. Shaking the mouse vigorously or flood-pinging from another machine alleviates the problem until it goes away (it makes sense: they produce interrupts that wake the process scheduler; this also explains the high system CPU usage: the scheduler never awakes for normal ticks since there is no timer, but it awakes to process hardware interrupts).

I'm seeing similar behavior on a Server install of 10.04.

Adam Ziegler (mrbond) wrote :

Unusually enough, the issue seems to have resolved itself on my end. I haven't experienced any hangups for the past few days; checking on the system periodically, CPU load averages are reasonable for the activity (< 0.50, vs. > 4.0 from before).

The only relevant changes I've made since is disabling KDE's nepomuk/Strigi indexing entirely, and installing the almost daily updates of ALSA kernel modules from the ubuntu-audio-dev PPA, not to mention keeping up-to-date from all official repos. This leads me to believe that the source is either the audio or file I/O subsystems.

Any progress for the others affected?

Juliano Ravasi (jravasi) wrote :

Since updating to mainline kernel 2.6.37-rc1, I'm also not experiencing the issue. But I also found and removed a faulty memory module from my system.

I have this problem with my system, initially I thought it might have been KDE as I was using Gnome in 10.04. I later found that the same thing occurred when using fluxbox so it obviously wasn't KDE related. http://ubuntuforums.org/showthread.php?p=10032443 For me I found it happened more when I had a virtual machine running (but it did still occur when no VM was running.)

Juliano Ravasi (jravasi) wrote :

Kalidarn, please help us delimit the problem. From your linked thread, I see you have nVidia, it is a match with the problem related here. What is your CPU? Is it by any chance a Core i7? Do you see the timer interrupt freezing in /proc/interrupts during these events? (Use 'watch -d grep timer /proc/interrupts' to see if the timer interrupt is working)

Currently I'm using kernel 2.6.37-rc1 from https://wiki.ubuntu.com/KernelMainlineBuilds and I'm yet to see the problem happen again.

Yes my machine also has an i7 930 in it. I have not yet checked the interrupts during the time when the system is experiencing high cpu system usage, but I will do so when i see it happen next (I know when it's going to happen because the cpu widget i have jumps up from 5% to 100%) I did install htop and I noticed that my cores crazy when it occurs, before I was using KDEs system monitor and had noticed the same thing.

One thing I have noted is it seems to happen more when I am running a VM, (but it does happen when I'm not as well).

Also I'm using sysfs.conf method to tell /dev/sda to use the noop scheduler instead of CFQ as I heard that was better for SSDs. I'm using CFQ for my other disks however.

Oh and I have a NVIDIA 470 GTX. I considered using non-proprietary drivers such as nouveau but the current version in the repository doesn't support the Fermi series of cards.

I did run "watch -d grep timer /proc/interrupts" and I'm not sure if it was my imagination but during the laggy period the numbers didn't seem to change as often. What exactly am I looking for?

Adam Ziegler (mrbond) wrote :

I'll throw in that it is still happening for me. Before, I was probably "lucky" enough to hit the gaps, both in observation and usage. Same symptoms, with or without VMs running, all CFQ scheduling. Not sure if HDD layout/config affects this, but maybe it'll help isolate the problem:
1x Corsair F120 120 GB SSD, primary OS drive
4x Seagate "green" 5900 RPM 2 TB HDD, in 2x 2-disk RAID-1 configuration (mdadm)

Juliano Ravasi (jravasi) wrote :

Kalidarn: the "timer" interrupt must ALWAYS increment. It is produced continuously by the hardware, hundreds or thousands of them per second. watch keeps running that command every 2 seconds, so, the number is always changing. The kernel uses this interrupt mainly for process scheduling. If for some reason the interrupt stops, the kernel won't be able to do proper scheduling, and the system becomes sluggish.

There are other sources of hardware interrupts. Every time a hardware device causes an interrupt, the kernel awakens to process that interrupt. If the timer interrupt is not working, then all wakeups are to process hardware events, and this causes monitors to report very high System CPU usage.

I'm using the CFQ I/O scheduler. It seems that Core i7 and nVidia proprietary drivers are the common characteristics of all people reporting this issue, I'll update the title.

summary: - Timer interrupt freezes, system becomes sluggish
+ Intel Core i7, nVidia proprietary - Timer interrupt freezes, system
+ becomes sluggish

Same problem here, i7/nvidia as well.

The problems started after the xorg 1.9+nvidia update in the maverick beta. And was still there on a fresh maverick install after release.

Some discussion about this was also done in https://bugs.launchpad.net/ubuntu/+source/gnome-settings-daemon/+bug/649809
Which also seems to be a problem specific to i7/nvidia/ssd, and for me started also after the xorg/nvidia update.

dyna (ubuntu-dyna) wrote :

I think i have found a fix for my interrupt timer freeze, but need more testing to be sure.

After having tried several kernel boot options without succes (irqfixup, irqpoll, noapic) i ran into this 2 year old thread: https://bugzilla.kernel.org/show_bug.cgi?id=8300 which suggests clocksource=acpi_pm in the last post.

After booting with this option dmesg noted: Override clocksource acpi_pm is not HRT compatible. Cannot switch while in HRT/NOHZ mode, and cat /sys/devices/system/clocksource/clocksource0/current_clocksource still returned: tsc

But strangly enough it did seem te fix the problem. I ran the system for 2x1.5 hours without problems with heavy io and audio/video, which would normally trigger the problem within 15 minutes.
I compared dmesg's and 2 mem ranges seem to be slightly different.

After digging up some more information about clocksource i changed it to hpet, since my bios supports 64bit hpet. And this also seems to work. Running for 3+ hours atm. Seems to be the best option anyway since my bios supports it.

I also tried to disabled hpet in the bios and boot without clocksource, but the problem was still there, but i guess without the option it is not using it as timer as default (yet?).

Bruno (jambock) wrote :

I have the same issue, but on a different cpu

model name : Intel(R) Xeon(R) CPU W3520 @ 2.67GHz
0f:00.0 VGA compatible controller: nVidia Corporation G86 [Quadro NVS 290] (rev a1)

dyna (ubuntu-dyna) wrote :

Problem still occurs with clocksource boot options, it just doesn't seem to get triggered by my testcase.

With clocksource option it takes about 5-6 hours before it first occurs with testcase running, however after the first occurence my testcase seem to trigger it very fast again (under 5 minutes).

Well anyway since my testcase is a real world case for me it, it's some progress. But i guess it's not really useful for anyone else.

The problem also doesn't exist in newer kernel's btw. But my system has problems with several crashes with 2.6.36 mainline, 2.6.36 natty sauce, 2.6.37-rc1 natty sauce. (Witch i tried to get stable for weeks before returning to .35).

That's a shame I was hoping updating to 2.6.37 might allow me to escape this bug until 11.04 comes out and is shipped with whatever newer kernel that isn't effected by this bug has. On the upside 2.6.37 had also the USB 3.0 suspend fixes i wanted too but if that's too unstable to use then that's not a solution.

Fedora Core 13 and OpenSUSE 11.3 are both using 2.6.34 while Fedora Core 14 (in development uses 2.6.36) and OpenSUSE Factory uses 2.6.37rc. Nobody cept us ubuntu users seem to be using 2.6.35 :(.

Rob Alf (rpgrca) wrote :

Same happening to me, I7 930. Happens a few minutes after I launch a video, regardless of whether I have it still open or already closed it. May last up to 5 minutes. The only thing I notice is that /var/log/messages get a psmouse.c message whenever the CPU starts spiking:

Nov 22 02:58:31 ansalon kernel: [235256.800265] psmouse.c: Wheel Mouse at isa0060/serio1/input0 lost synchronization, throwing 1 bytes away.

However, I am guessing it is not a symptom but a consequence of the CPU spike issue. This issue is pretty depressing, I must launch the video, pause it and wait until the CPU starts spiking (which may be a few seconds or a few minutes after that), wait until it finishes, and then start watching the video.

Rob Alf (rpgrca) wrote :

Sorry, forgot to add that I am on Nvidia as well: GeForce GTS 250 (512MB) and NVIDIA Driver Version: 260.19.06.

Toni Shocker (shocker) wrote :

I am also experiencing the EXACT same problem with i7 930. My videocard: VGA compatible controller: S3 Inc. 86c375 [ViRGE/DX] or 86c385 [ViRGE/GX] (rev 01) (veeeeeery old crap), I don't need any video, was using that videocard just to have the computer start... random high kernel cpu usages, ~3-5 minutes, "interrupt 0 (IR-IO-APIC-edge timer) stays frozen until the problem goes away. No messages appear on dmesg or in /var/log/kern.log"
There is no sound/video usage, so it might not be directly related to them

Juliano Ravasi (jravasi) wrote :

Confirmed that nVidia driver seems to be unrelated. I tested with the nouveau driver with the same results. It is either something in the 2.6.35 branch, or some Ubuntu patch added to the kernel. I experienced no issues for over a month using 2.6.36 from mainline repo.

summary: - Intel Core i7, nVidia proprietary - Timer interrupt freezes, system
- becomes sluggish
+ Intel Core i7 - Timer interrupt freezes, high CPU usage, system becomes
+ sluggish
Toni Shocker (shocker) wrote :

Any specific version or simply anything 2.6.36.x ?

Juliano Ravasi (jravasi) wrote :

Just 2.6.36. This one, specifically:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.36-maverick/

There are other stable kernels (up to 2.6.36.2, see parent directory), but they are tagged "natty", so I didn't try them.

Adam Ziegler (mrbond) wrote :

For what it's worth, the issue has cleared itself up (again) for me, and the only relevant change I can think of is using the 2.6.35-23 kernel (a relatively recent update over 2.6.35-22).

I must say I haven't had it in a while. Maybe your right Adam.

Toni Shocker (shocker) wrote :

I can confirm, upgrading the kernel fixed the problem

clayton craft (craftyguy) wrote :

Upgrading to 2.6.35.22 does not fix the issue for me. I'm running an i7 940, and my configuration is very similar to what is posted above

Adam Ziegler (mrbond) wrote :

I think it was the 2.6.35-23 update that resolved it (whether intentionally or otherwise) - try that one.

Brad Figg (brad-figg) on 2011-04-07
Changed in linux (Ubuntu):
status: New → Confirmed
tags: removed: regression-potential

Juliano Ravasi, thank you for reporting this and helping make Ubuntu better. Maverick reached EOL on April 10, 2012.
Please see this document for currently supported Ubuntu releases:
https://wiki.ubuntu.com/Releases

We were wondering if this is still an issue in a supported release? If so, can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in a supported release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.

Please let us know your results. Thanks in advance.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.