Ubuntu

regression in karmic thermal control

Reported by Rolf Leggewie on 2009-09-18
100
This bug affects 18 people
Affects Status Importance Assigned to Milestone
Linux
Expired
Medium
linux (Ubuntu)
Medium
Unassigned
Nominated for Karmic by Cvet

Bug Description

I believe there may be a regression in thermal control which was introduced in karmic a few days ago. Looking at the packages that aptitude updated, I believe the problem may have started after the linux-image package was upgraded from 2.6.31-10.32 to 2.6.31-10.34. But of course, we ddi see a lot of fairly intrusive changes in a lot of key packages during the last few days as well.

My Thinkpad X24 shut down a couple of times over the last few days without any warning. It's never done that before. I believe the machine may have shut down on its own to prevent damage from overheating. I should say though that the machine wasn't really doing much (but I think Firefox kept it busy nonetheless with the gazillion of tabs I have open all the time ;-)). The machine felt only mildly warm when touched, the fan had been running constantly but at a fairly low speed (judging from the little noise it was making). I added information from the temp sensor to conky and indeed, temperature was going straight for 92°C. At that point, I throttled CPU frequency and the temperature dropped sharply.

To me, it looks like the fan control is not in line with the temperature. 92°C should have the fan going full blast, but I don't think it was going that hard.

$ cat /proc/acpi/thermal_zone/THM0/trip_points
critical (S5): 95 C
passive: 92 C: tc1=5 tc2=2 tsp=600 devices=CPU0

bmarsh (bmarsh-bmarsh) wrote :

Am having similar problems with an Intel 2.8GHX board.... At 77-C am getting waaay too many messages from the kernel, and can't even break in to kill the machine. Have filled up the root directory twice in the last two days. After restarting the machine (and blowing all the dust out of CPU and other spots) the temp seemed to be settling down to around 37-C. It came down from 46-C so was heading in the right direction. Then within 5 minutes, the temp was back up to 77-C. I didn't get a chance to look at the fan RPM but it was ok when I restarted the machine... around 2670 rpm and looking good.

Ramana (ramana-b-v) wrote :

I also feel that my temperatures have gone up recently. My system has a single fan. The temp reported under /proc/acpi/thermal_zone/THRM/temperature is the CPU temperature. My CPU temp is 51 but the issue the temperature of Graphic Processor nVidia (I am having to throttle down CPU, disable all effectslike compiz to keep this at 70C) . The fan should be controlled by maximum of these temperatures.

Cvet (cvet) wrote :

I have a similar problem. CPU temperature reached 96C at one point which shut down my pc. I checked my logs no warning message was ever produced. It happened while running firefox at full blast (flash playing a movie). I can throttle my cpu to keep it form reaching that temperature. The temperature control seems to be ok..the computer doesnt run hotter than usual (compared to 9.04), its just that at high temperature the computer should throttle the cpu automatically to prevent it from reaching such temperatures. I am using kernel 2.6.31-14.48 kernel and I just updated from 9.04 (not sure if that makes a difference). I have never been able to reach such high temperatures with the 2.6.28 kernel.

Rolf Leggewie (r0lf) on 2009-10-31
Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
Brian Derr (bderrly) wrote :

I too have had my / partition fill up due to /var/log/kern.log (currently at 1.72G). I think this deserves higher than "Medium" priority as it is causing laptops to shut off and hard drives to fill up. A non-power user is not going to know what to do when they get a warning that their hard drive is filled up.

This message is repeated from Nov. 1 through now, Nov. 3.
Nov 1 07:47:03 geordi kernel: [74706.611036] CPU0: Temperature/speed normal
Nov 1 07:47:03 geordi kernel: [74706.612140] CPU0: Temperature above threshold, cpu clock throttled (total events = 415560)

Running 'acpi -t' returns nothing. I'm running an upgraded karmic with default kernel.

marcogoni (cogoni) wrote :

On my Thinkpad X24 (P3 1133MHz) Karmic is working perfectly.
No overheating problems up to now.

Cvet (cvet) wrote :

I am not sure but could it be the case of unspecified temperature monitor. I did some digging and I discovered a program called sensors (it is used by the kernel to monitor temperature). So in its config you have the option of setting cpu max temperature (/etc/sensors.conf, first check your sensor by typing sensors). I know my max cpu temp is 90C (So i would set up under my sensor model set temp1_over 88 and set temp1_hyst 85 and temp2_over 88 and set temp2_hyst 85. It says u have to run sensors -s to check it out. The problem is that i dont know how to set up sensors -s in the boot scripts (to add it to /etc/init.d/sensors and to set a correct order (after modules load)). Anyone have any idea. I will try to this set up see if it helps. Will report any results.

Cvet (cvet) wrote :

Setting sensor limits makes no difference to me. I dont have a fan speed sensor (or so ubuntu says) so maybe that is why. I also believe that ubuntu does not spin the fan to its full potential when i reach high temperatures. I checked /proc/acpi and i noticed thermal limit set to T0 which is processor at 100%. As i understand that might not be used any more. If its not then where are the parameters set for decisions on throttling the cpu?

Cvet (cvet) wrote :

It turns out i wasnt testing the sensor configuration correctly. The default configuration file is sensors3.conf not sensors.conf (go figure). It turns out i cannot set limits to my k8temp* ...not sure why that is.

sibs (swmpofu) wrote :

Since installing Karmic my fan has been operating at high speed too often for my liking and now I've had a number of unexpected shutdowns whilst using Firefox or transcoding using dvd::rip, particularly when doing both . I got the following errors from the last shutdown after it hung:

Modem Manager caught signal 15, shutting down
avahi-daemon main process terminated with status 255
rsyslog-kmsg main process killed by TERM signal

I've checked for any log in /var/log but I haven't found any useful information (not sure if that's even the right place to check). Please help me resolve this as I've been experimenting with linux for a month now and was hoping not to reload windows if I liked it enough, but I can't have unexpected shutdowns everytime I do anything processor intensive.

sibs (swmpofu) wrote :
Download full text (3.3 KiB)

Had a look at kern.log after reading earlier post found:

Nov 22 22:20:01 sibs-laptop kernel: [ 2.176614] Marking TSC unstable due to TSC halts in idle
Nov 22 22:20:01 sibs-laptop kernel: [ 2.176637] ACPI: CPU0 (power states: C1[C1] C2[C2] C3[C3])
Nov 22 22:20:01 sibs-laptop kernel: [ 2.176673] processor LNXCPU:00: registered as cooling_device0
Nov 22 22:20:01 sibs-laptop kernel: [ 2.176678] ACPI: Processor [CPU0] (supports 8 throttling states)
Nov 22 22:20:01 sibs-laptop kernel: [ 2.177133] ACPI: SSDT 7f7b8300 000C8 (v01 PmRef Cpu1Ist 00003000 INTL 20051117)
Nov 22 22:20:01 sibs-laptop kernel: [ 2.177774] ACPI: SSDT 7f7b8620 00085 (v01 PmRef Cpu1Cst 00003000 INTL 20051117)
Nov 22 22:20:01 sibs-laptop kernel: [ 2.179187] ACPI: CPU1 (power states: C1[C1] C2[C2] C3[C3])
Nov 22 22:20:01 sibs-laptop kernel: [ 2.179218] processor LNXCPU:01: registered as cooling_device1
Nov 22 22:20:01 sibs-laptop kernel: [ 2.179223] ACPI: Processor [CPU1] (supports 8 throttling states)
Nov 22 22:20:01 sibs-laptop kernel: [ 2.185768] ACPI Warning: \_TZ_.THRM._PSL: Return Package type mismatch at index 0 - found Processor, expected Reference 20090521 nspredef-946
Nov 22 22:20:01 sibs-laptop kernel: [ 2.185779] ACPI: Expecting a [Reference] package element, found type C
Nov 22 22:20:01 sibs-laptop kernel: [ 2.185782] ACPI: Invalid passive threshold
Nov 22 22:20:01 sibs-laptop kernel: [ 2.187887] thermal LNXTHERM:01: registered as thermal_zone0
Nov 22 22:20:01 sibs-laptop kernel: [ 2.187899] ACPI: Thermal Zone [THRM] (91 C)
Nov 22 22:20:01 sibs-laptop kernel: [ 2.187975] isapnp: Scanning for PnP cards...
======================

Nov 22 22:25:12 sibs-laptop kernel: [ 329.573161] CPU1: Temperature above threshold, cpu clock throttled (total events = 196)
Nov 22 22:25:12 sibs-laptop kernel: [ 329.574131] CPU1: Temperature/speed normal
Nov 22 22:27:02 sibs-laptop kernel: [ 439.606851] Critical temperature reached (127 C), shut...

Read more...

nexus (bugie) wrote :

I go through the same proplem like sibs since I upgraded to karmic. Same symptoms: fan at high speed after a few minutes, critical temperature under full load, e.g. encoding with Avidemux, shutdown. Last night I had this problem although CPU was thottled down to 800MHz instead of maximum 2.2GHz!

Strange thing is that sometimes the log says "Critical temperature reached (127 C)" and sometimes "...(105 C)".

This is definitly a problem which came with karmic and no dust-in-the-fan issue because it occured from one day to the other.

Please tell me what information you need to solve this very annoying problem.

Kind regards,
Florian

nexus (bugie) wrote :

Maybe importance should be higher than "Medium" if problems like "microphone does not work" (Bug #443089) are "Critical".

Rolf Leggewie (r0lf) wrote :

don't worry too much about importance level (but I did set the other report to medium as well).

What you guys can do (and I should have done it already) is try with the so-called mainline kernels from http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/ Should the problem occur with that kernel as well, please report it to http://bugzilla.kernel.org/ and add information about that bug report here. For me, the problem does not always occur and thus I have not tested the mainline kernels extensively for this problem. Help from one of you guys is greatly appreciated.

Please take a look at https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies as well.

Hi guys, I would like to apport a couple of details:

1) Normally both "acpi -V" and "sensors" output a temperature of 40.0 C all the session long, even when the computer is obviously overheating because the fan doesn't spin at all.
2) When due to this overheating the computer automatically shuts down and I immediately start it again, then the fan spins all the time, and "acpi -V" and "sensors" output 75.0 C of temperature, also constant for all the session.

It looks like the temperature is read only once at startup, but it's also strange that it can only have two values, 40 and 75. By the way my "critical" temperature is 110 C.

I have been suffering these critical shutdowns since I updated to Karmic, but I was already suspicious of the fan not working, or not working much, and some overheating in Jaunty. Maybe then the overheating didn't get to shutdown just because I couldn't set extra graphic effects in Jaunty (those jelly windows, you know!! xD) due to another drivers bug.

I think this is a serious issue: how can I recommend Ubuntu/Linux to anyone now if I have to warm them from possible overheating???
Well sorry but I had to say this xD

Joan

nexus (bugie) wrote :

I tried current mainline kernel. Yesterday I encoded with avidemux which resulted in a critical temperature shutdown, too.

I opened a bug report at kernel.org: http://bugzilla.kernel.org/show_bug.cgi?id=14695

I forgot to say that I had also tried the mainline kernel, without any better results, before posting here.

Also, "apci -V" and "sensors" did output a constant 0.0C (and fan not running) in the last session (before overheating and shutting down again). This was with the standard kernel. Now it's 75 (and fan running) with mainline.

The 0.0C session was the first boot within the last about 10 hours, so it was fresh but obviously not at 0.0 C. It looks like an extremely bare rounding, as if the 0 value was chosen because the real temperature was at that time closer to 0C than to 40C, which seems possible to me.

So, I throw two hypotheses:

1) The temperature is only once read at startup
2) The result is rounded slightly too much, there are no possible values other than 0, 40 and 75 C.

Maybe the second issue is rather a hardware issue and it may vary on other computers, but it wouldn't be a problem if the temperature was read periodically, as the fan spins when the "rounded result" is 75. So I think the main problem is: the temperature is only read once at startup.

Is there a thermal control daemon not working or missing in this kernel?

Rolf Leggewie (r0lf) on 2009-12-17
Changed in linux (Ubuntu):
status: Confirmed → Triaged
Changed in linux:
status: Unknown → Incomplete
Mark (mark-wege) wrote :

Is there any workaround for this bug yet? My thinkpad R50e is nearly unusable at the moment. It runs fine with Windows still, but using Kubuntu Karmic with all the latest updates it shuts down after some time without any warning, resulting in data loss. Is there any way to turn thermal control off?

nexus (bugie) wrote :

You could set a lower threshold:
http://bugzilla.kernel.org/show_bug.cgi?id=14695#c9

Please also have a look at comment #13.

Regards,
Florian

Mark (mark-wege) wrote :

BTW: I have installed Sidux Linux (which is Debian based too) as a parallel installation. It runs with Kernel 2.6.32 and so far there are no problems. Touching the case, the system is definitely much cooler even after running a full day, then running with Karmic. I do not know, what the reason for this is, the newer kernel or a different set up. But may be this information helps to find the bug.

nexus (bugie) wrote :

Unfortunatly Sidux uses a modified kernel as far as I know. So there are multiple possibilities where the bug could be:

a) Vanilla kernel (fixed by a Sidux-specific patch)
b) Ubuntu setup (no kernel bug)

Mark (mark-wege) wrote :

I this could help to rule out one of the possibilities, I could install a pure Debian Kernel on Sidux (which should work since Sidux is Debian appart from the Kernel and some additional tools. Therefore I would need some instructions which kernel to pick and how to install and remove it again (preferebly I would like to install a *.deb package).

Rolf Leggewie (r0lf) wrote :

Mark, just use one of the Ubuntu unpatched mainline kernels that I mentioned previously. If you need further help catch me in IRC or Jabber.

jstump209 (jstump209) wrote :

I was having the random shutdowns and losing my wireless connection too. i just updated my acer aspire 5610 laptop bios to v3.6 and it all seems to be working fine now.

nexus (bugie) wrote :

Hi, who is the manufacturer of acer aspire bios?

This is what I also tried to do, but with less luck. Following some forum threads, I downloaded the lastest bios updater from the Acer website itself, which contains a program by a company called InsydeBios, to be launched from Windows or from DOS. I launched the Windows executabe (from Windows Vista) and now I have no laptop anymore.

My BIOS became completely corrupted so that it didn't even react to the 'boot' button or anything, it was like laptop-shaped stone. And I say "didn't" and "was" because I split it all into pieces trying to find the BIOS chip and replace it -but I can't find it! Maybe this is related to the so "innovative" design of that InsydeBios thing that may not exactly be a BIOS chip but I don't understand exactly what it is -and what to do?

Well, if someone can help with this, it would be great. By now, I'll throw two warnings I should have followed:

BE VERY CAREFUL and DON'T FLASH YOUR BIOS FROM WINDOWS.

Salut

Rolf Leggewie (r0lf) wrote :

Joan, sorry to hear about the problems you are having. But everybody, please let's not pollute this ticket with off-topic remarks. This has nothing to do with BIOS updates, earlier versions of the kernel were running fine. This ticket is about tracking down and fixing a regression in the kernel.

jstump209 (jstump209) wrote :

Just an update. As i stated earlier, i updated my bios on my acer laptop and have had NO problems at all for the past 4 days!

Mark (mark-wege) wrote :

With from Rolf, I have installed the Kubuntu Vanilla Kernel 2.6.33-999 from today on Sidux. It has been running nearly 4 hours (on rather high system load) now, which is a very long time compared to the maximum of 1 hour when I was running Karmic. I am not a technical expert, but as I understand this, this means that the problem is not a kernel bug (or it has been fixed within the last weeks since others tested the Vanilla Kernel on Karmic) and rather a problem of configuration and settings.
What would be further steps to localise the problem? May be someone else who has the problem could first confirm what I have done? Installing Sidux parallel, see if the problem occurs there ...
What settings/configs/debugs could be interesting?

Mark (mark-wege) wrote :

Just a short update: System is now running for more than a day using Sidux and Ubuntu-Vanilla-Kernal. Now unexpected shutdown until now.

Mark (mark-wege) wrote :

I have made some progress with (K)Ubuntu. Unfortunately I do not know what exactly did the trick. First of all I noticed that the cpu-temperature drops significantly (15-20 degrees) when I detach the battery. Then I also did some updates and activated Thinkpad fan control as it is described here:
http://www.thinkwiki.org/wiki/How_to_control_fan_speed
I did this, after I read in a forum that Thinkpad-fan-control is disabled in Karmic. I just set is to automatic.
The result: No sudden shut downs the last day. But: The fan is on, most of the time, even though is appears that the behaviour has changed. The temperature sometimes still goes up pretty high, nearly 90 degrees, which is near to the 97, where I experienced the shut downs. Most of the time it is significantly lower, around 60, which I consider kind of normal. Detaching the battery keeps the temperature below 80 degrees.
That is progress at least, although the fan noise has definitely increased since Jaunty. I hope this will be fixed. Sorry that I can not exactly tell what do the trick. But may be others can find out more.

Changed in linux:
status: Incomplete → Confirmed
Changed in linux:
status: Confirmed → Incomplete
Flávio Etrusco (etrusco) wrote :

Any differences if you use "acpi.power_nocheck=1 acpi_osi=linux" in the kernel boot parameters?

nexus (bugie) wrote :

Unfortunately there are no differences with this params.

Mark (mark-wege) wrote :

With the recent kernel-update to 2.6.31-19-generic the problem has resurfaced. When I boot with the previous version I have no shutdowns, only a very busy fan.

jstump209 (jstump209) wrote :

I just got an acer aspire 7540. with bios version 1.7 i was getting kicked off my wireless network constantly. just updated to 1.8 and am not having any problems.

Mark (mark-wege) wrote :

I have found this bug https://bugs.launchpad.net/ubuntu/+bug/457100
and I wonder if it might be related to our bug and point to the reason for the problems with thermal control. I fond it because I fond these messages
[ 16.120024] Clocksource tsc unstable (delta = -69562152 ns)
2010-02-20 11:20:15 padlock VIA PadLock not detected.
and wanted to find out, what they mean. Reading this report, it seems like the clocksource has something to do with the acpi. Unfortunately the descriptions of what happens are vague (logout, suddenly crash) and there are now references to the thermal control. But it seems similar.

Does anyone else have these messages in the kernel-protocol?

Rolf Leggewie (r0lf) wrote :

Mark, interesting find!

Indeed, I do have some of these messages and I have noticed in conky in the past some funky switching of the CPU frequency always right before this overheating shutdown occurred.

Feb 19 19:40:11 X24 kernel: [ 1.196034] Clocksource tsc unstable (delta = -283962795 ns)

Paul C. Bryan (pbryan) wrote :

From one of my systems:
[97552.000601] Clocksource tsc unstable (delta = -351842000 ns)

Mark (mark-wege) wrote :

i can furthermore add that this message does not appear running sidux. unfortunately i have removed the ubuntu kernel from sidux again, so that i can not say, if it also does not appear if the ubuntu-kernel is running in the sidux-environment. i do not know if i have time to test this again in the next weeks, but may be someone with knowledge of what this exactly means is able to determine this otherwise.

Rolf Leggewie (r0lf) wrote :

http://kerneltrap.org/node/8306 says this message is harmless

Mark (mark-wege) wrote :

you might be right, i have bad news which may be turned into good news: the bug has found its way into debian. i now have the same symptoms with debian sid based on sidux as with kubuntu. the bad news is for me since i do not have any reliable installation anymore.
but maybe this can help to determine the cause.

since the bug has appeared i have tried to boot with older sidux kernels. all of them where affected now. i am certain that this was not the case before. if this is true, this means whatever triggers the bug, does not come with the kernel itself, but with another package which was updated. i would also assume that the package which causes the bug has been an older version as in ubuntu karmic and has been recently updated. or debian has accepted a patch from ubuntu for that package.

i have had a look in /var/cache/apt/archives to see if there are obvious candidates for that. but unfortunately my knowledge is not good enough for that. so may be it would be better if someone else had a lock. unfortunately i do not know how to generate a list of the recently updated packages. can somebody tell me how to do this?

Rolf Leggewie (r0lf) wrote :

If you use aptitude for package management (which IMHO is superior to apt-get), you may want to have a look at /var/log/aptitude. I'm sure the other package management software keep similar logs somewhere.

Mikkel Munch Mortensen (3xm) wrote :

I've pushed Ubuntu to a friend, but after upgrading her laptop to Karmic she has been experiencing this too.

The fan is working at full speed and the CPU is getting very hot. The result is that, after a while, when the CPU temperature reaches somewhere above 90 degrees celcius, the system "freezes" (I can't help thinking that's a bad word for something actually related to overheating). Then, after a forced shutdown (holding down the power button), it needs to cool down for a while before we're able to turn it on again.

As a temporary workaround I've told her to blow into the heatsink hole for a while, to lower the temperature when it's getting close to the critical point.

Removing the battery doesn't seem to make any difference.

It's an Acer Aspire-something. I don't remember the specs, but if I can be of any help, please tell me what you need to know.

Rolf Leggewie (r0lf) wrote :

Mikkel, sorry to hear about your experiences. A couple of suggestions to help you alleviate the worst problems.

a) run a lucid kernel (just download and install)
b) as mentioned in the upstream kernel bug, add thermal.psv=80 to kernel boot parameters (-> grub)
c) lower the CPU frequency (use ondemand or even powersave)
d) if running ondemand, look at top to understand what is using CPU and do "sudo renice `pidof $app` " for that app

That should get the computer back to usable. Would be nice to hear about your experiences and your data in http://bugzilla.kernel.org/show_bug.cgi?id=14695

Mikkel Munch Mortensen (3xm) wrote :

Rolf, thanks for taking your time.

a) How do I download a Lucid kernel for Karmic? Or do you mean I should try installing an entire Lucid on her laptop? I'm already using Lucid on my own laptop, but don't want to bother her with the possible problems of alpha releases (although it may be better than overheating).

b) I'll try that next time I get close to her computer.

c, d) I already tried both, but that doesn't seem to make any difference (I settled on ondemand). It's a 2GHz CPU, but the heating didn't seem to stop even when idling at 500MHz. I also checked top but no programs seemed to be consuming huge amounts of CPU. Firefox was at the top with about 7-8%, as far as I remember. I'll check up on that. But then again: The temperature was at 90+ and the fan running wild even when the CPU scaled down to 500MHz.

I'll get back with some results as soon as possible and comment in the kernel.org bug.

Rolf Leggewie (r0lf) wrote :

Mikkel, if you don't know how to install only a lucid kernel, maybe you shouldn't be running lucid just yet ;-) Running the development very often requires more than this skill. Assuming you are running 32-bit, you can install the kernel like "cd /tmp ; wget http://de.archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-image-2.6.32-14-386_2.6.32-14.20_i386.deb ; sudo dpkg -i linux/linux-image-2.6.32-14-386_2.6.32-14.20_i386.deb"

If your computer is overheating even when idling, you may be seeing a different issue than me. Try out one of the live CDs, including those from earlier releases to see if the problems arise there as well. Are you sure nothing is blocking free air flow? Please open an item in the answer tracker and subscribe me to it so as not to clutter this ticket. To me it sounds like you may be experiencing something else.

dgoosens (dgoosens) wrote :

same here...
since the kernel update (2.6.31-19) on my Karmic 64-bit my computer sometimes shuts down

kern.log shows:
Critical temperature reached (102 C), shutting down.

I never encountered this issue prior to the kernel update...
So I wonder when the next kernel will be released...

also, what would be the kind of trouble I might encounter if I use the prior version of the kernel
(2.6.31-17 I believe) ?

dgoosens (dgoosens) wrote :

hi !

just updated my Ubuntu with the latest kernel (2.6.31-20) and it directly appears that the temperature is lower...
Now I am hoping this will remain like this when asking a lot of the processor...

anybody else experiencing this ?

Dr D J Clark (djc-online) wrote :

Updated to 2.6.31-20, on Lenovo IdeaPad S9e. There were no problems with previous kernel but now the fan is running on full power all the time, the battery life reduced by 30%

Changed in linux:
status: Incomplete → Expired
Changed in linux:
importance: Unknown → Medium

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.