Thermald is totally broken, or its default configuration is

Bug #1600599 reported by teo1978 on 2016-07-10
40
This bug affects 7 people
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
Critical
Unassigned

Bug Description

** WORKAROUND**: shut down the thermald process completely. If your computer has an actual physical cooling fan and it's fully functional, you don't need thermald at all.

I have Ubuntu 15.10 up to date with automatic updates and I never touched thermald configuration.
This is on a laptop, which has an actual physical cooling fan (like most laptops).

EXPECTED BEHAVIOR:
as the CPU temperature increases, the fan should spin faster to keep the temperature from getting too high. ONLY IF, even with the fan at its full capacity, or approaching it, the temperature keeps growing, THEN that's when powerclamp and things like that should trigger, throttling the CPU, so that it doesn't burn (or shut down abruptly). Also, these kinds of CPU throttling should come in gradually as needed. That is, if you inject idle processes, you should inject just the minimum amount that is needed. For example, if the fan at its maximum speed is *almost* enough to keep the temperature below the threshold, but not *quite* enough, injecting just a small amount of idle time into the CPU should be enough to do that extra bit of cooling that is needed. You would barely notice it. It would not slow your system down a lot, unless the heating is *way* higher than the fan alone can fight.

On a fully functional system (where the fan is enough to prevent the CPU from overheating and/or excessive CPU consumption does not occur in a huge degree for a long time), you shouldn't note any difference by shutting down thermald completely. Only on a system where the fan is not fully functional and/or hugely excessive CPU usage goes on for too long (actually, if the latter alone is enough to make it happen, it means that the fan is underdimensioned) would you notice the difference between having thermald (powerclamp and other CPU throttling mechanisms would kick in and prevent the temperature from becoming critical) and not having it (temperature would eventually go critical and something bad would happen, such as a sudden shutdown)

OBSERVED BEHAVIOR
When CPU temperature becomes high due to relatively high (not huge) CPU consumption, intel_powerclamp starts to kick in injecting idle processes and crippling the whole system. The observable result is that the system becomes unresponsive and unusable, yet the physical fan is sponning at roughly HALF of its maximum speed. So, you have a fast quad core machine, with a cooling fan that is perfectly capable of keeping the temperature down while using all the computing power that you require, BUT since powerclamp and things like that kick in too soon, you are limited to use a tiny fraction of the power your machine is capable of.
To put it another way: you can't watch a f***ing youtube video in full screen because the whole system will become unresponsive.
Even after removing and blacklisting the intel_powerclamp and intel_rapl kernel modules, the apparent behavior was practically the same, except that I wouldn't observe the "kidle_inject" processes by running "top". I guess there are other CPU-throttling mechanisms besides powerclamp and rapl.

So now I have SHUT DOWN THERMALD completely, and my system behaves NORMALLY. The fan, of course, reaches higher speeds. Not even _much_ higher, which means that it needed just a little bit more speed to keep up with the heating. Powerclamp and other cpu throttling mechanisms were kicking in WAY too soon.

It took me quite a long time to figure out that this was the problem. I just assumed that some bug was causing excessive CPU consumption for trivial stuff such as playing video (which is actually true but is not the whole story) and that the CPU consumption actually was causing too much heat for the fan to dissipate, making it necessary for powerclamp to kick in. Also, I thought my fan was probably filled with dust and uncapable of doing its job efficiently (which is also true but is not the whole story).

Until I realised that when I was observing unresponsiveness, the fan was not even close to its maximum speed.

CONCLUSION: either thermald does a ridiculously bad job, or its default configuration is ridiculously bad.

NOTE: this issue is **CRITICAL**: this cripples the whole system making it unresponsive when doing moderately heavy work (which the system would otherwise be perfectly capable of handling without overheating).
Most non-geek users don't even know what thermald is and will never find out that they can work around the problem by shutting it down, let alone fix its configuration. So, for most users, this "renders the system temporarily or permanently unusable", which is one of the criteria for critical importance.

ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: thermald 1.4.3-5ubuntu2
ProcVersionSignature: Ubuntu 4.2.0-41.48-generic 4.2.8-ckt11
Uname: Linux 4.2.0-41-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 2.19.1-0ubuntu5
Architecture: amd64
CurrentDesktop: Unity
Date: Sun Jul 10 14:03:56 2016
InstallationDate: Installed on 2013-10-11 (1002 days ago)
InstallationMedia: Ubuntu 13.04 "Raring Ringtail" - Release amd64 (20130424)
SourcePackage: thermald
UpgradeStatus: Upgraded to wily on 2016-01-18 (174 days ago)

teo1978 (teo8976) wrote :
description: updated
description: updated
teo1978 (teo8976) on 2016-07-10
description: updated
dino99 (9d9) wrote :

that version is reaching End Of Life in a few days; and that thermald issue has been fixed with the other releases.

https://wiki.ubuntu.com/Releases
https://bugs.launchpad.net/ubuntu/wily/+source/thermald/+bug/1543046

Changed in thermald (Ubuntu):
status: New → Invalid
teo1978 (teo8976) wrote :

That doesn't seem the same issue at all.

teo1978 (teo8976) wrote :

Actually, that is obviously not the same issue, since I disabled rapl before I disabled thermald, and I stopped observing the kernel log spam, but was still observing this issue.

So unless this is a duplicate of another one, this is not invalid, and the fact that EOL is near doesn't make it any more so. Feel free to change the status to "Won't fix" once EOL is actually reached.

Changed in thermald (Ubuntu):
status: Invalid → New
dino99 (9d9) wrote :

Be sure no one will glance at that issue; please read the report's comment posted above, and try to understand.

Changed in thermald (Ubuntu):
status: New → Invalid
teo1978 (teo8976) wrote :

> please read the report's comment posted above, and try to understand.

I wonder if you have read mine, and the issue description in the first place.

This has NOTHING TO DO with 1543046

Colin Ian King (colin-king) wrote :

A few questions:

1. What hardware is this being run on?
2. You mentioned you disabled RAPL? Can you describe what you mean by that and how?

Thanks.

Colin Ian King (colin-king) wrote :

@teo1978, I am supposing that from your actions in the comment #1 that you are no longer using thermald and therefore bug 1543046 is no longer going to be tested by you?

teo1978 (teo8976) wrote :

> 1. What hardware is this being run on?

An Acer Aspire V3-571G which has an Intel Core i7-3632QM

> 2. You mentioned you disabled RAPL? Can you describe what you mean by that and how?

I created a file /etc/modprobe.d/blacklist-power.conf with these lines:
  blacklist intel_powerclamp
  blacklist intel_rapl

and I rebooted. I found that on StackExchange by googing "disable intel rapl".

After that (and before I shut down thermald completely) I stopped experiencing bug 1543046 as per the test mentioned in comment 31 (repeated while observing system unresponsiveness)

> I am supposing that from your actions in the comment #1 that you are no longer using thermald
> and therefore bug 1543046 is no longer going to be tested by you?

Well, for the moment I have only shut it down by "service thermald stop" and I haven't rebooted since. So I may try the fix some day. (To do that, I would first have to reenable the blacklisted kernel modules, because otherwise I don't observe the symptoms of that issue even with the unfixed thermald version.)

Since the pain is gone by just shutting down thermald, it's not high-priority for me to try that.

I had thought that bug 1543046 could be the cause of the general system slowdown because people in the Quality mailing list commented it was the cause of 1593468, but it seems to me that it's clearly the other way round:
1593468 causes higher-than-normal CPU consumption when playing video; this issue (1600599) causes powerclamp and rapl to kick in much sooner than they should, and this triggers (not "causes" but "triggers") issue 1543046. That's the only correlation between the three issues that makes sense to me. (note that they remain 3 separate issues)

I am interested in seeing run with loglevel=debug. They are kicked at 10C below TJMAX and you can be shutdown by kernel any moment as the temperature can swing 5-10C immediately.
If someone don't care about life of the system or shutdowns they can increase this setting by setting one time dbus message:

dbus-send --system --dest=org.freedesktop.thermald /org/freedesktop/thermald org.freedesktop.thermald.SetUserMaxTemperature string:cpu uint32:Your temp in millidegree C

dbus-send --system --dest=org.freedesktop.thermald /org/freedesktop/thermald org.freedesktop.thermald.SetUserPassiveTemperature string:cpu uint32:Your temp in millidegree C

teo1978 (teo8976) wrote :

I'm not sure I understand what you mean by this:

> They are kicked at 10C below TJMAX and you can be shutdown by kernel any
> moment as the temperature can swing 5-10C immediately.
> If someone don't care about life of the system or shutdowns they can increase
> this setting by setting one time dbus message:

Are you implying that the behavior I observe is the expected one?
It is not.

With thermald running (with its default configuration): the system slows down becoming completely unusable when the cooling fan has barely reached half its maximum speed.

With thermald not running: my system works perfectly, nothing bad happens, the fan spins noticeably (not a lot) faster but there's still PLENTY of margin before it reaches its maximum speed.

So, cpu throttling is starting WAY too soon, unnecessarily rendering the system unusable, when the physical fan alone is more than enough to keep the temperature down.

It's not that I "don't care about life or the system or shutdown", it's that thermald is limiting CPU when there's not even a remote risk of any of that.
Either thermald is not working or its default configuration is ridiculously wrong.

This is upto you what you want to do with your system. Don't judge system heat with Fan speed. If you want debug, provide logs with log level suggested.

teo1978 (teo8976) wrote :

> This is upto you what you want to do with your system.

Yet the current behavior is objectively wrong. The system is becoming unusable because of the amount of CPU throttling, and this is totally avoidable, so it's taking a suboptimal decision.

> Don't judge system heat with Fan speed.

I'm not judging system heat with Fan speed. I'm judging what I can observe: fan speed and CPU throttling.

If system performance degrades and fan speed never goes anywhere near its full capacity, it means one of two things (theoretically):
a) thermald could use more fan power and reduce or even eliminate the need for CPU throttling. Hence it's making the wrong decision
b) thermald is actually right: using more fan alone would be too risky as the temperature could increase

Now (b) already makes little sense, because if that was the case, then the fan should already be close to its maximum speed (by definition, if it's not at its full capacity there's unused cooling power there). But let's say for the sake of argument that temperature can vary too quickly and the cooling effect of the fan takes time, so that would be risky.

So I run this little experiment: I shut down thermald and keep CPU consumption steadily high as it was; actually, I increase it by playing half a dozen youtube videos at once (when one alone was enough to make the system unusable with thermald running).

If your hypothesis (b) were right, then now I would by definition be in a situation where the temperature cannot be kept safe (otherwise there would be no reason why thermald was throttling the CPU in the first place). Hence two things should happen: first, the fan speed actually should go crazy, because it is now the ONLY cooling device available. Then, at a certain point the system would shut down (or some internal hardware protection, if it exists, would slow down the CPU itself and I'd observe a system slowdown similar to that caused by powerclamp).

NEITHER happens: the fan speed does go up a little bit, but not much. While I cannot deduce temperature from that, I can tell for sure that the temperature is not steadily growing if the fan speed isn't, and that the temperature is not critical if the fan is not at its maximum speed.

I can, and I will, run all the debugs and provide the logs, so you can put numbers to all of this, and that will certainly be valuable information to fix the issue. But the facts are already there to prove that the current behavior is WRONG, you don't need any log for that.

Kwang Moo Yi (kwang-m-yi) wrote :

I too find it worse with thermald on my Macbook pro. With thermald, the frequency clipping is too severe, and fans stay below 50%. I think the fans should still try to increase while cpu frequency clipping is happening by default.

In my case, without thermald, the CPU temp goes about 5--10 deg higher at very high loads, but it still seems fine as the fans quickly kick in. and at worst, cpu frequency clips to prevent extreme temperature.

tags: added: xenial

To really debug this, need logs. There may be number of causes if the temperature is not really high.
There may be wrong powerlimits configured in RAPL registers. So it is better to debug for good.

#systemctl stop thermald
#thermald --no-daemon --loglevel=debug

Copy this on MacBook Pro, thermald 1.5. Took forever to figure out that thermald was the culprit. Unfortunately, thermald does not report its throttling activity to syslog on a default Mint 18.1 install, so there was nothing pointing in its direction. I don't want to stir up the debate about whether or not thermald's behaviour is right or not (perhaps there really are people who prefer a slow system over one that has its fans running), but at least it would have been nice if it thermald would inform users that it is crippling CPU speed. Then I would have simply disabled it immediately.

#!16

Kasper Peeters (kasper-peeters), atleast you can provide logs, by doing the following and run whatever workloads yo do on macbook

#systemctl stop thermald
#thermald --no-daemon --loglevel=debug

While I have to agree with teo1978 that it did take a long time to find out why my performance was being compromised so heavily due to the lack of information that thermald provides to the log (involving a lot of red herrings and detours through pstate vs cpufreq, read-only frequency scaling properties, etc), I can report that cleaning out your fans with many shots of compressed air can really, really help avoid the problem of getting to the critical temperatures where performance must be seriously constrained.

Before my fan cleanup, "stress-ng --matrix 0" could cause my CPU to be reduced to the scaling_min_freq. After the cleanup, the frequency was between 60% to 80% of the max and better able to be controlled. (Sometimes it never goes back to scaling_max_freq until I reboot, I'm still not sure whose fault that is, though it only seems to happen when intel_pstate is included as a CoolingDevice in thermal-cpu-cdev-order.xml. I will try to reproduce it and get debug logs.)

Since I can't see my fan speeds even after running sensors-detect and loading all modules, I have them set to the max they can go all the time in my BIOS. The fans don't sound different when under high load or low load. stress-ng can send my temperature soaring to 98C without thermald but the fan speed sounds the same. So I can conclude that even with the fax at max, I still need the cooling techniques of thermald sometimes.

But seriously, it would have been so much easier to figure out what was going on if thermald wrote something, anything to the system log to note it was throttling CPU.

teo1978 (teo8976) on 2017-02-08
Changed in thermald (Ubuntu):
status: Invalid → Confirmed
teo1978 (teo8976) wrote :

I wonder why the status had been set to Invalid.

> cleaning out your fans [...] can really, really help avoid the problem
> of getting to the critical temperatures where performance must be
> seriously constrained.

That's not the point.
The point is that, because of this bug, you reach those temperatures even though the fan is capable of avoiding that.

If the CPU gets clamped (be it by reducing its frequency or by injecting idle processes or a combination of both and of whatever other method) AND the fan is not at its maximum speed, then something is wrong.

Proof (in case you don't see the obvious):

If CPU is being clamped, then either the temperature is critically high, or it isn't. If the temperature were not critically high, then obviously there would be no reason to clamp the CPU in the first place, so that would obviosuly be wrong. So let's assume the CPU temperature is indeed high. Then either the fan can cool it down by spinning faster, or it can't. If it can, then it should, and then there would be no need to clamp the CPU. If it can't (spin faster) then it means it must be at its maximum speed.

This is under the assumption that you want the CPU to be as fast as possible. Obviously, this should be the default assumption if not explicitly specified otherwise in some system setting that the user can edit (e.g. a "power saving mode" where you may prefer longer battery life over best performance, or, if this was to be invented, a "silent mode", where you may prefer to keep fan noise low even at the expense of performance).

I agree with your reasoning teo, I was just mentioning that it didn't apply in my situation since I know my fans are at the max all the time. But what you're saying is true - thermald does reduce performance before I get near the max temp (as reported by 'sensors' high=80 and crit=98), sometimes by a lot, and sometimes it fails to increase the performance again after the CPU has cooled down (I just reproduced and will attach separately).

Perhaps it's just that the thermald default configuration is too conservative, and is kicking in when temp is within 10C of 'high' but not 'crit' (which seems more a better choice). I may try to change the settings mentioned above (what is the difference between SetUserPassiveTemperature and SetUserMaxTemperature?).

started with thermald stopped:

pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 0 [ON]
    pstate::CPU_MIN -> 50% [1850000KHz]
    pstate::CPU_MAX -> 100% [3700000KHz]

start a lot of load, which sends my CPU up to crit temperature sometimes (but doesn't actually shut the machine down, so this number might be reported lower by BIOS):
stress-ng --matrix 0 -t 3m

starting thermald with log:
thermald --no-daemon --loglevel=debug > /root/thermald-debug-reduced.log 2>&1

thermald correctly reduces my performance (though it's by quite a lot) but keeps it very low even when CPU temp has been very much reduced; at this time stress-ng is still running:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Wed Feb 8 20:50:05 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 1 [OFF]
    pstate::CPU_MIN -> 43% [1600000KHz]
    pstate::CPU_MAX -> 50% [1850000KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +99.0°C)
temp2: +29.8°C (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +65.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +64.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +65.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +65.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +65.0°C (high = +80.0°C, crit = +98.0°C)

stress-ng stopped for many minutes but performance still stuck on low:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Wed Feb 8 20:53:26 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 1 [OFF]
    pstate::CPU_MIN -> 43% [1600000KHz]
    pstate::CPU_MAX -> 50% [1850000KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +99.0°C)
temp2: +29.8°C (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +59.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +58.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +59.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +59.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +58.0°C (high = +80.0°C, crit = +98.0°C)

at this point I stopped thermald to upload logs. I could swear I had an issue where the max freq was stuck in read-only mode and I couldn't get it back to 100% except to reboot, but I can't seem to reproduce that issue right now.

Ok, I'm uploading another thermald log that shows it restricting performance even more than before (probably because I was able to get the temperature higher), and then NOT improving performance again once temperatures have gone back down to normal. thermald was started when temperatures were cool and performance was at max (I put 'date' in the log, 8:42:43am). Then I used a combination of stress-ng and some wine programs to get a high temperature, which eventually restricted my performance to minimums (but not quite 0KHz as indicated by pstate-frequency). Once temperature was back to normal the performance stayed at low levels and I stopped thermald. At that point:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Thu Feb 9 08:54:19 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 1 [OFF]
    pstate::CPU_MIN -> 0% [0KHz]
    pstate::CPU_MAX -> 0% [0KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +99.0°C)
temp2: +29.8°C (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +53.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +52.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +52.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +51.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +52.0°C (high = +80.0°C, crit = +98.0°C)

Download full text (4.8 KiB)

Sorry, just one more thermald debug log to attach, as I was able to reproduce the issue with the read-only sysfs attributes. After stopping thermald in debug mode at the previous session, I was unable to change pscale's max_perf_pct or any/all of my CPU's scaling_max_freq past a certain point.

root@ossy:~# echo 100 | tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
root@ossy:~# grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:50
/sys/devices/system/cpu/intel_pstate/min_perf_pct:50
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/num_pstates:22
/sys/devices/system/cpu/intel_pstate/turbo_pct:19

root@ossy:~# for i in /sys/devices/system/cpu/cpu[0-3]/cpufreq/; do
> cat $i/cpuinfo_max_freq > $i/scaling_max_freq
> done

root@ossy:~# for i in /sys/devices/system/cpu/cpu[0-3]/cpufreq/; do grep . $i/*; done

/sys/devices/system/cpu/cpu0/cpufreq//affected_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_cur_freq:1799853
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_max_freq:3700000
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_min_freq:1600000
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_transition_latency:4294967295
/sys/devices/system/cpu/cpu0/cpufreq//related_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq//scaling_available_governors:performance powersave
/sys/devices/system/cpu/cpu0/cpufreq//scaling_cur_freq:1799853
/sys/devices/system/cpu/cpu0/cpufreq//scaling_driver:intel_pstate
/sys/devices/system/cpu/cpu0/cpufreq//scaling_governor:performance
/sys/devices/system/cpu/cpu0/cpufreq//scaling_max_freq:1850000
/sys/devices/system/cpu/cpu0/cpufreq//scaling_min_freq:1850000
/sys/devices/system/cpu/cpu0/cpufreq//scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu1/cpufreq//affected_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_cur_freq:1799853
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_max_freq:3700000
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_min_freq:1600000
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_transition_latency:4294967295
/sys/devices/system/cpu/cpu1/cpufreq//related_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq//scaling_available_governors:performance powersave
/sys/devices/system/cpu/cpu1/cpufreq//scaling_cur_freq:1799853
/sys/devices/system/cpu/cpu1/cpufreq//scaling_driver:intel_pstate
/sys/devices/system/cpu/cpu1/cpufreq//scaling_governor:performance
/sys/devices/system/cpu/cpu1/cpufreq//scaling_max_freq:1850000
/sys/devices/system/cpu/cpu1/cpufreq//scaling_min_freq:1850000
/sys/devices/system/cpu/cpu1/cpufreq//scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu2/cpufreq//affected_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_cur_freq:1799853
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_max_freq:3700000
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_min_freq:1600000
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_transition_latency:4294967295
/sys/devices/system/cpu/cpu2/cpufreq//related_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq//scaling_available_governors:performance powersave
/sys/devices/system/cpu/cpu2/cpufreq//scaling_cur_freq:1799853
/sys/devices/system/cpu/cpu2/cpufreq//scaling_driver:intel_pstate
...

Read more...

I see that your load is bringing up temperature to 97C, which is one less than critical temp 98C of your system. So whatever cooling of your device without thermald is pretty bad. So your device can reboot any time. Also your BIOS locked good power control method, so can't use.

Let me see if I can optimize algorithm for such system running close to critical temperature. I will send you some change to test if you wish,

I can't always get my load temp up to 97C, sometimes it takes a lot of work. ;) I'll also go back into my BIOS and set it back to some better defaults, I had disabled things like Enhanced SpeedStep while I was troubleshooting these issues. I'm not sure if I saw whatever intel_rapl uses in there but I'll look closely.

If you're running so close to a critical temperature like that, I think you're justified in doing whatever you can to get it back down. It's not that I think is the problem, it's more:

1 - the original bug reporter (and I apologize if I hijacked his bug reporter!) wanted his fan to be used first and foremost to bring temp down, as opposed to idle injection, and only do freq stepping and then idle injection if you had to (if fan's couldn't handle it); I think he may need to send some thermald debug logs for you for that though

2 - once the CPU is cool thermald was not bringing my performance back up to max speed, and it took me a while to figure out this was thermald's fault; better logging of what is going on in syslog would help here as well as figuring out why it didn't increase my performance again even though CPU was cool

Colin Ian King (colin-king) wrote :

Just my 2 cents worth: I was getting hit with kinject and so forth, and after cleaning the fan and replacing the thermal paste on my aging laptop I saw an amazing improvement:

http://smackerelofopinion.blogspot.co.uk/2016/08/fixing-overheating-lenovo-x230-laptop.html

teo1978 (teo8976) wrote :

Suddenly a stupid question pops into my mind:

Please tell me that fan speed control is NOT based on a static temperature to speed map such as:
  0 - 50 degrees => 300 rpm
 51 - 70 degrees => 600 rpm
 ....
and the like. That would be retarded, and would explain why everything goes to shit as soon as the fan's performance degrades the slightest bit.

That's not the case, right??

teo1978 (teo8976) wrote :
Download full text (4.3 KiB)

sorry (actually not, Launchpad's fault, not mine): reposting the comment with corrections, as this fucking pathetic bug tracker doesn't let me edit it
------------------------------------------------------------------------------

I suspect there are actually two (or more) issues here.

1) Thermald issues:

1a) As described in the original report, various types of CPU clamping kick in before the fan has an opportunity to do its job alone

1b) Plus, as described in comments 20-23, CPU clamping not stopping after the temperature has gone back to normal.

2) BUT THERE'S SOMETHING ELSE, which seems to be outside thermald, and I wonder if it actually explains issue 1a on its own (but of course not 1b):

Something (in software), even without thermald running, seems to prevent the fan from spinning as fast as needed.

What I base this theory on is:

- I now always stop thermald as soon as I turn my laptop on.

- when I do stuff that consumes a lot of CPU (like watching video -LOL- I know, it shouldn't consume a significant amont of CPU, but apparently there are other bugs), the temperature grows and grows.

* Now before you say it, YES, my fan needs some cleaning. That's not the point.
* If my fan alone was the problem, the temperature would go up as it does AND
* the fan would spin at its maximum speed, as already "proven" in my previous comment.

I don't know how to see the actual temperature, but I know that it goes high (i) by touching the bottom of the laptop and burning myself, and (ii) because it gets to a point (remember: thermald is off) where everything becomes incredibly slow. When it is, top shows that all the CPU power is being used (while doing the same amount of work that was not saturating the CPU before) - now that could be either because some software bug causes more and more CPU power to be used, OR because the total CPU power is less because some extreme (probably hardware) protection mechanism is lowering CPU power to stop it from burning. The confirmation that it's the latter, is that trivial processes doing virtually zero work seem to consume a high percentage of CPU.

Now, all this would be expected and could be attributed to a dirty fan, IF at this point the fan was spinning at its maximum speed. Because, if the span is being incapable of coping with the temperature growth, by definition it should be working at its maximum, that being insufficient.

And the fact is that that is not the case: I get to the above described situation while the fan is WAY below its maximum speed.
Remember: all this with thermald off.

The FINAL PROOF that something is preventing the fan from going as fast as it should is that:
- if I turn the computer off and on, the fan spins superfast while the boot menu is shown. That means that, when the OS with all its parafernalia is not yet running, the hardware somehow "knows" that it needs the fan to spin a lot faster to cool it down. Then, as the system boots, you can hear the fan go to the minimum (or close), and then speed up again, but much much less.

Also by suspending I can reproduce a similar effect, to a lesser extent:
when I suspend and resume, the fan goes quite a bit faster than it was prior ...

Read more...

Doug Smythies (dsmythies) wrote :

> I don't know how to see the actual temperature,

Recommend you use turbostat to observe package and core temperatures.
turbostat is included in the linux-tools-common package (I think).

> but I know that it goes
> high (i) by touching the bottom of the laptop and burning myself, and
>(ii) because it gets to a point (remember: thermald is off) where
> everything becomes incredibly slow.

Yes, that is the last level of protection before shutdown.
clock modulation is turned on at 50%. The intel pstate cpu frequency driver is fundamentally incompatible with clock modulation and so your CPU frequencies will be become locked at a very low frequency. You should be able to observe this with turbostat also.

teo, do
apt install lm-sensors
to install the basic sensors package. This includes
sensors-detect
which can be used to make sure all the kernel modules are loaded to detect temperatures and fan speeds. Then you can run the
sensors
command. I'm not able to see my fan speed but perhaps you can see yours, and you should definitely be able to see the temperature. Run it in another terminal window with
watch sensors
to update every 2s. As to why the fan isn't running full speed for you, that's a great question, but at least now hopefully you'll be able to see the temperature and fan speeds.

Mike,
For issue 2 please enter a new bug, I will explain in that what I see and possible fix to try.

Let's not mix with fan speed control. thermald can control Fan speed if there is a way to control Fan speed (namely called ACPI Fan or some proprietary control like in thinkpad). man therma-conf.xml has one example to control fan speed on a specific thinkpad model.
I need to get more information about the system, which I will ask to dump for Fan control.

teo1978 (teo8976) wrote :

By the way, when thermald is not running, what controls the fan speed?
And when thermald *is* running, how exactly does thermald interact with [whatever it is]? Does thermald take control over? Or does it somehow "modulate" the other thing that would normally be controlling the fan speed?

Embedded controller controls fan and on many systems it will not allow OS to control Fans. So thermald can't control unless user manually configured to do (In that case he has some means to control speed from sysfs).

Mike, for your issue 2 I have uploaded a change. https://github.com/01org/thermal_daemon/commits/master
In systems like yours which runs close to critical the auto max adjustment needs some better algorithm.

Colin,
Do you have some auto builder which can make a test package for Mike? I added one commit on my branch.

Sri I was able to clone the changes you made and compile and run your new thermald (with your default config files too). (BTW - No matter what I did in my BIOS I couldn't get RAPL to not be locked by the BIOS but apparently this is a relatively common problem.) Very Good news: I was not able to get thermald stuck in the same state as previously, where it would constrain performance dramatically even once temperatures had returned to a normal state. Normally my CPU does not get anywhere near critical temp (in the 90+C range) but I have found some programs to get it there. Once temperatures had returned to 60-69C it took a few seconds but thermald started giving me performance back, and it seemed that within 30s it was max performance again.

I then verified that the Ubuntu-provided thermald (1.5.4-2) had the same terrible behavior, even resorting to idle injection (which got re-enabled for me since I put your default config files in place), and then not removing that idle injection once temperatures were in normal range! (And I waited over a minute too.) This is a huge deal! Not sure if you need to see any debug logs or anything, if you do e-mail me off-ticket, I should really stop hijacking teo's ticket here, I would have opened another ticket if these changes did not work but as it is they have and I also don't think you need to design any other algorithm, this is working fine. Any other adjustments I can make by experimenting at the temperature which thermald kicks in (instead of 80 as reported by my sensors, I could make it 85 or 90).

teo, I think in order to continue with this they need to know if your OS is able to see the fan and fan speeds, you would need to run sensors-detect and sensors to see if you see this, and provide them with thermald debug logs to continue troubleshooting.

Colin Ian King (colin-king) wrote :

Since the report from Mike looks positive, I'll go ahead and apply this patch across the release via an SRU.

Changed in thermald (Ubuntu):
importance: Undecided → Critical

Thanks Mike.
I would like to help to close the other issues pointed here on macbook and others. But need debug logs like you provided. Also
grep -r . /sys/class/thermal/*
Also if there are some sysfs entries for fan control /sys/devices/platform or others.

Looks like there is a way to control fan speed in macbooks
https://github.com/dgraziotin/mbpfan

Looks like there is some sysfs path
/sys/devices/platform/applesmc.768/fan

I am travelling and on vacation, so expect delays.

Doug Smythies (dsmythies) wrote :

Readers: I started with Mike when he was originally posting on bug #1188647 and we continued via e-mail before he came here. I do not know thermald at all, but have been running it with a trip point of 55 degrees (because my test server runs quite cool) while following here.

Mike said:
> I can't always get my load temp up to 97C, sometimes it takes a lot of work. ;)

That is not consistent with the turbostat data you sent me, which shows a package temperature of 98 degrees under about 70% load.

Colin said:
> after cleaning the fan and replacing the thermal paste on my
> aging laptop I saw an amazing improvement:

@Mike: Yes, and at the risk of losing a good user test case, I suspect you need to re-do the thermal paste between your processor and its heat sink. You and I have have similar vintage processors, with identical TDPs, although mine is an i7 and yours is an i5. Under light load (I matched the package power of mine with your turbostat output) my processor runs 13 degrees cooler than yours. Under higher load (again I matched package power) my processor runs 30 degrees cooler then yours.
If Srinivas does want logs,then please post them here, for the education of others following this (i.e. me).

Doug I only meant that I have to run stress-ng and other tasks to get it that hot, regular web browsing/emails etc doesn't do it. But yes, thermal paste is definitely the #2 think after blowing out the fans. I may do that later but my main issue is solved by this update (though more logging in syslog would also really help people in the future).

Without thermald, stress-ng and another task can get it up to 98C even, though something in hardware is limiting me from getting it up higher and shutting down the machine (and processor frequency does go down slightly when I'm in that range... in fact I just noticed some entries in dmesg about processor speed and cpu clock throttled, but it's not coming from thermald!). I am happy with thermald doing what it needs to keep me from being in the danger zone as long as it gives me back normal performance when I'm not (which the new thermald is definitely doing and the old one was not, I've been doing a lot of checking). But other users may be just as happy with shutting it off if their BIOS and other hardware mechanisms prevent them from shutting down.

I do hope teo and/or someone else w/ a macbook can provide some debug logs or info about their fans (I think mine is just going to be controlled and I'm OK with that). I'll watch but I'm not going to post anymore. ;)

But since you asked for some logs here is one more post with some I thought would be good:
- turbostat_high_load_without_thermald
- turbostat_high_load_with_thermald
- thermald_debug_high_load

The load was created with "stress-ng -matrix 0 -t 1m" and thunderbird, and
"turbostat --debug sleep 10" was executed halfway into the 1m stress-ng run.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thermald - 1.5.4-3

---------------
thermald (1.5.4-3) unstable; urgency=medium

  * upstream fix 7f83ada8133 ("check for recv failure and ensure "
    "buffer is null terminated")
    - fixes potential buffer overrun
  * upstream fix 53154fd496a ("Remove auto adjusted max temp")
    - addresses aggressive over-throttling on some H/W (LP: #1600599)

 -- Colin King <email address hidden> Fri, 10 Feb 2017 15:19:11 +0000

Changed in thermald (Ubuntu):
status: Confirmed → Fix Released

Please don't close this bug without another update from the original reporter. While I originally posted here because I thought our problems the same or very closely related, this bug does report a slightly different problem. He will need to provide debug logs to help but I feel bad hijacking his bug and in the future I will create a new bug instead if I'm not 100% sure my problem is the same one.

teo1978 (teo8976) wrote :

I don't know, I guess I'm experiencing a mix of two issues, one (or a bunch of them) related to thermald, and another one related to something else preventing the fan from spinning as fast as it should, which I should probably report as a separate bug.

I guess it's pretty safe to assume that the part about Thermald is fixed, until proven otherwise. Clearly *something* that was wrong in thermald has been fixed.

Roger Lawhorn (rll-m) wrote :

Everything said by TEO fully expresses my frustration with thermald.
So as not to repeat anything I'd just like to add that if I stop thermald some other process starts it up again.
Alas, I cannot win.

$ lsb_release -a
No LSB modules are available.
Distributor ID: LinuxMint
Description: Linux Mint 18.3 Sylvia
Release: 18.3
Codename: sylvia

$ inxi
CPU~Quad core Intel Core i7-4940MX (-HT-MCP-) speed/max~3553/3301 MHz Kernel~4.15.13-041513-generic x86_64 Up~44 min Mem~3664.4/32151.9MB HDD~8001.6GB(25.8% used) Procs~306 Client~Shell inxi~2.2.35

dah bien-hwa (dahbien-hwa) wrote :

Just had similar episodes as what is described here.
My CPU apparently was close to overheating (though the maximum temperature I afterwards observed was 95°C, with a specified high/max of 100°C). The Ubuntu 18.04 on my laptop (Asus Zenbook UX301LA) consistently became unresponsive after starting high-cpu-usage tasks (compiling something using ninja-build); after stopping the thermald service, the unresponsiveness was gone even when running the high-cpu-usage tasks for a long time.

The unresponsiveness is quite severe: Though I did manage one time to switch to the tty1 and log in there, I could never enter any more commands there (waiting for approximately a minute or so...). I eventually always had to resort to more harsh methods of restarting the laptop.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers