Bug #1600599 “Thermald is totally broken, or its default configu...” : Bugs : thermald package : Ubuntu

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-10:

#1

Dependencies.txt Edit (1.7 KiB, text/plain; charset="utf-8")
JournalErrors.txt Edit (41.3 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (325 bytes, text/plain; charset="utf-8")

description:	updated
description:	updated

teo1978 (teo8976) on 2016-07-10

description:

updated

Revision history for this message

dino99 (9d9) wrote on 2016-07-10:

#2

that version is reaching End Of Life in a few days; and that thermald issue has been fixed with the other releases.

https://wiki.ubuntu.com/Releases
https://bugs.launchpad.net/ubuntu/wily/+source/thermald/+bug/1543046

Changed in thermald (Ubuntu):
status:	New → Invalid

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-10:

#3

That doesn't seem the same issue at all.

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-10:

#4

Actually, that is obviously not the same issue, since I disabled rapl before I disabled thermald, and I stopped observing the kernel log spam, but was still observing this issue.

So unless this is a duplicate of another one, this is not invalid, and the fact that EOL is near doesn't make it any more so. Feel free to change the status to "Won't fix" once EOL is actually reached.

Changed in thermald (Ubuntu):
status:	Invalid → New

Revision history for this message

dino99 (9d9) wrote on 2016-07-10:

#5

Be sure no one will glance at that issue; please read the report's comment posted above, and try to understand.

Changed in thermald (Ubuntu):
status:	New → Invalid

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-10:

#6

> please read the report's comment posted above, and try to understand.

I wonder if you have read mine, and the issue description in the first place.

This has NOTHING TO DO with 1543046

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-10:

#7

A few questions:

1. What hardware is this being run on?
2. You mentioned you disabled RAPL? Can you describe what you mean by that and how?

Thanks.

Revision history for this message

Colin Ian King (colin-king) wrote on 2016-07-10:

#8

@teo1978, I am supposing that from your actions in the comment #1 that you are no longer using thermald and therefore bug 1543046 is no longer going to be tested by you?

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-10:

#9

> 1. What hardware is this being run on?

An Acer Aspire V3-571G which has an Intel Core i7-3632QM

> 2. You mentioned you disabled RAPL? Can you describe what you mean by that and how?

I created a file /etc/modprobe.d/blacklist-power.conf with these lines:
blacklist intel_powerclamp
blacklist intel_rapl

and I rebooted. I found that on StackExchange by googing "disable intel rapl".

After that (and before I shut down thermald completely) I stopped experiencing bug 1543046 as per the test mentioned in comment 31 (repeated while observing system unresponsiveness)

> I am supposing that from your actions in the comment #1 that you are no longer using thermald
> and therefore bug 1543046 is no longer going to be tested by you?

Well, for the moment I have only shut it down by "service thermald stop" and I haven't rebooted since. So I may try the fix some day. (To do that, I would first have to reenable the blacklisted kernel modules, because otherwise I don't observe the symptoms of that issue even with the unfixed thermald version.)

Since the pain is gone by just shutting down thermald, it's not high-priority for me to try that.

I had thought that bug 1543046 could be the cause of the general system slowdown because people in the Quality mailing list commented it was the cause of 1593468, but it seems to me that it's clearly the other way round:
1593468 causes higher-than-normal CPU consumption when playing video; this issue (1600599) causes powerclamp and rapl to kick in much sooner than they should, and this triggers (not "causes" but "triggers") issue 1543046. That's the only correlation between the three issues that makes sense to me. (note that they remain 3 separate issues)

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2016-07-11:

#10

I am interested in seeing run with loglevel=debug. They are kicked at 10C below TJMAX and you can be shutdown by kernel any moment as the temperature can swing 5-10C immediately.
If someone don't care about life of the system or shutdowns they can increase this setting by setting one time dbus message:

dbus-send --system --dest=org.freedesktop.thermald /org/freedesktop/thermald org.freedesktop.thermald.SetUserMaxTemperature string:cpu uint32:Your temp in millidegree C

dbus-send --system --dest=org.freedesktop.thermald /org/freedesktop/thermald org.freedesktop.thermald.SetUserPassiveTemperature string:cpu uint32:Your temp in millidegree C

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-11:

#11

I'm not sure I understand what you mean by this:

> They are kicked at 10C below TJMAX and you can be shutdown by kernel any
> moment as the temperature can swing 5-10C immediately.
> If someone don't care about life of the system or shutdowns they can increase
> this setting by setting one time dbus message:

Are you implying that the behavior I observe is the expected one?
It is not.

With thermald running (with its default configuration): the system slows down becoming completely unusable when the cooling fan has barely reached half its maximum speed.

With thermald not running: my system works perfectly, nothing bad happens, the fan spins noticeably (not a lot) faster but there's still PLENTY of margin before it reaches its maximum speed.

So, cpu throttling is starting WAY too soon, unnecessarily rendering the system unusable, when the physical fan alone is more than enough to keep the temperature down.

It's not that I "don't care about life or the system or shutdown", it's that thermald is limiting CPU when there's not even a remote risk of any of that.
Either thermald is not working or its default configuration is ridiculously wrong.

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2016-07-11:

#12

This is upto you what you want to do with your system. Don't judge system heat with Fan speed. If you want debug, provide logs with log level suggested.

Revision history for this message

teo1978 (teo8976) wrote on 2016-07-12:

#13

> This is upto you what you want to do with your system.

Yet the current behavior is objectively wrong. The system is becoming unusable because of the amount of CPU throttling, and this is totally avoidable, so it's taking a suboptimal decision.

> Don't judge system heat with Fan speed.

I'm not judging system heat with Fan speed. I'm judging what I can observe: fan speed and CPU throttling.

If system performance degrades and fan speed never goes anywhere near its full capacity, it means one of two things (theoretically):
a) thermald could use more fan power and reduce or even eliminate the need for CPU throttling. Hence it's making the wrong decision
b) thermald is actually right: using more fan alone would be too risky as the temperature could increase

Now (b) already makes little sense, because if that was the case, then the fan should already be close to its maximum speed (by definition, if it's not at its full capacity there's unused cooling power there). But let's say for the sake of argument that temperature can vary too quickly and the cooling effect of the fan takes time, so that would be risky.

So I run this little experiment: I shut down thermald and keep CPU consumption steadily high as it was; actually, I increase it by playing half a dozen youtube videos at once (when one alone was enough to make the system unusable with thermald running).

If your hypothesis (b) were right, then now I would by definition be in a situation where the temperature cannot be kept safe (otherwise there would be no reason why thermald was throttling the CPU in the first place). Hence two things should happen: first, the fan speed actually should go crazy, because it is now the ONLY cooling device available. Then, at a certain point the system would shut down (or some internal hardware protection, if it exists, would slow down the CPU itself and I'd observe a system slowdown similar to that caused by powerclamp).

NEITHER happens: the fan speed does go up a little bit, but not much. While I cannot deduce temperature from that, I can tell for sure that the temperature is not steadily growing if the fan speed isn't, and that the temperature is not critical if the fan is not at its maximum speed.

I can, and I will, run all the debugs and provide the logs, so you can put numbers to all of this, and that will certainly be valuable information to fix the issue. But the facts are already there to prove that the current behavior is WRONG, you don't need any log for that.

> This is upto you what you want to do with your system.

Yet the current behavior is objectively wrong. The system is becoming unusable because of the amount of CPU throttling, and this is totally avoidable, so it's taking a suboptimal decision.

> Don't judge system heat with Fan speed.

I'm not judging system heat with Fan speed. I'm judging what I can observe: fan speed and CPU throttling.

If system performance degrades and fan speed never goes anywhere near its full capacity, it means one of two things (theoretically):
a) thermald could use more fan power and reduce or even eliminate the need for CPU throttling. Hence it's making the wrong decision
b) thermald is actually right: using more fan alone would be too risky as the temperature could increase

Now (b) already makes little sense, because if that was the case, then the fan should already be close to its maximum speed (by definition, if it's not at its full capacity there's unused cooling power there). But let's say for the sake of argument that temperature can vary too quickly and the cooling effect of the fan takes time, so that would be risky.

So I run this little experiment: I shut down thermald and keep CPU consumption steadily high as it was; actually, I increase it by playing half a dozen youtube videos at once (when one alone was enough to make the system unusable with thermald running).

If your hypothesis (b) were right, then now I would by definition be in a situation where the temperature cannot be kept safe (otherwise there would be no reason why thermald was throttling the CPU in the first place). Hence two things should happen: first, the fan speed actually should go crazy, because it is now the ONLY cooling device available. Then, at a certain point the system would shut down (or some internal hardware protection, if it exists, would slow down the CPU itself and I'd observe a system slowdown similar to that caused by powerclamp).

NEITHER happens: the fan speed does go up a little bit, but not much. While I cannot deduce temperature from that, I can tell for sure that the temperature is not steadily growing if the fan speed isn't, and that the temperature is not critical if the fan is not at its maximum speed.

I can, and I will, run all the debugs and provide the logs, so you can put numbers to all of this, and that will certainly be valuable information to fix the issue. But the facts are already there to prove that the current behavior is WRONG, you don't need any log for that.

Revision history for this message

Kwang Moo Yi (kwang-m-yi) wrote on 2017-01-23:

#14

I too find it worse with thermald on my Macbook pro. With thermald, the frequency clipping is too severe, and fans stay below 50%. I think the fans should still try to increase while cpu frequency clipping is happening by default.

In my case, without thermald, the CPU temp goes about 5--10 deg higher at very high loads, but it still seems fine as the fans quickly kick in. and at worst, cpu frequency clips to prevent extreme temperature.

tags:

added: xenial

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-01-23:

#15

To really debug this, need logs. There may be number of causes if the temperature is not really high.
There may be wrong powerlimits configured in RAPL registers. So it is better to debug for good.

#systemctl stop thermald
#thermald --no-daemon --loglevel=debug

Revision history for this message

Kasper Peeters (kasper-peeters) wrote on 2017-02-01:

#16

Copy this on MacBook Pro, thermald 1.5. Took forever to figure out that thermald was the culprit. Unfortunately, thermald does not report its throttling activity to syslog on a default Mint 18.1 install, so there was nothing pointing in its direction. I don't want to stir up the debate about whether or not thermald's behaviour is right or not (perhaps there really are people who prefer a slow system over one that has its fans running), but at least it would have been nice if it thermald would inform users that it is crippling CPU speed. Then I would have simply disabled it immediately.

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-02-01:

#17

#!16

Kasper Peeters (kasper-peeters), atleast you can provide logs, by doing the following and run whatever workloads yo do on macbook

#systemctl stop thermald
#thermald --no-daemon --loglevel=debug

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-08:

#18

While I have to agree with teo1978 that it did take a long time to find out why my performance was being compromised so heavily due to the lack of information that thermald provides to the log (involving a lot of red herrings and detours through pstate vs cpufreq, read-only frequency scaling properties, etc), I can report that cleaning out your fans with many shots of compressed air can really, really help avoid the problem of getting to the critical temperatures where performance must be seriously constrained.

Before my fan cleanup, "stress-ng --matrix 0" could cause my CPU to be reduced to the scaling_min_freq. After the cleanup, the frequency was between 60% to 80% of the max and better able to be controlled. (Sometimes it never goes back to scaling_max_freq until I reboot, I'm still not sure whose fault that is, though it only seems to happen when intel_pstate is included as a CoolingDevice in thermal-cpu-cdev-order.xml. I will try to reproduce it and get debug logs.)

Since I can't see my fan speeds even after running sensors-detect and loading all modules, I have them set to the max they can go all the time in my BIOS. The fans don't sound different when under high load or low load. stress-ng can send my temperature soaring to 98C without thermald but the fan speed sounds the same. So I can conclude that even with the fax at max, I still need the cooling techniques of thermald sometimes.

But seriously, it would have been so much easier to figure out what was going on if thermald wrote something, anything to the system log to note it was throttling CPU.

teo1978 (teo8976) on 2017-02-08

Changed in thermald (Ubuntu):
status:	Invalid → Confirmed

Revision history for this message

teo1978 (teo8976) wrote on 2017-02-08:

#19

I wonder why the status had been set to Invalid.

> cleaning out your fans [...] can really, really help avoid the problem
> of getting to the critical temperatures where performance must be
> seriously constrained.

That's not the point.
The point is that, because of this bug, you reach those temperatures even though the fan is capable of avoiding that.

If the CPU gets clamped (be it by reducing its frequency or by injecting idle processes or a combination of both and of whatever other method) AND the fan is not at its maximum speed, then something is wrong.

Proof (in case you don't see the obvious):

If CPU is being clamped, then either the temperature is critically high, or it isn't. If the temperature were not critically high, then obviously there would be no reason to clamp the CPU in the first place, so that would obviosuly be wrong. So let's assume the CPU temperature is indeed high. Then either the fan can cool it down by spinning faster, or it can't. If it can, then it should, and then there would be no need to clamp the CPU. If it can't (spin faster) then it means it must be at its maximum speed.

This is under the assumption that you want the CPU to be as fast as possible. Obviously, this should be the default assumption if not explicitly specified otherwise in some system setting that the user can edit (e.g. a "power saving mode" where you may prefer longer battery life over best performance, or, if this was to be invented, a "silent mode", where you may prefer to keep fan noise low even at the expense of performance).

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-09:

#20

I agree with your reasoning teo, I was just mentioning that it didn't apply in my situation since I know my fans are at the max all the time. But what you're saying is true - thermald does reduce performance before I get near the max temp (as reported by 'sensors' high=80 and crit=98), sometimes by a lot, and sometimes it fails to increase the performance again after the CPU has cooled down (I just reproduced and will attach separately).

Perhaps it's just that the thermald default configuration is too conservative, and is kicking in when temp is within 10C of 'high' but not 'crit' (which seems more a better choice). I may try to change the settings mentioned above (what is the difference between SetUserPassiveTemperature and SetUserMaxTemperature?).

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-09:

#21

thermald debug logs which ends up with thermald stuck in a low freq mode when CPU is cool Edit (83.3 KiB, text/plain)

started with thermald stopped:

pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 0 [ON]
    pstate::CPU_MIN -> 50% [1850000KHz]
    pstate::CPU_MAX -> 100% [3700000KHz]

start a lot of load, which sends my CPU up to crit temperature sometimes (but doesn't actually shut the machine down, so this number might be reported lower by BIOS):
stress-ng --matrix 0 -t 3m

starting thermald with log:
thermald --no-daemon --loglevel=debug > /root/thermald-debug-reduced.log 2>&1

thermald correctly reduces my performance (though it's by quite a lot) but keeps it very low even when CPU temp has been very much reduced; at this time stress-ng is still running:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Wed Feb 8 20:50:05 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 1 [OFF]
    pstate::CPU_MIN -> 43% [1600000KHz]
    pstate::CPU_MAX -> 50% [1850000KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +99.0°C)
temp2: +29.8°C (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +65.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +64.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +65.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +65.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +65.0°C (high = +80.0°C, crit = +98.0°C)

stress-ng stopped for many minutes but performance still stuck on low:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Wed Feb 8 20:53:26 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 1 [OFF]
    pstate::CPU_MIN -> 43% [1600000KHz]
    pstate::CPU_MAX -> 50% [1850000KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +99.0°C)
temp2: +29.8°C (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +59.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +58.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +59.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +59.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +58.0°C (high = +80.0°C, crit = +98.0°C)

at this point I stopped thermald to upload logs. I could swear I had an issue where the max freq was stuck in read-only mode and I couldn't get it back to 100% except to reboot, but I can't seem to reproduce that issue right now.

started with thermald stopped:

pstate-frequency version 3.7.2
    pstate::CPU_DRIVER   -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO        -> 0 [ON]
    pstate::CPU_MIN      -> 50% [1850000KHz]
    pstate::CPU_MAX      -> 100% [3700000KHz]

start a lot of load, which sends my CPU up to crit temperature sometimes (but doesn't actually shut the machine down, so this number might be reported lower by BIOS):
stress-ng --matrix 0 -t 3m

starting thermald with log:
thermald --no-daemon --loglevel=debug > /root/thermald-debug-reduced.log 2>&1

thermald correctly reduces my performance (though it's by quite a lot) but keeps it very low even when CPU temp has been very much reduced; at this time stress-ng is still running:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Wed Feb  8 20:50:05 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER   -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO        -> 1 [OFF]
    pstate::CPU_MIN      -> 43% [1600000KHz]
    pstate::CPU_MAX      -> 50% [1850000KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +99.0°C)
temp2:        +29.8°C  (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +65.0°C  (high = +80.0°C, crit = +98.0°C)
Core 0:        +64.0°C  (high = +80.0°C, crit = +98.0°C)
Core 1:        +65.0°C  (high = +80.0°C, crit = +98.0°C)
Core 2:        +65.0°C  (high = +80.0°C, crit = +98.0°C)
Core 3:        +65.0°C  (high = +80.0°C, crit = +98.0°C)

stress-ng stopped for many minutes but performance still stuck on low:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Wed Feb  8 20:53:26 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER   -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO        -> 1 [OFF]
    pstate::CPU_MIN      -> 43% [1600000KHz]
    pstate::CPU_MAX      -> 50% [1850000KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +99.0°C)
temp2:        +29.8°C  (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +59.0°C  (high = +80.0°C, crit = +98.0°C)
Core 0:        +58.0°C  (high = +80.0°C, crit = +98.0°C)
Core 1:        +59.0°C  (high = +80.0°C, crit = +98.0°C)
Core 2:        +59.0°C  (high = +80.0°C, crit = +98.0°C)
Core 3:        +58.0°C  (high = +80.0°C, crit = +98.0°C)

at this point I stopped thermald to upload logs. I could swear I had an issue where the max freq was stuck in read-only mode and I couldn't get it back to 100% except to reboot, but I can't seem to reproduce that issue right now.

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-09:

#22

thermald debug log ending with restricted performance even when temp is cool Edit (147.1 KiB, text/plain)

Ok, I'm uploading another thermald log that shows it restricting performance even more than before (probably because I was able to get the temperature higher), and then NOT improving performance again once temperatures have gone back down to normal. thermald was started when temperatures were cool and performance was at max (I put 'date' in the log, 8:42:43am). Then I used a combination of stress-ng and some wine programs to get a high temperature, which eventually restricted my performance to minimums (but not quite 0KHz as indicated by pstate-frequency). Once temperature was back to normal the performance stayed at low levels and I stopped thermald. At that point:

mike@ossy /u/s/pstate-frequency> date; ./pstate-frequency -G; sensors
Thu Feb 9 08:54:19 EST 2017
pstate-frequency version 3.7.2
    pstate::CPU_DRIVER -> intel_pstate
    pstate::CPU_GOVERNOR -> performance
    pstate::TURBO -> 1 [OFF]
    pstate::CPU_MIN -> 0% [0KHz]
    pstate::CPU_MAX -> 0% [0KHz]
asus-isa-0000
Adapter: ISA adapter
cpu_fan: 0 RPM

acpitz-virtual-0
Adapter: Virtual device
temp1: +27.8°C (crit = +99.0°C)
temp2: +29.8°C (crit = +99.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +53.0°C (high = +80.0°C, crit = +98.0°C)
Core 0: +52.0°C (high = +80.0°C, crit = +98.0°C)
Core 1: +52.0°C (high = +80.0°C, crit = +98.0°C)
Core 2: +51.0°C (high = +80.0°C, crit = +98.0°C)
Core 3: +52.0°C (high = +80.0°C, crit = +98.0°C)

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-09:

#23

thermald log where starting it puts me back to max performance Edit (72.0 KiB, text/plain)

Download full text (4.8 KiB)

Sorry, just one more thermald debug log to attach, as I was able to reproduce the issue with the read-only sysfs attributes. After stopping thermald in debug mode at the previous session, I was unable to change pscale's max_perf_pct or any/all of my CPU's scaling_max_freq past a certain point.

root@ossy:~# echo 100 | tee /sys/devices/system/cpu/intel_pstate/max_perf_pct
100
root@ossy:~# grep . /sys/devices/system/cpu/intel_pstate/*
/sys/devices/system/cpu/intel_pstate/max_perf_pct:50
/sys/devices/system/cpu/intel_pstate/min_perf_pct:50
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/num_pstates:22
/sys/devices/system/cpu/intel_pstate/turbo_pct:19

root@ossy:~# for i in /sys/devices/system/cpu/cpu[0-3]/cpufreq/; do
> cat $i/cpuinfo_max_freq > $i/scaling_max_freq
> done

root@ossy:~# for i in /sys/devices/system/cpu/cpu[0-3]/cpufreq/; do grep . $i/*; done

/sys/devices/system/cpu/cpu0/cpufreq//affected_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_cur_freq:1799853
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_max_freq:3700000
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_min_freq:1600000
/sys/devices/system/cpu/cpu0/cpufreq//cpuinfo_transition_latency:4294967295
/sys/devices/system/cpu/cpu0/cpufreq//related_cpus:0
/sys/devices/system/cpu/cpu0/cpufreq//scaling_available_governors:performance powersave
/sys/devices/system/cpu/cpu0/cpufreq//scaling_cur_freq:1799853
/sys/devices/system/cpu/cpu0/cpufreq//scaling_driver:intel_pstate
/sys/devices/system/cpu/cpu0/cpufreq//scaling_governor:performance
/sys/devices/system/cpu/cpu0/cpufreq//scaling_max_freq:1850000
/sys/devices/system/cpu/cpu0/cpufreq//scaling_min_freq:1850000
/sys/devices/system/cpu/cpu0/cpufreq//scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu1/cpufreq//affected_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_cur_freq:1799853
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_max_freq:3700000
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_min_freq:1600000
/sys/devices/system/cpu/cpu1/cpufreq//cpuinfo_transition_latency:4294967295
/sys/devices/system/cpu/cpu1/cpufreq//related_cpus:1
/sys/devices/system/cpu/cpu1/cpufreq//scaling_available_governors:performance powersave
/sys/devices/system/cpu/cpu1/cpufreq//scaling_cur_freq:1799853
/sys/devices/system/cpu/cpu1/cpufreq//scaling_driver:intel_pstate
/sys/devices/system/cpu/cpu1/cpufreq//scaling_governor:performance
/sys/devices/system/cpu/cpu1/cpufreq//scaling_max_freq:1850000
/sys/devices/system/cpu/cpu1/cpufreq//scaling_min_freq:1850000
/sys/devices/system/cpu/cpu1/cpufreq//scaling_setspeed:<unsupported>
/sys/devices/system/cpu/cpu2/cpufreq//affected_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_cur_freq:1799853
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_max_freq:3700000
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_min_freq:1600000
/sys/devices/system/cpu/cpu2/cpufreq//cpuinfo_transition_latency:4294967295
/sys/devices/system/cpu/cpu2/cpufreq//related_cpus:2
/sys/devices/system/cpu/cpu2/cpufreq//scaling_available_governors:performance powersave
/sys/devices/system/cpu/cpu2/cpufreq//scaling_cur_freq:1799853
/sys/devices/system/cpu/cpu2/cpufreq//scaling_driver:intel_pstate
...

sorry (actually not, Launchpad's fault, not mine): reposting the comment with corrections, as this fucking pathetic bug tracker doesn't let me edit it
------------------------------------------------------------------------------

I suspect there are actually two (or more) issues here.

1) Thermald issues:

1a) As described in the original report, various types of CPU clamping kick in before the fan has an opportunity to do its job alone

1b) Plus, as described in comments 20-23, CPU clamping not stopping after the temperature has gone back to normal.

2) BUT THERE'S SOMETHING ELSE, which seems to be outside thermald, and I wonder if it actually explains issue 1a on its own (but of course not 1b):

Something (in software), even without thermald running, seems to prevent the fan from spinning as fast as needed.

What I base this theory on is:

- I now always stop thermald as soon as I turn my laptop on.

- when I do stuff that consumes a lot of CPU (like watching video -LOL- I know, it shouldn't consume a significant amont of CPU, but apparently there are other bugs), the temperature grows and grows.

* Now before you say it, YES, my fan needs some cleaning. That's not the point.
* If my fan alone was the problem, the temperature would go up as it does AND
* the fan would spin at its maximum speed, as already "proven" in my previous comment.

I don't know how to see the actual temperature, but I know that it goes high (i) by touching the bottom of the laptop and burning myself, and (ii) because it gets to a point (remember: thermald is off) where everything becomes incredibly slow. When it is, top shows that all the CPU power is being used (while doing the same amount of work that was not saturating the CPU before) - now that could be either because some software bug causes more and more CPU power to be used, OR because the total CPU power is less because some extreme (probably hardware) protection mechanism is lowering CPU power to stop it from burning. The confirmation that it's the latter, is that trivial processes doing virtually zero work seem to consume a high percentage of CPU.

Now, all this would be expected and could be attributed to a dirty fan, IF at this point the fan was spinning at its maximum speed. Because, if the span is being incapable of coping with the temperature growth, by definition it should be working at its maximum, that being insufficient.

And the fact is that that is not the case: I get to the above described situation while the fan is WAY below its maximum speed.
Remember: all this with thermald off.

The FINAL PROOF that something is preventing the fan from going as fast as it should is that:
- if I turn the computer off and on, the fan spins superfast while the boot menu is shown. That means that, when the OS with all its parafernalia is not yet running, the hardware somehow "knows" that it needs the fan to spin a lot faster to cool it down. Then, as the system boots, you can hear the fan go to the minimum (or close), and then speed up again, but much much less.

Also by suspending I can reproduce a similar effect, to a lesser extent:
when I suspend and resume, the fan goes quite a bit faster than it was prior to suspending, but not quite as fast as when rebooting. Then it gradually but noticeably slows down. It may be expected that it slows down, if it was managing to quickly reduce the temperature, but it is quite clear that it slows down much faster than the temperature goes down. The proof of this last statement is that (i) if I now reboot it will go much faster; (ii) it's obvious that the temperature cannot be going down so fast, and (iii) if I keep doing the same cpu-heavy stuff I was doing, in a matter of seconds the system is slow again (meaning, as described above, that it's so hot it has to reduce CPU performance in whatever way it does it) but the fan speed doesn't go up; indeed it keeps slowing down.

So, my conclusion is that whether or not thermald is running, SOMETHING PREVENTS THE FAN FROM GOING AS FAST AS IT SHOULD.

Note that whether or not the fan is able to do its work at its best, or even sufficiently (i.e. it's dirty) is irrelevant. Whether or not that is the case, if the temperature is increasing, then the fan should be going faster. Which means that if it is not capable of keeping the temperature within the established limits, it should reach its maximum speed.

Revision history for this message

Doug Smythies (dsmythies) wrote on 2017-02-09:

#30

> I don't know how to see the actual temperature,

Recommend you use turbostat to observe package and core temperatures.
turbostat is included in the linux-tools-common package (I think).

> but I know that it goes
> high (i) by touching the bottom of the laptop and burning myself, and
>(ii) because it gets to a point (remember: thermald is off) where
> everything becomes incredibly slow.

Yes, that is the last level of protection before shutdown.
clock modulation is turned on at 50%. The intel pstate cpu frequency driver is fundamentally incompatible with clock modulation and so your CPU frequencies will be become locked at a very low frequency. You should be able to observe this with turbostat also.

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-09:

#31

teo, do
apt install lm-sensors
to install the basic sensors package. This includes
sensors-detect
which can be used to make sure all the kernel modules are loaded to detect temperatures and fan speeds. Then you can run the
sensors
command. I'm not able to see my fan speed but perhaps you can see yours, and you should definitely be able to see the temperature. Run it in another terminal window with
watch sensors
to update every 2s. As to why the fan isn't running full speed for you, that's a great question, but at least now hopefully you'll be able to see the temperature and fan speeds.

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-02-09:

#32

Mike,
For issue 2 please enter a new bug, I will explain in that what I see and possible fix to try.

Let's not mix with fan speed control. thermald can control Fan speed if there is a way to control Fan speed (namely called ACPI Fan or some proprietary control like in thinkpad). man therma-conf.xml has one example to control fan speed on a specific thinkpad model.
I need to get more information about the system, which I will ask to dump for Fan control.

Revision history for this message

teo1978 (teo8976) wrote on 2017-02-09:

#33

By the way, when thermald is not running, what controls the fan speed?
And when thermald *is* running, how exactly does thermald interact with [whatever it is]? Does thermald take control over? Or does it somehow "modulate" the other thing that would normally be controlling the fan speed?

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-02-10:

#34

Embedded controller controls fan and on many systems it will not allow OS to control Fans. So thermald can't control unless user manually configured to do (In that case he has some means to control speed from sysfs).

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-02-10:

#35

Mike, for your issue 2 I have uploaded a change. https://github.com/01org/thermal_daemon/commits/master
In systems like yours which runs close to critical the auto max adjustment needs some better algorithm.

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-02-10:

#36

Colin,
Do you have some auto builder which can make a test package for Mike? I added one commit on my branch.

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-10:

#37

Sri I was able to clone the changes you made and compile and run your new thermald (with your default config files too). (BTW - No matter what I did in my BIOS I couldn't get RAPL to not be locked by the BIOS but apparently this is a relatively common problem.) Very Good news: I was not able to get thermald stuck in the same state as previously, where it would constrain performance dramatically even once temperatures had returned to a normal state. Normally my CPU does not get anywhere near critical temp (in the 90+C range) but I have found some programs to get it there. Once temperatures had returned to 60-69C it took a few seconds but thermald started giving me performance back, and it seemed that within 30s it was max performance again.

I then verified that the Ubuntu-provided thermald (1.5.4-2) had the same terrible behavior, even resorting to idle injection (which got re-enabled for me since I put your default config files in place), and then not removing that idle injection once temperatures were in normal range! (And I waited over a minute too.) This is a huge deal! Not sure if you need to see any debug logs or anything, if you do e-mail me off-ticket, I should really stop hijacking teo's ticket here, I would have opened another ticket if these changes did not work but as it is they have and I also don't think you need to design any other algorithm, this is working fine. Any other adjustments I can make by experimenting at the temperature which thermald kicks in (instead of 80 as reported by my sensors, I could make it 85 or 90).

teo, I think in order to continue with this they need to know if your OS is able to see the fan and fan speeds, you would need to run sensors-detect and sensors to see if you see this, and provide them with thermald debug logs to continue troubleshooting.

Revision history for this message

Colin Ian King (colin-king) wrote on 2017-02-10:

#38

Since the report from Mike looks positive, I'll go ahead and apply this patch across the release via an SRU.

Alberto Salvia Novella (es20490446e) on 2017-02-10

Changed in thermald (Ubuntu):
importance:	Undecided → Critical

Revision history for this message

Srinivas Pandruvada (srinivas-pandruvada) wrote on 2017-02-10:

#39

Thanks Mike.
I would like to help to close the other issues pointed here on macbook and others. But need debug logs like you provided. Also
grep -r . /sys/class/thermal/*
Also if there are some sysfs entries for fan control /sys/devices/platform or others.

Looks like there is a way to control fan speed in macbooks
https://github.com/dgraziotin/mbpfan

Looks like there is some sysfs path
/sys/devices/platform/applesmc.768/fan

I am travelling and on vacation, so expect delays.

Revision history for this message

Doug Smythies (dsmythies) wrote on 2017-02-10:

#40

Readers: I started with Mike when he was originally posting on bug #1188647 and we continued via e-mail before he came here. I do not know thermald at all, but have been running it with a trip point of 55 degrees (because my test server runs quite cool) while following here.

Mike said:
> I can't always get my load temp up to 97C, sometimes it takes a lot of work. ;)

That is not consistent with the turbostat data you sent me, which shows a package temperature of 98 degrees under about 70% load.

Colin said:
> after cleaning the fan and replacing the thermal paste on my
> aging laptop I saw an amazing improvement:

@Mike: Yes, and at the risk of losing a good user test case, I suspect you need to re-do the thermal paste between your processor and its heat sink. You and I have have similar vintage processors, with identical TDPs, although mine is an i7 and yours is an i5. Under light load (I matched the package power of mine with your turbostat output) my processor runs 13 degrees cooler than yours. Under higher load (again I matched package power) my processor runs 30 degrees cooler then yours.
If Srinivas does want logs,then please post them here, for the education of others following this (i.e. me).

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-10:

#41

Doug I only meant that I have to run stress-ng and other tasks to get it that hot, regular web browsing/emails etc doesn't do it. But yes, thermal paste is definitely the #2 think after blowing out the fans. I may do that later but my main issue is solved by this update (though more logging in syslog would also really help people in the future).

Without thermald, stress-ng and another task can get it up to 98C even, though something in hardware is limiting me from getting it up higher and shutting down the machine (and processor frequency does go down slightly when I'm in that range... in fact I just noticed some entries in dmesg about processor speed and cpu clock throttled, but it's not coming from thermald!). I am happy with thermald doing what it needs to keep me from being in the danger zone as long as it gives me back normal performance when I'm not (which the new thermald is definitely doing and the old one was not, I've been doing a lot of checking). But other users may be just as happy with shutting it off if their BIOS and other hardware mechanisms prevent them from shutting down.

I do hope teo and/or someone else w/ a macbook can provide some debug logs or info about their fans (I think mine is just going to be controlled and I'm OK with that). I'll watch but I'm not going to post anymore. ;)

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-10:

#42

logs.zip Edit (5.6 KiB, application/zip)

But since you asked for some logs here is one more post with some I thought would be good:
- turbostat_high_load_without_thermald
- turbostat_high_load_with_thermald
- thermald_debug_high_load

The load was created with "stress-ng -matrix 0 -t 1m" and thunderbird, and
"turbostat --debug sleep 10" was executed halfway into the 1m stress-ng run.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2017-02-11:

#43

This bug was fixed in the package thermald - 1.5.4-3

---------------
thermald (1.5.4-3) unstable; urgency=medium

  * upstream fix 7f83ada8133 ("check for recv failure and ensure "
    "buffer is null terminated")
    - fixes potential buffer overrun
  * upstream fix 53154fd496a ("Remove auto adjusted max temp")
    - addresses aggressive over-throttling on some H/W (LP: #1600599)

-- Colin King <email address hidden> Fri, 10 Feb 2017 15:19:11 +0000

Changed in thermald (Ubuntu):
status:	Confirmed → Fix Released

Revision history for this message

mike@papersolve.com (mike-papersolve) wrote on 2017-02-13:

#44

Please don't close this bug without another update from the original reporter. While I originally posted here because I thought our problems the same or very closely related, this bug does report a slightly different problem. He will need to provide debug logs to help but I feel bad hijacking his bug and in the future I will create a new bug instead if I'm not 100% sure my problem is the same one.

Revision history for this message

teo1978 (teo8976) wrote on 2017-02-13:

#45

I don't know, I guess I'm experiencing a mix of two issues, one (or a bunch of them) related to thermald, and another one related to something else preventing the fan from spinning as fast as it should, which I should probably report as a separate bug.

I guess it's pretty safe to assume that the part about Thermald is fixed, until proven otherwise. Clearly *something* that was wrong in thermald has been fixed.

Revision history for this message

Roger Lawhorn (rll-m) wrote on 2018-06-01:

#46

Everything said by TEO fully expresses my frustration with thermald.
So as not to repeat anything I'd just like to add that if I stop thermald some other process starts it up again.
Alas, I cannot win.

$ lsb_release -a
No LSB modules are available.
Distributor ID: LinuxMint
Description: Linux Mint 18.3 Sylvia
Release: 18.3
Codename: sylvia

$ inxi
CPU~Quad core Intel Core i7-4940MX (-HT-MCP-) speed/max~3553/3301 MHz Kernel~4.15.13-041513-generic x86_64 Up~44 min Mem~3664.4/32151.9MB HDD~8001.6GB(25.8% used) Procs~306 Client~Shell inxi~2.2.35

Revision history for this message

dah bien-hwa (dahbien-hwa) wrote on 2018-08-03:

#47

Just had similar episodes as what is described here.
My CPU apparently was close to overheating (though the maximum temperature I afterwards observed was 95°C, with a specified high/max of 100°C). The Ubuntu 18.04 on my laptop (Asus Zenbook UX301LA) consistently became unresponsive after starting high-cpu-usage tasks (compiling something using ninja-build); after stopping the thermald service, the unresponsiveness was gone even when running the high-cpu-usage tasks for a long time.

The unresponsiveness is quite severe: Though I did manage one time to switch to the tty1 and log in there, I could never enter any more commands there (waiting for approximately a minute or so...). I eventually always had to resort to more harsh methods of restarting the laptop.

Revision history for this message

Andy Whitcroft (apw) wrote on 2019-08-15: Please test proposed package

#48

Hello teo1978, or anyone else affected,

Accepted thermald into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.7.0-5ubuntu4 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in thermald (Ubuntu Bionic):
status:	New → Fix Committed
tags:	added: verification-needed verification-needed-bionic

Revision history for this message

Colin Ian King (colin-king) wrote on 2019-08-15:

#49

The bionic SRU test message occurred because I accidentally uploaded the package with the entire old history. This bug has already been fixed and the verification for bionic can be ignored.

no longer affects:	thermald (Ubuntu Bionic)
tags:	added: verification-done removed: verification-needed verification-needed-bionic

Ubuntu
thermald package

Thermald is totally broken, or its default configuration is

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntuthermald package

Thermald is totally broken, or its default configuration is

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
thermald package