Significantly lower power and thermal limits on ThinkPad T480s (and probably others) than on Windows

Bug #1763144 reported by Julian Andres Klode on 2018-04-11
58
This bug affects 18 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
thermald (Ubuntu)
Undecided
Unassigned

Bug Description

A ThinkPad T480s under windows has a power limit of 44W, both short and long term, with a thermal maximum of about 93C or so. Under Linux, the power limits are 44W and 15W (short) or so, and the thermal limit is 80C, causing a significant performance loss.

Looking at MSR and MCHBAR values, we can see that the values are correctly at 44W in the MSR, but the MCHBAR is set to a lower value:

$ sudo rdmsr -a 0x610
42816000dd8160
42816000dd8160
42816000dd8160
42816000dd8160
42816000dd8160
42816000dd8160
42816000dd8160
42816000dd8160
$ sudo /home/jak/Downloads/iotools-1.5/iotools mmio_read64 0xfed159a0
0x0042816000dd8078

Setting the MCHBAR to the same value as the MSR register solves the problem. At some point intel-rapl seems to reduce overall frequency to 600 MHz, though.

The thermal limit is configured in MSR register 0x1a2; rdmsr -f 29:24 -d 0x1a2 returns 20. Setting those bits to 7 increases it, resulting in performance comparative to Windows.

Most of the analysis is based on the analysis in

https://www.reddit.com/r/thinkpad/comments/870u0a/t480s_linux_throttling_bug/

This applies to all bionic kernels I have tested so far, including

Ubuntu 4.15.0-13.14-generic 4.15.10
---
ApportVersion: 2.20.9-0ubuntu4
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/pcmC2D0p: jak 5881 F...m pulseaudio
 /dev/snd/controlC2: jak 5881 F.... pulseaudio
 /dev/snd/controlC1: jak 5881 F.... pulseaudio
 /dev/snd/controlC0: jak 5881 F.... pulseaudio
CurrentDesktop: GNOME
DistroRelease: Ubuntu 18.04
InstallationDate: Installed on 2018-03-14 (28 days ago)
InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Alpha amd64 (20180313)
MachineType: LENOVO 20L8S02D00
Package: linux (not installed)
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: root=/dev/mapper/ubuntu--vg-root ro rootflags=subvol=@ quiet splash vt.handoff=1
ProcVersionSignature: Ubuntu 4.15.0-13.14-generic 4.15.10
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-13-generic N/A
 linux-backports-modules-4.15.0-13-generic N/A
 linux-firmware 1.173
Tags: bionic
Uname: Linux 4.15.0-13-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip kvm lpadmin lxd plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 01/22/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: N22ET31W (1.08 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20L8S02D00
dmi.board.vendor: LENOVO
dmi.board.version: Not Defined
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN22ET31W(1.08):bd01/22/2018:svnLENOVO:pn20L8S02D00:pvrThinkPadT480s:rvnLENOVO:rn20L8S02D00:rvrNotDefined:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad T480s
dmi.product.name: 20L8S02D00
dmi.product.version: ThinkPad T480s
dmi.sys.vendor: LENOVO

apport information

tags: added: apport-collected bionic
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

An actually useful dmesg log (from journalctl -k)

summary: - Significantly lower power and thermal limits on T480s (and probably
- others) than on Windows
+ Significantly lower power and thermal limits on ThinkPad T480s (and
+ probably others) than on Windows
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.16 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.16

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Julian Andres Klode (juliank) wrote :

Note that the thermal limit seems to be rewritten by the EC at random times.

The person originally discovering the issue wrote a userspace daemon to write the proper stuff in memory from time to time - https://github.com/erpalma/lenovo-throttling-fix - but this should best get fixed in the kernel (unless it's a bug in the BIOS/EC firmware, then it's out of our hands).

Julian Andres Klode (juliank) wrote :

I have seen this problem on all bionic kernels I have tested. The machine has never seen anything older than bionic. I installed it from a daily end of March.

The mainline kernel does not fix the bug.

tags: added: bug-exists-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Julian Andres Klode (juliank) wrote :

Sorry, messed up the tag a bit.

tags: added: kernel-bug-exists-upstream
removed: bug-exists-upstream
AaronMa (mapengyu) wrote :

Hi,
TDP Intel U CPU family is defined as 25w.
Please refer to:
https://www.intel.com/content/www/us/en/products/processors/core/i5-processors/i5-8250u.html

CPU is not meant to go to 44W that is far away as Intel designed.

To allow the optimal operation and long-term reliability of Intel processor-based
systems, the component temperature specification is the applicable Tjmax which defined in MSR 0x1a2h[23:16];
Tj MAX is factory calibrated and is not user configurable.
TCC Activation Offset is set in TEMPERATURE_TARGET (0x1A2) MSR,bits [29:24];
the offset value will be subtracted from the value TjMAX.
Please try it if you like overclock.

When temperature is retrieved by the processor MSR, it is the instantaneous
temperature of the given DTS.
thermal interrupt in Linux depends on reading MSR to warn the system like:
temperature above threshold
Then EC will make CPU be full speed to cool down system, when CPU temperature is stable like 80C, CPU can work at the defined freq.

This is designed to make system reliable to work at the standard performance no matter with OS.

Julian Andres Klode (juliank) wrote :

Sure, it's not designed for 44W by Intel. I only measured 27W maximum when configured to 44W. It's just that the laptop is likely configured for an overall thermal limit of 44W because it can house a MX150 with the same cooling system (dual heat pipe). In any case, on Windows it is not bound by 15W long term, but runs higher, averaging at about 30W from what I read.

The thermal point I wrote the same I think. On Linux the offset is 20, on Windows it seems higher, as it reaches 90C and more.

AaronMa (mapengyu) wrote :

Quoted from Lenovo's feedback:

"
Due to "DPTF" function will act on Windows system by BIOS/Driver support, the system will turn to cool mode when keep high temperature several minutes.
But Linux system didn't have driver support DPTF function, it only have TDP limit on platform.
So user should have different between Windows and Linux system as normal behavior.
"

And my test result:
1, When stress testing on Win10, CPU temperature reaches 98C and 3.2G.
But this status can't keep more than several mins, it is too hot for CPU. So the CPU tried to maintain a 1.1G freq to cool itself.
Then the temperature and freq will go into a loop like 98C/3.2G -> 80C/1.1G -> 98C/3.2G.
2, On Ubuntu 16.04 CPU keeps 80C/2.7G when stressing.

I own a T460P with i7-6700HQ cpu.
The thermal throttling makes my pc lag like hell.
Even on the next day, after a night in standby.

Only a restart can fix it.

MMS-Prodeia (mms-prodeia) wrote :

I think, we have a problem here, that needs to be solved on multiple issues.

* there's throttling hitting in to avoid a dGPU reach its fall-off limit which is 76° (T580, i7-8550u, MX150). The only possible way to stay below this thermal limit is to power down the CPU massively, which leads to =>

* the dGPU isn't throttling down itself enough. In tests (glxsphere) my findings are:
dGPU runs at about 1600-1700Mhz, not going down further. I guess this is a result of the gpu itself has a thermal limit of around 95° and gpu's own throttling will kick in much later than 76°, making the gpu not helping in cooling down with slowing down more.
I wasn't able to get a reading of the power consumed by gpu, nvidia-smi does not offer these (nvidia-396). Thus leading to =>

* There's an imbalance between gpu's and cpu's abilities to thermal adapt. As gpu isn't able to, cpu is and does which leads to =>

a) a system running on brakes to manage heat, for to stay below a poosible gpu fall-off (full gpu power and frequency stable along time with cpu going down to only 3-5W
or
b) a system running at the possible performance with using the cpu's full potential, but only being able to use ⅓ of dGPU's capacities

Conclusion from my perspective:
There's a need to balance :-)

* there's a need to have influence on gpu's abilities to throttle earlier (=> nvidia)
* Lenovo-fix being enhanced to provide some kind of profiles, which are dynamically providing usecase profiles (easiest way to test is: set temp in Lenovo-fix to max 75°. As both, cpu & dGPU producing heat and 76° is fall-off, staying below that keeps system from throttling.)

Julian Andres Klode (juliank) wrote :

This bug was about different power limits for the non-dGPU version compared to Windows. It turned out that the power limits in Windows are broken, and the ones in Linux are correct, so this bug is in fact Invalid.

For the dGPU, I'm not sure what the intention is. Maybe report a bug against the nvidia driver if you think it's wrong?

Changed in linux (Ubuntu):
status: Confirmed → Invalid
MMS-Prodeia (mms-prodeia) wrote :

"It turned out that the power limits in Windows are broken, and the ones in Linux are correct, so this bug is in fact Invalid."
Where do I find that?

luckyrings (d8f2) wrote :

The described behavior with CPU thermal throtteling was confirmed by Lenovo. The limits are higher as in Windows "DPTF" works by driver support and enables the "desk mode". In Linux the "desk mode" is never reached due to the lack of sensor information and the system operates in the "lap mode" which has much lower trip temp.

See forum threat there:

https://forums.lenovo.com/t5/Other-Linux-Discussions/X1C6-T480s-low-cTDP-and-trip-temperature-in-Linux/td-p/4028489

The patches are not published yet for all platforms. But as T480s, many other models e.g. X1C6 are affected too.

Background info by Lenovo:

https://forums.lenovo.com/lnv/attachments/lnv/Special_Interest_Linux/13642/1/Linux%20Thermal%20throttling.pdf

Technical description by German magazine in German language only: https://www.notebookcheck.com/Lenovo-ThinkPads-haben-mit-CPU-Throttling-unter-Linux-zu-kaempfen-Loesung-in-Arbeit.435573.0.html?fbclid=IwAR0koYya67WfT8SlZmDuNx6F51niWYHzvd4C1MMzZdiyCJLutmC_4p3BJVs

Francois Thirioux (fthx) wrote :

I'm affected too (Ubuntu focal, kernel 5.4).

I found a simple workaround here using thermald :
https://forums.lenovo.com/t5/Other-Linux-Discussions/X1C6-T480s-low-cTDP-and-trip-temperature-in-Linux/m-p/4637873#M14378
helped from here :
https://github.com/intel/thermal_daemon/issues/215
It's just a workaround... and I don't really understand how it does make the limits higher, but it's better than nothing.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in thermald (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.