Thermal and power monitoring issues: CPU overheats and fan control broken/strange

Bug #1803442 reported by Chris
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Medium
Unassigned

Bug Description

On my ThinkPad P72 with i7-8850H CPU (Coffee Lake) the thermal setup is horrible: it runs very hot (you can feel it on the keyboard as well as the air exhausted by the fan) even when idle, the fan speeds up and gets silent after a a short time for no obvious load reasons when the system is idle.
(Definition of idle: quite a few browser tabs open in firefox and chrome sitting in the background untouched since a longish time, no interaction by the user with the machine)

The reported CPU temperature rises to an average of 60°C during idle with sporadic peaks well over 80°C (e.g. 95°C) for a single CPU.
With slight work (browsing the web) temperature rises with more often temperature peaks rising high - expected due to the higher work load. But still far too much for the little work load for such a high end machine.

Chances are high that after a short while the CPU frequency gets hard locked at 800 MHz with no way I could figure out to get out of that apart from rebooting.

A cross check with Windows 10 (dual boot) showed a much better thermal behavior there. Temperatures stayed lower and under high load the fan is running on a moderate level with the CPUs averaging at about 3 GHz. Actually it showed the behavior that I expected for Linux...

During my search for causes and ways to fix it I tried so far:
- disabling hardware P-states or even the pstate driver - no improvement (actually I felt it was even slightly worse)
- switching from laptop-mode-tools to tlp (no change)
- activating everything from powertop (no change)
- running powerstat was very interesting:
*********
 $ powerstat
Running for 300.0 seconds (30 samples at 10.0 second intervals).
Power measurements will start in 180 seconds time.

  Time User Nice Sys Idle IO Run Ctxt/s IRQ/s Watts
[...]
-------- ----- ----- ----- ----- ----- ---- ------ ------ ------
 Average 6.9 0.0 2.0 91.0 0.0 1.4 6732.7 3615.9 41.71
 GeoMean 6.8 0.0 2.0 91.0 0.0 1.3 6129.6 3566.7 41.63
  StdDev 1.1 0.0 0.5 1.5 0.1 0.9 4380.3 668.7 2.55
-------- ----- ----- ----- ----- ----- ---- ------ ------ ------
 Minimum 5.3 0.0 1.4 86.9 0.0 1.0 4508.5 2859.7 37.65
 Maximum 10.4 0.0 4.2 93.0 0.4 5.0 28301.5 6420.4 47.85
-------- ----- ----- ----- ----- ----- ---- ------ ------ ------
Summary:
System: 41.71 Watts on average with standard deviation 2.55
*********
but
*********
$ powerstat -R
Running for 60.0 seconds (60 samples at 1.0 second intervals).
Power measurements will start in 0 seconds time.

  Time User Nice Sys Idle IO Run Ctxt/s IRQ/s Watts
[...]
-------- ----- ----- ----- ----- ----- ---- ------ ------ ------
 Average 6.4 0.0 1.7 91.9 0.0 1.7 4966.6 3426.7 5.18
 GeoMean 4.5 0.0 1.1 91.5 0.0 1.5 4158.6 2173.9 4.40
  StdDev 6.1 0.0 1.8 7.8 0.1 1.0 3324.3 3723.3 4.00
-------- ----- ----- ----- ----- ----- ---- ------ ------ ------
 Minimum 1.2 0.0 0.3 71.9 0.0 1.0 1565.0 791.0 2.84
 Maximum 23.1 0.0 6.2 98.4 0.3 6.0 13108.0 15195.0 21.91
-------- ----- ----- ----- ----- ----- ---- ------ ------ ------
Summary:
CPU: 5.18 Watts on average with standard deviation 4.00
Note: power read from RAPL domains: dram, package-0, core, psys.
These readings do not cover all the hardware in this device.
*********

ProblemType: Bug
DistroRelease: Ubuntu 18.10
Package: linux-image-4.18.0-11-generic 4.18.0-11.12
ProcVersionSignature: Ubuntu 4.18.0-11.12-generic 4.18.12
Uname: Linux 4.18.0-11-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.10-0ubuntu13.1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: cm 2134 F.... pulseaudio
 /dev/snd/controlC0: cm 2134 F.... pulseaudio
CurrentDesktop: KDE
Date: Wed Nov 14 22:07:52 2018
HibernationDevice:
 #RESUME=UUID=9cd06200-ed0e-4f66-9baa-a3977c6b59e0
 RESUME=/dev/mapper/GROUP-SWAP
InstallationDate: Installed on 2018-10-10 (35 days ago)
InstallationMedia: Kubuntu 18.04.1 LTS "Bionic Beaver" - Release amd64 (20180725)
MachineType: LENOVO 20MBCTO1WW
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.18.0-11-generic root=/dev/mapper/GROUP-ROOT ro quiet splash vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-11-generic N/A
 linux-backports-modules-4.18.0-11-generic N/A
 linux-firmware 1.175
SourcePackage: linux
UpgradeStatus: Upgraded to cosmic on 2018-10-20 (25 days ago)
dmi.bios.date: 10/24/2018
dmi.bios.vendor: LENOVO
dmi.bios.version: N2CET31W (1.14 )
dmi.board.asset.tag: Not Available
dmi.board.name: 20MBCTO1WW
dmi.board.vendor: LENOVO
dmi.board.version: Not Defined
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: None
dmi.modalias: dmi:bvnLENOVO:bvrN2CET31W(1.14):bd10/24/2018:svnLENOVO:pn20MBCTO1WW:pvrThinkPadP72:rvnLENOVO:rn20MBCTO1WW:rvrNotDefined:cvnLENOVO:ct10:cvrNone:
dmi.product.family: ThinkPad P72
dmi.product.name: 20MBCTO1WW
dmi.product.sku: LENOVO_MT_20MB_BU_Think_FM_ThinkPad P72
dmi.product.version: ThinkPad P72
dmi.sys.vendor: LENOVO

Revision history for this message
Chris (mail-christianmayer) wrote :
Revision history for this message
Chris (mail-christianmayer) wrote :
Revision history for this message
Chris (mail-christianmayer) wrote :
Revision history for this message
Chris (mail-christianmayer) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Chris (mail-christianmayer) wrote :

Also interesting that I get in the kern.log messages like:
Nov 14 23:18:55 kenobi kernel: [14566.011739] CPU2: Core temperature above threshold, cpu clock throttled (total events = 3475)
Nov 14 23:18:55 kenobi kernel: [14566.011739] CPU8: Core temperature above threshold, cpu clock throttled (total events = 3475)
Nov 14 23:18:55 kenobi kernel: [14566.011741] CPU8: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011743] CPU2: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011839] CPU0: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011840] CPU7: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011841] CPU6: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011842] CPU1: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011843] CPU4: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011844] CPU10: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011845] CPU3: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011845] CPU9: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011847] CPU5: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.011847] CPU11: Package temperature above threshold, cpu clock throttled (total events = 7672)
Nov 14 23:18:55 kenobi kernel: [14566.012760] CPU2: Core temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012761] CPU8: Core temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012762] CPU8: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012762] CPU2: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012763] CPU7: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012764] CPU1: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012764] CPU6: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012765] CPU0: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012766] CPU9: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012782] CPU10: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012783] CPU4: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012783] CPU3: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012814] CPU11: Package temperature/speed normal
Nov 14 23:18:55 kenobi kernel: [14566.012815] CPU5: Package temperature/speed normal

Revision history for this message
Chris (mail-christianmayer) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.20 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Chris (mail-christianmayer) wrote :

The system is new so I can't comment whether it was better using a previous ubuntu version.

I'll try to run the latest upstream kernel now and report later the results.

Revision history for this message
Chris (mail-christianmayer) wrote :

I was testing with mainline/v4.20-rc3 but it was unusable on my machine: running the Quadro P3200 with the nouveau driver created a completely broken display (the nVidia 390 drivers are fine, though). At least this behavior is is identical between 4.18.0-11 and mainline/v4.20-rc3

So unless I have a way to run mainline/v4.20-rc3 with the Ubuntu nVidia package I can't test.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Brad Figg (brad-figg)
tags: added: ubuntu-certified
tags: added: cscc
To post a comment you must log in.