Thermald 1.9.1-1ubuntu0.6 keeps Tigerlake GPU frequency on 400 MHz

Bug #1944389 reported by Eugene86
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

After update to 1.9.1-1ubuntu0.6 from 1.9.1-1ubuntu0.4 thermald keeps Tigerlake Iris Xe GPU frequency on 400 MHz after reaching some high temperature value. It became impossible to play video games on the laptop.
System: Ubuntu 20.04.3 LTS
Kernel: 5.10.0-1045-oem
Laptop: Dell XPS 9310, CPU: Intel Core i7-1165G7 (Tigerlake), Integrated GPU Iris Xe.
BIOS 2.1.1 03/25/2021
Display: 2560x1440 144Hz HDMI USB-C connection
Room temperature is 23.5-25.4 degrees
Game: Stalker Clear Sky (Wine/Proton Steam)
Note: The game itself is very old and loads 100% of one CPU core disregarding of frequency.

GPU frequency is monitored by intel-gpu-top
GPU frequencies (according to /sys/class/drm/card0/) min/max/boost/efficiency: 100/1300/1300/400

Previous behavior (1.9.1-1ubuntu0.4)
After starting the game at first GPU reaches the boost value of 1300 MHz and CPU/package temperatures continuously increase. At this point game renders at ~80FPS.
After some time when threshold temperature value (~78 degrees) is reached the GPU frequency decreases to ~660 MHz and FPS to 40-48 FPS. Package temperature decreases to 66-68 degrees. It's possible to play for indefinite amount of time.

New behavior (1.9.1-1ubuntu0.4)
After starting the game at first GPU reaches the boost value of 1300 MHz and CPU/package temperatures continuously increase. At this point game renders at ~80FPS.
But after reaching the threshold temperature (about 81 degrees) GPU frequency decreases to 400 MHz (gt_RP1_freq_mhz -- "efficiency" temperature for the GPU) and stays on this value for the indefinite amount of time. The temperature is maintained on 70-74 degrees. FPS is about 25-30 FPS, it is not possible to play the game anymore. The only way to return the good FPS and frequency is to fold the game window, wait some time and open it again.

Also there is a workaround -- limit the CPU frequency to 2001 MHz and disable Intel turbo boost. With such approach package temperature never reaches 80 degrees and it is possible to play game with 500 MHz and 35-40 FPS. Better than nothing.

If it is needed I can perform any additional checks, provide CPU frequencies and so on. Most probably regression happened with 1.9.1-1ubuntu0.5, but I tried only versions 0.4 and 0.6

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

Try the upstream version
https://github.com/intel/thermal_daemon

May be missing some backports.

tags: added: regression-update
Revision history for this message
Eugene86 (eugene86) wrote :

I've checked with the latest git version -- it has the same problem.

New behavior (git thermald, as well as latest Ubuntu version with backports): thermald keeps the GPU frequency at 400 Hz after reaching 85 degrees, even if temperature decreases to 74.
After the thermal regime stabilizes there are following measurements:
GPU freq 400MHz
CPU freq 2.4GHz
Package temperature: 74 degrees
FPS is 25-35 -- it's impossible to play the game
intel_gpu_top displays that "Render/3D/0" engine is busy up to 100%, usually 97-99

Old behavior:
thermald starts throttling after reaching 70 degrees but allows GPU to run on 660 MHz
After the thermal regime stabilizes there are following measurements:
GPU freq 660MHz
CPU freq 1.3GHz
Package temperature: 64 degrees
FPS is 45-53 -- game is playable
intel_gpu_top displays that "Render/3D/0" engine is almost always busy at 100%

So new thermald ignores GPU demand and sacrifices the GPU performance in favor of CPU and at the same time keeps the higher temperature.

I also have found a workaround -- to disable the new behavior one needs to remove the "--adaptive" switch. W/o it the new version works as the old one with that switch.

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

Run the github version

#systemctl disable thermald
#thermald --no-daemon --loglevel=info --adaptive

Attach the log. It is possible that skin temperature a limit

Revision history for this message
Eugene86 (eugene86) wrote :

Here is the log
GPU frequency decreased to 400MHz at ~23:50 UTC+3

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

There is a temperature sensor "THP" which has a limit of 53C. This temperature exceeded which calls for thermal throttling to limit to 15W by thermal tables on this system.

Revision history for this message
Eugene86 (eugene86) wrote :

I just checked old and new behavior again.
First of all it's not clear for me from where this value of 53 is taken, according to the data in /sys/class/thermal/thermal_zone5 (THP)
trip_point_0_hyst:4000
trip_point_0_temp:-274000
trip_point_0_type:passive
trip_point_1_hyst:4000
trip_point_1_temp:-274000
trip_point_1_type:passive
trip_point_2_hyst:4000
trip_point_2_temp:80050
trip_point_2_type:critical
trip_point_3_hyst:4000
trip_point_3_temp:75050
trip_point_3_type:hot
trip_point_4_hyst:4000
trip_point_4_temp:65050
trip_point_4_type:passive
and the same values are shown for THP by old thermald (via ThermalMonitor):
THP:
65 Passive
75 Max
80 Critical
58 Polling

For new thermald with --adaptive THP values (via ThermalMonitor) are the following:
52 Passive
46 Polling
and doesn't comply with data from /sys/class/thermal/thermal_zone5/

I'll be grateful if you could explain me what I missed, but I believe that it could be so -- at least it looks like thermald tries to maintain 53 degree value.

But regarding my issue I can tell you some measurements:
With old thermald zone5 (THP) temperature is kept 53-56 during the load, CPU temperature 60-64, CPU frequency 1.3-1.4
with new thermald zone5 (THP) temperature is also kept 53-56 during the load, CPU temperature 70-74 CPU frequency 2.4 (up to 2.7) and amount of noise is bigger.

So it looks like thermald tries to keep THP temperature within limits by decreasing the GPU frequency, but as far as GPU chip is on the same die with CPU and shares the cooling fan, thermald is not succeed -- because the CPU heats the GPU chip.

So my biggest concern and suspicion here is that new thermald (with --adaptive) sacrifices GPU by CPU -- it makes GPU run on lower frequencies in favor of allowing CPU run on higher ones. Maybe it is reasonable for some loads (I can hardly imagine for which ones?) but not for end-user laptop and especially for gaming/3D or 4K accelerated video from Youtube -- if there is a high demand for GPU it means that it would mostly affect performance, latency and overall end-user experience.
Tigerlake laptops have more than enough CPU performance for end-user but a very weak GPU and even this GPU is throttled in case of serious load.

I understand that it would be better to have some option, what to prioritize GPU or CPU, but for majority of end-users (and this Tigerlake is not a tool for ones who want to run a heavy computing tasks) the GPU prioritization would be more desired and expected.

Revision history for this message
Colin Ian King (colin-king) wrote :

I've backported the latest thermald to older releases to make testing easier:

https://launchpad.net/~colin-king/+archive/ubuntu/thermald-backports

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

Data from /sys/class/thermal are usually stale. They are left from previous platforms. The actual limits comes from adaptive tables based on the condition match. This is what OEM defined for the system.
Also without adaptive power limits are insane here.

Thermal doesn't reduce GPU frequency. It only reduced total power. How power is distributed between CPU and GPU is beyond thermald control.

If someone don't want adaptive, they can always not use it. But adaptive is important as this is keeping the system under thermal limits as defined and for the life of the system.

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

I can give you some other knob, please try that whether this improves while playing this game.

#cd /sys/bus/pci/devices/0000:00:04.0/workload_request
#echo idle > workload_type

or

# echo bursty > workload_type

Revision history for this message
Eugene86 (eugene86) wrote :

Srinvas, thank you for the explanation. But could you also explain what do you mean by

> It only reduced total power
Are these 15 Watts limit (after reaching 53 degrees) for the whole Core i7 chip (CPU+GPU+..) or for GPU chiplet only?
If this limit is for the whole chip than something is wrong with thermald, because overall power consumption with new thermald is significantly bigger (because of higher temperature / fan RPM) and the CPU frequency is almost two times higher (2.4 vs 1.3 GHz).

Regarding workload_type/workload_request there are no such file/folder in 0000:00:04.0 and in any other folders located /sys/bus/pci/devices/ Kernel version 5.10.0-1045-oem

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

This is just long term package power limit after reaching 53C "/sys/class/powercap/intel-rapl-mmio/intel-rapl-mmio\:0/device/intel-rapl-mmio\:0/constraint_0_power_limit_uw".
Doesn't include off package limit.

You need 5.11 kernel atleast for those workload controls. Tiger lake controls patches are submitted for 5.11, but distros may have backported few.

This workload type will reduce cpu power and give it to GPU.

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

I have the same XPS 9310 with
5.11.0-34-generic #36-Ubuntu SMP Thu Aug 26 19:22:09 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux

# tree /sys/bus/pci/devices/0000\:00\:04.0
/sys/bus/pci/devices/0000:00:04.0

...
...
├── uevent
├── vendor
└── workload_request
    ├── workload_available_types
    └── workload_type

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

In your current 5.10 kernel try this setting:

#for i in {0..7}; do echo balance_power > /sys/devices/system/cpu/cpufreq/policy$i/energy_performance_preference; done

confirm with
for i in {0..7}; do cat /sys/devices/system/cpu/cpufreq/policy$i/energy_performance_preference; done
balance_power
balance_power
balance_power
balance_power
balance_power
balance_power
balance_power
balance_power

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

It will also be great if you can specify how to download and play the game to get to this condition.

Revision history for this message
Eugene86 (eugene86) wrote :

Thank you, Srinivas! I'll try to perform the tests with the proposed configuration on the weekend.
Game: https://store.steampowered.com/app/20510/STALKER_Clear_Sky/ it costs 9.99 EUR but I'll be happy to buy it for your for the test purposes.

To run it with Steam on Linux one should enable "Steam Play" in Steam settings ("For selected titles" and "For all other titles" (I selected as compatibility tool Proton 5.0-10, but it also runs fine with the other versions)
The game graphics configuration (steamapps/common/STALKER Clear Sky/_appdata_/user.ltx) is attached. I run it on external 2560x1440 144 Hz display (USB-C=>HDMI cable)

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

Thanks. I can buy, I don't have gaming skills!

You already indicated that limiting cpu frequency or removing turbo boost helps.
I want to see if the energy_performance_preference or workload_request works.
"workload_request" will be the best as this is one setting for all CPUs.

If this works then if GPU util exceed some utilization we can reduce CPU power by any of above methods in the thermald.

Revision history for this message
Eugene86 (eugene86) wrote :

Checked with kernel 5.13.0-1014-oem
(5.11.0-37-generic is not usable for this laptop as wifi and external display are not recognized)
Both
# echo idle > workload_type
or
# echo bursty > workload_type
are not helpful.
Also balance_power is in /sys/devices/system/cpu/cpufreq/policy$i/energy_performance_preference for every CPU by default.

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

balance_power is not a kernel default, so something on the system running which is changing cpufreq parameters. "balamce_performance" is the default.

Anyway thermald is setting the package power limit as per thermal tables. There is no power sharing info in the tables.

We have this power sharing gap, which needs to be addressed. I thought it will be as simple as setting these knobs. But since they don't help, need to try other methods. I will try to setup the game and check.

Revision history for this message
Eugene86 (eugene86) wrote :

I also checked with GTA IV game -- with the new thermald performance also decreased. Average FPS for in-game benchmark drop from 53 to 42 (I run only one attempt for each thermald version) (and game became unplayable as minimum FPS in complex scenes became less than 30).
GTA IV has a different CPU load pattern: while STALKER loads only one CPU core but to 100%, GTA IV looks to be multi-threaded -- it utilizes all cores but all of them partially (no one was > 50%).

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

Looks like some game mode daemon or program, which is changing parameters for cpu frequency.
Like energy performance preference was changed to balance_power.
Need to find out what is that program, may be some tuning can be done there as it knows game is going to be played.

Also this will be helpful to get this data, while this condition is triggered.

$sudo ./turbostat --show CPU,Busy%,Bzy_MHz,TSC_MHz,GFXMHz,GFXAMHz,PkgWatt,CorWatt,GFXWatt -i 1

Revision history for this message
Eugene86 (eugene86) wrote :

I've performed measurements with turbostat (kernel 5.13)
With new thermald ("bad" game performance):
GFXAMHz: 400, PgkWatt: 16.56, CorWatt: 7.34, GFXWatt: 3.88
With old thermald (good performance):
GFXAMhz: 650, PkgWatt: 14.86 CorWatt: 3.81, GFXWatt: 4.45

About game mode daemon: I have gamemode installed but I'm not sure whether it is actually used. I remind that have it installed just in case, but I heard about this program (it is useful for AMD CPUs with discrete GPUs to set "performance" CPU governor during playing)

It could be a good idea to utilize it, however I have a concern that most of "typical users" who want to play on their laptops don't know about this program.

Revision history for this message
Eugene86 (eugene86) wrote :
Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

With the increase in power limit by 2W from the thermal tables results in increase power for CPU driving to higher frequency may be saturating GPU.

It will take some time to come up with some algorithm. May be part of another power sharing daemon.

koba (kobako)
Changed in thermald (Ubuntu):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.