Ubuntu
thermald package

CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Bug #2026658 reported by Eli on 2023-07-10

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Incomplete	Undecided	Unassigned
	thermald (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

I've tried to include as much detail as possible in this bug report, I originally assembled it just after the release of ubuntu 23.04. There has been no change since then.

I have had substantial performance problems since updating from ubuntu 22.10 to 23.04.
The computer in question is the 17 inch Razer Blade laptop from 2022 with an intel i7-12800H.
Current kernel is 6.2.0-20-generic. (now I'm on 6.2.0-24-generic and nothing has changed.)
This issue occurs regardless of whether the OpenRazer (https://openrazer.github.io/) drivers etc are installed.

Description of problem:
I have discovered what may be two separate bugs involving low level power management details on the cpu, they involve the cpu entering different types of throttled states and never recovering. These issues appeared immediately after upgrading from ubuntu 22.10. The computer is a large ~gaming laptop with plenty of thermal headroom, cpu temperatures cannot reach concerning values except when using stress testing tools.

(I don't know how to propery untangle these two issues, so I'm posting them as one. I apologize for the review complexity this causes, but I think posting the information all in one spot is more constructive here.)

High level testing notes:
- This issue occurs with use of both the intel_pstate driver and the cpufreq driver. (I don't have the same level of detail for cpufreq, but the issue still occurs.)
- I have additionally tested a handful of intel_pstate parameters (and others) via grub kernel command line arguments to no effect. All testing reported here was done with:
  GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau"
  GRUB_CMDLINE_LINUX=""
  (loading nouveau caused problems for me on 22.10, I have not bothered reinvestigating it on 23.04)
- There is a firmware update available from the manufacture when I boot into Windows, I have not installed it (yet).
- - Update: I installed it. No change.
- Changing the cpu governor setting from "powersave" to "performance" using `cpupower frequency-set -g performance` has no effect. (Note: this action is separate from the intel_pstate's power-saver/balanced/performance setting visible with the `powerprofilesctl` utility. It doesn't seem to be a governor bug.
- - (There is a tertiary issue where I also see substantial (+50%) performance degredation using the "performance" profile in a test suite I run constantly for my job; that is clearly a problem but it is an unrelated bug that has existed for quite some time.)

Summary and my own conclusions:
These are my takeaways, the ~raw data is in the followup section.

Bug 1)
The reported cpu power limits are progressively constrained over time. Once this failure mode starts the performance never recovers.
  - As this situation progresses the observed cpu speeds (I'm using htop) list as 2800Mhz at idle, but the instant any load at all is placed on a cpu core that core immediately drops to exactly 400Mhz.
  - This situation occurs quite quickly in human terms, frequently within 20 minutes of normal usage after a boot, but it will also occur when the computer is just sitting there unused for a handful of hours.
  - This occurs when using the cpufreq gevernor (by including "intel_pstate=disable" on the grub command line args.)
  - At boot the default value for short_term_time looks wrong to me. This is the duration of higher thermal targets in seconds, ~0.002 seconds seems extremely short. A normal value would be a handful of seconds.
  - This situation can be remedied by running the following python script. It uses the undervolt package (pip install undervolt==0.3.0) to force particular power limits (the provided values are intentional overkill):
     1 │ from undervolt import read_power_limit, set_power_limit, PowerLimit, ADDRESSES
     2 │ from pprint import pprint
     3 │
     4 │ limits = read_power_limit(ADDRESSES)
     5 │ pprint(vars(limits)) # print current values before setting them
     6 │
     7 │ POWER_LIMITS = PowerLimit()
     8 │ POWER_LIMITS.locked = True # lock means don't allow the value to be reset until a reboot.
     9 │ POWER_LIMITS.backup_rest = 281474976776192 # afaik this is just a backup-on-failure setting, it has no effect here.
    10 │ POWER_LIMITS.long_term_enabled = True
    11 │ POWER_LIMITS.long_term_power = 160 # values are intentional overkill
    12 │ POWER_LIMITS.long_term_time = 2880.0
    13 │ POWER_LIMITS.short_term_enabled = True
    14 │ POWER_LIMITS.short_term_power = 250
    15 │ POWER_LIMITS.short_term_time = 500.0
    16 │ set_power_limit(POWER_LIMITS, ADDRESSES)
    17 |
    18 | limits2 = read_power_limit(ADDRESSES) # and print the new state
    19 | pprint(vars(limits2))

Bug 2)
`powerprofilesctl` has unearthed some bug where the cpu performance enters the degraded state "high-operating-temperature", and never recovers.
  - This appears to happen for no reason. There is a brief cpu temperature spike in the example data below, but it does not hit the listed hardware limit values so I am at a loss for its cause.
  - I ran a cpu stress test (prime95/mprime torture test), it immediately spikes cpu temperature to 100 degrees and throttles the cpu, but doesn't trigger the high temperature degraded state. Go figure.
  - This bug takes quite a while to kick in, uptime in my example below was at over 14 hours.
  - When this situation occurs the maximum cpu speed becomes 2400Mhz across all cpu cores. The cpu power management appears to behave correctly in the 400-2400Mhz range. I believe this means all turbo frequencies are disabled.
  - Running the comman `sudo cpupower frequency-set -u 4800000` (or any value above 2400000) does not correct the reported cpu_policy_range, it remains locked at 2400Mhz.
  - The only fix I know is a reboot.

THE DATA:

Bug 1:
This output was gathered using a python package called undervolt's read_power_limit function from a script that starts running at ~boot.
long_term_power and short_term_power metrics are values in watts, long_term_time and short_term_time are values in seconds.

2023-05-12 15:14:32 up 0 min, 0 user, load average: 0.39, 0.10, 0.03
(boot, log starts after normal user login)
long_term_power: 65.0
long_term_time: 32.0
short_term_power: 160.0
short_term_time: 0.00244140625

2023-05-12 15:20:29 up 6 min, 2 users, load average: 1.90, 0.86, 0.37
long_term_power: 20.875 <-- down
long_term_time: 28.0 <-- down
short_term_power: 160.0
short_term_time: 0.00244140625

2023-05-12 15:20:46 up 6 min, 2 users, load average: 1.63, 0.87, 0.38
long_term_power: 22.625 <-- hey it went up! I was still using the computer at this point
long_term_time: 28.0
short_term_power: 160.0
short_term_time: 0.00244140625

2023-05-12 15:46:15 up 32 min, 2 users, load average: 0.66, 0.84, 0.79
(no longer at computer by the time this occurs)
long_term_power: 20.625 <-- down
long_term_time: 28.0
short_term_power: 160.0
short_term_time: 0.00244140625

2023-05-12 16:04:46 up 50 min, 3 users, load average: 0.46, 0.70, 0.79
long_term_power: 16.625 <-- down
long_term_time: 28.0
short_term_power: 160.0
short_term_time: 0.00244140625

2023-05-12 17:23:07 up 2:08, 3 users, load average: 0.49, 0.61, 0.68
(by the time long_term_power hits 8.625 all cpu cores throttle to 400Mhz under any load. This one was preceded by ~1 second of a single cpu core randomly spiking to 78 degrees, output from `powerprofilesctl` remains normal. At this point long_term_power will never go up again. I have seen one more lowered stage at ~4.3125w.)
long_term_power: 8.625 <-- way down - I've seen lower, though.
long_term_time: 28.0
short_term_power: 160.0
short_term_time: 0.00244140625

(And then after several hours stuck in this mode I returned to the computer and needed to run the script in the bug 1 summary to make it usable again.)

Bug 2:
(Some cleanup of output, script starts at ~boot)
2023-05-11 22:21:15 up 14:15, 2 users, load average: 0.38, 0.42, 0.52

some summarized details from the `cpupower` utility:
  | cpu_number: 2
  | cpu_range: 400 MHz - 4.70 GHz
  | cpu_policy_range: 400 MHz and 4.70 GHz.
  | governor: powersave

output from `sensors` (slightly compactified, I don't know what's up with the cpu core numbers):
  | iwlwifi_1-virtual-0 - Adapter: Virtual device - temp1: +49.0°C
  | nvme-pci-0300 - Adapter: PCI adapter - Composite:
  | +40.9°C (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
  | nvme-pci-0200 - Adapter: PCI adapter:
  | Composite: +36.9°C (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
  | Sensor 1: +36.9°C (low = -273.1°C, high = +65261.8°C)
  | Sensor 2: +38.9°C (low = -273.1°C, high = +65261.8°C)
  | coretemp-isa-0000 - Adapter: ISA adapter
  | Package id 0: +77.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 0: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 4: +54.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 8: +77.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 12: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 16: +64.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 20: +45.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 24: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 25: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 26: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 27: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 28: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 29: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 30: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 31: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | acpitz-acpi-0 - Adapter: ACPI interface: temp1: +27.8°C (crit = +105.0°C)

2023-05-11 22:21:17 up 14:15, 2 users, load average: 0.38, 0.42, 0.52 (2 seconds later)

some summarized details from the `cpupower` utility:
  | cpu_number: 8
  | cpu_range: 400 MHz - 4.70 GHz
  | cpu_policy_range: 400 MHz and 2.40 GHz.
  | governor: powersave

output from `sensors` | iwlwifi_1-virtual-0 | nvme-pci-0300 | Composite: | nvme-pci-0200 | Composite: | Sensor 1: | Sensor 2: | coretemp-isa-0000 | | Core 0: | Core 4: | Core 8: | Core 12: | Core 16: | Core 20: | Core 24: | Core 25: | Core 26: | Core 27: | Core 28: | Core 29: | Core 30: | Core 31: | acpitz-acpi-0 (slightly compactified, I don't know what's up with the cpu core numbers):
Adapter: Virtual device temp1: +49.0°C
- Adapter: PCI adapter
+40.9°C (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
- Adapter: PCI adapter
+36.9°C (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
+36.9°C (low = -273.1°C, high = +65261.8°C)
+38.9°C (low = -273.1°C, high = +65261.8°C)
- Adapter: ISA adapter
Package id 0: +60.0°C (high = +100.0°C, crit = +100.0°C)
+53.0°C (high = +100.0°C, crit = +100.0°C)
+59.0°C (high = +100.0°C, crit = +100.0°C)
+54.0°C (high = +100.0°C, crit = +100.0°C)
+58.0°C (high = +100.0°C, crit = +100.0°C)
+58.0°C (high = +100.0°C, crit = +100.0°C)
+60.0°C (high = +100.0°C, crit = +100.0°C)
+58.0°C (high = +100.0°C, crit = +100.0°C)
+58.0°C (high = +100.0°C, crit = +100.0°C)
+58.0°C (high = +100.0°C, crit = +100.0°C)
+58.0°C (high = +100.0°C, crit = +100.0°C)
+55.0°C (high = +100.0°C, crit = +100.0°C)
+55.0°C (high = +100.0°C, crit = +100.0°C)
+55.0°C (high = +100.0°C, crit = +100.0°C)
+55.0°C (high = +100.0°C, crit = +100.0°C)
- Adapter: ACPI interface - temp1: +27.8°C (crit = +105.0°C)

See original description

Tags:

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-10:

AlsaInfo.txt Edit (65.9 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (118.8 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.4 KiB, text/plain; charset="utf-8")
IwConfig.txt Edit (508 bytes, text/plain; charset="utf-8")
Lspci.txt Edit (32.1 KiB, text/plain; charset="utf-8")
Lspci-vt.txt Edit (2.3 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (886 bytes, text/plain; charset="utf-8")
Lsusb-t.txt Edit (1.6 KiB, text/plain; charset="utf-8")
Lsusb-v.txt Edit (73.3 KiB, text/plain; charset="utf-8")
PaInfo.txt Edit (139.8 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (33.1 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.7 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (29.3 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (10.9 KiB, text/plain; charset="utf-8")
RfKill.txt Edit (112 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (469.9 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (169.2 KiB, text/plain; charset="utf-8")
acpidump.txt Edit (3.5 MiB, text/plain; charset="utf-8")

description:

updated

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2023-07-11:

> I have had substantial performance problems since updating from ubuntu 22.10 to 23.04.
Maybe it's caused by thermald? See if `sudo systemctl stop thermald` can help.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-11:

> Maybe it's caused by thermald? See if `sudo systemctl stop thermald` can help.

I will try this, I will also reinstall the package.

I am waiting to see if using sane power parameters in my script for bug 1 fixes the bug 2 issue, but that means I need to leave it sitting for 12+ hours so my iteration speed here is very slow.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-14 (last edit on 2023-07-17):

I figured I would do a an apt reinstall thermald first, aanndd that seems to hae fixe bug 1, bug 2 takes many hours to kick in so I will have to leave it on for a ~day to find out.

Thank you, hopefully this was a false alarm.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-17:

I disabled thermald, I think bug 1 may be resolved now, but bug 2 eventually occurred again overnight.

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2023-07-19:

Can you please attach output of `grep . /sys/devices/system/cpu/intel_pstate/*` when the issue happens?

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-19:

I will grep that for you, here is the reference information just after boot when everything works. I have to wait for the bug to kick in (and woke up to blinking cursor in the upper left corner, funnn).

reference:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-20 (last edit on 2023-07-20):

Interesting, a new ~intermediate situation of bug 1. Its not constraining that thermal envelope quite as much now. Thermald is running, I guess bug 1 is still present but lessened since I reinstalled it? Computer was not in use.

When running `stress -c 1` it places the task on the correct ideal core (the one that clocks up to 4.8Ghz), but it bounces around between 1300mhz and 500mhz.
**update: it looks like any multiithreaded load gets shunted down to 400Mhz, but occasiional spikes on single threaded operation.

First log statement for the power envelope was at 2023-07-19T18:07:15 (roughly at boot) and then long_term_time and long_term_power suddenly step down and stay that way at roughly 3 hours 30 minutes uptime. (Current uptime is about 9 hours 50 minutes.)

2023-07-19T21:39:11
{'backup_rest': 281474976776192,
'locked': False,
'long_term_enabled': True,
'long_term_power': 65.0,
'long_term_time': 32.0,
'short_term_enabled': True,
'short_term_power': 160.0,
'short_term_time': 0.00244140625}

2023-07-19T21:39:13
{'backup_rest': 281474976776192,
'locked': False,
'long_term_enabled': True,
'long_term_power': 7.75,
'long_term_time': 28.0,
'short_term_enabled': True,
'short_term_power': 160.0,
'short_term_time': 0.00244140625}

here is your grep:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

It will take a ~day for bug 2 to trigger, I will update and do the grep when that happens. If I don't need to use this computer I will leave it on without running my script to lock the power envelope at a higher value.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-21:

Ok well this is interesting.
It has been over 24 hours (uptime 1 day 15:48hrs) and bug 2 still hasn't triggered, "Degraded = no".

I probably need this computer today, so I ran `stress -c 1` to try to force the high temperature power state after running the script to reset the power values, but with lock set to False. This went onto the ideal cpu core at 4.8ghz, package temperature spiked up to ~93 degrees C, and eventually the power parameters dropped to this:

{'backup_rest': 281474976776192,
'locked': False,
'long_term_enabled': True,
'long_term_power': 0.125,
'long_term_time': 28.0,
'short_term_enabled': True,
'short_term_power': 250.0,
'short_term_time': 512.0}

Bug 2 still had not been triggered, cpu was throttled to 400Mhz, so I reset the power values again this time with lock=True. When I ran `stress -c 1` it was on the ideal core but now limited to 4.3ghz with a cpu package temp of ~67.

I ran the grep and get this:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:90
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-21:

#10

I have updates!
- I set /sys/devices/system/cpu/intel_pstate/max_perf_pct to 100 and confirmed that it restore the 4.7/4.8 peak turbo frequencies.
- I ran `stress -c 1`, cpu package temps went up to ~84 degrees C. No changes on the grep, no changes on the powerprofilesctl degraded state.
- I ran `stress -c 2`, cpu package temps went up to ~90 degrees C. This triggered the powerprofilesctl degraded state.

The grep:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:70
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:1
/sys/devices/system/cpu/intel_pstate/status:active

$ powerprofilesctl
* performance:
    Driver: intel_pstate
    Degraded: yes (high-operating-temperature)
  balanced:
    Driver: intel_pstate
  power-saver:
    Driver: intel_pstate

$ cpupower frequency-info
analyzing CPU 6:
  driver: intel_pstate
  ...
  hardware limits: 400 MHz - 4.80 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 2.40 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  boost state support:
    Supported: yes
    Active: yes

In case it is relevant, this has been with thermald running.
Temperatures are back down around 45 degrees C, which is typical, but as stated in the original error report it will never recover on its own

* * *

I have now set those values back to their originals, e.g.
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/no_turbo:0

I will also note that the 400MHz to 2.40GHz range indicated by cpupower reverts the full range when no_turbo is set to 0, and powerprofilesctl degraded state is also directly based off this value. (So I will stop reporting them!)

From this I can at least write a script that sets these variables back to normal and regain normal functionality without rebooting!

My next step will be to uninstall thermald entirely, reboot, and report back with whether I'm able trigger either bug. I'm confused about what I experienced before, the reboot is to clear thermal envelope lock from my script.

$ powerprofilesctl
* performance:
    Driver:     intel_pstate
    Degraded:   yes (high-operating-temperature)
  balanced:
    Driver:     intel_pstate
  power-saver:
    Driver:     intel_pstate

* * *

I have now set those values back to their originals, e.g.
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/no_turbo:0

From this I can at least write a script that sets these variables back to normal and regain normal functionality without rebooting!

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2023-07-26:

#11

Can you please see if the bug present on latest thermald?

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-26:

#12

So far I have not had either bug recur with thermald uninstalled, I'm satisfied that thermald was the culprit for bug one. Running `stress -c 2` (highest possible thermal load) doesn't trigger anything.

The grep on this run is
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:1
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

(I don't know why hwp_dynamic_boost, I only know it changed at some point during running stress. I though this was locked to 0 based on a kernel command line argument for intel_pstate?)

I have installed 2.5.2-1 and will reboot and report back.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-26:

#13

Nope, running stress -c 2 yields...

long_term_power starts at 65, drops to ~10, then jumps around values like 37.625, 30.0, 31.0, 28.0, but then after a ~minute the power values look like this and clearly aren't recovering:

{'backup_rest': 281474976776192,
'locked': False,
'long_term_enabled': True,
'long_term_power': 3.0,
'long_term_time': 28.0,
'short_term_enabled': True,
'short_term_power': 160.0,
'short_term_time': 0.00244140625}

grep looks like this:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

CPU speeds are set to their nominal speeds but immediately drop down to 400mhz as soon as they take any load.

A more real-world check of a parallel test suite I run frequently and know should take on the order of 3.5 seconds takes 30. (Running it in single threaded is clearly also terrible but I don't offhand know its normal run time.)

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-26:

#14

reference data point: the power values just after a reboot when thermald is not installed are:

{'backup_rest': 65536,
'locked': False,
'long_term_enabled': True,
'long_term_power': 65.0,
'long_term_time': 56.0,
'short_term_enabled': True,
'short_term_power': 160.0,
'short_term_time': 0.00244140625}

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-26:

#15

Not to distract too much, but I have also uncovered a separate, replicable performance bug that is quite bizarre. tl;dr, though reported clock speeds are consistently higher after running `cpupower frequency-set -g performance` I see very reliable on-the-order-of-40%-ish performance loss on a particular compile task, and on that test suite I mentioned. This occurs whether thermald is installed or not, it also occurs when min_perf_pct is set to 100.

I will look into that more deeply and can create a new bug report for it - but I don't know what to label that one and would appreciate a recommendation.

I feel this current issue's name has been proven wrong, it seems clear this is a thermald Thing so I propose we rename it. (If that is an option.) The new performance weirdness is a better candidate with respect to its source closer to the kernel due to it literally being a pstate driver option - assuming I've understand what the cpupower command does.

Revision history for this message

koba (kobako) wrote on 2023-07-27:

#16

@Eli, could you provide the thermald logs? thanks
#sudo systemctl stop thermald
#sudo thermald --no-daemon --adaptive --loglevel=debug >> thermald_debug_202307270935

Revision history for this message

Eli (biblicabeebli) wrote on 2023-07-27:

#17

I'm running day-to-day without thermald installed, so I will need to find time to do this. But yes I can.

koba (kobako) on 2023-08-08

Changed in thermald (Ubuntu):
status:	New → In Progress
assignee:	nobody → koba (kobako)

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-08:

#18

(Trying to find time to do this. Had some life externalities come up, and my work is currently unavoidably attached to this specific computer and I have some deadlines. 🫠)

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-11:

#19

Hi all,

I believe I am also experiencing this issue, pretty much exactly as described.

I'm using a Lambda Tensorbook 2022 (which is a 15.6-inch Razor Blade from 2022 in a new case) with Ubuntu 22.04.3 LTS and the current kernel is 6.2.0.26-generic.

Here is the output of 'cpupower frequency-info'

"""
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: Cannot determine or is not supported.
  hardware limits: 400 MHz - 4.70 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 4.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 400 MHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

"""

And the output of 'sensors'

"""

sensors
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +37.0°C

nvme-pci-0200
Adapter: PCI adapter
Composite: +35.9°C (low = -273.1°C, high = +82.8°C)
(crit = +84.8°C)
Sensor 1: +35.9°C (low = -273.1°C, high = +65261.8°C)

BAT0-acpi-0
Adapter: ACPI interface
in0: 17.36 V
curr1: 0.00 A

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +58.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +58.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +52.0°C (high = +100.0°C, crit = +100.0°C)
Core 8: +53.0°C (high = +100.0°C, crit = +100.0°C)
Core 12: +52.0°C (high = +100.0°C, crit = +100.0°C)
Core 16: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 20: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 24: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 25: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 26: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 27: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 28: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 29: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 30: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 31: +50.0°C (high = +100.0°C, crit = +100.0°C)

nvme-pci-0300
Adapter: PCI adapter
Composite: +38.9°C (low = -273.1°C, high = +82.8°C)
(crit = +84.8°C)
Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1: +27.8°C (crit = +105.0°C)

"""

Let me know if there is any other information I can provide.

Hi all,

I believe I am also experiencing this issue, pretty much exactly as described.

I'm using a Lambda Tensorbook 2022 (which is a 15.6-inch Razor Blade from 2022 in a new case) with Ubuntu 22.04.3 LTS and the current kernel is 6.2.0.26-generic.

Here is the output of 'cpupower frequency-info'

"""
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency:  Cannot determine or is not supported.
  hardware limits: 400 MHz - 4.70 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 4.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 400 MHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

"""

And the output of 'sensors'

"""

sensors
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1:        +37.0°C

nvme-pci-0200
Adapter: PCI adapter
Composite:    +35.9°C  (low  = -273.1°C, high = +82.8°C)
                       (crit = +84.8°C)
Sensor 1:     +35.9°C  (low  = -273.1°C, high = +65261.8°C)

BAT0-acpi-0
Adapter: ACPI interface
in0:          17.36 V  
curr1:         0.00 A

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +58.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +58.0°C  (high = +100.0°C, crit = +100.0°C)
Core 4:        +52.0°C  (high = +100.0°C, crit = +100.0°C)
Core 8:        +53.0°C  (high = +100.0°C, crit = +100.0°C)
Core 12:       +52.0°C  (high = +100.0°C, crit = +100.0°C)
Core 16:       +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 20:       +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 24:       +48.0°C  (high = +100.0°C, crit = +100.0°C)
Core 25:       +48.0°C  (high = +100.0°C, crit = +100.0°C)
Core 26:       +48.0°C  (high = +100.0°C, crit = +100.0°C)
Core 27:       +48.0°C  (high = +100.0°C, crit = +100.0°C)
Core 28:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
Core 29:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
Core 30:       +50.0°C  (high = +100.0°C, crit = +100.0°C)
Core 31:       +50.0°C  (high = +100.0°C, crit = +100.0°C)

nvme-pci-0300
Adapter: PCI adapter
Composite:    +38.9°C  (low  = -273.1°C, high = +82.8°C)
                       (crit = +84.8°C)
Sensor 1:     +38.9°C  (low  = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +105.0°C)

"""

Let me know if there is any other information I can provide.

Revision history for this message

koba (kobako) wrote on 2023-08-11:

#20

@james,
* dump thermald versio
#sudo apt policy thermald

* stop thermald.service and run manually.
#systemctl disalbe thermald.service
#systemctl stop thermald.service
#reboot
#sudo thermald --no-daemon --adaptive --loglevel=debug > thermaldLog_202308111733
#try to reproduce
#upload log.

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-11:

#21

#sudo apt policy thermald

thermald:
  Installed: 2.4.9-1ubuntu0.3
  Candidate: 2.4.9-1ubuntu0.3
  Version table:
*** 2.4.9-1ubuntu0.3 500
        500 http://gb.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.4.9-1 500
        500 http://gb.archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Disabling and stopping thermald.service, rebooting and then running the command:

#sudo thermald --no-daemon --adaptive --loglevel=debug > thermaldLog_202308111733

currently seems to be resulting in me not being able to reproduce the bug, which in my case is Bug 1 in the original post. All CPU cores were dropping to 400MHz within a few minutes of startup and currently, it's at 2Hrs of uptime with no drop. I'll continue to try and reproduce and upload the log should it occur.

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-11:

#22

thermaldLog_202308111733 Edit (4.6 MiB, text/plain)

The issue has occurred again. All CPU cores lock to 400MHz when under any load. I've attached the thermald log.

Revision history for this message

koba (kobako) wrote on 2023-08-12: Re: [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

#23

@James,
Thanks for the update,
Could you show the using kernel?
#uname -a

Is it possible to check with t uostream thermald?
https://github.com/intel/thermal_daemon

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-13:

#24

thermaldlog_202308130942 Edit (6.3 MiB, text/plain)

#uname -a:

Linux james-TensorBook-2022 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I have attached the log from running:

#sudo sbin/thermald --no-daemon --adaptive --loglevel=debug > thermaldlog_202308130942

using the thermald binary installed from source.

The same issue occurred, though perhaps slightly different as it was locking to ~900 - 1000MHz rather than 400MHz when under load.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-13:

#25

thermald.log Edit (437.7 KiB, text/plain)

I have some time (and was reminded by thread updates, thanc you for posting Mr. Gardner!) and am running my test. Reinstalled (after a purge uninstall iirc) thermald via apt, then ran `stress -c 2` for a ~minute to bring it to the brink, then `stress -c 1` to push it over the edge. (This seems to reliably cause bug 1.)

$ sudo apt policy thermald
Installed: 2.5.2-1
  Candidate: 2.5.2-1
  Version table:
*** 2.5.2-1 500
        500 http://us.archive.ubuntu.com/ubuntu lunar/main amd64 Packages
        100 /var/lib/dpkg/status

$ uname -a
Linux TheUssBenterprise 6.2.0-27-generic #28-Ubuntu SMP PREEMPT_DYNAMIC Wed Jul 12 22:39:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Those power values after triggering bug 1 (TL;DR, long_term_power is stuck at 0.125):
{'backup_rest': 281474976776192,
'locked': False,
'long_term_enabled': True,
'long_term_power': 0.125,
'long_term_time': 28.0,
'short_term_enabled': True,
'short_term_power': 160.0,
'short_term_time': 0.00244140625}

Thermald log file is attached.

I will now try to work out how to trigger bug 2.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-13:

#26

(oh, that's the first time I attached something on a thread, I will name my attachments better from now on, that was just named thermald.log)

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-14:

#27

Unfortunately I experienced a bad crash - black screen, blinking underscore-style cursor in the upper-left corner - after running that stress test for several hours, so I don't know if I ever triggered bug 2.

I will run the standard long-form test ("keep it on for several days with thermald running"), with my script that locks the power details so that I can use the computer. I still have a script logging cpupower details, sensors, and powerprofilesctl in the background, so I can and will track down exactly when bug 2 is triggered.

(aaaaaaannnd I just deleted those old log files so I can't check if I succeeded on that most recent test. derp.)

This tends to take 48 hours+.

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-14:

#28

I have installed and run an older kernel:

#uname -a
Linux james-TensorBook-2022 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

And have been unable to reproduce the bug with thermald 2.4.9-1ubuntu0.3 running.

Revision history for this message

koba (kobako) wrote on 2023-08-15:

#29

@James, thanks for your information,
could you please also upload the thermald log against 6.0.9-060009-generic

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-15:

#30

thermaldLog_202308140806_older_kernel_version Edit (13.7 MiB, text/plain)

I've attached the log.

Revision history for this message

koba (kobako) wrote on 2023-08-15:

#31

@James and @Eli, could you please also list the content of this folder? thanks
#sudo ls /sys/bus/acpi/devices/INTC1041:00/

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-15:

#32

#sudo ls /sys/bus/acpi/devices/INTC1041:00/
hid modalias path physical_node power status subsystem uevent uid wakeup

Revision history for this message

koba (kobako) wrote on 2023-08-15:

#33

@James,
i found this error occurs against 6.0.9-060009-generi and 6.2.0-27-generic
this is caused by INTC1041:00/data_vault doesn't exist on your system.
~~~
[1691916182][DEBUG]Unable to open GDDV data vault
[1691916182][INFO]THD engine init failed
[1691916182][INFO]--adaptive option failed on this platform
[1691916182][INFO]Ignoring --adaptive option
~~~

could you please try another policy with thermald against 6.0.9-060009-generi and 6.2.0-27-generic? thanks
#sudo thermald --no-daemon --loglevel=debug > thermaldLog_woAdaptive_$(date "+%Y%m%d%H%M")

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#34

biblicabeebli-thermald-bug-2.zip Edit (3.8 MiB, application/zip)

Download full text (6.4 KiB)

I have been able to get thermald log info for bug 2.
(This was accomplished with locked power details, so the computer remained usable over the ~30 hours of uptime before I saw it had triggered.)

The log file itself is over 50MB, I've zipped it into a 3.8MB file.

Grep output to confirm bug 2:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:70
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:1
/sys/devices/system/cpu/intel_pstate/status:active

To save a lot of bother, from my own system logging script I determined that bug 2 was triggered between (2023-08-14) 21:37:13 and 21:37:14 (US Eastern). Converting that first one to unix timestamps yields 1692063433.

So, of these 3 logging events, 2 shoouuld be the before/after log statements from thermald - unless I've screwed up my math:

[1692063430][DEBUG]poll exit 0 polls_fd event 0 0
[1692063430][DEBUG] energy 1:524286656:772647335 mj: 7965 mw
[1692063430][DEBUG]read_temperature sensor ID 4
[1692063430][DEBUG]Sensor TCPU :temp 48000
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 103050
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 104550
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 106050
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 107050
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 109050
[1692063430][DEBUG]pref 0 type 0 temp 48000 trip 110050
[1692063430][DEBUG]pref 0 type 2 temp 48000 trip 110050
[1692063430][DEBUG]Passive Trip point applicable
[1692063430][DEBUG]Trip point applicable < 1:110050
[1692063430][DEBUG]cdev size for this trippoint 0
[1692063430][DEBUG]pref 0 type 3 temp 48000 trip 90000
[1692063430][DEBUG]Passive Trip point applicable
[1692063430][DEBUG]Trip point applicable < 2:90000
[1692063430][DEBUG]cdev size for this trippoint 4
[1692063430][DEBUG]cdev at index 13:Processor
[1692063430][DEBUG]>>thd_cdev_set_state temperature 90000:48000 index:13 state:0 :zone:4 trip_id:2 target_state_valid:0 target_value :0 force:0 min_state:0 max_state:0
[1692063430][DEBUG]zone_trip_limits.size() 0
[1692063430][DEBUG]def_max_state:0 temp_max_state:0 curr_max_state:0
[1692063430][DEBUG]thd_cdev_set_13:curr state -1657 max state 0
[1692063430][DEBUG]def_min_state:0 curr_min_state:0
[1692063430][INFO]op->device:Processor -1658
[1692063430][DEBUG]set cdev state index 13 state -1658
[1692063430][INFO]sysfs write failed /sys/class/thermal/cooling_device13/cur_state
[1692063430][INFO]Set : threshold:90000, temperature:48000, cdev:13(Processor), curr_state:-1658, max_state:0
[1692063430][DEBUG]<<thd_cdev_set_state 0

I have been able to get thermald log info for bug 2.
(This was accomplished with locked power details, so the computer remained usable over the ~30 hours of uptime before I saw it had triggered.)

The log file itself is over 50MB, I've zipped it into a 3.8MB file.

So, of these 3 logging events, 2 shoouuld be the before/after log statements from thermald - unless I've screwed up my math:

[1692063430][DEBUG]poll exit 0 polls_fd event 0 0
[1692063430][DEBUG] energy 1:524286656:772647335 mj: 7965 mw 
[1692063430][DEBUG]read_temperature sensor ID 4
[1692063430][DEBUG]Sensor TCPU :temp 48000 
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 103050 
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 104550 
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 106050 
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 107050 
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 109050 
[1692063430][DEBUG]pref 0 type 0 temp 48000 trip 110050 
[1692063430][DEBUG]pref 0 type 2 temp 48000 trip 110050 
[1692063430][DEBUG]Passive Trip point applicable 
[1692063430][DEBUG]Trip point applicable <  1:110050 
[1692063430][DEBUG]cdev size for this trippoint 0
[1692063430][DEBUG]pref 0 type 3 temp 48000 trip 90000 
[1692063430][DEBUG]Passive Trip point applicable 
[1692063430][DEBUG]Trip point applicable <  2:90000 
[1692063430][DEBUG]cdev size for this trippoint 4
[1692063430][DEBUG]cdev at index 13:Processor
[1692063430][DEBUG]>>thd_cdev_set_state temperature 90000:48000 index:13 state:0 :zone:4 trip_id:2 target_state_valid:0 target_value :0 force:0 min_state:0 max_state:0
[1692063430][DEBUG]zone_trip_limits.size() 0
[1692063430][DEBUG]def_max_state:0 temp_max_state:0 curr_max_state:0
[1692063430][DEBUG]thd_cdev_set_13:curr state -1657 max state 0
[1692063430][DEBUG]def_min_state:0 curr_min_state:0
[1692063430][INFO]op->device:Processor -1658
[1692063430][DEBUG]set cdev state index 13 state -1658
[1692063430][INFO]sysfs write failed /sys/class/thermal/cooling_device13/cur_state
[1692063430][INFO]Set : threshold:90000, temperature:48000, cdev:13(Processor), curr_state:-1658, max_state:0
[1692063430][DEBUG]<<thd_cdev_set_state 0

[1692063434][DEBUG]poll exit 0 polls_fd event 0 0
[1692063434][DEBUG] energy 1:524286656:772685798 mj: 9615 mw 
[1692063434][DEBUG]read_temperature sensor ID 4
[1692063434][DEBUG]Sensor TCPU :temp 90000 
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 103050 
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 104550 
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 106050 
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 107050 
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 109050 
[1692063434][DEBUG]pref 0 type 0 temp 90000 trip 110050 
[1692063434][DEBUG]pref 0 type 2 temp 90000 trip 110050 
[1692063434][DEBUG]Passive Trip point applicable 
[1692063434][DEBUG]Trip point applicable <  1:110050 
[1692063434][DEBUG]cdev size for this trippoint 0
[1692063434][DEBUG]pref 0 type 3 temp 90000 trip 90000 
[1692063434][DEBUG]Passive Trip point applicable 
[1692063434][DEBUG]Trip point applicable >  2:90000 
[1692063434][DEBUG]cdev size for this trippoint 4
[1692063434][DEBUG]cdev at index 27:rapl_controller
[1692063434][DEBUG]Need to switch to next cdev target 0 
[1692063434][DEBUG]cdev at index 28:intel_pstate
[1692063434][DEBUG]>>thd_cdev_set_state temperature 90000:90000 index:28 state:1 :zone:4 trip_id:2 target_state_valid:0 target_value :0 force:0 min_state:0 max_state:0
[1692063434][DEBUG]def_min_state:0 curr_min_state:0
[1692063434][DEBUG]thd_cdev_set_28:curr state 1 max state 10
[1692063434][INFO]cdev index:28 consecutive call, increment exponentially state 3 (min 0 max 10) (1:1)
[1692063434][DEBUG]def_max_state:10 temp_max_state:0 curr_max_state:10
[1692063434][DEBUG]op->device:intel_pstate 3
[1692063434][DEBUG]set cdev state index 28 state 3 percent 70
[1692063434][INFO]turbo disabled 
[1692063434][INFO]Set : threshold:90000, temperature:90000, cdev:28(intel_pstate), curr_state:3, max_state:10
[1692063434][DEBUG]<<thd_cdev_set_state 1

[1692063438][DEBUG]poll exit 0 polls_fd event 0 0
[1692063438][DEBUG] energy 1:524286656:772715101 mj: 7325 mw 
[1692063438][DEBUG]read_temperature sensor ID 4
[1692063438][DEBUG]Sensor TCPU :temp 49000 
[1692063438][DEBUG]pref 0 type 4 temp 49000 trip 103050 
[1692063438][DEBUG]pref 0 type 4 temp 49000 trip 104550 
[1692063438][DEBUG]pref 0 type 4 temp 49000 trip 106050 
[1692063438][DEBUG]pref 0 type 4 temp 49000 trip 107050 
[1692063438][DEBUG]pref 0 type 4 temp 49000 trip 109050 
[1692063438][DEBUG]pref 0 type 0 temp 49000 trip 110050 
[1692063438][DEBUG]pref 0 type 2 temp 49000 trip 110050 
[1692063438][DEBUG]Passive Trip point applicable 
[1692063438][DEBUG]Trip point applicable <  1:110050 
[1692063438][DEBUG]cdev size for this trippoint 0
[1692063438][DEBUG]pref 0 type 3 temp 49000 trip 90000 
[1692063438][DEBUG]Passive Trip point applicable 
[1692063438][DEBUG]Trip point applicable <  2:90000 
[1692063438][DEBUG]cdev size for this trippoint 4
[1692063438][DEBUG]cdev at index 13:Processor
[1692063438][DEBUG]>>thd_cdev_set_state temperature 90000:49000 index:13 state:0 :zone:4 trip_id:2 target_state_valid:0 target_value :0 force:0 min_state:0 max_state:0
[1692063438][DEBUG]zone_trip_limits.size() 0
[1692063438][DEBUG]def_max_state:0 temp_max_state:0 curr_max_state:0
[1692063438][DEBUG]thd_cdev_set_13:curr state -1658 max state 0
[1692063438][DEBUG]def_min_state:0 curr_min_state:0
[1692063438][INFO]op->device:Processor -1659
[1692063438][DEBUG]set cdev state index 13 state -1659
[1692063438][INFO]sysfs write failed /sys/class/thermal/cooling_device13/cur_state
[1692063438][INFO]Set : threshold:90000, temperature:49000, cdev:13(Processor), curr_state:-1659, max_state:0
[1692063438][DEBUG]<<thd_cdev_set_state 0

and also that ls statement:
$ sudo ls /sys/bus/acpi/devices/INTC1041:00/
hid  modalias  path  physical_node  power  status  subsystem  uevent  uid  wakeup

I'm now going to attempt to trigger the bug using that older kernel version, since I know how to reliably manually trigger it using `stress`.

Revision history for this message

koba (kobako) wrote on 2023-08-15:

#35

@Eli, is it possible to try the upstream thermald?
~~~
https://github.com/intel/thermal_daemon
~~~

after compilation is finished, run thermald in thermal_daemon foler.
~~~
sudo ./thermald --no-daemon --adaptive --loglevel=debug > thermald_adaptive_$(date %Y%m%d%H%M)
~~~
W/o adaptive
sudo ./thermald --no-daemon --loglevel=debug > thermald_adaptive_$(date %Y%m%d%H%M)
~~~

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#36

Testing 6.0.9-060009-generic behavior

output of `uname -a`
Linux TheUssBenterprise 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Description of behavior:
- The throttling does occur initially after running stress -c 1 and -c 2. It sticks at 400Mhz briefly after stressing for ~2 minutes, but unlike with 6.2.0-27.28 the system recovers afterwards.
- Looking at power details, long_term_power drops to 0.125, but then very slowly recovers back up towards the default of 65.
- While stressing at high cpu temperatures /sys/devices/system/cpu/intel_pstate/max_perf_pct transiently drops down to 90, but recovers up to 100 almost immediately.
- After letting it recover for a while I attempted to get the cpu to throttle again using, but I couldn't make it happen.

thermald log attached (log updates for frequently during stress test).

---

I will now try to get that version of thermald compiling and test on 6.0.9 and 6.2.x

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#37

thermaldLog_woAdaptive_202308151237 Edit (648.7 KiB, text/plain)

Woops, didn't attach log, here it is.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15 (last edit on 2023-08-15):

#38

thermaldLog_woAdaptive_202308151328 Edit (209.8 KiB, text/plain)

Log output with the default cloned branch of the github thermald (I rebooted for a clean test).

$ uname -a
Linux TheUssBenterprise 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

** Command used was sudo ./thermald --no-daemon --loglevel=debug

Same behavior as prior test. Can trigger initially by toggling stress with -c 1 and -c 2, power details drop, cpu clock throttles to 400Mhz, but then system recovers once the stress test stops.

Next I will test with the normal kernel.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#39

Test of build of thermald from github.

$ uname -a
Linux TheUssBenterprise 6.2.0-27-generic #28-Ubuntu SMP PREEMPT_DYNAMIC Wed Jul 12 22:39:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ./thermald --no-daemon --loglevel=debug

Behavior:
toggling between stress -c 1 and stress -c 2, system throttles down to 400mhz and never recovers. Power details are stuck at 'long_term_power': 0.125.

Next I will reboot and run with the same test with the adaptive flag passed in to thermald.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#40

thermaldLog_woAdaptive_202308151341 Edit (143.0 KiB, text/plain)

I keep forgetting to attach the logs, sorry.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#41

thermaldLog_Adaptive_202308151352 Edit (191.8 KiB, text/plain)

Test of build of thermald from github, --adaptive flag enabled, normal linux kernel.

$ uname -a
Linux TheUssBenterprise 6.2.0-27-generic #28-Ubuntu SMP PREEMPT_DYNAMIC Wed Jul 12 22:39:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ./thermald --no-daemon --adaptive --loglevel=debug

Behavior: the usual toggling of stress triggers bug, power details becodhe 'long_term_power': 0.125, cpu stuck at 400mhz and never recovers.

actually attaching log this time!

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-15:

#42

thermaldLog_Adaptive_202308151401 Edit (247.9 KiB, text/plain)

Test of build of thermald from github, --adaptive flag enabled, old linux kernel.

$ uname -a
Linux TheUssBenterprise 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ./thermald --no-daemon --adaptive --loglevel=debug

Behavior: the usual toggling of stress, 'long_term_power' drops to 0.125 and cpu throttles to 400Mhz, but long_term_power thes slowly recovers, eventually jumps up to the default of 65. (was able to trigger this throttling twice this time.)

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-16:

#43

thermaldLog_woAdaptive_kernel6.0.9-060009-generic_202308161015 Edit (2.5 MiB, text/plain)

@koba

Here is the log for the other thermald policy on 6.0.9-060009-generic:

As before using 6.0.9 seems to be fixing the issue for me and I was unable to reproduce the throttling. I'll now run it on 6.2.0-27-generic.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-16 (last edit on 2023-08-16):

#44

I'll describe my process to replicate the transient throttling I see on 6.0.9, and the permanent throttle on 6.2.x
- Open up whatever you are using to watch clocks and temperature, and two terminals.
- in one terminal run stress -c 1.
- you will see one cpu core spike to your cpu's maximum clock speed and pretty much stay there. For me its 4.8 Ghz.
- in the other terminal run stress -c 1.
- you will now see two cores running slightly under that single core maximum speed. I get a value in the 4.7-4.8 range, which probably means flipping between 4.7-4.8 ghz.
- this is the maximum heat output of the cpu, if the fan has not spun up you will see a Package Temp up to 100, when the fan spins up for me it drops to ~92.
- after running at max clock for 10-20 seconds the cpu will throttle down to it's multicore turbo speed. For me it's 4.2-4.3Ghz, and the temperature will drop a solid 15-20 degrees.
- kill one of your stress commands, you will see the temperature spike back up.
- wait ~10 seconds and then run stress -c 1 again in the terminal you just killed it in. Clocks will stay at max for a bit, then drop, and then kill one of the stress commamds.
- Repeat this process of keeping 1 and then 2 cores always at maximum clocks, and you will eventually get thermal throttled down to 400Mhz.
- weirdly it stays at 400Mhz even on 6.0.9 until you stop running both stress commands, even though temps recover to like 45 degrees.

This process reliably triggers bug 1, and very occasionally (I've done it once) can trigger bug 2.

If you use something like my script to get power details (I just call them that, I don't have a better name) you can watch long_term_power fluctuate and then nosedive from 65 to 0.125.

(All values in celcius)

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-17:

#45

thermaldLog_woAdaptive_6.2.0-26-generic_202308171428 Edit (1.2 MiB, text/plain)

@koba

Here is the log for the other thermald policy on 6.2.0-26-generic:

This time the throttling occurred within a couple of hours of normal use.

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-17:

#46

@Eli

Interestingly, I cannot cause the throttling on kernel 6.0.9 even following your method for reproducing it.

And using the laptop under sustained heavy CPU and GPU load for many hours also doesn't produce any throttling down to 400MHz for me.

I only appear to be able to reliably see the bug on kernel 6.2.0-26.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-08-17:

#47

@James well we do have slightly different models, and I can get it within minutes of normal usage after boot. ¯\_(ツ)_/¯

Are you ever getting bug 2?

Revision history for this message

James Gardner (jadgardner) wrote on 2023-08-18:

#48

Okay, I've just had the same CPU throttling occur when using 6.0.9.

Revision history for this message

Eli (biblicabeebli) wrote on 2023-09-03:

#49

Download full text (4.5 KiB)

I have an interesting update:
I went and compiled/installed this tool: https://github.com/phush0/razer-laptop-control-no-dkms

@jadgardner: you will definitely want this, boost mode is at least +15% performance.

Its a cli tool for poking the Razer hardware bits to set the different power modes across the combination of CPU and GPU. None of these affect the power details variables or intel pstate values. For simplicity I'm using commands below that leave the GPU untouched. (the tool also lets you control the LEDs, and fan; those are irrelevant and fan doesn't work for me.)

All of the below testing was done with that compiled build of thermald running in --adaptive mode. I have attached it. I doubt a single long and fiddly run makes for a great data source, please let me know if there is a specific combination of settings you would like me test. (its 3 megs, I have compressed it. exact command was `sudo ./thermald --no-daemon --adaptive --loglevel=debug`)

(reminder: bug 1 is the 400mhz drop and lock, bug 2 is intel_pstate/no_turbo getting set to 1. bug 2 is way harder to trigger.)

1) razer-cli write power ac 4 3 0
Highest "boost" performance mode.
CPU has much higher all core and multicore speeds, cpu package temp spikes to 100 nearly instantly even moderate load.
I cannot trigger bug 1 (or bug 2) in this mode, `stress -c 1` pegs a core at 4.8ghz and it stays there. `stress -c 2` stays at values around ~4775MHz with periodic drops down to ~4450MHz but then jumps right back up after about 1 second. long_term_power either rock solid at 65 or very briefly drops down and then goes back up.

2) razer-cli write power ac 4 2 0
"High" performance mode.
It looks like this one sets a cpu target temperature around 90 for all/multicore, frequencies are higher than normal, temperature does force frequencies down at least until the fan ramps up a bit.
`stress -c 1` pegged a cpu core for less than a minute at 4.8, and then intel_pstate/max_perf_pct got set to 90 and cpu frequency dropped to 400MHz, and long_term_power dropped to 0.125. (e.g. bug 1).
Swapping back to level 3 (Boost) mode did not resolve bug.
Setting intel_pstate/max_perf_pct back to 100 does not resolve bug.
Setting long_term_power back to 65 resolves bug.

3) razer-cli write power ac 4 1 0
"Medium" power mode.
The behavior looks like a normal ~aggressive laptop performance behavior.
`stress -c 1` pegs a cpu core at 4.8, temps spike, fan slowly spins up. CPU speed drops down to various levels (2.8ghz, 4.5ghz, 4.3ghz, 4.2ghz), temperatures drop from mid 90s to mid 80s or 70s for a bit, long_term_power drops to values in the 20-30s, but then resets back to 65 after a few seconds.
I was able to trigger long_term_power down to 0.125 once by toggling stressing 1 vs 2 cpu cores, but it still reset up to 65 after a few seconds. Otherwise I was not able to trigger bug 1 or 2.
(All and multicore CPU speeds are pretty close to normal, looks like it targets ~75 degrees)

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-github-razer-linux-razer-laptop-control-no-dkms #37
[closed] Edit

Bug watches keep track of this bug in other bug trackers.

Changed in thermald (Ubuntu):
status:	In Progress → Confirmed
assignee:	koba (kobako) → nobody

Ubuntuthermald package

CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
thermald package