CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

Bug #2026658 reported by Eli
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Undecided
Unassigned
thermald (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I've tried to include as much detail as possible in this bug report, I originally assembled it just after the release of ubuntu 23.04. There has been no change since then.

I have had substantial performance problems since updating from ubuntu 22.10 to 23.04.
The computer in question is the 17 inch Razer Blade laptop from 2022 with an intel i7-12800H.
Current kernel is 6.2.0-20-generic. (now I'm on 6.2.0-24-generic and nothing has changed.)
This issue occurs regardless of whether the OpenRazer (https://openrazer.github.io/) drivers etc are installed.

Description of problem:
I have discovered what may be two separate bugs involving low level power management details on the cpu, they involve the cpu entering different types of throttled states and never recovering. These issues appeared immediately after upgrading from ubuntu 22.10. The computer is a large ~gaming laptop with plenty of thermal headroom, cpu temperatures cannot reach concerning values except when using stress testing tools.

(I don't know how to propery untangle these two issues, so I'm posting them as one. I apologize for the review complexity this causes, but I think posting the information all in one spot is more constructive here.)

High level testing notes:
- This issue occurs with use of both the intel_pstate driver and the cpufreq driver. (I don't have the same level of detail for cpufreq, but the issue still occurs.)
- I have additionally tested a handful of intel_pstate parameters (and others) via grub kernel command line arguments to no effect. All testing reported here was done with:
  GRUB_CMDLINE_LINUX_DEFAULT="modprobe.blacklist=nouveau"
  GRUB_CMDLINE_LINUX=""
  (loading nouveau caused problems for me on 22.10, I have not bothered reinvestigating it on 23.04)
- There is a firmware update available from the manufacture when I boot into Windows, I have not installed it (yet).
- - Update: I installed it. No change.
- Changing the cpu governor setting from "powersave" to "performance" using `cpupower frequency-set -g performance` has no effect. (Note: this action is separate from the intel_pstate's power-saver/balanced/performance setting visible with the `powerprofilesctl` utility. It doesn't seem to be a governor bug.
- - (There is a tertiary issue where I also see substantial (+50%) performance degredation using the "performance" profile in a test suite I run constantly for my job; that is clearly a problem but it is an unrelated bug that has existed for quite some time.)

Summary and my own conclusions:
These are my takeaways, the ~raw data is in the followup section.

Bug 1)
The reported cpu power limits are progressively constrained over time. Once this failure mode starts the performance never recovers.
  - As this situation progresses the observed cpu speeds (I'm using htop) list as 2800Mhz at idle, but the instant any load at all is placed on a cpu core that core immediately drops to exactly 400Mhz.
  - This situation occurs quite quickly in human terms, frequently within 20 minutes of normal usage after a boot, but it will also occur when the computer is just sitting there unused for a handful of hours.
  - This occurs when using the cpufreq gevernor (by including "intel_pstate=disable" on the grub command line args.)
  - At boot the default value for short_term_time looks wrong to me. This is the duration of higher thermal targets in seconds, ~0.002 seconds seems extremely short. A normal value would be a handful of seconds.
  - This situation can be remedied by running the following python script. It uses the undervolt package (pip install undervolt==0.3.0) to force particular power limits (the provided values are intentional overkill):
     1 │ from undervolt import read_power_limit, set_power_limit, PowerLimit, ADDRESSES
     2 │ from pprint import pprint
     3 │
     4 │ limits = read_power_limit(ADDRESSES)
     5 │ pprint(vars(limits)) # print current values before setting them
     6 │
     7 │ POWER_LIMITS = PowerLimit()
     8 │ POWER_LIMITS.locked = True # lock means don't allow the value to be reset until a reboot.
     9 │ POWER_LIMITS.backup_rest = 281474976776192 # afaik this is just a backup-on-failure setting, it has no effect here.
    10 │ POWER_LIMITS.long_term_enabled = True
    11 │ POWER_LIMITS.long_term_power = 160 # values are intentional overkill
    12 │ POWER_LIMITS.long_term_time = 2880.0
    13 │ POWER_LIMITS.short_term_enabled = True
    14 │ POWER_LIMITS.short_term_power = 250
    15 │ POWER_LIMITS.short_term_time = 500.0
    16 │ set_power_limit(POWER_LIMITS, ADDRESSES)
    17 |
    18 | limits2 = read_power_limit(ADDRESSES) # and print the new state
    19 | pprint(vars(limits2))

Bug 2)
`powerprofilesctl` has unearthed some bug where the cpu performance enters the degraded state "high-operating-temperature", and never recovers.
  - This appears to happen for no reason. There is a brief cpu temperature spike in the example data below, but it does not hit the listed hardware limit values so I am at a loss for its cause.
  - I ran a cpu stress test (prime95/mprime torture test), it immediately spikes cpu temperature to 100 degrees and throttles the cpu, but doesn't trigger the high temperature degraded state. Go figure.
  - This bug takes quite a while to kick in, uptime in my example below was at over 14 hours.
  - When this situation occurs the maximum cpu speed becomes 2400Mhz across all cpu cores. The cpu power management appears to behave correctly in the 400-2400Mhz range. I believe this means all turbo frequencies are disabled.
  - Running the comman `sudo cpupower frequency-set -u 4800000` (or any value above 2400000) does not correct the reported cpu_policy_range, it remains locked at 2400Mhz.
  - The only fix I know is a reboot.

THE DATA:

Bug 1:
This output was gathered using a python package called undervolt's read_power_limit function from a script that starts running at ~boot.
long_term_power and short_term_power metrics are values in watts, long_term_time and short_term_time are values in seconds.

2023-05-12 15:14:32 up 0 min, 0 user, load average: 0.39, 0.10, 0.03
(boot, log starts after normal user login)
 long_term_power: 65.0
 long_term_time: 32.0
 short_term_power: 160.0
 short_term_time: 0.00244140625

2023-05-12 15:20:29 up 6 min, 2 users, load average: 1.90, 0.86, 0.37
 long_term_power: 20.875 <-- down
 long_term_time: 28.0 <-- down
 short_term_power: 160.0
 short_term_time: 0.00244140625

2023-05-12 15:20:46 up 6 min, 2 users, load average: 1.63, 0.87, 0.38
 long_term_power: 22.625 <-- hey it went up! I was still using the computer at this point
 long_term_time: 28.0
 short_term_power: 160.0
 short_term_time: 0.00244140625

2023-05-12 15:46:15 up 32 min, 2 users, load average: 0.66, 0.84, 0.79
(no longer at computer by the time this occurs)
 long_term_power: 20.625 <-- down
 long_term_time: 28.0
 short_term_power: 160.0
 short_term_time: 0.00244140625

2023-05-12 16:04:46 up 50 min, 3 users, load average: 0.46, 0.70, 0.79
 long_term_power: 16.625 <-- down
 long_term_time: 28.0
 short_term_power: 160.0
 short_term_time: 0.00244140625

2023-05-12 17:23:07 up 2:08, 3 users, load average: 0.49, 0.61, 0.68
(by the time long_term_power hits 8.625 all cpu cores throttle to 400Mhz under any load. This one was preceded by ~1 second of a single cpu core randomly spiking to 78 degrees, output from `powerprofilesctl` remains normal. At this point long_term_power will never go up again. I have seen one more lowered stage at ~4.3125w.)
 long_term_power: 8.625 <-- way down - I've seen lower, though.
 long_term_time: 28.0
 short_term_power: 160.0
 short_term_time: 0.00244140625

(And then after several hours stuck in this mode I returned to the computer and needed to run the script in the bug 1 summary to make it usable again.)

Bug 2:
(Some cleanup of output, script starts at ~boot)
2023-05-11 22:21:15 up 14:15, 2 users, load average: 0.38, 0.42, 0.52

Output from powerprofilesctl:
  | performance:
  | Driver: intel_pstate
  | Degraded: no
  |* balanced:
  | Driver: intel_pstate
  | power-saver:
  | Driver: intel_pstate

some summarized details from the `cpupower` utility:
  | cpu_number: 2
  | cpu_range: 400 MHz - 4.70 GHz
  | cpu_policy_range: 400 MHz and 4.70 GHz.
  | governor: powersave

output from `sensors` (slightly compactified, I don't know what's up with the cpu core numbers):
  | iwlwifi_1-virtual-0 - Adapter: Virtual device - temp1: +49.0°C
  | nvme-pci-0300 - Adapter: PCI adapter - Composite:
  | +40.9°C (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
  | nvme-pci-0200 - Adapter: PCI adapter:
  | Composite: +36.9°C (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
  | Sensor 1: +36.9°C (low = -273.1°C, high = +65261.8°C)
  | Sensor 2: +38.9°C (low = -273.1°C, high = +65261.8°C)
  | coretemp-isa-0000 - Adapter: ISA adapter
  | Package id 0: +77.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 0: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 4: +54.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 8: +77.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 12: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 16: +64.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 20: +45.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 24: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 25: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 26: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 27: +52.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 28: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 29: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 30: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 31: +50.0°C (high = +100.0°C, crit = +100.0°C)
  | acpitz-acpi-0 - Adapter: ACPI interface: temp1: +27.8°C (crit = +105.0°C)

2023-05-11 22:21:17 up 14:15, 2 users, load average: 0.38, 0.42, 0.52 (2 seconds later)

output from `powerprofilesctl`:
  | performance:
  | Driver: intel_pstate
  | Degraded: yes (high-operating-temperature)
  |* balanced:
  | Driver: intel_pstate
  | power-saver:
  | Driver: intel_pstate

some summarized details from the `cpupower` utility:
  | cpu_number: 8
  | cpu_range: 400 MHz - 4.70 GHz
  | cpu_policy_range: 400 MHz and 2.40 GHz.
  | governor: powersave

output from `sensors` (slightly compactified, I don't know what's up with the cpu core numbers):
  | iwlwifi_1-virtual-0 Adapter: Virtual device temp1: +49.0°C
  | nvme-pci-0300 - Adapter: PCI adapter
  | Composite: +40.9°C (low = -5.2°C, high = +89.8°C) (crit = +93.8°C)
  | nvme-pci-0200 - Adapter: PCI adapter
  | Composite: +36.9°C (low = -273.1°C, high = +80.8°C) (crit = +84.8°C)
  | Sensor 1: +36.9°C (low = -273.1°C, high = +65261.8°C)
  | Sensor 2: +38.9°C (low = -273.1°C, high = +65261.8°C)
  | coretemp-isa-0000 - Adapter: ISA adapter
  | Package id 0: +60.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 0: +53.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 4: +59.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 8: +54.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 12: +58.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 16: +58.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 20: +60.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 24: +58.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 25: +58.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 26: +58.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 27: +58.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 28: +55.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 29: +55.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 30: +55.0°C (high = +100.0°C, crit = +100.0°C)
  | Core 31: +55.0°C (high = +100.0°C, crit = +100.0°C)
  | acpitz-acpi-0 - Adapter: ACPI interface - temp1: +27.8°C (crit = +105.0°C)

Revision history for this message
Eli (biblicabeebli) wrote :
description: updated
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

> I have had substantial performance problems since updating from ubuntu 22.10 to 23.04.
Maybe it's caused by thermald? See if `sudo systemctl stop thermald` can help.

Revision history for this message
Eli (biblicabeebli) wrote :

> Maybe it's caused by thermald? See if `sudo systemctl stop thermald` can help.

I will try this, I will also reinstall the package.

I am waiting to see if using sane power parameters in my script for bug 1 fixes the bug 2 issue, but that means I need to leave it sitting for 12+ hours so my iteration speed here is very slow.

Revision history for this message
Eli (biblicabeebli) wrote (last edit ):

I figured I would do a an apt reinstall thermald first, aanndd that seems to hae fixe bug 1, bug 2 takes many hours to kick in so I will have to leave it on for a ~day to find out.

Thank you, hopefully this was a false alarm.

Revision history for this message
Eli (biblicabeebli) wrote :

I disabled thermald, I think bug 1 may be resolved now, but bug 2 eventually occurred again overnight.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please attach output of `grep . /sys/devices/system/cpu/intel_pstate/*` when the issue happens?

Revision history for this message
Eli (biblicabeebli) wrote :

I will grep that for you, here is the reference information just after boot when everything works. I have to wait for the bug to kick in (and woke up to blinking cursor in the upper left corner, funnn).

reference:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

Revision history for this message
Eli (biblicabeebli) wrote (last edit ):

Interesting, a new ~intermediate situation of bug 1. Its not constraining that thermal envelope quite as much now. Thermald is running, I guess bug 1 is still present but lessened since I reinstalled it? Computer was not in use.

When running `stress -c 1` it places the task on the correct ideal core (the one that clocks up to 4.8Ghz), but it bounces around between 1300mhz and 500mhz.
**update: it looks like any multiithreaded load gets shunted down to 400Mhz, but occasiional spikes on single threaded operation.

First log statement for the power envelope was at 2023-07-19T18:07:15 (roughly at boot) and then long_term_time and long_term_power suddenly step down and stay that way at roughly 3 hours 30 minutes uptime. (Current uptime is about 9 hours 50 minutes.)

2023-07-19T21:39:11
{'backup_rest': 281474976776192,
 'locked': False,
 'long_term_enabled': True,
 'long_term_power': 65.0,
 'long_term_time': 32.0,
 'short_term_enabled': True,
 'short_term_power': 160.0,
 'short_term_time': 0.00244140625}

2023-07-19T21:39:13
{'backup_rest': 281474976776192,
 'locked': False,
 'long_term_enabled': True,
 'long_term_power': 7.75,
 'long_term_time': 28.0,
 'short_term_enabled': True,
 'short_term_power': 160.0,
 'short_term_time': 0.00244140625}

here is your grep:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

It will take a ~day for bug 2 to trigger, I will update and do the grep when that happens. If I don't need to use this computer I will leave it on without running my script to lock the power envelope at a higher value.

Revision history for this message
Eli (biblicabeebli) wrote :

Ok well this is interesting.
It has been over 24 hours (uptime 1 day 15:48hrs) and bug 2 still hasn't triggered, "Degraded = no".

I probably need this computer today, so I ran `stress -c 1` to try to force the high temperature power state after running the script to reset the power values, but with lock set to False. This went onto the ideal cpu core at 4.8ghz, package temperature spiked up to ~93 degrees C, and eventually the power parameters dropped to this:

{'backup_rest': 281474976776192,
 'locked': False,
 'long_term_enabled': True,
 'long_term_power': 0.125,
 'long_term_time': 28.0,
 'short_term_enabled': True,
 'short_term_power': 250.0,
 'short_term_time': 512.0}

Bug 2 still had not been triggered, cpu was throttled to 400Mhz, so I reset the power values again this time with lock=True. When I ran `stress -c 1` it was on the ideal core but now limited to 4.3ghz with a cpu package temp of ~67.

I ran the grep and get this:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:90
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

Revision history for this message
Eli (biblicabeebli) wrote :

I have updates!
- I set /sys/devices/system/cpu/intel_pstate/max_perf_pct to 100 and confirmed that it restore the 4.7/4.8 peak turbo frequencies.
- I ran `stress -c 1`, cpu package temps went up to ~84 degrees C. No changes on the grep, no changes on the powerprofilesctl degraded state.
- I ran `stress -c 2`, cpu package temps went up to ~90 degrees C. This triggered the powerprofilesctl degraded state.

The grep:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:70
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:1
/sys/devices/system/cpu/intel_pstate/status:active

$ powerprofilesctl
* performance:
    Driver: intel_pstate
    Degraded: yes (high-operating-temperature)
  balanced:
    Driver: intel_pstate
  power-saver:
    Driver: intel_pstate

$ cpupower frequency-info
analyzing CPU 6:
  driver: intel_pstate
  ...
  hardware limits: 400 MHz - 4.80 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 2.40 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  boost state support:
    Supported: yes
    Active: yes

In case it is relevant, this has been with thermald running.
Temperatures are back down around 45 degrees C, which is typical, but as stated in the original error report it will never recover on its own

* * *

I have now set those values back to their originals, e.g.
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/no_turbo:0

I will also note that the 400MHz to 2.40GHz range indicated by cpupower reverts the full range when no_turbo is set to 0, and powerprofilesctl degraded state is also directly based off this value. (So I will stop reporting them!)

From this I can at least write a script that sets these variables back to normal and regain normal functionality without rebooting!

My next step will be to uninstall thermald entirely, reboot, and report back with whether I'm able trigger either bug. I'm confused about what I experienced before, the reboot is to clear thermal envelope lock from my script.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Can you please see if the bug present on latest thermald?

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Eli (biblicabeebli) wrote :

So far I have not had either bug recur with thermald uninstalled, I'm satisfied that thermald was the culprit for bug one. Running `stress -c 2` (highest possible thermal load) doesn't trigger anything.

The grep on this run is
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:1
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

(I don't know why hwp_dynamic_boost, I only know it changed at some point during running stress. I though this was locked to 0 based on a kernel command line argument for intel_pstate?)

I have installed 2.5.2-1 and will reboot and report back.

Revision history for this message
Eli (biblicabeebli) wrote :

Nope, running stress -c 2 yields...

long_term_power starts at 65, drops to ~10, then jumps around values like 37.625, 30.0, 31.0, 28.0, but then after a ~minute the power values look like this and clearly aren't recovering:

{'backup_rest': 281474976776192,
 'locked': False,
 'long_term_enabled': True,
 'long_term_power': 3.0,
 'long_term_time': 28.0,
 'short_term_enabled': True,
 'short_term_power': 160.0,
 'short_term_time': 0.00244140625}

grep looks like this:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:100
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:0
/sys/devices/system/cpu/intel_pstate/status:active

CPU speeds are set to their nominal speeds but immediately drop down to 400mhz as soon as they take any load.

A more real-world check of a parallel test suite I run frequently and know should take on the order of 3.5 seconds takes 30. (Running it in single threaded is clearly also terrible but I don't offhand know its normal run time.)

Revision history for this message
Eli (biblicabeebli) wrote :

reference data point: the power values just after a reboot when thermald is not installed are:

{'backup_rest': 65536,
 'locked': False,
 'long_term_enabled': True,
 'long_term_power': 65.0,
 'long_term_time': 56.0,
 'short_term_enabled': True,
 'short_term_power': 160.0,
 'short_term_time': 0.00244140625}

Revision history for this message
Eli (biblicabeebli) wrote :

Not to distract too much, but I have also uncovered a separate, replicable performance bug that is quite bizarre. tl;dr, though reported clock speeds are consistently higher after running `cpupower frequency-set -g performance` I see very reliable on-the-order-of-40%-ish performance loss on a particular compile task, and on that test suite I mentioned. This occurs whether thermald is installed or not, it also occurs when min_perf_pct is set to 100.

I will look into that more deeply and can create a new bug report for it - but I don't know what to label that one and would appreciate a recommendation.

I feel this current issue's name has been proven wrong, it seems clear this is a thermald Thing so I propose we rename it. (If that is an option.) The new performance weirdness is a better candidate with respect to its source closer to the kernel due to it literally being a pstate driver option - assuming I've understand what the cpupower command does.

Revision history for this message
koba (kobako) wrote :

@Eli, could you provide the thermald logs? thanks
#sudo systemctl stop thermald
#sudo thermald --no-daemon --adaptive --loglevel=debug >> thermald_debug_202307270935

Revision history for this message
Eli (biblicabeebli) wrote :

I'm running day-to-day without thermald installed, so I will need to find time to do this. But yes I can.

koba (kobako)
Changed in thermald (Ubuntu):
status: New → In Progress
assignee: nobody → koba (kobako)
Revision history for this message
Eli (biblicabeebli) wrote :

(Trying to find time to do this. Had some life externalities come up, and my work is currently unavoidably attached to this specific computer and I have some deadlines. 🫠)

Revision history for this message
James Gardner (jadgardner) wrote :

Hi all,

I believe I am also experiencing this issue, pretty much exactly as described.

I'm using a Lambda Tensorbook 2022 (which is a 15.6-inch Razor Blade from 2022 in a new case) with Ubuntu 22.04.3 LTS and the current kernel is 6.2.0.26-generic.

Here is the output of 'cpupower frequency-info'

"""
analyzing CPU 0:
  driver: intel_pstate
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: Cannot determine or is not supported.
  hardware limits: 400 MHz - 4.70 GHz
  available cpufreq governors: performance powersave
  current policy: frequency should be within 400 MHz and 4.70 GHz.
                  The governor "powersave" may decide which speed to use
                  within this range.
  current CPU frequency: Unable to call hardware
  current CPU frequency: 400 MHz (asserted by call to kernel)
  boost state support:
    Supported: yes
    Active: yes

"""

And the output of 'sensors'

"""

sensors
iwlwifi_1-virtual-0
Adapter: Virtual device
temp1: +37.0°C

nvme-pci-0200
Adapter: PCI adapter
Composite: +35.9°C (low = -273.1°C, high = +82.8°C)
                       (crit = +84.8°C)
Sensor 1: +35.9°C (low = -273.1°C, high = +65261.8°C)

BAT0-acpi-0
Adapter: ACPI interface
in0: 17.36 V
curr1: 0.00 A

coretemp-isa-0000
Adapter: ISA adapter
Package id 0: +58.0°C (high = +100.0°C, crit = +100.0°C)
Core 0: +58.0°C (high = +100.0°C, crit = +100.0°C)
Core 4: +52.0°C (high = +100.0°C, crit = +100.0°C)
Core 8: +53.0°C (high = +100.0°C, crit = +100.0°C)
Core 12: +52.0°C (high = +100.0°C, crit = +100.0°C)
Core 16: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 20: +51.0°C (high = +100.0°C, crit = +100.0°C)
Core 24: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 25: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 26: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 27: +48.0°C (high = +100.0°C, crit = +100.0°C)
Core 28: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 29: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 30: +50.0°C (high = +100.0°C, crit = +100.0°C)
Core 31: +50.0°C (high = +100.0°C, crit = +100.0°C)

nvme-pci-0300
Adapter: PCI adapter
Composite: +38.9°C (low = -273.1°C, high = +82.8°C)
                       (crit = +84.8°C)
Sensor 1: +38.9°C (low = -273.1°C, high = +65261.8°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1: +27.8°C (crit = +105.0°C)

"""

Let me know if there is any other information I can provide.

Revision history for this message
koba (kobako) wrote :

@james,
* dump thermald versio
#sudo apt policy thermald

* stop thermald.service and run manually.
#systemctl disalbe thermald.service
#systemctl stop thermald.service
#reboot
#sudo thermald --no-daemon --adaptive --loglevel=debug > thermaldLog_202308111733
#try to reproduce
#upload log.

Revision history for this message
James Gardner (jadgardner) wrote :

#sudo apt policy thermald

thermald:
  Installed: 2.4.9-1ubuntu0.3
  Candidate: 2.4.9-1ubuntu0.3
  Version table:
 *** 2.4.9-1ubuntu0.3 500
        500 http://gb.archive.ubuntu.com/ubuntu jammy-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     2.4.9-1 500
        500 http://gb.archive.ubuntu.com/ubuntu jammy/main amd64 Packages

Disabling and stopping thermald.service, rebooting and then running the command:

#sudo thermald --no-daemon --adaptive --loglevel=debug > thermaldLog_202308111733

currently seems to be resulting in me not being able to reproduce the bug, which in my case is Bug 1 in the original post. All CPU cores were dropping to 400MHz within a few minutes of startup and currently, it's at 2Hrs of uptime with no drop. I'll continue to try and reproduce and upload the log should it occur.

Revision history for this message
James Gardner (jadgardner) wrote :

The issue has occurred again. All CPU cores lock to 400MHz when under any load. I've attached the thermald log.

Revision history for this message
koba (kobako) wrote : Re: [Bug 2026658] Re: CPU frequency governor broken after upgrading from 22.10 to 23.04, stuck at 400Mhz on Alder Lake

@James,
Thanks for the update,
Could you show the using kernel?
#uname -a

Is it possible to check with t uostream thermald?
https://github.com/intel/thermal_daemon

Revision history for this message
James Gardner (jadgardner) wrote :

#uname -a:

Linux james-TensorBook-2022 6.2.0-26-generic #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Jul 13 16:27:29 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

I have attached the log from running:

#sudo sbin/thermald --no-daemon --adaptive --loglevel=debug > thermaldlog_202308130942

using the thermald binary installed from source.

The same issue occurred, though perhaps slightly different as it was locking to ~900 - 1000MHz rather than 400MHz when under load.

Revision history for this message
Eli (biblicabeebli) wrote :

I have some time (and was reminded by thread updates, thanc you for posting Mr. Gardner!) and am running my test. Reinstalled (after a purge uninstall iirc) thermald via apt, then ran `stress -c 2` for a ~minute to bring it to the brink, then `stress -c 1` to push it over the edge. (This seems to reliably cause bug 1.)

$ sudo apt policy thermald
Installed: 2.5.2-1
  Candidate: 2.5.2-1
  Version table:
 *** 2.5.2-1 500
        500 http://us.archive.ubuntu.com/ubuntu lunar/main amd64 Packages
        100 /var/lib/dpkg/status

$ uname -a
Linux TheUssBenterprise 6.2.0-27-generic #28-Ubuntu SMP PREEMPT_DYNAMIC Wed Jul 12 22:39:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Those power values after triggering bug 1 (TL;DR, long_term_power is stuck at 0.125):
{'backup_rest': 281474976776192,
 'locked': False,
 'long_term_enabled': True,
 'long_term_power': 0.125,
 'long_term_time': 28.0,
 'short_term_enabled': True,
 'short_term_power': 160.0,
 'short_term_time': 0.00244140625}

Thermald log file is attached.

I will now try to work out how to trigger bug 2.

Revision history for this message
Eli (biblicabeebli) wrote :

(oh, that's the first time I attached something on a thread, I will name my attachments better from now on, that was just named thermald.log)

Revision history for this message
Eli (biblicabeebli) wrote :

Unfortunately I experienced a bad crash - black screen, blinking underscore-style cursor in the upper-left corner - after running that stress test for several hours, so I don't know if I ever triggered bug 2.

I will run the standard long-form test ("keep it on for several days with thermald running"), with my script that locks the power details so that I can use the computer. I still have a script logging cpupower details, sensors, and powerprofilesctl in the background, so I can and will track down exactly when bug 2 is triggered.

(aaaaaaannnd I just deleted those old log files so I can't check if I succeeded on that most recent test. derp.)

This tends to take 48 hours+.

Revision history for this message
James Gardner (jadgardner) wrote :

I have installed and run an older kernel:

#uname -a
Linux james-TensorBook-2022 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

And have been unable to reproduce the bug with thermald 2.4.9-1ubuntu0.3 running.

Revision history for this message
koba (kobako) wrote :

@James, thanks for your information,
could you please also upload the thermald log against 6.0.9-060009-generic

Revision history for this message
James Gardner (jadgardner) wrote :

I've attached the log.

Revision history for this message
koba (kobako) wrote :

@James and @Eli, could you please also list the content of this folder? thanks
#sudo ls /sys/bus/acpi/devices/INTC1041:00/

Revision history for this message
James Gardner (jadgardner) wrote :

#sudo ls /sys/bus/acpi/devices/INTC1041:00/
hid modalias path physical_node power status subsystem uevent uid wakeup

Revision history for this message
koba (kobako) wrote :

@James,
i found this error occurs against 6.0.9-060009-generi and 6.2.0-27-generic
this is caused by INTC1041:00/data_vault doesn't exist on your system.
~~~
[1691916182][DEBUG]Unable to open GDDV data vault
[1691916182][INFO]THD engine init failed
[1691916182][INFO]--adaptive option failed on this platform
[1691916182][INFO]Ignoring --adaptive option
~~~

could you please try another policy with thermald against 6.0.9-060009-generi and 6.2.0-27-generic? thanks
#sudo thermald --no-daemon --loglevel=debug > thermaldLog_woAdaptive_$(date "+%Y%m%d%H%M")

Revision history for this message
Eli (biblicabeebli) wrote :
Download full text (6.4 KiB)

I have been able to get thermald log info for bug 2.
(This was accomplished with locked power details, so the computer remained usable over the ~30 hours of uptime before I saw it had triggered.)

The log file itself is over 50MB, I've zipped it into a 3.8MB file.

Grep output to confirm bug 2:
/sys/devices/system/cpu/intel_pstate/hwp_dynamic_boost:0
/sys/devices/system/cpu/intel_pstate/max_perf_pct:70
/sys/devices/system/cpu/intel_pstate/min_perf_pct:10
/sys/devices/system/cpu/intel_pstate/no_turbo:1
/sys/devices/system/cpu/intel_pstate/status:active

To save a lot of bother, from my own system logging script I determined that bug 2 was triggered between (2023-08-14) 21:37:13 and 21:37:14 (US Eastern). Converting that first one to unix timestamps yields 1692063433.

So, of these 3 logging events, 2 shoouuld be the before/after log statements from thermald - unless I've screwed up my math:

[1692063430][DEBUG]poll exit 0 polls_fd event 0 0
[1692063430][DEBUG] energy 1:524286656:772647335 mj: 7965 mw
[1692063430][DEBUG]read_temperature sensor ID 4
[1692063430][DEBUG]Sensor TCPU :temp 48000
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 103050
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 104550
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 106050
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 107050
[1692063430][DEBUG]pref 0 type 4 temp 48000 trip 109050
[1692063430][DEBUG]pref 0 type 0 temp 48000 trip 110050
[1692063430][DEBUG]pref 0 type 2 temp 48000 trip 110050
[1692063430][DEBUG]Passive Trip point applicable
[1692063430][DEBUG]Trip point applicable < 1:110050
[1692063430][DEBUG]cdev size for this trippoint 0
[1692063430][DEBUG]pref 0 type 3 temp 48000 trip 90000
[1692063430][DEBUG]Passive Trip point applicable
[1692063430][DEBUG]Trip point applicable < 2:90000
[1692063430][DEBUG]cdev size for this trippoint 4
[1692063430][DEBUG]cdev at index 13:Processor
[1692063430][DEBUG]>>thd_cdev_set_state temperature 90000:48000 index:13 state:0 :zone:4 trip_id:2 target_state_valid:0 target_value :0 force:0 min_state:0 max_state:0
[1692063430][DEBUG]zone_trip_limits.size() 0
[1692063430][DEBUG]def_max_state:0 temp_max_state:0 curr_max_state:0
[1692063430][DEBUG]thd_cdev_set_13:curr state -1657 max state 0
[1692063430][DEBUG]def_min_state:0 curr_min_state:0
[1692063430][INFO]op->device:Processor -1658
[1692063430][DEBUG]set cdev state index 13 state -1658
[1692063430][INFO]sysfs write failed /sys/class/thermal/cooling_device13/cur_state
[1692063430][INFO]Set : threshold:90000, temperature:48000, cdev:13(Processor), curr_state:-1658, max_state:0
[1692063430][DEBUG]<<thd_cdev_set_state 0

[1692063434][DEBUG]poll exit 0 polls_fd event 0 0
[1692063434][DEBUG] energy 1:524286656:772685798 mj: 9615 mw
[1692063434][DEBUG]read_temperature sensor ID 4
[1692063434][DEBUG]Sensor TCPU :temp 90000
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 103050
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 104550
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 106050
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 107050
[1692063434][DEBUG]pref 0 type 4 temp 90000 trip 109050
[1692063434][DEBUG]pref 0 type 0 temp 90000 trip 110050
[16...

Read more...

Revision history for this message
koba (kobako) wrote :

@Eli, is it possible to try the upstream thermald?
~~~
https://github.com/intel/thermal_daemon
~~~

after compilation is finished, run thermald in thermal_daemon foler.
~~~
sudo ./thermald --no-daemon --adaptive --loglevel=debug > thermald_adaptive_$(date %Y%m%d%H%M)
~~~
W/o adaptive
sudo ./thermald --no-daemon --loglevel=debug > thermald_adaptive_$(date %Y%m%d%H%M)
~~~

Revision history for this message
Eli (biblicabeebli) wrote :

Testing 6.0.9-060009-generic behavior

output of `uname -a`
Linux TheUssBenterprise 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Description of behavior:
- The throttling does occur initially after running stress -c 1 and -c 2. It sticks at 400Mhz briefly after stressing for ~2 minutes, but unlike with 6.2.0-27.28 the system recovers afterwards.
- Looking at power details, long_term_power drops to 0.125, but then very slowly recovers back up towards the default of 65.
- While stressing at high cpu temperatures /sys/devices/system/cpu/intel_pstate/max_perf_pct transiently drops down to 90, but recovers up to 100 almost immediately.
- After letting it recover for a while I attempted to get the cpu to throttle again using, but I couldn't make it happen.

thermald log attached (log updates for frequently during stress test).

---

I will now try to get that version of thermald compiling and test on 6.0.9 and 6.2.x

Revision history for this message
Eli (biblicabeebli) wrote :

Woops, didn't attach log, here it is.

Revision history for this message
Eli (biblicabeebli) wrote (last edit ):

Log output with the default cloned branch of the github thermald (I rebooted for a clean test).

$ uname -a
Linux TheUssBenterprise 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

** Command used was sudo ./thermald --no-daemon --loglevel=debug

Same behavior as prior test. Can trigger initially by toggling stress with -c 1 and -c 2, power details drop, cpu clock throttles to 400Mhz, but then system recovers once the stress test stops.

Next I will test with the normal kernel.

Revision history for this message
Eli (biblicabeebli) wrote :

Test of build of thermald from github.

$ uname -a
Linux TheUssBenterprise 6.2.0-27-generic #28-Ubuntu SMP PREEMPT_DYNAMIC Wed Jul 12 22:39:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ./thermald --no-daemon --loglevel=debug

Behavior:
toggling between stress -c 1 and stress -c 2, system throttles down to 400mhz and never recovers. Power details are stuck at 'long_term_power': 0.125.

Next I will reboot and run with the same test with the adaptive flag passed in to thermald.

Revision history for this message
Eli (biblicabeebli) wrote :

I keep forgetting to attach the logs, sorry.

Revision history for this message
Eli (biblicabeebli) wrote :

Test of build of thermald from github, --adaptive flag enabled, normal linux kernel.

$ uname -a
Linux TheUssBenterprise 6.2.0-27-generic #28-Ubuntu SMP PREEMPT_DYNAMIC Wed Jul 12 22:39:51 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ./thermald --no-daemon --adaptive --loglevel=debug

Behavior: the usual toggling of stress triggers bug, power details becodhe 'long_term_power': 0.125, cpu stuck at 400mhz and never recovers.

actually attaching log this time!

Revision history for this message
Eli (biblicabeebli) wrote :

Test of build of thermald from github, --adaptive flag enabled, old linux kernel.

$ uname -a
Linux TheUssBenterprise 6.0.9-060009-generic #202211161102 SMP PREEMPT_DYNAMIC Wed Nov 16 12:14:18 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

$ sudo ./thermald --no-daemon --adaptive --loglevel=debug

Behavior: the usual toggling of stress, 'long_term_power' drops to 0.125 and cpu throttles to 400Mhz, but long_term_power thes slowly recovers, eventually jumps up to the default of 65. (was able to trigger this throttling twice this time.)

Revision history for this message
James Gardner (jadgardner) wrote :

@koba

Here is the log for the other thermald policy on 6.0.9-060009-generic:

As before using 6.0.9 seems to be fixing the issue for me and I was unable to reproduce the throttling. I'll now run it on 6.2.0-27-generic.

Revision history for this message
Eli (biblicabeebli) wrote (last edit ):

I'll describe my process to replicate the transient throttling I see on 6.0.9, and the permanent throttle on 6.2.x
- Open up whatever you are using to watch clocks and temperature, and two terminals.
- in one terminal run stress -c 1.
- you will see one cpu core spike to your cpu's maximum clock speed and pretty much stay there. For me its 4.8 Ghz.
- in the other terminal run stress -c 1.
- you will now see two cores running slightly under that single core maximum speed. I get a value in the 4.7-4.8 range, which probably means flipping between 4.7-4.8 ghz.
- this is the maximum heat output of the cpu, if the fan has not spun up you will see a Package Temp up to 100, when the fan spins up for me it drops to ~92.
- after running at max clock for 10-20 seconds the cpu will throttle down to it's multicore turbo speed. For me it's 4.2-4.3Ghz, and the temperature will drop a solid 15-20 degrees.
- kill one of your stress commands, you will see the temperature spike back up.
- wait ~10 seconds and then run stress -c 1 again in the terminal you just killed it in. Clocks will stay at max for a bit, then drop, and then kill one of the stress commamds.
- Repeat this process of keeping 1 and then 2 cores always at maximum clocks, and you will eventually get thermal throttled down to 400Mhz.
- weirdly it stays at 400Mhz even on 6.0.9 until you stop running both stress commands, even though temps recover to like 45 degrees.

This process reliably triggers bug 1, and very occasionally (I've done it once) can trigger bug 2.

If you use something like my script to get power details (I just call them that, I don't have a better name) you can watch long_term_power fluctuate and then nosedive from 65 to 0.125.

(All values in celcius)

Revision history for this message
James Gardner (jadgardner) wrote :

@koba

Here is the log for the other thermald policy on 6.2.0-26-generic:

This time the throttling occurred within a couple of hours of normal use.

Revision history for this message
James Gardner (jadgardner) wrote :

@Eli

Interestingly, I cannot cause the throttling on kernel 6.0.9 even following your method for reproducing it.

And using the laptop under sustained heavy CPU and GPU load for many hours also doesn't produce any throttling down to 400MHz for me.

I only appear to be able to reliably see the bug on kernel 6.2.0-26.

Revision history for this message
Eli (biblicabeebli) wrote :

@James well we do have slightly different models, and I can get it within minutes of normal usage after boot. ¯\_(ツ)_/¯

Are you ever getting bug 2?

Revision history for this message
James Gardner (jadgardner) wrote :

Okay, I've just had the same CPU throttling occur when using 6.0.9.

Revision history for this message
Eli (biblicabeebli) wrote :
Download full text (4.5 KiB)

I have an interesting update:
I went and compiled/installed this tool: https://github.com/phush0/razer-laptop-control-no-dkms

@jadgardner: you will definitely want this, boost mode is at least +15% performance.

Its a cli tool for poking the Razer hardware bits to set the different power modes across the combination of CPU and GPU. None of these affect the power details variables or intel pstate values. For simplicity I'm using commands below that leave the GPU untouched. (the tool also lets you control the LEDs, and fan; those are irrelevant and fan doesn't work for me.)

All of the below testing was done with that compiled build of thermald running in --adaptive mode. I have attached it. I doubt a single long and fiddly run makes for a great data source, please let me know if there is a specific combination of settings you would like me test. (its 3 megs, I have compressed it. exact command was `sudo ./thermald --no-daemon --adaptive --loglevel=debug`)

(reminder: bug 1 is the 400mhz drop and lock, bug 2 is intel_pstate/no_turbo getting set to 1. bug 2 is way harder to trigger.)

1) razer-cli write power ac 4 3 0
Highest "boost" performance mode.
CPU has much higher all core and multicore speeds, cpu package temp spikes to 100 nearly instantly even moderate load.
I cannot trigger bug 1 (or bug 2) in this mode, `stress -c 1` pegs a core at 4.8ghz and it stays there. `stress -c 2` stays at values around ~4775MHz with periodic drops down to ~4450MHz but then jumps right back up after about 1 second. long_term_power either rock solid at 65 or very briefly drops down and then goes back up.

2) razer-cli write power ac 4 2 0
"High" performance mode.
It looks like this one sets a cpu target temperature around 90 for all/multicore, frequencies are higher than normal, temperature does force frequencies down at least until the fan ramps up a bit.
`stress -c 1` pegged a cpu core for less than a minute at 4.8, and then intel_pstate/max_perf_pct got set to 90 and cpu frequency dropped to 400MHz, and long_term_power dropped to 0.125. (e.g. bug 1).
Swapping back to level 3 (Boost) mode did not resolve bug.
Setting intel_pstate/max_perf_pct back to 100 does not resolve bug.
Setting long_term_power back to 65 resolves bug.

3) razer-cli write power ac 4 1 0
"Medium" power mode.
The behavior looks like a normal ~aggressive laptop performance behavior.
`stress -c 1` pegs a cpu core at 4.8, temps spike, fan slowly spins up. CPU speed drops down to various levels (2.8ghz, 4.5ghz, 4.3ghz, 4.2ghz), temperatures drop from mid 90s to mid 80s or 70s for a bit, long_term_power drops to values in the 20-30s, but then resets back to 65 after a few seconds.
I was able to trigger long_term_power down to 0.125 once by toggling stressing 1 vs 2 cpu cores, but it still reset up to 65 after a few seconds. Otherwise I was not able to trigger bug 1 or 2.
(All and multicore CPU speeds are pretty close to normal, looks like it targets ~75 degrees)

4) razer-cli write power ac 4 0 0
"low" power mode.
All and multicore CPU speeds are pretty close to normal, looks like it targets ~70 degrees, very close to "Medium".
This behavior looks like a lower or fairly passive power mode on ...

Read more...

Revision history for this message
Eli (biblicabeebli) wrote :

previous post was too large to add an attachment? or something? here it is.
3MB text file, zipped to ~240k.

Revision history for this message
Eli (biblicabeebli) wrote :
Revision history for this message
Eli (biblicabeebli) wrote :
Revision history for this message
Eli (biblicabeebli) wrote :

I posted an issue on the repo of that razer-cli tool, maybe they can help.

(I apologize for my triple post above, but I think I know how they happened so...)

Revision history for this message
koba (kobako) wrote :

hi, would you please help me to try vanilla kernel? check if the issue is still here?
https://drive.google.com/drive/folders/1AFgeX8_USkR9omba8E-D-cJsaDuhzKLW?usp=sharing

Revision history for this message
Eli (biblicabeebli) wrote :

@koba
I've tried to install the vanilla kernel you linked, I'm installing with a `dpkg -i *.deb` command in a folder of the unpacked downloaded zip of that folder.

I'm getting the following error (I've been having issues with the nvidia 535 driver, I don't know exactly what's going on here, currently I'm on the 525 driver. I can't tell if I can boot into this or if it failed. I'll try to find time to reboot/debug this, its the middle of my work day.):

/etc/kernel/postinst.d/dkms:
 * dkms: running auto installation service for kernel 6.4.0-060400rc3-generic
Error! Could not locate dkms.conf file.
File: /var/lib/dkms/nvidia/535.104.05/source/dkms.conf does not exist.
Sign command: /usr/bin/kmodsign
Signing key: /var/lib/shim-signed/mok/MOK.priv
Public certificate (MOK): /var/lib/shim-signed/mok/MOK.der

Building module:
Cleaning build area...
unset ARCH; [ ! -h /usr/bin/cc ] && export CC=/usr/bin/gcc; env NV_VERBOSE=1 'make' -j16 NV_EXCLUDE_BUILD_MODULES='' KERNEL_UNAME=6.4.0-060400rc3-generic IGNORE_XEN_PRESENCE=1 IGNORE_CC_MISMATCH=1 SYSSRC=/lib/modules/6.4.0-060400rc3-generic/build LD=/usr/bin/ld.bfd CONFIG_X86_KERNEL_IBT= modules....(bad exit status: 2)
ERROR (dkms apport): kernel package linux-headers-6.4.0-060400rc3-generic is not supported
Error! Bad return status for module build on kernel: 6.4.0-060400rc3-generic (x86_64)
Consult /var/lib/dkms/nvidia/525.125.06/build/make.log for more information.
dkms autoinstall on 6.4.0-060400rc3-generic/x86_64 failed for nvidia(10)
Error! One or more modules failed to install during autoinstall.
Refer to previous errors for more information.
 * dkms: autoinstall for kernel 6.4.0-060400rc3-generic
   ...fail!
run-parts: /etc/kernel/postinst.d/dkms exited with return code 11
dpkg: error processing package linux-image-unsigned-6.4.0-060400rc3-generic (--install):
 installed linux-image-unsigned-6.4.0-060400rc3-generic package post-installation script subprocess returned error exit status 1
Errors were encountered while processing:
 linux-headers-6.4.0-060400rc3-generic
 linux-image-unsigned-6.4.0-060400rc3-generic

Revision history for this message
koba (kobako) wrote :

@Eli, are you using 23.04/22.04? I built with 22.04 configuration.

Revision history for this message
Eli (biblicabeebli) wrote :

@koba Everything here has been 23.04

Revision history for this message
koba (kobako) wrote :

@Eli, re-built with 23.04, would you please have a try, thanks
https://drive.google.com/drive/folders/1XmxwqgiUB_vjLRWiIaSpzXc89ilSwItx?usp=sharing

Revision history for this message
koba (kobako) wrote :

@Eli,
could you run this scripts and upload the log? thanks
~~~
// could find this repor, https://github.com/intel/thermal_daemon
thermal_daemon/test/thermal-debug-dump-ubuntu.sh
~~~

Revision history for this message
Eli (biblicabeebli) wrote :

I have some ~different behavior, but I was still able to achieve bug 1.

Initially I saw some intel_pstate/max_perf_pct getting set to 90, but it would recover quickly.

long_term_power would fluctuate from very low values all the way up to the nominal 65. Previously when triggering the bug it would slowly go mostly down toward 0.125, getting stuck between 0.125 and ~4.0, with the cpu clocked at 400Mhz. Now it will go up, seemingly reset up to 65, and allow the max frequencies.

I was able to get something like bug 1 to happen twice.
long_term_power got stuck at 19.5 (unusual, might have been 18.5 the first time)
long_term_time got stuck at 28.0 (typical, its usually this or 32.0)

This happened when I toggled between one and two pegged cores to keep the temperature at a maximum. If I killed the stress commands at the right moment while the fan was spun up, while long_term_power was in the high teens, if I then waited for the spans to spin down - I would get into a situation where I could not sustain maximum clocks long enough to reach temperatures to trigger thermald to.... poke whatever it is that resets long_term_power back up to 65.

I was able to do this twice.
On my third attempt I accidentally got a long_term_power value 1f 12.5. I waited for fans to spin down, started stressing again, and it pretty immediately dropped to 0.125 with cpu speeds locked to 400Mhz. I also noticed at the end that intelpstate/max_perf_pct was at 90.

Finally after resetting the long_power_mode to 65 I noticed that I still couldn't get the fan up because intelpstate/max_perf_pct 90 results in a maximum core speed of 4.3Ghz which is too low to trigger thermald.

(Likely irrelevant: even though I see there is that intel wireless driver deb package (I installed all of the packages) in that 6.2-26 kernel folder you shared with me, my intel wifi card does not work when booted into it.)

this was with the thermald build in adaptive mode, the full command was:
sudo ./thermald --no-daemon --adaptive --loglevel=debug

Revision history for this message
koba (kobako) wrote :

@Eli, may i know it it a brand-new notebook?

in the logs, there's one entry to limit your cpu.
this is came from bios through ACPI.
~~~
1694076246][INFO]index 2: type:passive temp:90000 hyst:1000 zone id:6 sensor id:6 control_type:1 cdev size:4
~~~

the cpu temperatrue is 93000 that exceed temp:90000, so thermald try to cool the cpu,
~~~
[1694076254][DEBUG]pref 0 type 3 temp 91000 trip 90000
[1694076254][DEBUG]Passive Trip point applicable
[1694076254][DEBUG]Trip point applicable > 2:90000
[1694076254][DEBUG]cdev size for this trippoint 4
[1694076254][DEBUG]cdev at index 27:rapl_controller
[1694076254][DEBUG]>>thd_cdev_set_state temperature 90000:91000 index:27 state:1 :zone:6 trip_id:2 target_state_valid:0 target_value :0 force:0 min_state:0 max_state:0
~~~

if this's a brand-new notebook, you should contact the vendor first.
if not, we need to investigate further.
thanks

Revision history for this message
Eli (biblicabeebli) wrote :

No its not new, its a 2022 model, 12 gen has been replaced by 13th gen. Its definitely a software issue introduced in a software update to either thermald or the layer in the kernel it interacts with directly.

(There is the group that makes the Linux hardware drivers for Razer products - I think I mentioned that the issues occur whether those are present or not - that we could loop in. They might understand the power level interaction here?)

It seems to me like a bug where thermald sets a low power target with whatever knobs it has, but then doesn't understand the system has recovered and it needs to reset. What data is it waiting on? Another value from the bios?

Is it possible the "all good" signal came in/is-processed out-of-order? (is this even interrupt driven or does it periodically read values?

Fwiw I would be interested diving into thermal myself, its just not my everyday language. (And I've not done this level of direct hardware work before.)

Revision history for this message
Eli (biblicabeebli) wrote :

Received a response on github from the maintainer, https://github.com/Razer-Linux/razer-laptop-control-no-dkms/issues/37#issuecomment-1808340928

Relevant extract of that comment is:
my opinion is uninstall thermald, as for Razer laptop it does nothing positive, same for tlp. Control on the Razer laptops is through EC, that's why [Razer Laptop Control] exist[s]. Setting various bits in kernel or using other software will just hinder performance or brake everything. On some of the models (if not all) ACPII is broken, because Razer will always target Windows

I've not upgraded to 23.10 yet, I will do so, in general uninstalling Thermald was the fix here.

I will report back if the issue is resolved in 23.10 when I get around to updating.

Changed in thermald (Ubuntu):
status: In Progress → Confirmed
assignee: koba (kobako) → nobody
Revision history for this message
Bo Chen (bochen87) wrote :

I have the same issue with the Razer Blade 15 2022 model. When waking up from suspend, it fixes the CPU to 400 MHz. Uninstalling thermald and installing the openrazer drivers solved it for me.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.