Erratic behavior of CPU frequency control under load

Bug #1797802 reported by Piotr Kołaczkowski
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
Won't Fix
Medium
Unassigned

Bug Description

I noticed that sometimes my Ubuntu 18.10 system feels sluggish. The other times everything works fine. I don't know what triggers this weird state. Suspend / resume? Maybe.

At first I thought this was maybe just a "perception issue", but then when the sluggishness happened again, I fired up i7z and it looks that CPU frequency almost never got over ~1600 MHz, But my CPU is perfectly capable of going up to 4000 MHz!

I tried to fix it by setting
echo 80 > /sys/devices/system/cpu/intel_pstate/min_perf_pct
echo 80 > /sys/devices/system/cpu/intel_pstate/max_perf_pct

I hoped I would force the CPU into almost max performance state this way. And indeed, once I ran these commands, i7z showed steady 3100+ MHz on all cores.

Then something weird happened - when I run some mild load on the CPU, like just starting the IDE, during the startup the frequency ** dropped back to about 800-1600 MHz ** and returned to 3100 MHz after the load was gone (IDE loaded).

The CPU core temperatures shown by sensors / i7z are ok and typically at about 40-50 C when this slowdown happens, so this doesn't look like thermal throttling.

Any ideas?

ProblemType: Bug
DistroRelease: Ubuntu 18.10
Package: linux-image-4.18.0-10-generic 4.18.0-10.11
ProcVersionSignature: Ubuntu 4.18.0-10.11-generic 4.18.12
Uname: Linux 4.18.0-10-generic x86_64
ApportVersion: 2.20.10-0ubuntu13
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: pkolaczk 3281 F.... pulseaudio
 /dev/snd/pcmC0D0c: pkolaczk 3281 F...m pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Sun Oct 14 22:16:50 2018
DistributionChannelDescriptor:
 # This is a distribution channel descriptor
 # For more information see http://wiki.ubuntu.com/DistributionChannelDescriptor
 canonical-oem-somerville-xenial-amd64-20160624-2
HibernationDevice: RESUME=UUID=4836842e-0116-43c4-98b4-7a56427f81f1
InstallationDate: Installed on 2017-04-12 (549 days ago)
InstallationMedia: Ubuntu 16.04 "Xenial" - Build amd64 LIVE Binary 20160624-10:47
MachineType: Dell Inc. Precision 5520
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=pl_PL.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.18.0-10-generic root=UUID=7f76b1f0-8fef-41cc-86a2-99e554cb4d40 ro acpi_rev_override quiet splash iwlwifi.power_save=1 crashkernel=384M-:128M crashkernel=384M-:128M vt.handoff=1
RelatedPackageVersions:
 linux-restricted-modules-4.18.0-10-generic N/A
 linux-backports-modules-4.18.0-10-generic N/A
 linux-firmware 1.175
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/24/2018
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 1.11.0
dmi.board.name: 06X96V
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 10
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr1.11.0:bd07/24/2018:svnDellInc.:pnPrecision5520:pvr:rvnDellInc.:rn06X96V:rvrA00:cvnDellInc.:ct10:cvr:
dmi.product.family: Precision
dmi.product.name: Precision 5520
dmi.product.sku: 07BF
dmi.sys.vendor: Dell Inc.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :
Revision history for this message
Cristian Aravena Romero (caravena) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.19-rc7/

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

BTW, when the system is in this weird state, CPU fans are running at the lowest speed (but running) and never go to their max speed, even under heavy CPU load like compiling stuff.

Is there a hidden energy saving thing that tries to keep my CPU clock low / voltages low at high load?

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Aaaah, I can see plenty of this in syslog:

Oct 14 22:30:59 p5520 kernel: [ 9481.033687] CPU3: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033688] CPU7: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033718] CPU1: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033719] CPU5: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033720] CPU0: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033720] CPU4: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033722] CPU6: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.033722] CPU2: Package temperature above threshold, cpu clock throttled (total events = 5845)
Oct 14 22:30:59 p5520 kernel: [ 9481.034709] CPU3: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034710] CPU0: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034711] CPU4: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034711] CPU7: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034738] CPU2: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034738] CPU6: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034739] CPU1: Package temperature/speed normal
Oct 14 22:30:59 p5520 kernel: [ 9481.034740] CPU5: Package temperature/speed normal

So maybe it THINKS it overheats (although it does NOT) and throttles down...

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

> Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

It started happening after upgrading from 18.04 to 18.10.

> Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.19 kernel[0].

I can do that, but so far I don't know a reliable way to trigger the system into that state.
For example at the moment I did a fresh boot and performance looks ok, as well as there are no "package temperature over threshold" messages in syslog so far.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Update: I got the "package temperature above threshold" again (still using 4.18.0-10 kernel), but it did not cause visible performance loss and i7z still shows the CPU freq can go up to ~3.8 GHz easily. So maybe these messages are not related.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Is it normal that C-states percentages reported by i7z don't sum up to 100%, but exceed it?
Under load, I frequently get C0% close to 99%, but at the same time Halt is at about 40-60%.
Also when C0 is >90, the frequency of cores goes DOWN. This looks totally reversed to me.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

The problem exists in 4.19.rc7 I just tried.

So overall, I'm not sure now if this is a "different state" of the system, or just a normal behavior of it.

Fresh after boot, the CPU is willing to use frequencies above 3 GHz. The same when I launch my IDE (IntelliJ idea for the first time). Then when I hit "rebuild project" to compile all the stuff, the frequency goes up to 3+ GHz for a short while, and after just 2-4 seconds it drops back to about 1.3-1.6 GHz and stays at this level for the whole build (several minutes). The temperatures reported by i7z do not go above 54 C on any of the cores, and VCore is kept around 0.69 - 0.71.

What does not look right to me are very high numbers reported in the Halt (C1) column at the same time when C0 column reports numbers > 90. Typically C1 numbers are over 50, and some hit 90+.
This doesn't make much sense to me, because I thought C1 and C0 are distinct (either/or, the core cannot be in both at the same time).

When the system is loaded by background compilation of my project, the CPU is not willing to use higher frequencies even when I add more load to it temporarily.

I compiled a really simple code doing an infinite loop:

int main() {
  for (;;);
  return 0;
}

Now running this program on an idle system makes the CPU frequency go to about 2.8 GHz. C0 states are reported at less than 5, C1 states for all cores are > 95 and temperatures are < 53.

Interestingly running this same program when the system is compiling in background does not increase the frequencies above 1.6 GHz. If it is loaded, it is unwilling to go faster. :D

So to summarize the CPU is capable to use higher frequencies for a very short span of time, but then the performance quickly degrades under load and stays there.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Huh, another surprise. After running my infinte loop for the time I wrote the previous post, suddenly it "magically" fixed itself and now one core reports 100% C0 and ~6% C1, temperature 57, voltage 0.94. The frequency did not go above 2.8 GHz, though.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

And as soon as I hit "rebuild project" in the IDE, the frequencies go immediately down from 2.8 GHz to about 1.2-1.5 GHz now.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
status: Incomplete → Triaged
Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

More observations:

The problem does not happen if I disable Intel speed-step in BIOS.
In this case, the CPU frequency of all cores goes to 2998 MHz whenever under load, and stays stable regardless of how long the load is applied.

The problem still happens even if intel_pstate=disable is passed to the kernel on boot.
I tried forcing the CPU to its max frequency by using userspace governor and it worked only under idle load (freq ~3.9 GHz). As soon as multi-core load was applied, the CPU frequency dropped to ~1.6 GHz and went back to ~3.9 GHz only when the load was over. I observed a similar pattern when trying to force CPU only to 2.8 GHz - it still slowed down to 1.6 GHz under load, even though temperature should not be a problem in this case.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

I collected some perf data for further analysis.

I run a simple arithmetic multi-core benchmark.
The benchmark is 100% CPU bound - it adds 2 integers.

perf record java -jar target/benchmarks.jar org.ttnr.pmato.e2.cpumem.ArithmeticsBenchmark.add -wi 0 -i 40 -t 4 -f 1
# JMH version: 1.21
# VM version: JDK 1.8.0_181, Java HotSpot(TM) 64-Bit Server VM, 25.181-b13
# VM invoker: /opt/jdk1.8.0_181/jre/bin/java
# VM options: <none>
# Warmup: <none>
# Measurement: 40 iterations, 1 s each
# Timeout: 10 min per iteration
# Threads: 4 threads, will synchronize iterations
# Benchmark mode: Average time, time/op
# Benchmark: org.ttnr.pmato.e2.cpumem.ArithmeticsBenchmark.add

# Run progress: 0,00% complete, ETA 00:00:40
# Fork: 1 of 1
Iteration 1: 2,246 ±(99.9%) 0,648 ns/op // <------- 3.5+ GHz when started
Iteration 2: 2,181 ±(99.9%) 0,196 ns/op
Iteration 3: 2,262 ±(99.9%) 2,673 ns/op
Iteration 4: 2,296 ±(99.9%) 2,751 ns/op
Iteration 5: 3,004 ±(99.9%) 3,518 ns/op
Iteration 6: 5,468 ±(99.9%) 0,159 ns/op // <------- sudden performance drop
Iteration 7: 6,372 ±(99.9%) 0,620 ns/op // <------- now 1.3-1.5 GHz
Iteration 8: 6,389 ±(99.9%) 4,850 ns/op
Iteration 9: 5,363 ±(99.9%) 0,223 ns/op
Iteration 10: 5,174 ±(99.9%) 0,584 ns/op
Iteration 11: 5,093 ±(99.9%) 0,414 ns/op
Iteration 12: 5,069 ±(99.9%) 0,127 ns/op
Iteration 13: 5,070 ±(99.9%) 0,559 ns/op
Iteration 14: 4,927 ±(99.9%) 0,080 ns/op
Iteration 15: 5,045 ±(99.9%) 0,033 ns/op
Iteration 16: 5,052 ±(99.9%) 0,162 ns/op
Iteration 17: 4,964 ±(99.9%) 0,063 ns/op
Iteration 18: 4,979 ±(99.9%) 0,058 ns/op
Iteration 19: 4,992 ±(99.9%) 0,147 ns/op
Iteration 20: 4,955 ±(99.9%) 0,083 ns/op
Iteration 21: 5,061 ±(99.9%) 0,462 ns/op
Iteration 22: 5,004 ±(99.9%) 0,264 ns/op
Iteration 23: 4,966 ±(99.9%) 0,207 ns/op
Iteration 24: 4,950 ±(99.9%) 0,125 ns/op
Iteration 25: 4,925 ±(99.9%) 0,553 ns/op
Iteration 26: 4,961 ±(99.9%) 0,138 ns/op
Iteration 27: 4,921 ±(99.9%) 0,188 ns/op
Iteration 28: 4,980 ±(99.9%) 0,372 ns/op
Iteration 29: 4,899 ±(99.9%) 0,119 ns/op
Iteration 30: 4,884 ±(99.9%) 0,314 ns/op
Iteration 31: 4,878 ±(99.9%) 0,194 ns/op
Iteration 32: 4,962 ±(99.9%) 0,997 ns/op
Iteration 33: 4,958 ±(99.9%) 0,280 ns/op
Iteration 34: 4,889 ±(99.9%) 0,162 ns/op
Iteration 35: 5,018 ±(99.9%) 0,201 ns/op
Iteration 36: 5,002 ±(99.9%) 0,229 ns/op
Iteration 37: 4,927 ±(99.9%) 0,088 ns/op
Iteration 38: 4,935 ±(99.9%) 0,114 ns/op
Iteration 39: 4,976 ±(99.9%) 0,284 ns/op
Iteration 40: 4,925 ±(99.9%) 0,128 ns/op

Result "org.ttnr.pmato.e2.cpumem.ArithmeticsBenchmark.add":
  4,748 ±(99.9%) 0,541 ns/op [Average]
  (min, avg, max) = (2,181, 4,748, 6,389), stdev = 0,962
  CI (99.9%): [4,207, 5,289] (assumes normal distribution)

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Perf data collected for the run above.

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Temperatures:
Before the performance drop: up to 97 C
After performance drop: 52-55 C

summary: - Erratic behavior of intel pstate CPU frequency control
+ Erratic behavior of CPU frequency control under load
Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Disabling thermald with:

sudo systemctl stop thermald

right after boot, before applying any load seems to help getting stable performance.

A thermald bug?

Revision history for this message
Piotr Kołaczkowski (pkolaczk-u) wrote :

Can you reassign this bug to thermald package?
Or should I create a new one?
Thanks

Revision history for this message
dimahetman (dimahetman) wrote :

This is of course ridiculous, but I have an inverse problem I have a big problem with thermal throttling In topic #1598394.

affects: linux (Ubuntu) → thermald (Ubuntu)
Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

try running thermald in a window from command line.
systemctl stop thermald
#thermald --no-daemon --loglevel=info

Then do what triggers this, and attach the output of the above command.

Revision history for this message
Srinivas Pandruvada (srinivas-pandruvada) wrote :

dimahetman (dimahetman),
please also try as described in #18.

Revision history for this message
Colin Ian King (colin-king) wrote :

There has been no feedback to comments #18 and #19 for 2 years. Closing this bug. If it still is an issue, please reopen the bug report.

Changed in thermald (Ubuntu):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.