thermald breaks frequency scaling in Xeon® E5-2687W v3 & E5-1650 v3

Bug #1480349 reported by Shahar Or on 2015-07-31
46
This bug affects 5 people
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
High
Colin Ian King
Trusty
High
Colin Ian King
Vivid
High
Colin Ian King
Wily
High
Colin Ian King
Xenial
High
Colin Ian King

Bug Description

SRU Justification Wily, Vivid, Trusty

CPU scaling on a class of Intel CPUs is not functioning correctly, causing the CPU to be throttled back to the lowest CPU frequency

[FIX]
Upstream cherry picks, as recommended by Intel
f4e316ef4d8d8c9a558ef5bfa74e25303c46a985 ("Add white list of the cpu ids")
18d1574230c6b9b4e8876c0b6739c074a24205e6 ("Move parser init to thd_engine")
6749427098434ccad81fa8c5f2a3e102fc1644f7 ("Remove wild card for loading")
ba4fe1e7bb77d09530544cda860fba559603ec83 ("Error recovery when sysfs attrib read fails")

Plus 4 changes to allow clean and simpler patching of the above 4 fix to reduce the risk of breaking thermald with a complex backport:

Remove trailing ':' from THD engine failure message
Remove !! from "No coretemp sysfs found"
Add new option for config file
Support target state

Essentially we now have a white list of valid CPUs to run thermald on, so we can exclude the issues on a wider class of CPUs.

[TEST CASE]
With the buggy thermald, CPU is pegged at the lowest CPU frequency. With the fixed thermald, CPU scaling now works.

[REGRESSION POTENTIAL]
We are allowing thermald now to run on a strict set of CPUs, so we are hoping that the whitelist covers the class that we can legitimately run thermald against.

--------------------------------------------------------------------------

When I boot with this installed frequency scaling is no longer behaving as expected.

It seems that with this installed, the OS loses control of the scaling.

It seems that scaling is performed, though. And very much to the purpose of power saving.

Performance is terrible because of this. And I do mean terrible.

The default governor is powersave.

Setting the governor to performance (yes, on all cores) doesn't seem to change to scaling behavior. The frequency doesn't pass 800MHz.

I would like to use Intel's microcode updates, but I have to have my CPU running at the speed for which it costs so damn much.

Any suggestions?

ProblemType: Bug
DistroRelease: Ubuntu 15.10
Package: intel-microcode 3.20150121.1
ProcVersionSignature: Ubuntu 4.1.0-2.2-generic 4.1.3
Uname: Linux 4.1.0-2-generic x86_64
ApportVersion: 2.18-0ubuntu5
Architecture: amd64
CurrentDesktop: XFCE
Date: Fri Jul 31 17:56:39 2015
InstallationDate: Installed on 2010-10-12 (1753 days ago)
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
SourcePackage: intel-microcode
UpgradeStatus: Upgraded to wily on 2014-11-11 (262 days ago)

Shahar Or (mightyiam) wrote :
Shahar Or (mightyiam) on 2015-07-31
description: updated
Colin Ian King (colin-king) wrote :

I wonder if it is something like thermald being over-zealous.

Can you try:

 sudo systemctl stop thermald

and see if that changes the behaviour

Shahar Or (mightyiam) wrote :

@colin-king, thank you for giving this attention.

`$ sudo systemctl stop thermald` doesn't seem to change it.

Shahar Or (mightyiam) wrote :

It is also probably important to add:

With the intel-microcode package installed the cpu scales to much, much lower than without.

Without is minimum 1.2GHz and with can be as low as around 400MHz.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in intel-microcode (Ubuntu):
status: New → Confirmed
Julien Barnier (julien-nozav) wrote :

I can confirm the exact same thing here.

Kubuntu 15.04
intel-microcode 3.20150121.1
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz

When intel-microcode is installed the CPU frequencies are stuck between 400 and 500 Mhz, they just slightly scale between this two values. No special messages in dmesg as far as I can tell.

Changed in intel-microcode (Ubuntu):
status: Confirmed → In Progress
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
Shahar Or (mightyiam) wrote :

Thank you for looking into this, Colin.

If I have your attention, I'd like to know whether you have a plan for #911206, if you don't mind sharing there.

Shahar Or (mightyiam) wrote :

Bug #911206, that is.

Len Brown (len-brown) wrote :

please show the output of this command on the failing config:

cat /proc/cpuinfo |grep microcode

If you can disable the update and show the output on the working config,
and also the BIOS version, that would also be helpful.

also, so you see the same behaviour when you are running intel_pstate
as when you are running acpi-cpufreq+ondoemand?

Shahar Or (mightyiam) wrote :

Working (intel-microcode not installed):
➜ ~ git:(master) ✗ cat /proc/cpuinfo| grep microcode
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
microcode : 0x27
➜ ~ git:(master) ✗ sudo dmidecode -s bios-version
[sudo] password for shahar:
1.0a
➜ ~ git:(master) ✗

Regarding intel-pstate vs. acpi-cpufreq+ondemand, I'm guessing these are two different possible configurations.
I'm not sure which I have currently.

Both `lsmod | grep intel_pstate` and `lsmod | grep acpi_cpufreq` do not have any output.

I'll now install intel-microcode, reboot and provide the same information.

Shahar Or (mightyiam) wrote :

With intel-microcode installed (issue present):

➜ ~ git:(master) ✗ cat /proc/cpuinfo| grep microcode
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
microcode : 0x29
➜ ~ git:(master) ✗ lsmod | grep acpi_cpufreq
➜ ~ git:(master) ✗ lsmod | grep intel_pstate
➜ ~ git:(master) ✗sudo dmidecode -s bios-version
1.0a

Hi,
Can you do this:

cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver

And paste the driver which is being used.

Shahar Or (mightyiam) wrote :

Without intel-microcode, I have:

➜ ~ git:(master) ✗ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver
intel_pstate

Shahar Or (mightyiam) wrote :

With intel-microcode booted into, I have the same scaling driver.

would you mind doing an experiment and using turbostat --debug to collect some info with the old microcode and new microcode.

use a recent version of turbostat :
turbostat --debug

capture output and paste here.

Thanks,
Kristen

Shahar Or (mightyiam) wrote :
Shahar Or (mightyiam) wrote :

@kristen-c-accardi, here are the two turbostat --debug outputs, attached.

thanks. Your MSR_CORE_PERF_LIMIT_REASONS seems to indicate that the pcode is clipping you after the microcode update. But just to be extra sure it's not a software problem, can you please update the microcode and do this:

# turbostat --msr 0x199

And paste the output.

One last paranoid check is to install the microcode directly from the intel site, just to make sure there's nothing wrong with the packaging.

https://downloadcenter.intel.com/download/24661/Linux-Processor-Microcode-Data-File

Shahar Or (mightyiam) wrote :
Download full text (3.8 KiB)

@kristen-c-accardi,

With updated microcode:

➜ ~ git:(master) ✗ sudo turbostat --msr 0x199
[sudo] password for shahar:
     CPU Avg_MHz %Busy Bzy_MHz TSC_MHz MSR 0x199
       - 16 3.77 427 3096 0x00000000
       0 30 6.99 426 3101 0x00000c00
      10 1 0.18 535 3101 0x00000c00
       1 55 13.09 422 3101 0x00000c00
      11 3 0.80 432 3101 0x00000c00
       2 35 8.12 425 3101 0x00000c00
      12 1 0.24 502 3101 0x00000c00
       3 24 5.69 426 3100 0x00000c00
      13 0 0.06 629 3100 0x00000c00
       4 28 6.58 424 3094 0x00000c00
      14 1 0.17 513 3093 0x00000c00
       5 30 7.14 426 3093 0x00000c00
      15 1 0.22 501 3093 0x00000c00
       6 22 5.10 427 3093 0x00000c00
      16 3 0.73 451 3093 0x00000c00
       7 30 7.14 423 3093 0x00000c00
      17 3 0.75 431 3093 0x00000c00
       8 27 6.36 423 3093 0x00000c00
      18 1 0.13 531 3093 0x00000c00
       9 24 5.53 428 3093 0x00000c00
      19 2 0.41 473 3093 0x00000c00
     CPU Avg_MHz %Busy Bzy_MHz TSC_MHz MSR 0x199
       - 14 3.38 428 3093 0x00000000
       0 26 6.04 428 3093 0x00000c00
      10 1 0.10 689 3093 0x00000c00
       1 32 7.60 423 3093 0x00000c00
      11 1 0.12 589 3093 0x00000c00
       2 25 5.89 427 3093 0x00000c00
      12 1 0.09 651 3093 0x00000c00
       3 24 5.58 426 3093 0x00000c00
      13 10 2.53 410 3093 0x00000c00
       4 26 6.03 426 3093 0x00000c00
      14 1 0.12 596 3093 0x00000c00
       5 19 4.42 434 3093 0x00000c00
      15 1 0.12 573 3093 0x00000c00
       6 30 6.95 425 3093 0x00000c00
      16 3 0.55 471 3093 0x00000c00
       7 28 6.61 425 3093 0x00000c00
      17 1 0.12 567 3093 0x00000c00
       8 31 7.39 426 3093 0x00000c00
      18 7 1.69 416 3093 0x00000c00
       9 22 5.16 425 3093 0x00000c00
      19 2 0.43 465 3093 0x00000c00
     CPU Avg_MHz %Busy Bzy_MHz TSC_MHz MSR 0x199
       - 16 3.70 426 3093 0x00000000
       0 53 12.48 422 3093 0x00000c00
      10 1 0.08 757 3093 0x00000c00
       1 20 4.75 428 3093 0x00000c00
      11 1 0.08 610 3093 0x00000c00
       2 45 10.79 421 3093 0x00000c00
      12 0 0.08 611 3093 0x00000c00
       3 26 6.21 426 3093 0x00000c00
      13 2 0.53 440 3093 0x00000c00
       4 21 4.88 425 3093 0x00000c00
      14 1 0.11 534 3093 0x00000c00
       5 15 3.49 434 3093 0x00000c00
      15 1 0.17 513 3093 0x00000c00
    ...

Read more...

Shahar Or (mightyiam) wrote :

Regarding the last paranoid check of installing the microcode directly from Intel's site,

I read the instructions and I would like to ask whether I can skip that and just make sure by md5sum.

seems reasonable enough.

driver is requesting 1.2GHz - so confirms not related to the OS at all.

Shahar Or (mightyiam) wrote :

I couldn't figure out which file to compare with. Perhaps the microcode.dat that is included in the intel-microcode is split up or something?

So, does it look like the microcode update itself introduces an issue?

Colin Ian King (colin-king) wrote :

@Kristen, I've tried to reproduce this with a couple of Xeons that I have access too (E3-xxxx v3) and not been able to reproduce this yet. Any luck with tracking down any reproducers in your labs?

Shahar Or (mightyiam) on 2015-09-03
summary: - Breaks frequency scaling in Xeon® E5-2687W v3
+ Breaks frequency scaling in Xeon® E5-2687W v3 & E5-1650 v3

@Colin - my understanding is that we have not been able to reproduce on any of our machines either.

summary: - Breaks frequency scaling in Xeon® E5-2687W v3 & E5-1650 v3
+ Intel Microcode Breaks frequency scaling in Xeon® E5-2687W v3 & E5-1650
+ v3

intel-microcode (3.20151106.1) was recently sync'd into Ubuntu Xenial, see: http://packages.ubuntu.com/xenial/intel-microcode

It may be worth checking to see if this resolves the issues for you.

Changed in intel-microcode (Ubuntu):
status: In Progress → Incomplete
Shahar Or (mightyiam) wrote :

Thank you for letting me know of the update, Colin. I've upgraded to Xenial and installed intel-microcode (3.20151106.1) and seemingly no change on this issue.

Philipp Kern (pkern) wrote :

So for E5-1650 v3 specifically:

0x2b works (microcode in trusty). 0x36 works with 3.13, but not with either 3.19 nor 4.3. (Pointing at intel_pstate, I think.)

Changed in intel-microcode (Ubuntu):
status: Incomplete → Confirmed
Doug Smythies (dsmythies) wrote :

As Kristen mentions in post 18, the CORE_PERF_LIMIT_REASONS MSR is saying that the frequency is reduced below the operating system request due to PBM (Power Budget Management) limit. It seems odd that in such a reduced power consumption state the bit is still asserted.

Note that whenever the actual clock is below what has been asked for, the intel_pstate driver will drive down and ask for the lowest pstate, regardless of load. This is fundamental to the current control algorithm. The acpi-cpufreq frequency scaling driver doesn't have this problem, conceivably somewhat masking this issue. (not sure in this case. I.E. I am not sure if the frequency is locked at 37.5% of the minimum pstate regardless, or 37.5% of requested pstate. The performance mode test mentioned in the description would tend to indicate the former. One way to test is to force the use of the acpi-cpufreq driver by disabling the intel_pstate driver).

Myself, I would keep track of both what is being asked for and what the processor is actually giving. Example (on an older i7):

What is being asked for (the system is idle and min freq is 1.6 GHz):
# rdmsr --bitfield 15:8 -d -a 0x199
16
16
16
16
16
16
16
16

What the processor is actually giving (pstate 24 What? (I created a known issue for dramatic effect)):
# rdmsr --bitfield 15:8 -d -a 0x198
24
24
24
24
24
24
24
24

Philipp Kern (pkern) wrote :

I could try briefly next week, but from what I recall from my December attempts setting intel_pstate=0 on the kernel's cmdline did *not* help. I saw a different frequency in cpuinfo (1.2 GHz), but the machine was still incredibly slow.

Doug Smythies (dsmythies) wrote :

When using the acpi-cpufreq driver, "/proc/cpuinfo" shows the frequency that is being asked for, not the frequency one actually gets.
One has to use turbostat to know for sure.

For whatever reason, and as mentioned in my previous post, the posted turbostat output is saying that excessive power has caused the slow down. Both the CORE_PERF_LIMIT_REASONS and the IA32_PACKAGE_THERM_STATUS and all of the IA32_THERM_STATUS MSRs are saying this.

Colin Ian King (colin-king) wrote :

For users with this problem on E5-1650 v3, Intel requested the model:stepping info from /proc/cpuinfo, so please paste the /proc/cpuinfo into the bug report. Thanks.

Philipp Kern (pkern) wrote :

processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
stepping : 2
microcode : 0x2b
cpu MHz : 1233.339
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 5
cpu cores : 6
apicid : 11
initial apicid : 11
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc
bugs :
bogomips : 6983.44
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:

Doug Smythies (dsmythies) wrote :

@Colin: Gabor (from the duplicate) and Shahar (from the turbostat postings) both have: family : 6 ; model : 63 ; stepping : 2 also.

XiongZhang (xiong-y-zhang) wrote :

@Philipp & @Doug:
I couldn't reproduce your issue on my E5-2697 v3 platform with 0x2B and 0x36 microcode in Ubuntu 16.04. E5-2697 v3 has the same F/M/S with the affected processor E5-1650 v3 and E5-2687W v3.

Could you tell me what's your conclusion that it is a microcode regression or upstream kernel regression ? From comment#27, it seems a kernel regression, but the title and some comments always point to microcode regression.
Could you reproduce it with 16.04 ?

Philipp Kern (pkern) wrote :

0x2b is known good for me, 0x36 known bad.

Doug Smythies (dsmythies) wrote :

@Philipp: The way I read them, your comment #35 and your comment #27 contradict each other.

@Xiong: The way I read all of this stuff, it has not been proven that the issue is in the microcode itself. However, the work done by Sharar in earlier postings, in my opinion, narrows it down to either the microcode itself or some loading issue when it is updated during boot.

I had an idea to use "iucode_tool" to further isolate the issue, by going back and forth between microcodes, but now I see that one can not downgrade the microcode on the fly, it only works for upgrading. So now I am thinking maybe it would be possible to make the same microcode with a newer version number to "trick" the system into loading it. I.E. if upgrading to the fake "newer" microcode during boot also caused issues, then the root issue would be the load procedure.

By the way, I don't have Xeon processor, I'm just trying to help with this bug report.

From all that I have been reading today, microcode 29 should be O.K., but wasn't for Sharar.
See also: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=776431

On Thu, Jan 21, 2016, at 03:18, Doug Smythies wrote:
> @Philipp: The way I read them, your comment #35 and your comment #27
> contradict each other.
>
> @Xiong: The way I read all of this stuff, it has not been proven that
> the issue is in the microcode itself. However, the work done by Sharar
> in earlier postings, in my opinion, narrows it down to either the
> microcode itself or some loading issue when it is updated during boot.

To me, so far it looks like the issue is caused by a three-way
interaction: microcode - kernel - platform/BIOS.

There are several users of Xeon E5v3 with up-to-date microcode (0x36)
that do not observe the issue. Since not every box with the same
processor, microcode and kernel (but not the same *mainboard*) suffers
the issue, there's a platform component that is required to trigger the
issue. Maybe the firmware is not doing everything it should, or the
Linux driver is not doing everything it should: the microcode update
might not be the root cause.

> I had an idea to use "iucode_tool" to further isolate the issue, by
> going back and forth between microcodes, but now I see that one can not
> downgrade the microcode on the fly, it only works for upgrading. So now

That's untested, unspecified, unsupported territory. Avoid it unless
someone @intel that knows better tells you to do it: just because a
microcode downgrade looks like it worked fine doesn't mean it did, after
all... Especially since the issue in this bug report might well be
related to insufficient new-state sanitization after a microcode update,
so downgrading the microcode might invalidate the testing entirely...

It is much better to start from the microcode that works (in the BIOS),
and test if you can still reproduce the issue with normal microcode
updates (i.e. no downgrading).

> I am thinking maybe it would be possible to make the same microcode with
> a newer version number to "trick" the system into loading it. I.E. if
> upgrading to the fake "newer" microcode during boot also caused issues,
> then the root issue would be the load procedure.

Yes, it is possible. That's the main component of an Intel microcode
downgrade attack, which works pretty much everywhere (and not just in
Linux).

--
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique de Moraes Holschuh <email address hidden>

ALthough it was tried before:
"
Shahar Or (mightyiam) wrote on 2015-07-31:
`$ sudo systemctl stop thermald` doesn't seem to change it.

"
Can we try by deleting the service to avoid any throttling might have occurred during boot?
This is a Xeon server, don't need thermald.

Shahar Or (mightyiam) wrote :

Thanks for all of the churn on this. Please tell me if I can help in any way.

Keve Gabbert (keve-a-gabbert) wrote :

FYI - Intel is trying to reproduce and debug this issue in the US.

Shahar Or (mightyiam) wrote :

@Intel, my affected system is up for sale if you want it.

Hi Shahar,
Can you quickly try steps suggested in #38? You can do "sudo systemctl disable thermald" and then reboot.

Shahar Or (mightyiam) wrote :

@srinivas-pandruvada

If I understand you correctly, the steps are:

1. Install the intel-microcode package
2. `sudo systemctl disable thermald`
3. Reboot
4. See whether frequency scaling is broken

Is this correct?

@Shahar
I thought you already have a system with upgraded microcode. If this is not too much trouble. then please try the steps you outlined in #43 (previous comment).
I am still not clear is that
- For the same kernel version and ubuntu 15.10, if you just upgrade micro-code, does it breaks frequency scaling (Nothing else changed)?

Doug Smythies (dsmythies) wrote :

>> However, the work done by Sharar
>> in earlier postings, in my opinion, narrows it down to either the
>> microcode itself or some loading issue when it is updated during boot.

> Henrique de Moraes Holschuh wrote:
> To me, so far it looks like the issue is caused by a three-way
> interaction: microcode - kernel - platform/BIOS.

I agree, and was not clear. I meant everything related to the actual upgrade. Issues with re-initializing after suspend, and maybe in this case, after microcode upgrade are why I have started to follow this bug report.
And along those lines, and further to my post #28 and #30, I have been wondering if the CPU frequencies could be unstuck by clearing all the logged bits. Perhaps they got set inadvertently during the microcode upgrade, and not cleared as part of post upgrade re-initialization. Even though that wouldn't explain the currently active bits, it might be worth a try (as su).

wrmsr 0x690 0x00
wrmsr 0x6B0 0x00
wrmsr 0x6B1 0x00
wrmsr 0x1B1 0x00
wrmsr -a 0x19c 0x00
wrmsr -a 0x19a 0x00

I may have forgotten one or more registers. The msr-tools package is required and "modprobe msr" (as su) is needed before those commands.

We should look for any platform/BIOS commonality between the suffers of this issue (if the current Srinivas test doesn't work, that is).

Philipp Kern (pkern) wrote :

FWIW, regarding platform commonality: We saw this after pushing the microcode update on two HP z440 workstations, one in Munich and one in Tokyo. Specifically both were E5-1650 v3.

@Shahar
Also if #43 solves your issue, can you do following steps
cd /sys/class/powercap
grep -r . *
cd /sys/class/thermal
grep -r . *
cd /sys/class/hwmon
grep -r . *

sudo ./thermald --no-daemon --loglevel=debug
<Send output of above command>

I am thinking a combination of factors is causing this issue. First of all this is a server platform, no need to run thermald (thermald doesn't take care about of multiple packages and not tested on Xeons).

Shahar Or (mightyiam) wrote :

I've:

1. Installed `intel-microcode`
2. Rebooted

I did this to make sure that the problem still exists. And it did. Then I:

3. `sudo systemctl disable thermald`
4. Rebooted

Frequency scaling seems fine. Hooray!

Attached is output of commands requested in #47.

Shahar Or (mightyiam) wrote :

Would testing with 15.10 help or have we figured this out?

Can you also check MSR_IA32_ENERGY_PERF_BIAS (MSR 0x1b0) ?

Shahar Or (mightyiam) wrote :

@hmh, `$ sudo rdmsr 0x1b0` returned `7`.

I wonder if these MSRs are set to the same contents in *all* cores?

Anyway, please look at this:

http://article.gmane.org/gmane.linux.power-management.general/70615/match=msr_ia32_energy_perf_bias

"The assumption that BIOSes never want to have this register being set to
full performance (zero) is wrong.

While wrongly overruling this BIOS setting and set it to from performance
to normal did not hurt that much, because nobody really knew the effects inside
Intel processors.

But with Broadwell-EP processor (E5-2687W v4) the CPU will not enter turbo modes
if this value is not set to performance."

Now, it says "Broadwell-EP E5-2687W v4" in the commit text, but there is no such a beast in Intel ARK, so it might well be the Haswell E5-2687W v3. And even if the commit text is indeed correct, it is likely worth a try to check the behavior of MSR_IA32_ENERGY_PERF_BIAS in the new Haswell microcode...

So, maybe you could try to set MSR_IA32_ENERGY_PERF_BIAS to zero on all cores (maybe using the x86_energy_perf_policy utility, it is in the Linux kernel source tree, at "tools/power/x86/x86_energy_perf_policy"), and check if that fixes the issue as well?

Shahar Or (mightyiam) wrote :

@hmh, does this require compiling a kernel? Last time I did that was some 8 years ago, on Debian.

No, I think you can just go to that directory and type Make.

You can also use wrmsr (make sure to write to *all* cores).

oh, you *do* need a kernel source tarball, or a fresh clone from git if you want to compile the tool.

Very strange. From the logs thermald didn't take any action. Then also the problem seems to be fixed.
Can you let thermald run for few minutes in the same way and collect logs?

Chris J Arges (arges) wrote :

I am also affected on my haswell-ep i7-5820k desktop. After disabling thermald performance is back to normal.

I think it has to do with the starting of thermald too early while driver modules are not ready.
@Chris:
Can you attach thermald logs (when started by systemd) and also by disabling and started by command line?

Currently systemd runlevels are [2345], can we try just [5]
In the systemd service file. thermald.conf

change
start on runlevel [2345] and started dbus
to
start on runlevel [5] and started dbus

If thermald is loading too early, then this will delay till run level 5.

Colin Ian King (colin-king) wrote :

I've uploaded a patched version of thermald that includes 4 fixes from Srinivas and also modifies thermald to only start on runlevel 5.

Packages for wily and xenial can be found in ppa:colin-king/thermald-1480349 - please install and test to see if it fixes the issues

sudo add-apt-repository ppa:colin-king/thermald-1480349
sudo apt-get update && sudo apt-get dist-upgrade

Changed in intel-microcode (Ubuntu):
status: Confirmed → In Progress
Chris J Arges (arges) wrote :

I just tested the version of thermald in the PPA and my now computer is functioning as expected.

Colin Ian King (colin-king) wrote :

Hi folks, I'd appreciate some more testing on this before I proceed any further with the fixes. If one can test the packages in my ppa (see comment #60) and let me know:

release (wily or xenial)
passed or failed.

Thanks.

I up still not sure about the root cause. Can anybody help on collecting startup logs (journalctl -b)?
Can we just change runlevel to 5 and see if the problem is fixed.If there is a race condition in loading then this should address this.
@Colin, What changes are required to run at run level 5? and also we should start with --loglevel=debug for test in thermald.service.

I was asked provide root cause analysis as this issue moved up in chains for attention.

Colin Ian King (colin-king) wrote :

@Srinivas,

I changed the run level to 5 in this updated package, so it should do this automatically when the update is installed.

@Colin
I want to see if there is a race condition, by changing runlevel to 5 without any other change. What do we need to change in systemd service file?

Colin Ian King (colin-king) wrote :

@Srinivas,

systemd uses targets which serve a similar purpose as runlevels, to change this on a running system one uses sysemctl

1. To list different target types use:

systemctl list-units --type=target

multi-user.target is equivalent to run level 3
graphical.target is equivalent to run level 5

2. Edit the thermald service config:

/lib/systemd/system/thermald.service

and set WantedBy=graphical.target

run:

sudo systemctl enable thermald.service

and this will setup the symlinks to enable this service on the graphical.target target level

description: updated
description: updated
description: updated
Colin Ian King (colin-king) wrote :

debdiff attached

Changed in thermald (Ubuntu):
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
milestone: none → xenial-updates
Changed in thermald (Ubuntu Wily):
milestone: none → wily-updates
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → High
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thermald - 1.4.3-6

---------------
thermald (1.4.3-6) unstable; urgency=medium

  * Fix frequency scaling issue on specific Intel CPUs (LP: #1480349)
    - roll earlier patches that allow us to cleanly patch the issue
    - Error recovery when sysfs attrib read fails
    - Remove wild card for loading
    - Move parser init to thd_engine
    - Add white list of the cpu ids

 -- Colin King <email address hidden> Tue, 26 Jan 2016 10:15:11 +0000

Changed in thermald (Ubuntu Xenial):
status: New → Fix Released
summary: - Intel Microcode Breaks frequency scaling in Xeon® E5-2687W v3 & E5-1650
- v3
+ thermald breaks frequency scaling in Xeon® E5-2687W v3 & E5-1650 v3
Philipp Kern (pkern) wrote :

What about trusty?

Changed in thermald (Ubuntu Trusty):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → High
status: New → In Progress
Changed in thermald (Ubuntu Wily):
status: New → In Progress
Changed in thermald (Ubuntu Vivid):
assignee: nobody → Colin Ian King (colin-king)
Changed in thermald (Ubuntu Vivid):
importance: Undecided → High
milestone: none → vivid-updates
status: New → In Progress
Changed in thermald (Ubuntu Trusty):
milestone: none → trusty-updates
description: updated
no longer affects: intel-microcode (Ubuntu)
no longer affects: intel-microcode (Ubuntu Trusty)
no longer affects: intel-microcode (Ubuntu Vivid)
no longer affects: intel-microcode (Ubuntu Wily)
no longer affects: intel-microcode (Ubuntu Xenial)

Does this bug affect thermald 1.3 ?

Colin Ian King (colin-king) wrote :

yes

tags: added: trusty vivid
Brian Murray (brian-murray) wrote :

@Colin - looking at the patch 0203-Add-white-list-of-the-cpu-ids.patch it looks like a couple of CPUs were dropped.

1129 + // Add any tested platform ids in this table
1130 +-static supported_ids_t id_table[] = { { 6, 0x2a }, // Sandybridge
1131 +- { 6, 0x2d }, // Sandybridge
1132 ++static supported_ids_t id_table[] = {
1133 ++ { 6, 0x2a }, // Sandybridge
1134 + { 6, 0x3a }, // IvyBridge
1135 +- { 6, 0x3c }, { 6, 0x3e }, { 6, 0x3f }, { 6, 0x45 }, // Haswell ULT */
1136 +- { 6, 0x46 }, // Haswell ULT */
1137 ++ { 6, 0x3c }, // Haswell
1138 ++ { 6, 0x45 }, // Haswell ULT
1139 ++ { 6, 0x46 }, // Haswell ULT
1140 ++ { 6, 0x3d }, // Broadwell
1141 ++ { 6, 0x37 }, // Valleyview BYT
1142 ++ { 6, 0x4c }, // Brasewell
1143 ++ { 6, 0x4e }, // skylake
1144 ++ { 6, 0x5e }, // skylake
1145 ++ { 6, 0x5c }, // Broxton

Specifically, 0x2d, 0x3e, and 0x3f. Was that intended?

Colin Ian King (colin-king) wrote :

Brian, let me double check that with Intel.

0x2d, 0x3e, and 0x3f are Xeon parts, where we don't want thermald to start.

Hello Shahar, or anyone else affected,

Accepted thermald into wily-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.4.3-5ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in thermald (Ubuntu Wily):
status: In Progress → Fix Committed
tags: added: verification-needed
Timo Aaltonen (tjaalton) wrote :

Hello Shahar, or anyone else affected,

Accepted thermald into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.4.3-5~14.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in thermald (Ubuntu Trusty):
status: In Progress → Fix Committed
Colin Ian King (colin-king) wrote :

If bug reporters could try testing the -proposed packages then we can get the fixes into the archive. Thanks!

Shahar Or (mightyiam) wrote :

I'm running 16.04. I won't install 15.10 for this purpose.

Colin Ian King (colin-king) wrote :

@Philip Kern "What about trusty?" - can you test the fixed version that is in -proposed, if this works for you then we can get this released.

Philipp Kern (pkern) wrote :

The updated thermald package in trusty-proposed fixed the issue on my machine. Downgrading it brings it back.

tags: added: verification-done-trusty verification-needed-wily
removed: verification-needed
tags: added: verification-needed
removed: verification-needed-wily
Colin Ian King (colin-king) wrote :

Thanks Philipp.

Anyone care to test Wily? I'd like to get that one released too.

Changed in thermald (Ubuntu Vivid):
status: In Progress → Won't Fix
Colin Ian King (colin-king) wrote :

Ubuntu 15.04 (Vivid Vervet) reaches End of Life on February 4 2016, so won't fix for Vivid.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thermald - 1.4.3-5~14.04.2

---------------
thermald (1.4.3-5~14.04.2) trusty; urgency=medium

  * Fix frequency scaling issue on specific Intel CPUs (LP: #1480349)
    - roll earlier patches that allow us to cleanly patch the issue
    - Error recovery when sysfs attrib read fails
    - Remove wild card for loading
    - Move parser init to thd_engine
    - Add white list of the cpu ids

 -- Colin King <email address hidden> Thu, 4 Feb 2016 12:31:00 +0000

Changed in thermald (Ubuntu Trusty):
status: Fix Committed → Fix Released

As a part of the Stable Release Updates quality process a search for Launchpad bug reports using the version of thermald from wily-proposed was performed and bug 1549671 was found. Please investigate this bug report to ensure that a regression will not be created by this SRU. In the event that this is not a regression remove the "verification-failed" tag from this bug report and add the tag "bot-stop-nagging" to bug 1549671 (not this bug). Thanks!

tags: added: verification-failed
Colin Ian King (colin-king) wrote :

Re: comment #85, I do not believe this is a regression based on this fix. With the view that the same fix landed on Xenial and Trusty and it's the same code base, I'd like this bug to be released if possible.

tags: removed: verification-failed
tags: removed: vivid
tags: added: verification-needed-wily
removed: verification-needed wily
Colin Ian King (colin-king) wrote :

I've given this a run through on a bunch of Wily based intel systems and can't see any regressions, I therefore mark this as verified for wily.

tags: added: verification-done-wily
removed: verification-needed-wily
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thermald - 1.4.3-5ubuntu1

---------------
thermald (1.4.3-5ubuntu1) wily; urgency=medium

  * Fix frequency scaling issue on specific Intel CPUs (LP: #1480349)
    - roll earlier patches that allow us to cleanly patch the issue
    - Error recovery when sysfs attrib read fails
    - Remove wild card for loading
    - Move parser init to thd_engine
    - Add white list of the cpu ids

 -- Colin King <email address hidden> Tue, 26 Jan 2016 10:15:11 +0000

Changed in thermald (Ubuntu Wily):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for thermald has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Shahar Or (mightyiam) wrote :

Thanks a bunch!

Changed in thermald (Ubuntu):
milestone: xenial-updates → none
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Patches

Remote bug watches

Bug watches keep track of this bug in other bug trackers.