HWP and C1E are incompatible - Intel processors

Bug #1917813 reported by Doug Smythies
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Confirmed
Medium
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Modern Intel Processors (since Skylake) with HWP (HardWare Pstate) control enabled and Idle State 2, C1E, enabled can incorrectly drop the CPU frequency with an extremely slow recovery time.

The fault is not within HWP itself, but within the internal idle detection logic. One difference between OS driven pstate control and HWP driven pstate control is that the OS knows the system was not actually idle, but HWP does not. Another difference is the incredibly sluggish recovery with HWP.

The problem only occurs when Idle State 2, C1E, is involved. Not all processors have the C1E idle state. The issue is independent of C1E auto-promotion, which is turned off in general, as far as I know.

With all idle states enabled the issue is rare. The issue would manifest itself in periodic workflows, and would be extremely difficult to isolate (It took me over 1/2 a year).

The purpose of this bug report is to link to the upstream bug report, where readers can find tons of detail. I'll also set it to confirmed, as it has already been verified on 4 different processor models, and I do not want the bot asking me for files that are not required.

Workarounds include:
. don't use HWP.
. disable idle state 2, C1E
. change the C1E idle state to use MWAIT 0x03 instead of MWAIT 0x01 (still in test. documentation on the MWAIT least significant nibble is scant).

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :
Download full text (4.5 KiB)

Created attachment 294171
Graph of load sweep up and down at 347 Hertz.

Consider a steady state periodic single threaded workflow, with a work/sleep frequency of 347 Hertz and a load somewhere in the ~75% range at the steady state operating point.
For the intel-cpufreq CPU frequency scaling driver and powersave governor and hwp disabled, it goes indefinitely without any issues.
For the acpi-cpufreq CPU frequency scaling driver and ondemand governor, it goes indefinitely without any issues.
For the intel-cpufreq CPU frequency scaling driver and powersave governor and hwp enabled, it suffers from overruns.

Why?

For unknown reasons, HWP seems to incorrectly decide that the processor is idle and spins the PLL down to a very low frequency. Upon exit from the sleep portion of the periodic workflow it takes a very long time (on the order of 20 milliseconds (supporting data for that statement will added in a later posting)), resulting in the periodic job no being able to complete its work before the next interval, whereas it normally has plenty of time to do its work. Actually, typical worst case overruns are around 12 milliseconds, or several work/sleep periods (i.e. it takes a very long time to catch up.)

The probability of this occurring is about 3%, but varies significantly. Obviously, the recovery time is also a function of EPP, but mostly this work has been done with the default EPP of 128. I believe this to be a sampling and anti-aliasing issue, but can not prove it because HWP is black box. My best GUESS is:

If the periodic load is busy on a jiffy boundary, such that the tick is on.
Then if it is sleeping at the next jiffy boundary, with a pending wake such that idle state 2 was used.
  Then if the rest of the system was idle such that HWP decides to spin down the PLL.
    Then it is highly probable that upon that idle state 2 exit, the PLL is too slow to ramp up and the task will overrun as a result.
Else everything will be fine.

For a 1000 Hz kernel the above suggests that a work/sleep frequency of 500 Hz should behave in a binary way, either lots of overruns or none. It does.
For a 1000 Hz kernel the above suggests that a work/sleep frequency of 333.333 Hz should behave in a binary way, either lots of overruns or none. It does.
Note: in all cases the sleep time has to be within the window of opportunity.

Now, actually I can not prove if the idle state 2 part is a cause or consequence, but it never happens with it disabled, but at the cost of significant power.

Another way this issue would manifest itself is as seeming to be an extraordinary idle exit latency, but would be rather difficult to isolate as the cause.

processors tested:
Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz (mine)
Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz (not mine)

HWP has been around for years, why am I just reporting this now?

I never owned an HWP capable processor before. My older i7-2600K based test computer was getting a little old, so I built a new test computer. I noticed this issue the same day I first enabled HWP. That was months ago (notice the dates on the graphs that will eventually be added to this), and I tried, repeatedly, to get help from Intel via...

Read more...

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Intel: Kristen hasn't been the maintainer for years. please update the auto-assigned thing.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294173
Graph of an area of concern breaking down.

an experiment was done looking around the area initially found at 347 hertz work/sleep frequency of the periodic workflow and load.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294175
Graph of overruns from the same experiment as the previous post

There should not be overruns. (sometimes there are 1 or 2 from the first time start up)

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294177
inverse impulse test - short short sleep exit response

Good and bad inverse impulse response exits all on one graph.

The graph mentions 5 milliseconds a lot. At that time I did not know that the frequency steps times are a function of EPP. I have since mapped the entire EPP space getting:

0 <= EPP <= 1 : unable to measure.
2 <= EPP <= 39 : 2 milliseconds between frequency steps
40 <= EPP <= 55 : 3 milliseconds between frequency steps
56 <= EPP <= 79 : 4 milliseconds between frequency steps
80 <= EPP <= 133 : 5 milliseconds between frequency steps
134 <= EPP <= 143 : 6 milliseconds between frequency steps
144 <= EPP <= 154 : 7 milliseconds between frequency steps
155 <= EPP <= 175 : 8 milliseconds between frequency steps
176 <= EPP <= 255 : 9 milliseconds between frequency steps

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294179
inverse impulse response - multiple (like 1000) bad exits

by capturing a great many bad exits, one can begin to observe the width of the timing race window (which I already knew from other work, but don't think I wrote it herein yet). the next few attachments will drill down into some details of this same data.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294181
inverse impulse response - multiple (like 1000) bad exits - detail A

just a zoomed in graph of an area of interest, so I could verify that the window size was the same as (close enough) as what I asked for. The important point being that the window is always exactly around the frequency step point.

Now we already know that the frequency step points are aHWP thing, so this data supports the argument that HWP is doing this stuff on its own.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294185
inverse impulse response - multiple (like 1000) bad exits - detail

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294187
inverse impulse response - multiple (like 1000) bad exits - detail C

 the previous and this one are details B and C zoomed in looks at another two spots. Again calculating the window width.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294189
inverse impulse response - i5-6200u multi all bad

this is the other computer. there are also detail graphs, if needed.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294201
Just an example of inverse impulse verses some different EPPs

see also:

https://marc.info/?l=linux-pm&m=159354421400342&w=2

and on that old thread, I just added a link to this.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

> A 250 hertz kernel was tested, and it did not have this
> issue in this area. Perhaps elsewhere, I didn't look.

Correction: same thing for 250 Hertz kernel.

Some summary data for the periodic workflow manifestation of the issue. 347 hertz work/sleep frequency, fixed packet of work to do per cycle, 5 minutes, kernel 5.10, both 1000 Hz and 250 Hz, teo and menu idle governors, idle state 2 enabled and disabled.

1000 Hz, teo, idle state 2 enabled:
overruns 28399
maximum catch up 13334 uSec
Ave. work percent: 76.767
Power: ~14.5 watts

1000 Hz, menu, idle state 2 enabled:
overruns 835
maximum catch up 10934 uSec
Ave. work percent: 68.106
Power: ~16.3 watts

1000 Hz, teo, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 67.453
Power: ~16.8 watts (+2.3 watts)

1000 Hz, menu, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 67.849
Power: ~16.4 watts (and yes the 0.1 diff is relevant)

250 Hz, teo, idle state 2 enabled:
overruns 193
maximum catch up 10768 uSec
Ave. work percent: 68.618
Power: ~16.1 watts

250 Hz, menu, idle state 2 enabled:
overruns 22
maximum catch up 10818 uSec
Ave. work percent: 68.607
Power: ~16.1 watts

250 Hz, teo, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 68.550
Power: ~16.1 watts

250 Hz, menu, idle state 2 disabled:
overruns 0
maximum catch up 0 uSec
Ave. work percent: 68.586
Power: ~16.1 watts

So, the reason I missed the 250 hertz kernel in my earlier work, was because the probability was so much less. The probability is less because the operating point is so different between the teo and menu governors and the 1000 and 250 Hz kernels. i.e. there is much more spin down margin for the menu case.

The operating point difference between difference between the 250 Hz and 1000 Hz kernels for teo is worth a deeper look.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

additionally, and for all other things being equal, the use of idle state 2 is dramatically different between the 1000 (0.66%) and 250 (0.03%) Hertz kernels, resulting in differing probabilities of hitting the timing window while in idle state 2.

HWP does not work correctly in these scenarios.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294275
Graph of load sweep at 200 Hertz for various idle states

> Now, actually I can not prove if the idle state 2 part
> is a cause or consequence, but it never happens with it
> disabled, but at the cost of significant power.

idle state 2, combined with the timing window, which is much much larger than previously known, is the cause.

The CPU load is increased to max, then decreased. As a side note, there is a staggering amount of hysteresis and very long time constants involved here.

If one just sits and watches turbostat with the system supposedly in steady state operation, HWP can be observed very gradually (10s of seconds) deciding that it can reduce the CPU frequency, thus saving power. Then it has one of these false frequency drops, HWP struggles to catch up, raising the CPU frequency as it does so, and the cycle repeats.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294399
step function system response - overview

1514 step function tests were done.
the system response was monitored each time.
For 93% of the tests, the system response was as expected.
(do not confuse "as expected" with "ideal" or "best".)
For 7% of the tests the system response was not as expected, being much much too slow and taking way too long thereafter to completely come up to speed.

Note: The y-axis of these graphs is now "gap-time" instead of CPU frequency. This was not done to confuse the reader, but the reverse frequency calculation was not done on purpose. It is preferable to observe the data in units of time, without introducing frequency errors due to ISR and other latency gaps. Approximate CPU frequency conversions have been added.

While I will post about 5 graphs for this experiment, I have hundreds and have done many different EPPs and on and on ...

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294401
step function system response - detail A

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294403
step function system response - detail B

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294405
step function system response - detail B-1

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294407
step function system response - detail B-2

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294469
step function system response - idle state 2 disabled

1552 test runs with idle state 2 disabled, no failures.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294685
a set of tools for an automated test

At this point, I have provided 3 different methods that reveal the same HWP issue. Herein, tools are provided to perform an automated quick test to answer the question "does my processor have this HWP issue?"

The motivation for this automation is to make it easier to test other HWP capable Intel processors. Until now the other methods for manifesting the issue have required "tweeking", and have probabilities of occurrence even lower than 0.01%, requiring unbearably long testing times (many hours) in order to acquire enough data to be statistically valid. Typically, this test provides PASS/FAIL results in about 5 minutes.

The test changes idle state enabled/disabled status, requiring root rights to do so. The scale for the fixed workpacket periodic workflow is both arbitrary and different between processors. The test runs in two steps: The first finds the operating point for the test (i.e. it does the "tweeking" automatically); The second does the actual tests one without idle state 2 and one with only idle state 2 (recall that the issue is linked with the use of idle state 2). Forcing idle state 2 greatly increases the probability of the issue occurring. While this test has been created specifically for the intel_pstate CPU frequency scaling driver with HWP enabled and the powersave governor, it doesn't check. Therefore one way to test the test is to try it with HWP disabled.

Note: the subject test computer must be able to run one CPU at 100% without needing to throttle (power or thermal or any other reason), including with only idle state 2 enabled.

Results so far: 3 of 3 processors FAIL; i5-9600k; i5-6200U; i7-10610U.

use this command:

./job-control-periodic 347 6 6 900 10

Legend:
347 hertz work/sleep frequency
6 seconds per iteration run.
6 seconds per test run.
try for approximately 900 uSec average sleep time.
10 test loops at that 6 seconds per test.

the test will take about 5 minutes.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 294687
an example run of the quick test tools

the example contains results for:
HWP disabled: PASS (as expected)
HWP enabled: FAIL (as expected)

but tests were also done with a 250 Hertz kernel, turbo disabled, EEO and RHO bits changed... all give FAIL for HWP enabled forcing idle state 2, and PASS for other conditions.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

some other results for the quick test:

i5-9600k (Doug): FAIL. (Ubuntu 20.04; kernel any)
i5-6200U (Alin): FAIL. (Debian.)
i7-7700HQ (Gunnar): FAIL (Ubuntu 20.10)
i7-10610U (Russell) : FAIL. (CentOS (RedHat 8), 4.18.0-240.10.1.el8_3.x86_64 #1 SMP).
another Skylake(Rick) still waiting to hear back.

so 4 out of 4 so far (and I gave them no guidance at all, on purpose, as to any particular kernel to try).

I have been picking away at this thread (pun intended) for months, and I think it is finally starting to unravel. Somewhere above i said:

> For unknown reasons, HWP seems to incorrectly decide
> that the processor is idle and spins the PLL down to
> a very low frequency.

I now believe it to be something inside the processor, but maybe not part of HWP. I think that non-hwp processors or ones with it disabled, also misdiagnose that the entire processor is idle. My evidence is both not very thorough and not currently in a presentable form, but this issue only ever occurs some short time or immediately after every core has been idle, with at least one in idle state 2. The huge difference between HWP and OS driven pstates is that the OS knows the system wasn't actually idle and HWP doesn't. Even though package C1E is disabled it behaves, perhaps, similar to be it being enabled.

There is some small timing window where this really screws up. Mostly is works fine, and either the CPU frequency doesn't even ramp down at all, or it recovers quickly, within about 120 uSec.

And as far as I know, it exits the idle state O.K. but it takes an incredibly long time for HWP to ramp up the CPU frequency again. Meanwhile, any non-HWP approach doesn't drop the pstate request to minimum nor re-start any sluggish ramp up.

Now, this issue is rare and would be extremely difficult to diagnose appearing as occasional glitches, i.e. a frame rate drop in a game, dropped data, unbelievably long latency is any kind of performance is required. I consider this issue to be of the utmost importance.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295137
An example idle trace capture of the issue

these are very difficult to find.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295139
Just for reference, a good example of some idle trace data

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295155
graph of inverse impulse response measured verses theoretical failure probabilities.

As recently as late yesterday, I was still attempting to refine the gap time definition from comment #1. Through this entire process, I just assumed the processor would at least require 2 samples before deciding the entire system was idle. Why? Because it was beyond my comprehension that it would be based on one instant in time. Well, that was wrong, and it is actually based on one sample only at the HWP loop time (see attachment #294201), if idle state 2 is involved.

Oh, only idle state 2 was enabled for this. The reason I could not originally refine the gap definition, was that I did not yet know enough. I have to force idle state 2 increase the failure probabilities enough to find these limits without tests that would have otherwise run for days.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295159
forgot to label my axis on the previous post

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295211
take one point, 4500 uSec from the previous graph and do add a couple of other configurations

Observe the recovery time, which does not include the actual idle state exit latency, just the extra time needed to get to get to adequate CPU frequency, is on average 87 times slower for HWP verses noHWP and 44 times slower the passive/ondemand/noHWP.

Yes, there a few interesting spikes on the passive/ondemand/noHWP graph, but those things we can debug relatively easily (which I will not do).

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295533
changing the MWAIT definition of C1E fixes the problem

I only changed the one definition relevant to my test computer. The documentation on these bits is rather scant. Other potential fixes include getting rid of Idle state 2 (C1E) altogether. Or booting with it disabled: "intel_idle.states_off=4".

I observe that Rui fixed the "assigned" field. Thanks, not that it helps as Srinivas has been aware of this for over 1/2 a year.

Revision history for this message
In , srinivas.pandruvada (srinivas.pandruvada-linux-kernel-bugs) wrote :

I tried to reproduce with your scripts on CFL-S systems and didn't observe the same almost 1/2 half year back. Systems can be configured different way which impacts HWP algorithm. So it is possible that my lab system is configured differently.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

(In reply to Srinivas Pandruvada from comment #29)
> I tried to reproduce with your scripts on CFL-S systems and didn't observe
> the same almost 1/2 half year back. Systems can be configured different way
> which impacts HWP algorithm. So it is possible that my lab system is
> configured differently.

By "CFL-S" I assume you mean "Coffee Lake".

I wish you had reported back to me your findings, as we could have figured out the difference.

Anyway, try the automated quick test I posted in comment 20. Keep in mind that it needs to be HWP enabled, active, powersave, default epp=128. It is on purpose that the tool does not check for this configuration.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

(In reply to Doug Smythies from comment #28)
> Created attachment 295533 [details]
> changing the MWAIT definition of C1E fixes the problem

Conversely, I have tried to determine if other idle states can be broken by introducing the least significant bit of the MWAIT.

I did idle state 3, C3, and could not detect any change in system response.

I did idle state 5, C7S, which already had the bit set, along with bit 1, so I set bit one to 0:

  .name = "C7s",
- .desc = "MWAIT 0x33",
- .flags = MWAIT2flg(0x33) | CPUIDLE_FLAG_TLB_FLUSHED,
+ .desc = "MWAIT 0x31",
+ .flags = MWAIT2flg(0x31) | CPUIDLE_FLAG_TLB_FLUSHED,
  .exit_latency = 124,
  .target_residency = 800,
  .enter = &intel_idle,

I could not detect any change in system response.

I am also unable to detect any difference in system response between idle state 1, C1, and idle state 2, C1E, with this change. I do not know if the change merely makes idle state 2 = idle state 1.

Changed in linux (Ubuntu):
status: New → Confirmed
summary: - HWP and C1E are incompatible - Intel prcoessors
+ HWP and C1E are incompatible - Intel processors
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295827
wult statistics for c1,c1e for stock and mwait modifed kernels

Attempting to measure exit latency using Artem Bityutskiy's wult tool, tdt method.
Kernel 5.12-rc2 stock and with the MWAIT change from 0X01 to 0X03.
Statistics.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295829
graph of wult test results

graph of wult tdt method results.
If a I210 based NIC can be sourced, it will be tried, if pre-wake needs to be eliminated. I do not know if is needed or not.

Revision history for this message
In , srinivas.pandruvada (srinivas.pandruvada-linux-kernel-bugs) wrote :

(In reply to Doug Smythies from comment #30)
> (In reply to Srinivas Pandruvada from comment #29)
> > I tried to reproduce with your scripts on CFL-S systems and didn't observe
> > the same almost 1/2 half year back. Systems can be configured different way
> > which impacts HWP algorithm. So it is possible that my lab system is
> > configured differently.
>
> By "CFL-S" I assume you mean "Coffee Lake".
Yes, desktop part.

>
> I wish you had reported back to me your findings, as we could have figured
> out the difference.
>
I thought I have responded you, I have to search my emails. I had to specially get a system arranged but it had 200MHz higher turbo. You did share your scripts at that time.

These algorithm are tuned on a system, so small variations can have bigger impact.

Let's see if ChenYu, has some system same as yours.

> Anyway, try the automated quick test I posted in comment 20. Keep in mind
> that it needs to be HWP enabled, active, powersave, default epp=128. It is
> on purpose that the tool does not check for this configuration.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295853
wult statistics for c1,c1e for stock and mwait modifed kernels - version 2

Artum advised that I lock the CPU frequencies at some high value, in order to show some difference. Frequencies locked at 4.6 GHz for this attempt.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 295855
wult graph c1e for stock and mwait modifed kernels - version 2

As Artum advised, with locked CPU frequencies.

Other data (kernel 5.12-rc2):

Phoronix dbench 1.0.2 0 client count 1:

stock: 264.8 MB/S
stock, idle state 2 disabled: 311.3 MB/S (+18%)
stock, HWP boost: 417.9 MB/S (+58%)
stock, idle state 2 disabled & HWP boost: 434.3 MB/S (+64%)
stock, performance governor: 420 MB/S (+59%)
stock, performance governor & is2 disabled: 435MB/S (+64%)

inverse impulse response, 847 uSec gap:
stock: 2302 tests 38 fails, 98.35% pass rate.
+ MWAIT change: 1072 tests, 0 fails, 100% pass rate.

@Srinivas: The whole point of the quick test stuff is that it self adjusts to the system under test.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

For this:
Intel(R) Core(TM) i5-10600K CPU @ 4.10GHz
The quicktest gives indeterminate results.
However, it also is not using any idle state involving the least significant bit of MWAIT being set.

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C2_ACPI
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3_ACPI

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:ACPI FFH MWAIT 0x0
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:ACPI FFH MWAIT 0x30
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:ACPI FFH MWAIT 0x60

If there is a way to make idle work like all previous ways, i.e.:

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/name
/sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL
/sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1
/sys/devices/system/cpu/cpu0/cpuidle/state2/name:C1E
/sys/devices/system/cpu/cpu0/cpuidle/state3/name:C3
/sys/devices/system/cpu/cpu0/cpuidle/state4/name:C6

$ grep . /sys/devices/system/cpu/cpu0/cpuidle/state*/desc
/sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE
/sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00
/sys/devices/system/cpu/cpu0/cpuidle/state2/desc:MWAIT 0x01
/sys/devices/system/cpu/cpu0/cpuidle/state3/desc:MWAIT 0x10
/sys/devices/system/cpu/cpu0/cpuidle/state4/desc:MWAIT 0x20

I have not been able to figure out how.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

I found this thread:
https://patchwork.<email address hidden>/

And somehow figured out that a i5-10600K is COMETLAKE, and so did the same as that link:

doug@s19:~/temp-k-git/linux$ git diff
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 3273360f30f7..770660d777c4 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1155,6 +1155,7 @@ static const struct x86_cpu_id intel_idle_ids[] __initconst = {
        X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE_L, &idle_cpu_skl),
        X86_MATCH_INTEL_FAM6_MODEL(KABYLAKE, &idle_cpu_skl),
        X86_MATCH_INTEL_FAM6_MODEL(SKYLAKE_X, &idle_cpu_skx),
+ X86_MATCH_INTEL_FAM6_MODEL(COMETLAKE, &idle_cpu_skl),
        X86_MATCH_INTEL_FAM6_MODEL(ICELAKE_X, &idle_cpu_icx),
        X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNL, &idle_cpu_knl),
        X86_MATCH_INTEL_FAM6_MODEL(XEON_PHI_KNM, &idle_cpu_knl),

And got back the original types of idle states.
I do not want beat up my nvme drive with dbench, so installed an old intel SSD I had lying around:

Phoronix dbench 1.0.2 0 client count 1: (MB/S)
Intel_pstate HWP enabled, active powersave:

Kernel 5.12-rc2 stock:
All idle states enabled: 416.5
Only Idle State 0: 400.1
Only Idle State 1: 294.2
Only idle State 2: 401.6
Only idle State 3: 403.0

Kernel 5.12-rc2 patched as above:
All idle states enabled: 396.8
Only Idle State 0: 400.4
Only Idle State 1: 294.4
Only idle State 2: 245.9
Only idle State 3: 405.3
Only idle State 4: 402.8
quick test: FAIL.

Intel_pstate HWP disabled, active powersave:
Kernel 5.12-rc2 patched as above:
All idle states enabled: 340.0
Only Idle State 0: 399.5
Only Idle State 1: 358.5
Only idle State 2: 353.1
Only idle State 3: 346.9
Only idle State 4: 344.2
quick test: PASS.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

It is true that the quicktest should at least check the idle state 2 is indeed C1E.

I ran the inverse impulse response test. Kernel 5.12-rc2. Processor i5-10600K. inverse gap 842 nSec:

1034 tests 0 fails.

with the patch as per comment 38 above, i.e. with C1E:

1000 tests 16 fails. 98.40% pass 1.60% fail.

I ran just the generic periodic test at 347 hertz and light load, stock kernel, i.e. no C1E:

HWP disabled: active/powersave:
doug@s19:~/freq-scalers$ /home/doug/c/consume 32.0 347 300 1
consume: 32.0 347 300 PID: 1280
 - fixed workpacket method: Elapsed: 300000158 Now: 1617030857155911
Total sleep: 169222343
Overruns: 0 Max ovr: 0
Loops: 104094 Ave. work percent: 43.592582

HWP enabled: active/powersave:
doug@s19:~$ /home/doug/c/consume 32.0 347 300 1
consume: 32.0 347 300 PID: 1293
 - fixed workpacket method: Elapsed: 300000654 Now: 1617031529268276
Total sleep: 171458395
Overruns: 725 Max ovr: 1449
Loops: 104094 Ave. work percent: 42.847326

The above was NOT due to CPU migration:

doug@s19:~$ taskset -c 10 /home/doug/c/consume 32.0 347 3600 1
consume: 32.0 347 3600 PID: 1341
 - fixed workpacket method: Elapsed: 3600002498 Now: 1617036391455519
Total sleep: 2086618739
Overruns: 3189 Max ovr: 1864
Loops: 1249133 Ave. work percent: 42.038409

Conclusion: there is still something very minor going on even without C1E being involved.

Notes:

I think HWPBOOST was, at least partially, programming around the C1E issue.

In addition to the ultimate rejection of the patch of the thread referenced in
comment 38, I think other processors should be rolled back to the same state. I have never been able to measure any energy consumption or performance difference for all of those deep idle states on my i5-9600K processor.

Call me dense, but I only figured out yesterday that HWP is called "Speed Shift" in other literature and BIOS.

It does not make sense that we spent so much effort a few years ago to make sure that we did not dwell in shallow idle states for long periods, only to have HWP set the requested pstate to minimum upon its (C1E) use, albeit under some other conditions. By definition the system is NOT actually idle, it it were we would have asked for a deep idle state.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 305516
an updated set of tools for an automated quick test

Now checks if idle state 2 is C1E, and aborts if not.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 305517
Quick test runs on Kernel 6.7-rc3

Summary:
HWP disabled: PASS
HWP enabled: FAIL

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 305540
CPU frequency recovery time verses inactivity gap time.

Using only Idle state 2; using all except idle state 2; all idle states with and without HWP.

The maximum inactivity gap of ~400 mSec is different than a few years ago, when it didn't have an upper limit.

The C1E dependant stuff is at the lower end, less than ~60 mSec.

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 305541
An more detailed example at 250 mSec inactivity gap. HWP and no-Hwp

Revision history for this message
In , dsmythies (dsmythies-linux-kernel-bugs) wrote :

Created attachment 305542
All drivers and governors, HWP and no-HWP, execution times.

No disabled idle states.
250 mSec inactivity followed by the exact same work packet for every test.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.