HP Proliant Servers suffer from cpufreq initialization failure for some cpu cores

Bug #1447763 reported by Rafael David Tinoco
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

It was brought to my attention the following:

Ubuntu Trusty Kernel 3.13 is suffering from cpufreq initialization failure for some cpu cores on Proliant Servers.

/sys/devices/system/cpu# for cpu in `ls -1d cpu[0-9]*`; do ls -ld $cpu/cpufreq; done
drwxr-xr-x 2 root root 0 Apr 2 16:15 cpu0/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu1/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu10/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu11/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu12/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu13/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu14/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu15/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu16/cpufreq
ls: cannot access cpu17/cpufreq: No such file or directory
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu18/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu19/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu2/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu20/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu21/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu22/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu23/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu24/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu25/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu26/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu27/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu28/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu29/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu3/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu30/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu31/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu4/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu5/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu6/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu7/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu8/cpufreq
drwxr-xr-x 2 root root 0 Apr 2 16:16 cpu9/cpufreq

With the following message:

[ 2.465616] pcc-cpufreq: (v1.10.00) driver loaded with frequency limits: 1200 MHz, 2200 MHz
[ 2.474810] cpufreq: __cpufreq_add_dev: ->get() failed

Disabling Collaborative Power Control into firmware setup mitigates the issue.

Tags: cts
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

I found the following commit:

commit 2ed99e39cb9392312c100d9da591c20641c64d12
Author: Rafael J. Wysocki <email address hidden>
Date: Wed Mar 12 21:49:33 2014 +0100

cpufreq: Skip current frequency initialization for ->setpolicy drivers

After commit da60ce9f2fac (cpufreq: call cpufreq_driver->get() after
calling ->init()) __cpufreq_add_dev() sometimes fails for CPUs handled
by intel_pstate, because that driver may return 0 from its ->get()
callback if it has not run long enough to collect enough samples on the
given CPU. That didn't happen before commit da60ce9f2fac which added
policy->cur initialization to __cpufreq_add_dev() to help reduce code
duplication in other cpufreq drivers.
...

Already backported to 3.13 since 3.13.0-20.
Will investigate why we are suffering from the same problem for their kernel (3.13.0-43).

Changed in linux (Ubuntu):
assignee: nobody → Rafael David Tinoco (inaddy)
status: New → Confirmed
status: Confirmed → In Progress
tags: added: cts
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Collaborative Power Control = Enabled

enables "pcc-cpufreq" as the "core communication" between cpufreq driver AND firmware.

https://www.kernel.org/doc/Documentation/cpu-freq/pcc-cpufreq.txt

This is an interface used by HP extensively (HP servers tend to use extensively ACPI and interfaces between OS and firmware).

Judging by comments on commit:

https://lkml.org/lkml/2014/3/20/799

We can see that intel_pstates might actually fail to initialize cpufreq if it ever calls the cpufreq driver initialization and gets an error.

We might be getting a "0" from:

drivers/cpufreq/pcc-cpufreq.c -> pcc_get_freq()

which is called in the code (from commit given):

policy->cur = cpufreq_driver->get(policy->cpu);

During the cpufreq initialization function (__cpufreq_add_dev).

* This is actual the latest code in the kernel and they assume sometimes intel_pstates can fail to initialize cpufreq (even for other cpufreq drivers). If it fails it gives us that warning message pointed out. A workaround could be trying to "disable" and "enable" the core by putting it offline and online again * (Kernel code does not attempt to initialize cpufreq for a second time, for example).

It looks like HP ROM engineering team might want to debug this.

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

After Disabling Collaborative Power Control into firmware instead of getting:

$ cat ./sys/devices/system/cpu/cpuX/cpufreq/scaling_driver
pcc-cpufreq

we get

$ cat ./sys/devices/system/cpu/cpuX/cpufreq/scaling_driver
acpi-cpufreq

Meaning that acpi-cpufreq is giving intel_pstates good initialization parameters.

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Changed in linux (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.