Thermald does not set max CPU after reseting the voltage using RAPL

Bug #1811730 reported by Pablo Catalina on 2019-01-14
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
High
Colin Ian King
Bionic
Undecided
Unassigned

Bug Description

Hi,

I was using Ubuntu 18.10 thermald package, but I noted that, after few seconds at max CPU usage (max temp), thermald send the signal to RALP to reduce the voltage of the CPU. It set the freq to minimum (800MHz in my case). But when the CPU is idle and temp is lowered (35-40ºC) it did not send the signal to resume normal operation of the CPU.

I compiled the latest version of thermald from git (https://github.com/intel/thermal_daemon) and now everything works fine.

I started to thought that the problem was the hardware or BIOS problem, as I disabled the CPU scaling on the BIOS, but intel_pstate continued doing freq scaling (I think for Turbo mode).

But the real problem was Thermald. The latest version from git works really fine and it automatically disables and enable the Intel Turbo state and balance the freqs and fan control fine.

My hardware is a Lenovo Thinkpad P52 with i7-8850H.

I tested it using Ubuntu kernel and Kernel 4.20 optimized for i7 processor but with Ubuntu default config (except processor family="Core2/newer Xeon", Preemption Model="Preemptible Kernel (Low-Latency Desktop)" and Timer frequency="1000HZ". (Only changed those settings from make oldconfig and make deb-pkg).

ProblemType: Bug
DistroRelease: Ubuntu 18.10
Package: thermald (not installed)
ProcVersionSignature: Ubuntu 4.18.0-13.14-generic 4.18.17
Uname: Linux 4.18.0-13-generic x86_64
NonfreeKernelModules: nvidia_modeset nvidia
ApportVersion: 2.20.10-0ubuntu13.1
Architecture: amd64
CurrentDesktop: ubuntu:GNOME
Date: Mon Jan 14 23:45:01 2019
InstallationDate: Installed on 2018-12-11 (34 days ago)
InstallationMedia: Ubuntu 18.10 "Cosmic Cuttlefish" - Release amd64 (20181017.3)
SourcePackage: thermald
UpgradeStatus: No upgrade log present (probably fresh install)

-----------------------------------------

SRU Justification
==============

[Impact]
 * As described by the original bug reporter, CPU usage of Lenovo P52 is sub-optimal under heavy load.
 * My observation is the machine exhibits a sharp drop of power usage and CPU frequency and takes time to slowly ramp up again (refer to the chart at https://bit.ly/2OJphB8)
 * Fixed by bisecting and backporting fixes from thermald project.

[Test Case]
 * One can stress the CPU load of the machine and collect the CPU frequency and power usage over time to check for any anamoly. The script at https://people.canonical.com/~ypwong/p52_test_cpu.sh can help with this.
 * With the fix, the behaviour should be like this: https://bit.ly/2KA9EXB, a consistent power usage can be maintained.

[Regression Potential]
 * Medium. The fix consists of 7 commits cherry-picked from upstream, these changes will affect any machines using RAPL cooling device. The impact may not be obvious in normal daily usage but will be manifested during heavy load, suggest a longer verification period so that more people can discover any adverse effect.

Pablo Catalina (xkill) wrote :
Pablo Catalina (xkill) wrote :

Seems similar (but probably not the same): https://bugs.launchpad.net/ubuntu/+source/thermald/+bug/1769236

Changed in thermald (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → High
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in thermald (Ubuntu):
status: New → Confirmed
Anthony Wong (anthonywong) wrote :

debdiff against thermald 1.7.0-5ubuntu2 in Bionic.

The attachment "debdiff.patch" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Colin Ian King (colin-king) wrote :

I've sponsored this and uploaded it to -proposed.

description: updated
description: updated

Hello Pablo, or anyone else affected,

Accepted thermald into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.7.0-5ubuntu3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in thermald (Ubuntu Bionic):
status: New → Fix Committed
tags: added: verification-needed verification-needed-bionic
Brian Murray (brian-murray) wrote :

Let's use the "block-proposed" tag to indicate that this needs a longer period of time for verification. When you think its baked long enough please remove the tag.

tags: added: block-proposed
Colin Ian King (colin-king) wrote :

Looks like the i386 build failed:

g++ -DHAVE_CONFIG_H -I. -I./src -DTDLOCALEDIR=\"/usr/share/locale\" -DGLIB_SUPPORT -Wdate-time -D_FORTIFY_SOURCE=2 -I/usr/include/dbus-1.0 -I/usr/lib/i386-linux-gnu/dbus-1.0/include -I/usr/include/glib-2.0 -I/usr/lib/i386-linux-gnu/glib-2.0/include -DGLIB_VERSION_MIN_REQUIRED=GLIB_VERSION_2_26 -I/usr/include/libxml2 -DTDRUNDIR=\"/var/run/thermald\" -DTDCONFDIR=\"/etc/thermald\" -I src -Wreorder -Wsign-compare -Wreturn-type -Wunused-but-set-variable -Wformat -Wall -Wclobbered -Wempty-body -Wignored-qualifiers -Wmissing-field-initializers -Wtype-limits -Wuninitialized -Werror -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -c -o src/thermald-thd_zone.o `test -f 'src/thd_zone.cpp' || echo './'`src/thd_zone.cpp
In file included from /usr/include/glib-2.0/glib.h:62:0,
                 from src/thermald.h:62,
                 from src/thd_common.h:27,
                 from src/thd_zone.h:30,
                 from src/thd_zone.cpp:34:
src/thd_zone.cpp: In member function ‘void cthd_zone::sort_and_update_poll_trip()’:
/usr/include/glib-2.0/glib/gmessages.h:345:43: error: format ‘%lu’ expects argument of type ‘long unsigned int’, but argument 4 has type ‘std::vector<cthd_trip_point>::size_type {aka unsigned int}’ [-Werror=format=]
                                __VA_ARGS__)
                                           ^
src/thermald.h:73:24: note: in expansion of macro ‘g_debug’
 #define thd_log_debug g_debug
                        ^~~~~~~
src/thd_zone.cpp:131:2: note: in expansion of macro ‘thd_log_debug’
  thd_log_debug("sort_and_update_poll_trip: trip_points_size =%lu\n",
  ^~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
Makefile:839: recipe for target 'src/thermald-thd_zone.o' failed

@Anthony, I think you needed also to include:

commit bb7631163c8f3f44d0dc83690765cdb799664fd5
Author: Anuj Mittal <email address hidden>
Date: Wed Sep 26 10:34:15 2018 +0800

    Use correct format specifier for size_t

    %zu instead of %lu, otherwise on 32 bit:

    | ../git/src/thd_zone.cpp: In member function 'void cthd_zone::sort_and_update_poll_trip()':
    | ../git/src/thd_zone.cpp:106:16: error: format '%lu' expects argument of type 'long unsigned int', but argument 4 has type 'std::vector<cthd_trip_point>::size_type' {aka 'unsigned int'} [-Werror=format=]
    | thd_log_debug("sort_and_update_poll_trip: trip_points_size =%lu\n",
    | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    | trip_points.size());

    Signed-off-by: Anuj Mittal <email address hidden>

@Brain, can this be removed from -proposed and I'm hoping Antony can provide me a V2 that I can re-upload.

Colin Ian King (colin-king) wrote :

Oops, typos:

@Brian, can this be removed from -proposed and I'm hoping Anthony can provide me a V2 that I can re-upload.

tags: added: verification-failed
removed: verification-needed
Anthony Wong (anthonywong) wrote :

The commit pointed out by Colin can fix the build failure on i386. This is the new debdiff.

Hi,
Is this issue reproducible in the latest thermald 1.9 release? If yes, I want to fix ASAP.

Anthony Wong (anthonywong) wrote :

@Srinivas
It doesn't look too good on thermald 1.9 until dptfxtract is used.

* Result of thermald 1.9 only:
- https://docs.google.com/spreadsheets/d/1sSZe2gqfM8NWqaDyUfXx-Q2iGUkKvrREjJsobHfIZiQ/edit#gid=980670338

* Result of thermald 1.9 + dptfxtract 1.4.2:
- https://docs.google.com/spreadsheets/d/1sSZe2gqfM8NWqaDyUfXx-Q2iGUkKvrREjJsobHfIZiQ/edit#gid=1205697583

Correct.
I think you are using with 5.3 kernel or using workaround option.
Without dptfxtract output or user manually created thermal-conf.xml or
using aurogenerated, this is risky to play with power as the skin will
hit limit.

Thanks,
Srinivas

On Thu, 2019-08-15 at 16:28 +0000, Anthony Wong wrote:
> @Srinivas
> It doesn't look too good on thermald 1.9 until dptfxtract is used.
>
> * Result of thermald 1.9 only:
> -
> https://docs.google.com/spreadsheets/d/1sSZe2gqfM8NWqaDyUfXx-Q2iGUkKvrREjJsobHfIZiQ/edit#gid=980670338
>
> * Result of thermald 1.9 + dptfxtract 1.4.2:
> -
> https://docs.google.com/spreadsheets/d/1sSZe2gqfM8NWqaDyUfXx-Q2iGUkKvrREjJsobHfIZiQ/edit#gid=1205697583
>

Hello Pablo, or anyone else affected,

Accepted thermald into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.7.0-5ubuntu5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

tags: added: verification-needed
removed: verification-failed

That was tested on 4.15, which is the default kernel in ubuntu 18.04.
When tested on 5.3rc4, without dptfxtract seems to be good enough [1],
although after using dptfxtract, the system is more consistent [2].
So do you suggest users to always have dptfxtract installed? It shouldn't
make the system perform worse, am I right?

[1]
https://docs.google.com/spreadsheets/d/1sSZe2gqfM8NWqaDyUfXx-Q2iGUkKvrREjJsobHfIZiQ/edit#gid=1910979586
[2]
https://docs.google.com/spreadsheets/d/1sSZe2gqfM8NWqaDyUfXx-Q2iGUkKvrREjJsobHfIZiQ/edit#gid=907221455

On Fri, Aug 16, 2019 at 12:56 AM Srinivas Pandruvada <
<email address hidden>> wrote:

> Correct.
> I think you are using with 5.3 kernel or using workaround option.
> Without dptfxtract output or user manually created thermal-conf.xml or
> using aurogenerated, this is risky to play with power as the skin will
> hit limit.
>
> Thanks,
> Srinivas
>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers