pull in latest thermald bug fixes into thermald

Bug #1931565 reported by Colin Ian King
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
Fix Released
Medium
Colin Ian King
Focal
Fix Released
Undecided
Unassigned
Hirsute
Fix Released
Undecided
Unassigned

Bug Description

== SRU Justification [ HIRSUTE ] ==

The upstream thermald 2.4.6 has been released with some more bug fixes that are pertinent to H/W available in older releases such as Hirsute. These fix a variety of issues found on H/W in the field and such as over-throttling, handling alternate ACPI object names for B0D4, handling trip zones which may have wrong settings and disabling the legacy rapl cdev when rapl-mmio is available.

== The fixes ==

Pull in these fixes:

From 2.4.6:

commit 273a53a11da2a7302ad7a5bc7e3bf04f221ce4e2
Author: Srinivas Pandruvada <email address hidden>
Date: Mon Jun 7 16:47:47 2021 -0700

    Use Adaptive PPCC limits for RAPL MMIO

    Set the correct device name as RAPL-MSR so that RAPL-MMIO can
    also set the correct default power limits.

commit 2dd67300448fa4a2aa8f3e00ee5b604c73a1f7d9
Author: Srinivas Pandruvada <email address hidden>
Date: Mon Jun 7 16:45:14 2021 -0700

    Increase power limit for disabled RAPL-MMIO

    Increase 100W to 200W as some desktop platform already have limit
    more than 100W.

commit 3de1004a49d0d157573bbdc1097b2fbed056879f
Author: Srinivas Pandruvada <email address hidden>
Date: Sun Jun 6 19:19:15 2021 -0700

    Special case for default PSVT

    When there are no adaptive tables and only one default PSVT table
    is present with just one entry with MAX type. Add one additional
    entry as done for non default case.

From 2.4.5:

commit 301a89284e9d74a9a1f8315c1673a548dcacda8c
Author: Srinivas Pandruvada <email address hidden>
Date: Sat May 29 10:50:44 2021 -0700

    Set a very high RAPL MSR PL1 with --adaptive

    After upgrading Dell Latitude 5420, again noticed performance degradation.
    The PPCC power limit for MSR RAPL PL1 is reduced to 15W. Even though we
    disable MSR RAPL with --adaptive option, it is not getting disabled. So
    MSR RAPL limits still playing role.

    To fix that set a very high MSR RAPL PL1 limit so that it never causes
    throttling. All throttling with --adaptive option is done using RAPL-MMIO.

From 2.4.4:

commit ea4491971059259e46daad10ae850d3d530b02f2
Author: Srinivas Pandruvada <email address hidden>
Date: Thu Mar 11 10:06:15 2021 -0800

    Fix error for condition names

    The current code caps the max name as the last condition name,
    which is "Power_Slider". So any condition more than 56 will be
    printing error, with "Power_Slider" as condition name. For example
    for condition = 57:
    Unsupported condition 57 (Power_slider)

    This is confusing during debug, so print "UNKNOWN" for condition
    name 56.

commit 9115f2fb0b296a22a62908f2718ca873af2a452f
Author: Srinivas Pandruvada <email address hidden>
Date: Tue Mar 9 16:36:13 2021 -0800

    This is confusing during debug, so print "UNKNOWN" for condition
    name 56.

commit 9115f2fb0b296a22a62908f2718ca873af2a452f
Author: Srinivas Pandruvada <email address hidden>
Date: Tue Mar 9 16:36:13 2021 -0800

    Check for alternate names for B0D4 device

    B0D4 can be named as TCPU or B0D4. So search for both names
    if failed to find one.

commit 660ee6f1f6351e6c291c0699147231be402c2bb8
Author: Srinivas Pandruvada <email address hidden>
Date: Tue Mar 9 16:33:08 2021 -0800

    Delete all trips from zones before psvt install

    Initially zones has all the trips from sysfs, which may have wrong
    settings. Instead of deleting only for matched psvt zones, delete
    for all zones. In this way only zones which are in PSVT will be
    present.

commit 1ad03424f7f3d339521635f08377b323375b2747
Author: Srinivas Pandruvada <email address hidden>
Date: Tue Mar 9 11:36:27 2021 -0800

    Disable legacy rapl cdev when rapl-mmio is in use

    Explicitly disable legacy rapl based on MSR interface when rapl-mmio
    is in use. This will prevent PL1/PL2 power limit from MSR based rapl,
    which may not be the correct one.

commit 45832e16290fec7353513b1ce03533b73b18f0c6
Author: Colin Ian King <email address hidden>
Date: Thu Mar 18 11:22:38 2021 +0000

    Fix spelling mistakes found using codespell

    There are a handful of spelling mistakes in comments and literal
    strings found by codespell. Fix these.

    Signed-off-by: Colin Ian King <email address hidden>

== Test plan ==

Actually this is problematic as the changes affect different H/W in different ways and testing to touch all these changes requires full CPU exercising to try and trip thermal overrun.

test plan is as follows:

1. install -proposed thermald, run with debug logging enabled
2. run stress-ng --cpu 0 for a few hours on various H/W with air vents plugged to try and trigger thermal overrun and exercise thermald thermal throttling.
3. check the thermald logging to check for state change behaviour on thermal changes.

== Where problems could occur ==

These have been already tested on H/W in the field, for example:
https://bugs.launchpad.net/ubuntu/+source/thermald/+bug/1930422

however, all these fixes can alter the functionality of thermald for a range of platforms so the regressions potential is high. The set of changes in thermald are already in upstream thermald in Ubuntu Impish so these fixes will already have been exercised on the development release.

Issues can occur in:

1. H/W supporting RAPL MMIO
2. Devices with TCPU ACPI object names, this will now behave differently (correctly)
3. Systems with both RAPL cdev support and RAPL MMIO support will behave differently, cdev will now be (correctly) disabled.

Changed in thermald (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Colin Ian King (colin-king)
status: New → In Progress
description: updated
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Please test proposed package

Hello Colin, or anyone else affected,

Accepted thermald into hirsute-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/2.4.3-1ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-hirsute to verification-done-hirsute. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-hirsute. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Changed in thermald (Ubuntu Hirsute):
status: New → Fix Committed
tags: added: verification-needed verification-needed-hirsute
Revision history for this message
Colin Ian King (colin-king) wrote : Re: pull in latest thermald bug fixes into thermald 2.4.3

I've tested this on an older X220 with the air vents taped up while running stress-ng and glxgears for 1 hour:

stress-ng --cpu 1 --matrix 4 --iomix 1 --cache 1 --icache 1 --thermalstat 1 --vmstat 10 -t 1h -v

I enabled the full debug logging and was able to observe thermald throttling the CPU and changing CPU frequency when the CPU reached thermal trip zones. So passive cooling seems to work and the code didn't regress.

I'd like to get the HWE folk to test this on a more modern laptop to sanity check all the features in thermald with newer H/W.

Revision history for this message
Colin Ian King (colin-king) wrote :

Testing:

1. install -proposed thermald

2. stop thermald:

sudo systemctl stop thermald

3. capture stdout + stderr output using script command, run thermald manually to capture logging:

script
sudo thermald --no-daemon --loglevel=debug --ignore-cpuid-check

4. run stress-ng in another terminal for an hour or so:

stress-ng --cpu 1 --matrix 4 --iomix 1 --cache 1 --icache 1 --thermalstat 1 --vmstat 10 -t 1h -v

Note: The thermalstat 1 option in stress-ng will dump CPU thermal zones and CPU frequencies. This can show that thermald is performing passive thermal controlling of the machine.

5. when that completes, hit control-C on the terminal capturing output with the script command

6. attach the typescript to the bug report.

Revision history for this message
You-Sheng Yang (vicamo) wrote :

Tested on Intel ADL-S RVP running 5.13.0-1005-oem kernel.

Revision history for this message
Colin Ian King (colin-king) wrote :

Also tested on a Dell XPS 13 9380, looks good to me.

tags: added: verification-done verification-done-hirsute
removed: verification-needed verification-needed-hirsute
summary: - pull in latest thermald bug fixes into thermald 2.4.3
+ pull in latest thermald bug fixes into thermald
Revision history for this message
Colin Ian King (colin-king) wrote (last edit ):

I've taken the same patches and added these to the focal backport, it's essential the same set of changes. I've just uploaded these so they are waiting to go into -proposed for SRU testing for focal too.

Revision history for this message
Brian Murray (brian-murray) wrote :

Given the regression potential in the description I think we should let this age a bit longer in -proposed. Let's look at releasing this in another 7 days.

Changed in thermald (Ubuntu Focal):
status: New → Fix Committed
tags: added: verification-needed verification-needed-focal
removed: verification-done
Revision history for this message
Brian Murray (brian-murray) wrote : Please test proposed package

Hello Colin, or anyone else affected,

Accepted thermald into focal-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.9.1-1ubuntu0.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-focal to verification-done-focal. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-focal. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

Revision history for this message
Colin Ian King (colin-king) wrote (last edit ):

Tested thermald 1.9.1-1ubuntu0.5 on a Lenovo x220 and a Dell XPS 13 7390 with 2 hour thermal soak tests as documented in comment #3

I enabled the full debug logging and was able to observe thermald throttling the CPU and changing CPU frequency when the CPU reached thermal trip zones. So passive cooling seems to work and the code didn't regress.

For the x220 I taped up the air vents and ran the laptop on a non-conductive base to keep the heat in the machine and was able to observe thermald cycling the CPU speed at just below the thermal trip points even when the laptop was beyond the thermal design characterstics (e.g. fan not able to vent and full CPU loading).

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thermald - 2.4.3-1ubuntu1

---------------
thermald (2.4.3-1ubuntu1) hirsute; urgency=medium

  * Pull in bug fixes between 2.4.3 and 2.4.6 (LP: #1931565)
   [Colin Ian King]
   - Fix spelling mistakes found using codespell
   [Srinivas Pandruvada]
   - Disable legacy rapl cdev when rapl-mmio is in use
     This will prevent PL1/PL2 power limit from MSR based rapl, which
     may not be the correct one.
   - Delete all trips from zones before psvt install
     Initially zones has all the trips from sysfs, which may have wrong
     settings. Instead of deleting only for matched psvt zones, delete
     or all zones. In this way only zones which are in PSVT will be
     present.
   - Check for alternate names for B0D4 device
     B0D4 can be named as TCPU or B0D4. So search for both names
     if failed to find one.
   - Fix error for condition names
     The current code caps the max name as the last condition name,
     which is "Power_Slider". So any condition more than 56 will be
     printing error, with "Power_Slider" as condition name. For example
     for condition = 57: Unsupported condition 57 (Power_slider)
   - Set a very high RAPL MSR PL1 with --adaptive
     After upgrading Dell Latitude 5420, again noticed performance
     degradation.
     The PPCC power limit for MSR RAPL PL1 is reduced to 15W. Even though
     we disable MSR RAPL with --adaptive option, it is not getting
     disabled. So MSR RAPL limits still playing role.
     To fix that set a very high MSR RAPL PL1 limit so that it never
     causes throttling. All throttling with --adaptive option is done
     using RAPL-MMIO.
   - Special case for default PSVT
     When there are no adaptive tables and only one default PSVT table
     is present with just one entry with MAX type. Add one additional
     entry as done for non default case.
   - Increase power limit for disabled RAPL-MMIO
     Increase 100W to 200W as some desktop platform already have limit
     more than 100W.
   - Use Adaptive PPCC limits for RAPL MMIO
     Set the correct device name as RAPL-MSR so that RAPL-MMIO can
     also set the correct default power limits.

 -- Colin King <email address hidden> Thu, 10 Jun 2021 12:52:21 +0100

Changed in thermald (Ubuntu Hirsute):
status: Fix Committed → Fix Released
Revision history for this message
Łukasz Zemczak (sil2100) wrote : Update Released

The verification of the Stable Release Update for thermald has completed successfully and the package is now being released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Changed in thermald (Ubuntu Focal):
status: Fix Committed → Fix Released
tags: added: verification-done verification-done-focal
removed: verification-needed verification-needed-focal
Changed in thermald (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.