Fix CPU thermal sensors enumeration

Bug #2054391 reported by Kai-Heng Feng
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
HWE Next
New
Undecided
Unassigned
thermald (Ubuntu)
Fix Released
Undecided
Unassigned
Jammy
Fix Committed
Medium
Unassigned

Bug Description

[Impact]
Some CPU sensors are not enumerated, this can make thermald deviates from the correct behavior of the CPU TDP.

[Fix]
Traverse all sensors under hwmon sysfs directory to make sure everything is enumerated.

[Test]
Check the output of thermald. Once the fix is in place, thermal zones that are previously omitted now shows up:
[INFO]Zone 1: AMBF, Active:1 Bind:1 Sensor_cnt:1
To do so
0. get a large machine which will have more thermal zones
1. stop the potentially auto-running service
   $ systemctl stop thermald
2. run the daemon in foreground with loglevel to see what is going on.
   On many modern systemd (=the large ones) it might not know the CPUid,
   to bypass that for the test you can ask it to ignore the check
   $ sudo thermald --no-daemon --loglevel=info --ignore-cpuid-check
3. check the output
   On init the system will be probed and that will show something like:

...
 ZONE DUMP BEGIN
[1718954645][INFO]Zone 2: cpu, Active:1 Bind:0 Sensor_cnt:1
...
[1718954645][INFO]Zone 3: cpu, Active:1 Bind:0 Sensor_cnt:1
...
 ZONE DUMP END

In here, on systems with many thermal zones one would before the fix only see a few, and with the fix more zones.

[Where problems could occur]
Since the new logic traverse the whole hwmon sysfs, the startup time can take slightly longer.

[racb] Existing users' systems may have bad or otherwise irrelevant or out of scope sensors that may not have been causing misbehaviour due to being skipped, but after the fix, they would face a regression. I'm not sure that we can realistically identify such cases though, and it seems reasonable to favour correct systems over misbehaving ones.

[racb] Similar to my previous point, we may pick up additional sensor data that we shouldn't do due to inadequate filtering, causing incorrect behaviour but this time it would be a bug in our filtering rather than misbehaving existing systems. In mitigation, I see that the fixed version has been released in Kinetic, so has had some real world testing, and I see no indication upstream or in Launchpad that this was a problem in practice.

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

debdiff for Jammy.

Changed in thermald (Ubuntu):
status: New → Fix Released
Changed in thermald (Ubuntu Jammy):
status: New → Confirmed
importance: Undecided → Medium
tags: added: oem-priority originate-from-2050867 somerville
Revision history for this message
Lucas Kanashiro (lucaskanashiro) wrote :

Thanks for the patch @Kai-Heng! I took a look and in general it looks good, but I'd like to ask you to add some DEP-3 headers [1] to your patch, that will give us (packagers) more context when revisiting this package for whatever reason.

Now a personal opinion, I like when the patch file name is listed in the changelog entry ("* d/p/0016-Fixed-enumeration-of-cpu-thermal-sensors.patch: <what_this_is_fixing>"). It makes a direct link between the change listed there and the file actually fixing it.

Once you address my first comment at least (the second one is not mandatory), please subscribe ~ubuntu-sponsors again (I am unsubscribing it now).

[1] https://dep-team.pages.debian.net/deps/dep3/

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

DEP-3 headers and changelog entry added.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hi,
thanks for your update. I come by looking at the sponsoring queue and will have a look...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I double checked if we also need to fix others e.g. mantic as often -dev releases in between are forgotten but they need to be fixed as well to avoid upgraders regressing.
But that looks fine

git tag --contains df8ac025d22e51c237c8ebe2aa25bf672ec04edb
v2.5
v2.5.0-pre
v2.5.1
v2.5.2
v2.5.3
v2.5.4
v2.5.5
v2.5.6
v2.5.7

vs

 thermald | 1.9.1-1build1 | focal | source, amd64
 thermald | 1.9.1-1ubuntu0.6 | focal-updates | source, amd64
 thermald | 1.9.1-1ubuntu0.7 | focal-proposed | source, amd64
 thermald | 2.4.9-1 | jammy | source, amd64
 thermald | 2.4.9-1ubuntu0.4 | jammy-updates | source, amd64
 thermald | 2.5.4-2 | mantic | source, amd64
 thermald | 2.5.6-2build2 | noble | source, amd64
 thermald | 2.5.7-3 | oracular | source, amd64

So anything >= mantic is already fine having 2.5.4 there.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The patch headers requested by Lucas are good as well now - thanks!
The changelog mention of the related patch is good too.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I checked upstream git if there are any follow up fixes to the changed code we should apply right away, there are none.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Local test build has been happy as well.

You test description is not wrong, but a bit too incomplete for what an SRU should have.
"Check the output of thermald" could use some "how exactly" and good/bad case examples.

I got a system in testflinger with a modern CPU hoping it would expose this issue to test and try to improve that to be more acceptable.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I had two systems, none had more than two thermal zones.
It was enough to outline the test a bit better, but not to show the good / bad case output.

But it gave me the chance to test and install the build I had which is all fine

$ sudo apt install ./thermald_2.4.9-1ubuntu0.5_amd64.deb
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
Note, selecting 'thermald' instead of './thermald_2.4.9-1ubuntu0.5_amd64.deb'
The following packages will be upgraded:
  thermald
1 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
Need to get 0 B/221 kB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 /home/ubuntu/thermald_2.4.9-1ubuntu0.5_amd64.deb thermald amd64 2.4.9-1ubuntu0.5 [221 kB]
(Reading database ... 106221 files and directories currently installed.)
Preparing to unpack .../thermald_2.4.9-1ubuntu0.5_amd64.deb ...
Unpacking thermald (2.4.9-1ubuntu0.5) over (2.4.9-1ubuntu0.4) ...
Setting up thermald (2.4.9-1ubuntu0.5) ...
Processing triggers for dbus (1.12.20-2ubuntu4.1) ...
Processing triggers for man-db (2.10.2-1) ...
Scanning processes...
Scanning processor microcode...
Scanning linux images...

Running kernel seems to be up-to-date.

The processor microcode seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.

description: updated
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This all LGTM, sponsoring to jammy-unapproved

$ dput ubuntu ../thermald_2.4.9-1ubuntu0.5_source.changes
Uploading thermald using ftp to ubuntu (host: upload.ubuntu.com; directory: /ubuntu)
running gitubuntu: Warn if uploading without git-ubuntu Vcs-* entries.
running updatemaintainer: Stop if ubuntu changes are without ubuntu maintainer.
running gpg: check GnuPG signatures before the upload
running suite-mismatch: check the target distribution for common errors
running ppaforppaonly: Stop uploads to the archive with or to ppa without ~ppa suffix.
running nobug: Stop if uploading without any bug reference.
running releasemismatch: Warn about mismatching suffixesg e.g. focal with a XX.YY not being 20.04
running placeholderbug: Stop if using common placeholder numbers as bug reference.
running supported-distribution: check whether the target distribution is currently supported (using distro-info)
{'allowed': ['release', 'proposed', 'backports', 'security'], 'known': ['release', 'proposed', 'updates', 'backports', 'security']}
running check-debs: makes sure the upload contains a binary package
running required-fields: check whether a field is present and non-empty in the changes file
running checksum: verify checksums before uploading
running badauthor: Stop if uploading with root@ or ubuntu@ email adresses.
Uploading thermald_2.4.9-1ubuntu0.5.dsc
Uploading thermald_2.4.9-1ubuntu0.5.debian.tar.xz
Uploading thermald_2.4.9-1ubuntu0.5_source.buildinfo
Uploading thermald_2.4.9-1ubuntu0.5_source.changes

Up to the SRU team to have a second look and then probably to you @kaihengfeng to test it on a system of the right size for verification.

Robie Basak (racb)
description: updated
description: updated
description: updated
Revision history for this message
Robie Basak (racb) wrote :

[indent snipped]

> + std::string temp_dir_path = base_path[i] + entry->d_name
> + + "/";
> + DIR *temp_dir = nullptr;
> + struct dirent *temp_dir_entry = nullptr;
> + int len_temp_dir_entry = 0;
> + int len_input = strlen("_input");
> +
> + if ((temp_dir = opendir(temp_dir_path.c_str())) != NULL) {
> + while ((temp_dir_entry = readdir(temp_dir)) != NULL) {
> + len_temp_dir_entry = strlen(temp_dir_entry->d_name);
> + if ((len_temp_dir_entry >= len_input
> + && !strcmp(
> + temp_dir_entry->d_name
> + + len_temp_dir_entry
> + - len_input, "_input"))
> + && (!strncmp(temp_dir_entry->d_name, "temp",
> + strlen("temp")))) {

FTR, the suffix matching code here is...suboptimal, especially as std::string is being used elsewhere in the same code block. I don't see an error here, but the unnecessary complexity is a risk. I nearly rejected this asking for it to be rewritten, and it took me a disproportionately large amount of SRU review time to understand it as well as the previous suboptimal sensor_mask/mask obtuseness.

Revision history for this message
Robie Basak (racb) wrote :

Unsubscribing sponsors as there is nothing left to sponsor.

Changed in thermald (Ubuntu Jammy):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-jammy
Revision history for this message
Robie Basak (racb) wrote : Please test proposed package

Hello Kai-Heng, or anyone else affected,

Accepted thermald into jammy-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/2.4.9-1ubuntu0.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, what testing has been performed on the package and change the tag from verification-needed-jammy to verification-done-jammy. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-jammy. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance for helping!

N.B. The updated package will be released to -updates after the bug(s) fixed by this package have been verified and the package has been in -proposed for a minimum of 7 days.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.