thermald 1.5-2ubuntu1 failures in ubuntu16.04

Bug #1582982 reported by SunBear on 2016-05-18
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
thermald (Ubuntu)
Low
Colin Ian King
Xenial
Low
Colin Ian King

Bug Description

[SRU][XENIAL] Justification:

thermald is reporting non-critical errors whereas these should be downgraded and being reported as warnings so as not to alarm users.

[FIX]
Upstream commit,https://github.com/01org/thermal_daemon/commit/8280fd7ec6cff6db6463c8a1b01d2e427e418226

this is already in thermald in Yakkety and working fine.

[Regression Potential]
Minimal, this changes the status levels of various error messages, so core thermald functional is not touched at all by this fix.

----------------------------------------------------------------------------

Hi. I am still getting the same failure message despite using the latest thermald version. Can you tell me how to solve these failures.

May 18 10:08:11 Eliot thermald[1012]: failed to open /dev/acpi_thermal_rel
May 18 10:08:11 Eliot thermald[1012]: failed to open /dev/acpi_thermal_rel
May 18 10:08:11 Eliot thermald[1012]: TRT/ART read failed
May 18 10:08:12 Eliot thermald[1012]: sysfs read failed constraint_0_max_power_uw
May 18 10:08:12 Eliot thermald[1012]: sysfs write failed trip_point_0_temp
May 18 10:08:12 Eliot thermald[1012]: sysfs write failed trip_point_0_temp

$: dpkg -l thermald
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================-=======================-=======================-===============================================================================
ii thermald 1.5-2ubuntu1 amd64 Thermal monitoring and controlling daemon

$ lsb_release -rd
Description: Ubuntu 16.04 LTS
Release: 16.04

$ apt-cache policy thermald
thermald:
  Installed: 1.5-2ubuntu1
  Candidate: 1.5-2ubuntu1
  Version table:
 *** 1.5-2ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1.5-2 500
        500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages

Colin Ian King (colin-king) wrote :

May 18 10:08:11 Eliot thermald[1012]: failed to open /dev/acpi_thermal_rel
May 18 10:08:11 Eliot thermald[1012]: failed to open /dev/acpi_thermal_rel
May 18 10:08:11 Eliot thermald[1012]: TRT/ART read failed

These messages above occur because the device entries do not exist on your system, so they are just informing you that the ACPI TRT (Thermal Relationship Table) and ART (Active Relationship Table) cannot be read (probably because the corresponding ACPI _TRT and _ART objects do not exist on your machine). I believe these errors should occur just once during the lifetime of thermald and are just informing you that the capabilities do not exist.

May 18 10:08:12 Eliot thermald[1012]: sysfs read failed constraint_0_max_power_uw

The above error shows that the Maximum Power RAPL (Running Average Power Limit) interface in /sys/class/powercap/intel-rapl cannot be read, however since the failure does not occur on the constraint_1_max_power_uw then the RAPL max power reading has probably worked, so there is no need to worry about that message.

May 18 10:08:12 Eliot thermald[1012]: sysfs write failed trip_point_0_temp
May 18 10:08:12 Eliot thermald[1012]: sysfs write failed trip_point_0_temp

The above is of interest because the thermal zone trip point temperature setting can't be updated, I guess this message is going to be repeated many times.

From your bug it is hard to determine if the messages are being repeated many times or not. Can you run thermald for say 1hr and attach the full log so we can see what messages are spamming the log.

Changed in thermald (Ubuntu):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Colin Ian King (colin-king)

As Colin suggested about the three errors related to _TRT and _ART, they will appear once during startup. I see many folks confused about this, so I downgraded this error in my tree.

https://github.com/01org/thermal_daemon/commit/8280fd7ec6cff6db6463c8a1b01d2e427e418226

Colin Ian King (colin-king) wrote :

I will pull that fix in on the next SRU cycle for thermald.

SunBear (sunbear-c22) wrote :

@Colin I am using an ASUS Z170M-Plus MB (https://www.asus.com/sg/Motherboards/Z170M-PLUS/specifications/). It's UEFI is equip to perform active thermal management of the CPU and a few other locations on the MB. Is there a reason why the devices of TRT and ART are not detected? Are these tables required to monitor and control temperatures on my CPU and MB?

I observed my syslog also has two kernal complains on ACPI error :
May 19 08:17:42 Eliot kernel: [ 0.016207] ACPI Error: [\_SB_.PCI0.XHC_.RHUB.HS11] Namespace lookup failure, AE_NOT_FOUND (20150930/dswload-210)
May 19 08:17:42 Eliot kernel: [ 0.021413] ACPI Error: 1 table load failures, 7 successful (20150930/tbxfload-214)
Do you think this is why /dev/acpi_thermal_rel are not created?

I did not observed the "sysfs write failed trip_point_0_temp" repeating in the syslog.
$ cat /var/log/syslog | grep thermald
May 19 08:17:42 Eliot thermald[1086]: 22 CPUID levels; family:model:stepping 0x6:5e:3 (6:94:3)
May 19 08:17:42 Eliot thermald[1086]: Polling mode is enabled: 4
May 19 08:17:42 Eliot thermald[1086]: failed to open /dev/acpi_thermal_rel
May 19 08:17:42 Eliot thermald[1086]: failed to open /dev/acpi_thermal_rel
May 19 08:17:42 Eliot thermald[1086]: TRT/ART read failed
May 19 08:17:42 Eliot thermald[1086]: sysfs read failed constraint_0_max_power_uw
May 19 08:17:42 Eliot thermald[1086]: sysfs write failed trip_point_0_temp
May 19 08:17:42 Eliot thermald[1086]: sysfs write failed trip_point_0_temp
May 19 09:06:05 Eliot thermald[3898]: Couldn't get lock file 3898
May 19 09:06:05 Eliot thermald[3898]: An instance of thermald is already running, exiting ...

Other questions:
1. How do I verify that thermald is actually correctly managing the thermal state of my CPU and MB?
2. Does thermald overwrite the EUFI thermal management control?

SunBear (sunbear-c22) wrote :

Hi Colin, to update you, I found out the ACPI failures I mentioned in my earlier post concerns the USB3 controller. Thus I think it is unrelated to my thermald issues.

Colin Ian King (colin-king) wrote :

"@Colin I am using an ASUS Z170M-Plus MB (https://www.asus.com/sg/Motherboards/Z170M-PLUS/specifications/). It's UEFI is equip to perform active thermal management of the CPU and a few other locations on the MB."

..I could not find the source of information you are referring to about " UEFI is equip to perform active thermal management of the CPU and a few other locations on the MB". Can you point me to that assertion?

"Is there a reason why the devices of TRT and ART are not detected? Are these tables required to monitor and control temperatures on my CPU and MB?"

_TRT and _ART are ACPI control objects, see sections 11.4.19 _TRT (Thermal Relationship Table) and 11.4.3 _ART (Active Cooling Relationship Table) of the ACPI specification, http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf

The kernel driver (drivers/thermal/int340x_thermal/acpi_thermal_rel.c) attempts to evaluate these objects and these fail to evaluate (most probably because they don't exist) hence these interfaces are not exposed in a working state to user space programs such as thermald. It is highly probably that the BIOS vendor just didn't implement these ACPI objects. If they did not, well, that's the way it is. Debugging and fixing the firmware is out of scope for this kind of bug report.

Hi Colin,

It's called q-fan. It comes with the asus bios.
https://www.youtube.com/watch?v=qIBLJ7nG3KM
Pardon me if the terminology i use may or may not be correct.
Essentially, it has a wizard to tune/calibrate the speed (or voltage) to
cooling fans as a function of temperature at different MB location and
cpu. They also supply a ASUS Fan Xpert 2 for Win OS.
https://www.youtube.com/watch?v=BdxlUEG1IZc. I think thermald function
is similar to q-fan and fan Xpert2.

Questions:
1. Given that thermald cannot read _TRT and _ART, am i correct to say
that thermald is not managing the thermal state of my system?
2. Also, does this mean that thermal management in my system is only
performed by ASUS UEFI/BIOS with no active involvement from the OS and
Ubuntu?
3. Does this mean thermald is a redundancy on my system and I can
uninstall it w/o harming my system?

Appreciate your advice and help to answer these questions.

On 23/05/2016 16:34, Colin Ian King wrote:
> "@Colin I am using an ASUS Z170M-Plus MB
> (https://www.asus.com/sg/Motherboards/Z170M-PLUS/specifications/). It's
> UEFI is equip to perform active thermal management of the CPU and a few
> other locations on the MB."
>
> ..I could not find the source of information you are referring to about
> " UEFI is equip to perform active thermal management of the CPU and a
> few other locations on the MB". Can you point me to that assertion?
>
>
> "Is there a reason why the devices of TRT and ART are not detected? Are these tables required to monitor and control temperatures on my CPU and MB?"
>
> _TRT and _ART are ACPI control objects, see sections 11.4.19 _TRT
> (Thermal Relationship Table) and 11.4.3 _ART (Active Cooling
> Relationship Table) of the ACPI specification,
> http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf
>
> The kernel driver (drivers/thermal/int340x_thermal/acpi_thermal_rel.c)
> attempts to evaluate these objects and these fail to evaluate (most
> probably because they don't exist) hence these interfaces are not
> exposed in a working state to user space programs such as thermald. It
> is highly probably that the BIOS vendor just didn't implement these ACPI
> objects. If they did not, well, that's the way it is. Debugging and
> fixing the firmware is out of scope for this kind of bug report.
>

_TRT and _ART are optional. If not the algorithm falls to default to do CPU passive control. The thermald controls are complimentary to was BIOS is doing. If for any reason of the BIOS controls or fan is not able to bring temperature down thermald will act, it will work as this will reduce the cpu power consumption.

Colin Ian King (colin-king) wrote :

@SunBear,

I've applied upstream commit https://github.com/01org/thermal_daemon/commit/8280fd7ec6cff6db6463c8a1b01d2e427e418226 (as mentioned in comment #2) and prepared thermald for testing.

https://launchpad.net/~colin-king/+archive/ubuntu/thermald-sru-1582982

to install this use:

sudo add-apt-repository ppa:colin-king/thermald-sru-1582982
sudo apt update
sudo apt upgrade

And let me know if that helps

Changed in thermald (Ubuntu):
importance: Medium → Low
SunBear (sunbear-c22) wrote :

I have done the above. However, this caused a major problem. I am unable to
boot up my system. It is stuck in the initramfs mode. I don't know how to
fix this and really appreciate any help to recover my system and restore it
to it's previous working condition.

Require urgent assistance.

On Fri, May 27, 2016 at 12:50 AM, Colin Ian King <<email address hidden>
> wrote:

> @SunBear,
>
> I've applied upstream commit
>
> https://github.com/01org/thermal_daemon/commit/8280fd7ec6cff6db6463c8a1b01d2e427e418226
> (as mentioned in comment #2) and prepared thermald for testing.
>
> https://launchpad.net/~colin-king/+archive/ubuntu/thermald-sru-1582982
>
> to install this use:
>
> sudo add-apt-repository ppa:colin-king/thermald-sru-1582982
> sudo apt update
> sudo apt upgrade
>
> And let me know if that helps
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1582982
>
> Title:
> thermald 1.5-2ubuntu1 failures in ubuntu16.04
>
> Status in thermald package in Ubuntu:
> In Progress
>
> Bug description:
> Hi. I am still getting the same failure message despite using the
> latest thermald version. Can you tell me how to solve these failures.
>
> May 18 10:08:11 Eliot thermald[1012]: failed to open
> /dev/acpi_thermal_rel
> May 18 10:08:11 Eliot thermald[1012]: failed to open
> /dev/acpi_thermal_rel
> May 18 10:08:11 Eliot thermald[1012]: TRT/ART read failed
> May 18 10:08:12 Eliot thermald[1012]: sysfs read failed
> constraint_0_max_power_uw
> May 18 10:08:12 Eliot thermald[1012]: sysfs write failed
> trip_point_0_temp
> May 18 10:08:12 Eliot thermald[1012]: sysfs write failed
> trip_point_0_temp
>
> $: dpkg -l thermald
> Desired=Unknown/Install/Remove/Purge/Hold
> |
> Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
> |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
> ||/ Name Version Architecture Description
>
> +++-=====================================-=======================-=======================-===============================================================================
> ii thermald 1.5-2ubuntu1 amd64 Thermal monitoring and controlling daemon
>
> $ lsb_release -rd
> Description: Ubuntu 16.04 LTS
> Release: 16.04
>
> $ apt-cache policy thermald
> thermald:
> Installed: 1.5-2ubuntu1
> Candidate: 1.5-2ubuntu1
> Version table:
> *** 1.5-2ubuntu1 500
> 500 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64
> Packages
> 100 /var/lib/dpkg/status
> 1.5-2 500
> 500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/thermald/+bug/1582982/+subscriptions
>

SunBear (sunbear-c22) wrote :

@Colin
I have managed to recover my system after executing the recovery steps by http://askubuntu.com/questions/137655/boot-drops-to-a-initramfs-prompts-busybox. Phew! :)

Updating you with my latest thermald bootup log (see below).
$ cat /var/log/syslog | grep thermald
May 29 23:55:20 Eliot thermald[1039]: 22 CPUID levels; family:model:stepping 0x6:5e:3 (6:94:3)
May 29 23:55:20 Eliot thermald[1039]: Polling mode is enabled: 4
May 29 23:55:20 Eliot thermald[1039]: sysfs read failed constraint_0_max_power_uw
May 29 23:55:20 Eliot thermald[1039]: sysfs write failed trip_point_0_temp
May 29 23:55:20 Eliot thermald[1039]: sysfs write failed trip_point_0_temp

It shows your upstream commit to downgrade my earlier reported failures (as shown) is appllied succesfully. They no longer appears.
May 19 08:17:42 Eliot thermald[1086]: failed to open /dev/acpi_thermal_rel
May 19 08:17:42 Eliot thermald[1086]: failed to open /dev/acpi_thermal_rel
May 19 08:17:42 Eliot thermald[1086]: TRT/ART read failed

Colin Ian King (colin-king) wrote :

@SunBear, thanks for verifying this. I am not sure how the installation of minor update to thermald is related to the machine not booting and requiring fixing. I've tested the same update on a couple of machines and did not get any issues, so perhaps your drive issues and the thermald update are unrelated.

description: updated

Hello SunBear, or anyone else affected,

Accepted thermald into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/thermald/1.5-2ubuntu2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in thermald (Ubuntu Trusty):
assignee: nobody → Colin Ian King (colin-king)
Changed in thermald (Ubuntu):
status: In Progress → Fix Released
Changed in thermald (Ubuntu Trusty):
status: New → In Progress
tags: added: verification-needed
Changed in thermald (Ubuntu Xenial):
status: New → In Progress
no longer affects: thermald (Ubuntu Trusty)
Changed in thermald (Ubuntu Xenial):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → Low
Colin Ian King (colin-king) wrote :

I've tested this on a Lenovo x220 with the air vents covered up to force overheating. After running this for over 15 minutes I don't see any of the non-critical errors appearing, so this looks fixed to me.

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package thermald - 1.5-2ubuntu2

---------------
thermald (1.5-2ubuntu2) xenial; urgency=medium

  * Downgrade errors to just info messages (LP: #1582982)
    - Thermald is reporting information at too high a logging
      level, so downgrade messages to info level to stop confusing
      users when thermald complains that hardware or ACPI specific
      interfaces are not available

 -- Colin King <email address hidden> Thu, 26 May 2016 17:10:43 +0100

Changed in thermald (Ubuntu Xenial):
status: In Progress → Fix Released

The verification of the Stable Release Update for thermald has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions