Shuts down when supposed to suspend as a reaction to self-caused overheat, session lost

Bug #1491797 reported by Harri K. Hiltunen
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Wishlist
Unassigned

Bug Description

Error:
Kernel foolishly shuts down the computer when it overheats because of the underlying bug "Overheat due to slow fans when on 'auto'"
( https://bugs.launchpad.net/ubuntu/+source/linux/+bug/751689 ).
/var/log/kern.log
W500 kernel: [1448.648529] thermal thermal_zone1: critical temperature reached (100 C), shutting down
(Hardware is ok: when the fan is forced to full speed, no overheating.)

Consequence:
Shutting down destroys session in Ubuntu, Gnome, and all applications that can't remember their latest conscious state (most applications).

Attempted repair, failed:
Laptop has suspending ability, but I can't find the setting for the kernel to make the computer suspend instead of shutting down.

Repair suggestions:
1. Persistence of session, so that everything would reappear after the restart. (this would also make updating less disruptive)
2. Do not heat the machine like crazy; speed up fans or slow down processes. (fix the slow fan bug affecting Lenovo Thinkpads and fix thermald failing to throttle CPU)
3. Put the computer to suspend when it's too hot.

The problem has remained the same from at least Ubuntu 11.10 through 14.04 .

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-62-generic 3.13.0-62.102
ProcVersionSignature: Ubuntu 3.13.0-62.102-generic 3.13.11-ckt24
Uname: Linux 3.13.0-62-generic x86_64
ApportVersion: 2.14.1-0ubuntu3.12
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: user 2171 F.... pulseaudio
CurrentDesktop: Unity
Date: Thu Sep 3 13:42:28 2015
HibernationDevice: RESUME=UUID=991e1383-ff5b-46c1-84c4-c904e1d81256
InstallationDate: Installed on 2013-12-29 (612 days ago)
InstallationMedia: Ubuntu 13.10 "Saucy Salamander" - Release amd64 (20131016.1)
MachineType: LENOVO 4063B22
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-62-generic root=UUID=bd426989-b545-41b3-97b8-de9410f27aa6 ro persistent quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-62-generic N/A
 linux-backports-modules-3.13.0-62-generic N/A
 linux-firmware 1.127.15
SourcePackage: linux
UpgradeStatus: Upgraded to trusty on 2014-04-27 (494 days ago)
dmi.bios.date: 12/14/2011
dmi.bios.vendor: LENOVO
dmi.bios.version: 6FET92WW (3.22 )
dmi.board.name: 4063B22
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr6FET92WW(3.22):bd12/14/2011:svnLENOVO:pn4063B22:pvrThinkPadW500:rvnLENOVO:rn4063B22:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 4063B22
dmi.product.version: ThinkPad W500
dmi.sys.vendor: LENOVO

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

As recommended on irc/Freenode/#ubuntu-kernel ;
-installed linux-generic-lts-vivid
-ensured that new kernel is running
-ensured that thermald is running
The problem remains the same: overheating with fan speed staying low (3500 out of 5000rpm) after several minutes at 95C.

description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Have you tried cleaning the fans and vents to ensure they are free of dust?

Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.2 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.2-unstable/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

The hardware was disassembled and everything was fine. Fan kept speed well after blowing into it.
I drilled extra holes to the case bottom to aid air flow.

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

Installed kernel Linux 4.2.0-040200-generic .
Ensured it was running.
Problem remains the same: overheating with fan speed staying low (up to 3600 out of 5000rpm) after several minutes at 90-96C.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-bug-exists-upstream
Revision history for this message
Colin Ian King (colin-king) wrote :

I'd like to get a profile of your machine's CPU clock speed, CPU utilization and thermal zone temperatures. Can you install the latest powerstat and run a profile for me. To do so, use:

sudo add-apt-repository ppa:colin-king/white
sudo apt-get update
sudo apt-get install powerstat

then run:

powerstat -Da 1 60

and attach the output to the bug report. Thanks!

Revision history for this message
Colin Ian King (colin-king) wrote :

This model of thinkpad has 8 levels of fan control:
      0=off,
  1-2 = 1900 RPM
  3-5 = ~3000 RPM,
  6-7 = ~3500 RPM
and a disengaged mode works at ~5100 RPM

The 6-7 level @ 3500 RPM should be enough to dissipate heat generated by a loaded CPU. The fact that your fan is running at 3500 RPM is indicating that the fan control is correctly enabling the highest fan control level (6-7) and even that is not enough to dump all the heat out of the laptop. Also, thermald should be actively throttling back the CPU as a passive mode control, so this should also help reduce the overheating. The fact that this seems to be occurring across a wide range of kernel versions suggests to me that that perhaps this is hardware related, for example, is the thermal paste between the CPU and the thermal pipe/fan unit working correctly?

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

running
"powerstat -Da 1 60"
returns
"Device does not have any RAPL domains, cannot power measure power usage."

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

I might renew the cooler paste some day.

Why isn't the fan being run in disengaged mode at 5100 RPM by the kernel in a thermal emergency? Surely that would be easier to do than to have every Lenovo W500's cooler pastes renewed?

Revision history for this message
Elias Aarnio (elias-aarnio) wrote :

I also get "Device does not have any RAPL domains, cannot power measure power usage."

Thinkpad X201, Ubuntu 14.04 LTS.

Revision history for this message
Colin Ian King (colin-king) wrote :

OK, no RAPL interface, so can you run "powerstat -za 1 480" instead.

The default mode for the fan is to be controlled by the firmware and not the kernel, so the kernel has no direct control by default. The alternative mechanism is to enable the thinkpad_acpi fan control and twiddle the settings either manually or by software control. By default though, one would expect the firmware do be able to control the fan correctly since that what the system designers intended to be the default fan control mode.

If you do intend to try twiddling the fan controls manually, I believe the following instructions may work (but I've not tried these myself, so I can't vouch that they are correct):

as root, create a new file /etc/modprobe.d/thinkpad_acpi.conf and add the following line to it:

options thinkpad_acpi fan_control=1

..and reboot.

as root try the following to set the fan to the highest "engaged" fan speed:

echo level 7 > /proc/acpi/ibm/fan

..hopefully that will crank the fan up to ~5100 RPM in engaged mode.

you can enabled disengaged mode using:

echo level disengaged > /proc/acpi/ibm/fan

Essentially, "engaged" mode is where the fan speed is locked to a defined fan speed. "disengaged" mode can be used to drive the fan faster (I believe there is no feedback loop between the speed and a firmware control that sets the speed in disengaged mode). For more details, see http://www.thinkwiki.org/wiki/How_to_control_fan_speed

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

If thermald is supposed to throttle CPU according to temperature, it doesn't work.

Whatever is throttling the CPU seems to be unaware of temperature, because the more I load the machine, the more reliably "cpufreq-info -w" returns "2801000" while temperature is at 90-95 C.

Only when there is less load, does the CPU run at lower frequencies.

Revision history for this message
Colin Ian King (colin-king) wrote :

We can examine the thermald issues later, lets get some idea of what powerstat says and then I can ponder on the next best step to make.

Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
status: Confirmed → In Progress
Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

Attachment from "powerstat -za 1 480".

Revision history for this message
Colin Ian King (colin-king) wrote :

Attached is a spreadsheet with the above data and some graphs.

The first graph compares normalized data from the ACPI thermal zones (acpiz), CPU freq and CPU utilization. This shows a good correlation on CPU utilization and temperature (as we would expect). CPU frequencies are scaling according to the load and looks OK to me.

The second graph is a comparison of ACPI thermal zones and CPU utilization in absolute terms (e.g. degrees C and CPU %). Note that the CPU is loaded never less than 20% and that at this level the temperature seems to be levelling out to around 70 degrees C which seems rather high. The CPU is bouncing around 800Mhz - 1.8GHz at that point, which I guess is to be expected if one has so many context switches and IRQs occurring.

So. Some interesting data:

1. The IRQ rate does seem high.
2. The Context switch rate seems high too.
3. The machine is not that idle.
4. Even at a low-ish load, the system is rather hot. (This makes me wonder if there is something physically wrong between the CPU and the heatpipe/fan unit.)
5. Dropping from fully loaded to partially loaded we see heat being dissipated but not that well, e.g. it drops from nearly 100 C down to 70-65 C, which is surprising and I expected this to go lower.

Hence I think it may be a physical hardware issue with the cooling.

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

I can't see how that information could be helpful in fixing the problem. It doesn't matter whether there's a common hardware ailment involved in this particular overheating or not. The software is acting like an idiot and wasting all its chances to remedy the situation:
-fan not running at full speed in disengaged mode in a thermal emergency
-CPU not throttling in a thermal emergency (unless the frequency readings are wrong)
-shutting down when supposed to suspend as a reaction to overheat, unnecessarily destroying session
-destroying session in a shut down/restart cycle (I heard rumours this may be fixed later in Snappy with containers)

Revision history for this message
Colin Ian King (colin-king) wrote :

Let's take this one point at a time:
* fan not running at full speed in disengaged mode in a thermal emergency
   - as mentioned earlier, the default fan mode on the machine is to run under firmware control, in which case it runs in engaged mode with a loop feed back controller so it never exceeds a top speed of 3500 RPM. This matches the original thermal design by the manufacturer. So either they made a mistake and all machines like yours overheat (and we would see lots of owners with your machine reporting this bug) or this issue is particular to your machine

* CPU not throttling in a thermal emergency (unless the frequency readings are wrong)
  - that needs investigation as thermald should be doing that (but as I mentioned earlier, I will examine the thermald issues later)

* shutting down when supposed to suspend as a reaction to overheat, unnecessarily destroying session
  - when a critical thermal event occurs one has a very short time window to react. Potentially the silicon may be permanently damaged, so the kernel chooses to power down rather ran try to suspend (since this can get stuck and exacerbate the issue). Without the handling of this thermal event, the next step is for the hardware to physically shut itself down which is out of any form of operating system control, so either way, the machine is desperately trying to save itself from breaking.

* destroying session in a shut down/restart cycle (I heard rumours this may be fixed later in Snappy with containers)
  - again, in a rush to save your silicon from becoming irreparably damaged shutdown is the fastest mechanism. Snappy containers will not help.

I'd recommend reading https://en.wikipedia.org/wiki/Thermal_design_power, there is paragraph that states:
"Most modern processors will cause a therm-trip only upon a catastrophic cooling failure, such as a no longer operational fan or an incorrectly mounted heatsink."

So, the next step will be to see if we can see what thermald is doing.

1. Stop thermald so we can re-enable it with full debug on:

sudo systemctl stop thermald (if you are using systemd)

or

sudo service thermald stop (if you are using upstart)

2. Run thermald for a while from the command line and capture debug output:

sudo thermald --no-daemon --dbus-enable --loglevel=debug | tee thermald.log

..run this say for 5-10 minutes and use your machine, then attach the thermald.log to the bug report

3. Re-start themrald

sudo systemctl start thermald (if you are using systemd)

or

sudo service thermald start (if you are using upstart)

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

Attachment from "sudo thermald --no-daemon --dbus-enable --loglevel=debug | tee thermald.log".

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

Responses to the points above I disagree with:
* fan not running at full speed in disengaged mode in a thermal emergency
"... so it never exceeds a top speed of 3500 RPM. This matches the original thermal design by the manufacturer."
-When reality acutely requires full speed, it is madness to refer to original designs.

"... we would see lots of owners with your machine reporting this bug..."
-Where are the 100 automated error reports of my previous overheat crashes? I quess nowhere, because Apport doesn't catch them - it is not considered an error to shut down in a self-caused thermal emergency.
I procrastinated filing this bug report for 3 years, because "surely someone will notice all these crashes any day now". Then it took many days of persistent work to find out how to file a kernel bug report using Apport, because there is no such option in the menu; you have to ask Ubuntu support to find out the trick. It is too difficult, so don't expect people to do it. How many users even know how to launch Apport? There is no "Report a bug" in Gnome menu.

* shutting down when supposed to suspend as a reaction to overheat, unnecessarily destroying session
"when a critical thermal event occurs one has a very short time window to react. ... the machine is desperately trying to save itself from breaking."
-Then why not suspend in 1 second, but instead shut down in 8 seconds?

Revision history for this message
Colin Ian King (colin-king) wrote :

From my understanding, I can see that thermald is detecting the passive trip point temperature being reached at 75 degrees C and then it attempts to move to cpufreq index position 3; so it seems to be trying to do some passive CPU frequency switching.

Sensor hwmon :temp 76000
update_set_point 76000,0,98000
pref 0 type 1 temp 76000 trip 99000
Passive Trip point applicable
Trip point applicable < 0:99000
cdev size for this trippoint 2
cdev at index 0:Processor
Need to switch to next cdev
cdev at index 3:cpufreq
Need to switch to next cdev
pref 0 type 2 temp 76000 trip 95000
Passive Trip point applicable
Trip point applicable < 1:95000
cdev size for this trippoint 2
cdev at index 0:Processor
Need to switch to next cdev
cdev at index 3:cpufreq
Need to switch to next cdev

What is really interesting is that your firmware is configured to do passive (e.g. non-fan) cooling strategies at 75 degrees C. In the earlier analysis we saw that even at 20% CPU load your machine is close to this temperature. So, we can conclude that slightly busy CPU already into the thermal danger zone and informing the OS to start attempting passive cooling strategies.

Key facts we can draw from this:

At 20% utilization your machine is already moving into the zone where the designers of the machine believe that fan control is not sufficient and hence is triggering passive cooling strategies.

That really tells me that your machine is having some cooling issues between the CPU and the fan. This implies that one should check that there is sufficient thermal paste between the CPU and the heat pipe.

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

The thermald throttling malfunctioning is interesting.

Yes, I believe my laptop is among the several % of portables having cooling issues due to old age and/or on-the-edge design. Like I said, it's a common ailment. Just google "ubuntu laptop overheating" to find out how common. What are we supposed to do to fix that? Approach thermal grease manufacturers about the short life of their products? Solving my problem overhauling my computer is not fixing the problem, it's cover up.

Now how about fixing all the ways the kernel is behaving badly in all those overheating machines?

Things for software to do in an overheating case (from the original bug report):
-use the fan up to its maximum speed (not up to what a designer years ago assumed would probably be enough)
-throttle the CPU (thermald troubleshooting in progress)
-in a heat emergency, suspend in 1 second instead of shutting down wasting 8 seconds and my session

Revision history for this message
Colin Ian King (colin-king) wrote :

* use the fan up to its maximum speed (not up to what a designer years ago assumed would probably be enough)
  - see comment #13 for some guidelines. As mentioned before, fan control by default is under firmware control. So, one will have to enabled the thinkpad fan control manually to adjust fan settings outside of firmware control. Have you tried these yet?

* in a heat emergency, suspend in 1 second instead of shutting down wasting 8 seconds and my session
   No. The system has had a thermal overrun event that explicitly states to the machine "shut down" because the silicon is at risk from thermal meltdown. This is the policy and will always be the policy.

Revision history for this message
Colin Ian King (colin-king) wrote :

Can you accept the fact that even on a low CPU load, your machine is overheating. This looks like a H/W issue. Software can patch up the issue, but the reality is that I honestly believe that the problem needs to be fixed by examining the physical aspects of malfunctioning heat extraction from your laptop. Working around this problem is akin like saying "my house is on fire, can you fix it by opening a few windows in the house?".

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

I accepted it being a common hardware issue many messages up.

Which is easier; fixing the kernel to be non-destructive for these machines, or physically overhauling all overheating laptops running Linux?

The "opening windows in a house fire" -parable is inaccurate. What I'm suggesting for the house fire is "automatically open all water taps for cooling and to alert people, maximize ventilation to remove smoke, and text all phones in the approximate GPS location of the house with the warning: detected house fire growing at #address - vacate the building immediately bringing your relatives and valuables outside with you".

Revision history for this message
Colin Ian King (colin-king) wrote :

It is easier if you first check that your specific machine is not the root issue, then we can consider the wider issue.

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

It doesn't matter whether this machine has the common old-age ailment of dried thermal grease or not. It's part of normal ageing, so Linux needs to be able to handle it in a non-destructive way.

Now Linux is being bullheadedly intolerant to common ailments. It's like a doctor letting an old patient suffer from an indefinitely drug-relievable illness and saying "no, you must buy an expensive life-threatening surgery to correct the underlying root cause". (This parable is apt, because in my computer one insert brass nut has popped off the case and is now spinning in its cavity with the screw, so I would have to do something drastic to even reach the thermal grease.)

What kind of an attitude is that to the continual improvement of Linux?
https://en.wikipedia.org/wiki/Continual_improvement_process
Will it ever be smart if you avoid making it smarter? Or have I mistaken about the goal of Linux development?

Even my car from 1995 (Nissan) is smarter than this: in the case of overheating, it can protect itself gracefully in many ways; it doesn't kill the engine demanding immediate cooling system repair - it starts doing every possible cooling action and avoiding various heating actions in the order of least annoyance to the user. Then it informs the user about the heat problem, so that they can help. Then it files an error report about having overheated. Quite a lot better than Linux in 2015.

The concept of "failing safely" should be applied here.
https://en.wikipedia.org/wiki/Fail-safe
Users of these ailing (or just dusty) computers are now suffering of abrupt forced shut downs when they could easily be given lesser evils (noisier fan, slower processing, suspending for a bit every once in a while).

All the opportunities to remedy the situation should be taken, because as nothing always works, a low number of actions taken will more likely fail in someone's computer.

Revision history for this message
Colin Ian King (colin-king) wrote :

I therefore suggest speaking to Lenovo about this.

Changed in linux (Ubuntu):
status: In Progress → Incomplete
assignee: Colin Ian King (colin-king) → nobody
Revision history for this message
Colin Ian King (colin-king) wrote :

Or, try fan control as I suggested earlier. see comment #13 for some guidelines.

Revision history for this message
Harri K. Hiltunen (harri-k-hiltunen) wrote :

Speaking to Lenovo won't automatically fix the issue, because it is way harder to get people to install bios updates than to accept kernel updates.

Fan control disengaged mode works: fan runs 4700 RPM, which keeps temperature down at 84 C after several minutes at full load. Unfortunately in disengaged mode the fan always runs 4700 RPM, no matter what temperature or load.

So, make the system switch between fan control modes according to temperature and its growth rate to narrowly avoid hitting 100 C (it needs to be done on many affected Lenovos).

description: updated
Revision history for this message
Colin Ian King (colin-king) wrote :

So it clearly appears that the fan in disengaged mode can't keep the CPU under the first thermal trip level of 75 degrees C.

1. In this scenario, the first thermal trip level is basically saying "start using passive cooling strategies to keep CPU cool". This implies throttling back the CPU, for example CPU freq scaling or P-state limiting.

2. This *clearly* indicates that something is broken at the hardware level as I have pointed out numerous times.

Summary:

Even with fan running in disengaged mode the fan cannot get the machine below the passive trip zone level.
Hardware is clearly broken.
Not a software fix issue.
Won't Fix.

Changed in linux (Ubuntu):
status: Incomplete → Won't Fix
Changed in linux (Ubuntu):
importance: High → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.