5.3.0-23-generic causes fans to spin when idle

Bug #1853044 reported by Dean Henrichsmeyer on 2019-11-18
50
This bug affects 11 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Colin Ian King
Eoan
High
Seth Forshee
Focal
High
Colin Ian King

Bug Description

SRU Justification

Impact: "drm/i915/gen8+: Add RC6 CTX corruption WA" makes it effectively impossible to enter RC6 for some Intel GPUs when a display server is running. This results in increased energy usage and temperature.

Fix: Upstream has changed too much to make the fixes there applicable, but Chris Wilson has provided a patch for 5.3, here: https://gitlab.freedesktop.org/drm/intel/issues/614#note_366057

Test Case: See test results below. On an affected machine powertop will show increased RC6 usage on an idle desktop after the patch, and other symptoms of excessive power use should also subside.

Regression Potential: The 5.3 patch is pretty straightforward, and only adds running of an existing delayed worker to retire GPU requests. No regressions have been reported in testing so far.

---

After upgrading to 5.3.0-23-generic the fans in my machine don't stop running. They always sound like something is utilizing CPU - even with no applications running after boot.

If I boot back to 5.3.0-19-generic it's fine.

My microcode version is reported as 0xd4 and iucode-tool reports:

iucode-tool: system has processor(s) with signature 0x000506e3

Let me know if you need anything else.

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-23-generic 5.3.0-23.25
ProcVersionSignature: Ubuntu 5.3.0-23.25-generic 5.3.7
Uname: Linux 5.3.0-23-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu8.2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC2: dean 2898 F.... pulseaudio
 /dev/snd/pcmC2D0p: dean 2898 F...m pulseaudio
 /dev/snd/controlC0: dean 2898 F.... pulseaudio
 /dev/snd/controlC1: dean 2898 F.... pulseaudio
CurrentDesktop: ubuntu:GNOME
Date: Mon Nov 18 13:03:34 2019
HibernationDevice: RESUME=UUID=55a42c82-50bf-4e75-a133-dbd3aa93611b
InstallationDate: Installed on 2018-07-24 (482 days ago)
InstallationMedia: Ubuntu 18.04.1 LTS "Bionic Beaver" - Release amd64 (20180724)
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 i915drmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-5.3.0-23-generic root=/dev/mapper/ubuntu--vg-root ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-23-generic N/A
 linux-backports-modules-5.3.0-23-generic N/A
 linux-firmware 1.183.2
SourcePackage: linux
UpgradeStatus: Upgraded to eoan on 2019-07-19 (121 days ago)
dmi.bios.date: 05/16/2018
dmi.bios.vendor: Intel Corp.
dmi.bios.version: KYSKLi70.86A.0055.2018.0516.1629
dmi.board.name: NUC6i7KYB
dmi.board.vendor: Intel Corporation
dmi.board.version: H90766-406
dmi.chassis.type: 3
dmi.chassis.vendor: Intel Corporation
dmi.chassis.version: 1.0
dmi.modalias: dmi:bvnIntelCorp.:bvrKYSKLi70.86A.0055.2018.0516.1629:bd05/16/2018:svn:pn:pvr:rvnIntelCorporation:rnNUC6i7KYB:rvrH90766-406:cvnIntelCorporation:ct3:cvr1.0:

CVE References

Dean Henrichsmeyer (dean) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: nobody → Colin Ian King (colin-king)
status: Confirmed → In Progress
Colin Ian King (colin-king) wrote :

Hi Dean,

As a first triaging step, with the 5.3.0-23-generic and also the 5.3.0-19-generic kernel do you mind installing and running the following command:

powerstat -Ra | tee powerstat-$(uname -r).log

and attaching the log files to the bug report. The command takes about 60 seconds to run.

thanks.

Colin Ian King (colin-king) wrote :

Also, when the fan is running at high speed can you do the following:

sudo apt-get install acpi
acpi -V

and add the output to the bug report

Colin Ian King (colin-king) wrote :

I've found 3 possible commits that may have contributed to this regression. Can you install the kernel headers, image and module debs in https://kernel.ubuntu.com/~cking/lp-1853044/ and see if this helps fix the issue.

Changed in linux (Ubuntu):
status: In Progress → Incomplete
Dean Henrichsmeyer (dean) wrote :

Attached are the logs from all three kernels. The test kernel didn't seem to make much difference I don't think.

Dean Henrichsmeyer (dean) wrote :
Dean Henrichsmeyer (dean) wrote :
Dean Henrichsmeyer (dean) wrote :
Dean Henrichsmeyer (dean) wrote :
Dean Henrichsmeyer (dean) wrote :
Dean Henrichsmeyer (dean) wrote :
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Colin Ian King (colin-king) wrote :

CPU averages:
   5.3.0-19: 2.03W, 99.6% idle, 0.1% in kernel, 93.5% in C10 state, 5.3% in C8 state, 1.66GHz
   5.3.0-23: 13.71W, 99.3% idle, 0.1% in kernel, 92.3% in C10 state, 6.1% in C8 state, 2.05GHz

GPU averages:
   5.3.0-19: 0.10W
   5.3.0-23: 7.19W

ACPI thermal zone:
   5.3.0-19: 38.92 C
   5.3.0.23: 68.65 C

So, not much difference in CPU loading or in C10/C8 states, but it is clocked faster on the -23 kernel and is 11.7W more power being consumed. The GPU is also consuming far more power in the -23 kernel. The ACPI thermal zone is ~30 degrees hotter, hence the fan activity.

Given the kernel changes I provided made no changes, this looks like a i915 regression somehow. I'll see what has changed there.

Colin Ian King (colin-king) wrote :

@Dean, just one sanity check, do you have non-integer icon scaling on your desktop?

Dean Henrichsmeyer (dean) wrote :

I do not, I'm scaled at 200% and I've disabled all experimental-features.

Colin Ian King (colin-king) wrote :

Hi Dean,

I've prepared another debug test kernel that has 70+ of the drm patches removed that were introduced between the 5.3.0-19 and 5.3.9-23 kernels. If this stops the fan spinning then this implies the regression was introduced in a drm graphics patch.

Updated revision r2 Debian packages can be found here for testing:

https://kernel.ubuntu.com/~cking/lp-1853044/

Please test and let me know the outcome.

Dean Henrichsmeyer (dean) wrote :

Just to close the loop on this, I used the test kernels and the problem went away.

Kai-Heng Feng (kaihengfeng) wrote :

Seems like LP: #1856653 is the same issue.

The solution from Chris Wilson: https://gitlab.freedesktop.org/drm/intel/issues/614#note_366057

Dean Henrichsmeyer (dean) wrote :

This also affects 5.4.0-9-generic - I see the same behavior on today's focal.

Colin Ian King (colin-king) wrote :

I'll get a kernel sorted out for testing by EOD.

Brad Figg (brad-figg) wrote :

If we think that single patch is a solution then can we get a test kernel with that patch made available for confirmation and then get that submitted so it can go into a official 20.04 kernel?

Seth Forshee (sforshee) wrote :

Hi Dean -- I built a test kernel with the 5.3 patch from the gitlab bug, please give it a spin to see if it fixes your issue.

https://people.canonical.com/~sforshee/lp1853044/5.3.0-40.32+lp1853044v202002131655/

Thanks!

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Andrea Righi (arighi) wrote :

I also built a 5.4 based test kernel (with the extra drm/i915 patches from https://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug112315):

https://kernel.ubuntu.com/~arighi/LP-1853044/

Dean Henrichsmeyer (dean) wrote :

Thanks Andrea - I'm running the kernel now and all is well so far. I'll update this bug in a few days with my experience.

Francis Ginther (fginther) wrote :

I believe lp:1863489 is a duplicate of this, but the symptom is powertop showing that the i915 device is not dropping into RC6 or lower power state. I was able to reproduce this on my laptop and after installing the above 5.3 test kernel, powertop is now showing RC6 usage > 90% when my desktop is idle.

The 5.3 test kernel Seth generated worked for me. I also tried the 5.4 test kernel Andrea built which also showed RC6 usage > 90% via powertop.

arno (star-gmx) wrote :

Is it to expect to get the patched kernel 5.3 via usual update (next) or do I need to update manually? Didn't do that before....

On Fri, Feb 21, 2020 at 04:05:08PM -0000, arno wrote:
> Is it to expect to get the patched kernel 5.3 via usual update (next) or
> do I need to update manually? Didn't do that before....

The update will make it's way into the normal updates, though this may
take some time (3-6 weeks typically) due to the amount of testing we
need to do on kernel updates. It won't be in the next update as that one
is already undergoing testing; it will be in the update after that.

The only reason you would need to install anything manually is if you
want to use the test build while waiting for the fix to make it's way
into a release kernel update.

Dean Henrichsmeyer (dean) wrote :

The kernel from Andrea has been stable for me (no GPU hangs). The fans seem to come on a little more than they did with the previous kernel from Colin but certainly it's much better than the vanilla 5.4 was.

arno (star-gmx) wrote :

Thanks. I stay at 5.3.0.18 till then. Stupid question. Screen shudders (syncing fails from time to time) after sleep/hibernate. Is this a known issue and solved in newer kernels?

Seth Forshee (sforshee) wrote :

On Sun, Feb 23, 2020 at 08:47:32PM -0000, arno wrote:
> Thanks. I stay at 5.3.0.18 till then. Stupid question. Screen shudders
> (syncing fails from time to time) after sleep/hibernate. Is this a known
> issue and solved in newer kernels?

That's not an issue I've heard about, but we are constantly pulling in
bug fixes from upstream so it's still possible that it's fixed in a
newer kernel.

arno (star-gmx) wrote :

Mh. I'll better wait this 6 weeks and check it then again. Who knows, maybe that is one of the reasons to disable rc6. Won't file a bug at an outdated version.

Seth Forshee (sforshee) on 2020-02-24
description: updated
Seth Forshee (sforshee) wrote :
Changed in linux (Ubuntu Focal):
status: Incomplete → Fix Committed
Changed in linux (Ubuntu Eoan):
assignee: nobody → Seth Forshee (sforshee)
importance: Undecided → High
status: New → In Progress
Changed in linux (Ubuntu Eoan):
status: In Progress → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-focal' to 'verification-done-focal'. If the problem still exists, change the tag 'verification-needed-focal' to 'verification-failed-focal'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-focal
Launchpad Janitor (janitor) wrote :
Download full text (81.5 KiB)

This bug was fixed in the package linux - 5.4.0-18.22

---------------
linux (5.4.0-18.22) focal; urgency=medium

  * focal/linux: 5.4.0-18.22 -proposed tracker (LP: #1866488)

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - [Packaging] update helper scripts

  * Add sysfs attribute to show remapped NVMe (LP: #1863621)
    - SAUCE: ata: ahci: Add sysfs attribute to show remapped NVMe device count

  * [20.04 FEAT] Compression improvements in Linux kernel (LP: #1830208)
    - lib/zlib: add s390 hardware support for kernel zlib_deflate
    - s390/boot: rename HEAP_SIZE due to name collision
    - lib/zlib: add s390 hardware support for kernel zlib_inflate
    - s390/boot: add dfltcc= kernel command line parameter
    - lib/zlib: add zlib_deflate_dfltcc_enabled() function
    - btrfs: use larger zlib buffer for s390 hardware compression
    - [Config] Introducing s390x specific kernel config option CONFIG_ZLIB_DFLTCC

  * [UBUNTU 20.04] s390x/pci: increase CONFIG_PCI_NR_FUNCTIONS to 512 in kernel
    config (LP: #1866056)
    - [Config] Increase CONFIG_PCI_NR_FUNCTIONS from 64 to 512 starting with focal
      on s390x

  * CONFIG_IP_MROUTE_MULTIPLE_TABLES is not set (LP: #1865332)
    - [Config] CONFIG_IP_MROUTE_MULTIPLE_TABLES=y

  * Dell XPS 13 9300 Intel 1650S wifi [34f0:1651] fails to load firmware
    (LP: #1865962)
    - iwlwifi: remove IWL_DEVICE_22560/IWL_DEVICE_FAMILY_22560
    - iwlwifi: 22000: fix some indentation
    - iwlwifi: pcie: rx: use rxq queue_size instead of constant
    - iwlwifi: allocate more receive buffers for HE devices
    - iwlwifi: remove some outdated iwl22000 configurations
    - iwlwifi: assume the driver_data is a trans_cfg, but allow full cfg

  * [FOCAL][REGRESSION] Intel Gen 9 brightness cannot be controlled
    (LP: #1861521)
    - Revert "USUNTU: SAUCE: drm/i915: Force DPCD backlight mode on Dell Precision
      4K sku"
    - Revert "UBUNTU: SAUCE: drm/i915: Force DPCD backlight mode on X1 Extreme 2nd
      Gen 4K AMOLED panel"
    - SAUCE: drm/dp: Introduce EDID-based quirks
    - SAUCE: drm/i915: Force DPCD backlight mode on X1 Extreme 2nd Gen 4K AMOLED
      panel
    - SAUCE: drm/i915: Force DPCD backlight mode for some Dell CML 2020 panels

  * [20.04 FEAT] Enable proper kprobes on ftrace support (LP: #1865858)
    - s390/ftrace: save traced function caller
    - s390: support KPROBES_ON_FTRACE

  * alsa/sof: load different firmware on different platforms (LP: #1857409)
    - ASoC: SOF: Intel: hda: use fallback for firmware name
    - ASoC: Intel: acpi-match: split CNL tables in three
    - ASoC: SOF: Intel: Fix CFL and CML FW nocodec binary names.

  * [UBUNTU 20.04] Enable CONFIG_NET_SWITCHDEV in kernel config for s390x
    starting with focal (LP: #1865452)
    - [Config] Enable CONFIG_NET_SWITCHDEV in kernel config for s390x starting
      with focal

  * Focal update: v5.4.24 upstream stable release (LP: #1866333)
    - io_uring: grab ->fs as part of async offload
    - EDAC: skx_common: downgrade message importance on missing PCI device
    - net: dsa: b53: Ensure the default VID is untagged
    - net: fib_rules: Correctly set table field when table number exceeds 8 bit...

Changed in linux (Ubuntu Focal):
status: Fix Committed → Fix Released

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-eoan' to 'verification-done-eoan'. If the problem still exists, change the tag 'verification-needed-eoan' to 'verification-failed-eoan'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-eoan
tags: added: verification-done-eoan
removed: verification-needed-eoan
Michael Haas (mhaas87) wrote :

Hi,

happy to see that this regression is addressed in newer Ubuntu releases!

Will this fix be backported to the 5.3.0 series used in Ubuntu 18.04.4?

I was previously running Ubuntu 18.04; then reinstalled to 18.04.4 and was very surprised to find my time on battery was *cut by half*.

I would advocate to have the fix for all affected & currently supported kernel lines, including the HWE packages. Let me know if I can help with testing.

Launchpad Janitor (janitor) wrote :
Download full text (49.1 KiB)

This bug was fixed in the package linux - 5.3.0-46.38

---------------
linux (5.3.0-46.38) eoan; urgency=medium

  * eoan/linux: 5.3.0-43.36 -proposed tracker (LP: #1867301)

  * Fix AMD Stoney Ridge screen flickering under 4K resolution (LP: #1864005)
    - iommu/amd: Disable IOMMU on Stoney Ridge systems

  * Allow BPF tracing under lockdown (LP: #1868626)
    - Revert "UBUNTU: SAUCE: (efi-lockdown) Lock down kprobes"
    - Revert "bpf: Restrict bpf when kernel lockdown is in confidentiality mode"

  * Missing wireless network interface after kernel 5.3.0-43 upgrade with eoan
    (LP: #1868442)
    - iwlwifi: mvm: Do not require PHY_SKU NVM section for 3168 devices

  * Packaging resync (LP: #1786013)
    - [Packaging] resync getabis
    - [Packaging] update helper scripts

  * iSCSI-target: Deleting a LUN hangs in the kernel (LP: #1862682)
    - scsi: Revert "target/core: Inline transport_lun_remove_cmd()"

  * Stop using get_scalar_status command in Dell AIO uart backlight driver
    (LP: #1865402)
    - SAUCE: platform/x86: dell-uart-backlight: add get_display_mode command

  * Eoan update: upstream stable patchset 2020-03-11 (LP: #1867051)
    - Revert "drm/sun4i: dsi: Change the start delay calculation"
    - ovl: fix lseek overflow on 32bit
    - kernel/module: Fix memleak in module_add_modinfo_attrs()
    - media: iguanair: fix endpoint sanity check
    - ocfs2: fix oops when writing cloned file
    - x86/cpu: Update cached HLE state on write to TSX_CTRL_CPUID_CLEAR
    - udf: Allow writing to 'Rewritable' partitions
    - printk: fix exclusive_console replaying
    - iwlwifi: mvm: fix NVM check for 3168 devices
    - sparc32: fix struct ipc64_perm type definition
    - cls_rsvp: fix rsvp_policy
    - gtp: use __GFP_NOWARN to avoid memalloc warning
    - l2tp: Allow duplicate session creation with UDP
    - net: hsr: fix possible NULL deref in hsr_handle_frame()
    - net_sched: fix an OOB access in cls_tcindex
    - net: stmmac: Delete txtimer in suspend()
    - bnxt_en: Fix TC queue mapping.
    - tcp: clear tp->total_retrans in tcp_disconnect()
    - tcp: clear tp->delivered in tcp_disconnect()
    - tcp: clear tp->data_segs{in|out} in tcp_disconnect()
    - tcp: clear tp->segs_{in|out} in tcp_disconnect()
    - rxrpc: Fix use-after-free in rxrpc_put_local()
    - rxrpc: Fix insufficient receive notification generation
    - rxrpc: Fix missing active use pinning of rxrpc_local object
    - rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect
    - media: uvcvideo: Avoid cyclic entity chains due to malformed USB descriptors
    - mfd: dln2: More sanity checking for endpoints
    - ipc/msg.c: consolidate all xxxctl_down() functions
    - tracing: Fix sched switch start/stop refcount racy updates
    - rcu: Avoid data-race in rcu_gp_fqs_check_wake()
    - brcmfmac: Fix memory leak in brcmf_usbdev_qinit
    - usb: typec: tcpci: mask event interrupts when remove driver
    - usb: gadget: legacy: set max_speed to super-speed
    - usb: gadget: f_ncm: Use atomic_t to track in-flight request
    - usb: gadget: f_ecm: Use atomic_t to track in-flight request
    - ALSA: usb-audio: Fix endianess in descriptor validatio...

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Fix Released
Michael Haas (mhaas87) wrote :

I can confirm that this is now fixed for me on Ubuntu 18.04.4 LTS with kernel 5.3.0-46-generic. With this kernel, my Dell Latitude E5470 with a i5-6440HQ now consumes about 4W in idle instead of 9.W as before.

Thanks a lot!

arno (star-gmx) wrote :

As for me (kernel 5.4.0-24-generic #28 I have a bad regression.
After return from suspend state the screen flickers like hell. So it becomes unusable.

Seth Forshee (sforshee) wrote :

@arno can you file a new bug for this please? Thanks!

arno (star-gmx) wrote :

Sure:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1872760

But with one of the previous versions (that one that consumes a lot of power) I didn't have this issue. Hope this will not end up to a either or decision ;).

Kostadin Stoilov (kmstoilov) wrote :

Unfortunately this bug is back with linux-5.3.0-53.47.

It got reintroduced somewhere between linux-5.3.0-53.47 and 5.3.0-51.44.

System is Dell XPS 9550 with Skylake HD Graphics 530.

Dean Henrichsmeyer (dean) wrote :

Agreed

Colin Ian King (colin-king) wrote :

There are nearly 600 commits between the working and non-working kernel. Rather than work on ~8-9 bisects steps, I've built 5 test kernels that just the relevant i915 driver commits reverted between the working and broken kernel.

I've put the .debs at the following location:

https://kernel.ubuntu.com/~cking/lp-1853044/

Please can you test the kernels in https://kernel.ubuntu.com/~cking/lp-1853044/revert1/ first. Install the kernel, reboot into it, and check it is the correct one using:

uname -r

Please work through the kernels in revert1 through to revert5 one by one (install, reboot, test for a while) and let me know which ones work fine and which ones cause the fan to kick in because of overheating.

Thank you!

Colin Ian King (colin-king) wrote :

ping? Any one care to test these kernels?

arno (star-gmx) wrote :

Is there a chance that "flicker after reboot" (that seems to be a side effect of the previous fix) is gone too?
Is it enough to save the /boot partition to recover system after tests?

Dean Henrichsmeyer (dean) wrote :

This happens to me on 5.4.0-33-generic (focal). Do you want me to test these 5.3 kernels?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.