xserver terminates frequently, journal says drm/i915: Resetting chip after gpu hang

Bug #1724047 reported by Ernst Kloppenburg
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
xorg-server-hwe-16.04 (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Since yesterdays update of xserver-xorg-core-hwe-16.04, the graphical session crashes reproducibly, some time into using the session. Might be related to using libreoffice at the time of the crash.

The relevant message from the system journal is:

kernel: [drm] GPU HANG: ecode 9:0:0x85dffffb, in Xorg [948], reason: Hang on render ring, action: reset

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: xserver-xorg-core-hwe-16.04 2:1.19.3-1ubuntu1~16.04.3
ProcVersionSignature: Ubuntu 4.10.0-37.41~16.04.1-generic 4.10.17
Uname: Linux 4.10.0-37-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: amd64
CurrentDesktop: KDE
Date: Mon Oct 16 20:49:58 2017
InstallationDate: Installed on 2014-08-28 (1145 days ago)
InstallationMedia: Kubuntu 14.04 LTS "Trusty Tahr" - Release amd64 (20140416.1)
SourcePackage: xorg-server-hwe-16.04
UpgradeStatus: Upgraded to xenial on 2016-05-20 (514 days ago)

Revision history for this message
Ernst Kloppenburg (ernst-kloppenburg) wrote :
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

that hang comes from the kernel

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg-server-hwe-16.04 (Ubuntu):
status: New → Confirmed
Revision history for this message
John Neffenger (jgneff) wrote :

I started to see the same GPU HANG error sometime this year (2020).

I am using the default "modesetting" driver with automatic configuration and no "xorg.conf" file (/etc/X11/xorg.conf). The Xorg log file (/var/log/Xorg.1.log) is attached.

The following error occurred on April 1, 2020, at 18:36:30.

[drm] GPU HANG: ecode 9:0:0x85dffffa, in Xorg [1784],
    reason: Hang on rcs0, action: reset
i915 0000:00:02.0: Resetting rcs0 after gpu hang
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
i915 0000:00:02.0: Resetting chip after gpu hang
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
[drm] Reducing the compressed framebuffer size.
    This may lead to less power savings than a non-reduced-size.
    Try to increase stolen memory size if available in BIOS.

The next error occurred on April 4, 2020 at 12:15:13.

[drm] GPU HANG: ecode 9:0:0x85dffffa, in skypeforlinux [17067],
    reason: Hang on rcs0, action: reset
i915 0000:00:02.0: Resetting rcs0 after gpu hang
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
i915 0000:00:02.0: Resetting chip after gpu hang
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5
[drm] Reducing the compressed framebuffer size.
    This may lead to less power savings than a non-reduced-size.
    Try to increase stolen memory size if available in BIOS.

I have now switched to the older "intel" driver to see whether that might solve the problem, using the following "xorg.conf" file:

# Try "intel" instead of "modesetting" driver to fix GPU HANG.
Section "Device"
    Identifier "Intel Graphics"
    Driver "intel"
    # Disables acceleration
    # Option "NoAccel" "True"
    # Disables only 3D acceleration (Direct Rendering Infrastructure)
    # Option "DRI" "False"
EndSection

My system information is listed below (now showing the "intel" graphics driver).

$ uname -a
Linux tower 4.15.0-91-generic #92~16.04.1-Ubuntu SMP
    Fri Feb 28 14:57:22 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 16.04.6 LTS
Release: 16.04
Codename: xenial

$ inxi -S -M -C -G -y 80
System: Host: tower Kernel: 4.15.0-91-generic x86_64 (64 bit)
          Desktop: Unity 7.4.5 Distro: Ubuntu 16.04 xenial
Machine: System: Dell product: Precision Tower 3420
          Mobo: Dell model: 08K0X7 v: A01 Bios: Dell v: 2.13.1 date: 06/14/2019
CPU: Quad core Intel Xeon E3-1225 v5 (-HT-MCP-) cache: 8192 KB
          clock speeds: max: 3300 MHz 1: 917 MHz 2: 1962 MHz 3: 1896 MHz
          4: 1349 MHz
Graphics: Card: Intel HD Graphics P530
          Display Server: X.Org 1.19.6 driver: intel
          Resolution: 2560x1440@59.95hz
          GLX Renderer: Mesa DRI Intel HD Graphics P530 (Skylake GT2)
          GLX Version: 3.0 Mesa 18.0.5

Revision history for this message
John Neffenger (jgneff) wrote :

I just hit another GPU HANG, this time with the "intel" driver instead of the "modesetting" driver. The kernel log file (/var/log/kern.log) shows:

Apr 6 08:59:36 tower kernel: [44387.837794]
--------------------------------------------
[drm] GPU HANG: ecode 9:0:0x85dffffa, in Xorg [1751],
    reason: Hang on rcs0, action: reset
i915 0000:00:02.0: Resetting rcs0 after gpu hang
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
i915 0000:00:02.0: Resetting chip after gpu hang
asynchronous wait on fence i915:compiz[3304]/1:56366 timed out
[drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

Now I will try with the hardware acceleration disabled, still using the "intel" driver, as follows (/etc/X11/xorg.conf):

---------------------------------------------------------------
# Tries "intel" instead of "modesetting" driver to fix GPU HANG
# https://bugs.launchpad.net/ubuntu/+source/xorg-server-hwe-16.04/+bug/1724047
Section "Device"
    Identifier "Intel Graphics"
    Driver "intel"
    # Disables acceleration
    Option "NoAccel" "True"
    # Disables only 3D acceleration (Direct Rendering Infrastructure)
    # Option "DRI" "False"
EndSection
---------------------------------------------------------------

Revision history for this message
John Neffenger (jgneff) wrote :

Well, that didn't work for very long. Disabling the acceleration of the "intel" driver did not solve the problem for me. I tried three more changes, described below.

(1) Next I tried going back to the "modesetting" driver but with the acceleration disabled:

/etc/X11/xorg.conf

  Section "Device"
      Identifier "Intel Graphics"
      Driver "modesetting"
      Option "AccelMethod" "none"
  EndSection

Eventually, the screen went into some kind of infinite-updating loop among all the window and was completely unresponsive. The kernel log file (/var/log/kern.log) contained the messages:

  Apr 12 10:00:25 tower kernel: [ 4941.921796] show_signal_msg: 21 callbacks suppressed
  Apr 12 10:00:25 tower kernel: [ 4941.921808] GpuWatchdog[7196]: segfault at 0 ip 000055f89a514727 sp 00007f8ec6d966d0 error 6 in signal-desktop[55f89733a000+53d7000]

I'm not sure whether the Signal desktop snap application triggered the problem, but those were the last messages before having to power-off the system.

(2) Next I removed the "/etc/X11/xorg.conf" file and reverted to the original Ubuntu 16.04 kernel version 4.4 instead of the Hardware Enablement (HWE) version 4.15. That eventually encountered the same infinite-updating loop problem.

(3) So a couple of days ago I switched back to the HWE kernel version 4.15 and updated the Xorg configuration to use UXA instead of the default SNA acceleration method. Several posts on the Internet suggested that it might solve problems with Intel Skylake graphics. (I have Intel HD Graphics P530 on a 2015 "Skylake" Xeon E3-1225v5 processor.)

/etc/X11/xorg.conf

  Section "Device"
      Identifier "Intel Graphics"
      Driver "intel"
      # Acceleration method: Unified Acceleration Architecture
      Option "AccelMethod" "UXA"
  EndSection

So far, so good. It has been only a couple of days, though, and sometimes the problem takes several days to occur.

Revision history for this message
John Neffenger (jgneff) wrote :

That didn't work either. Using the "intel" driver with:

      Option "AccelMethod" "UXA"

still resulted in the following GPU HANG:

  [drm] GPU HANG: ecode 9:0:0x85dffffa, in compiz [3398], reason:
    Hang on rcs0, action: reset
  i915 0000:00:02.0: Resetting rcs0 after gpu hang
  [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
  i915 0000:00:02.0: Resetting chip after gpu hang
  [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
  [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -5

After all these tests, my current guess is that the hardware acceleration in my Web browsers (Firefox and Chromium) is triggering the error in the graphics driver or firmware.

I disabled the hardware acceleration in Firefox and Chromium. I also now avoid running anything that uses Electron, such as the Skype and Signal desktop snap packages, because I don't know how to disable the hardware acceleration in an Electron app.

After about a week, the problem occurs now only if I forget and leave Skype running for an afternoon. I'm hoping this can continue to function as a simple work-around.

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

16.04 is EOL, if you can reproduce on a newer release, please open a new bug, thanks

Changed in xorg-server-hwe-16.04 (Ubuntu):
status: Confirmed → Won't Fix
Revision history for this message
John Neffenger (jgneff) wrote :

> John Neffenger (jgneff) wrote on 2020-04-21:
> After all these tests, my current guess is that the hardware acceleration in my Web browsers (Firefox and Chromium) is triggering the error in the graphics driver or firmware.

Years ago when I made that change, that in fact seemed to solve the problem. I ran with the browsers' hardware acceleration disabled until around the time I upgraded to the Linux version 5 kernels. After that, the problem seemed to be gone, and I could enabled the hardware acceleration in my browsers again.

Now, three years later, I'm on Ubuntu 22.04 LTS with Linux 5.15 LTS and Wayland, and I don't see the problem anymore.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.