64-bit XUbuntu 16.04 "Xenial" hybrid graphics (Intel + AMD): AMDGPU crashes / freezes / hangs entire system

Bug #1608042 reported by Yuri Ribeiro Sucupira
56
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned
Xenial
Triaged
High
Unassigned
xserver-xorg-video-amdgpu (Ubuntu)
Confirmed
Medium
Unassigned
Xenial
Confirmed
Undecided
Unassigned

Bug Description

COMPUTER: Dell Inspiron 5548 laptop

CPU: Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz

GRAPHICS: Intel-AMD hybrid:
- CPU-integrated: Intel Corporation Broadwell-U Integrated Graphics (driver: i915)
- GPU: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265] (driver: amdgpu)

OPERATING SYSTEM: 64-bit GNU/Linux XUbuntu 16.04 "Xenial"

KERNEL: 4.4.0-31-generic

PROBLEM: system hangs / freezes very frequently.
I have Xscreensaver installed and noticed that when it's running an OpenGL animation (screensaver) and I move the touchpad pointer the screen freezes. I can still turn the keyboard LED backlight on and off (hence: keyboard doesn't stop working) but the pointer won't move (touchpad is locked) and I can't Ctrl-Alt-F[1-6] (switch TTY/terminals). Only solution is to power off my laptop (press and hold the power button).

I usually select the "Molecule" screensaver (it uses OpenGL), then I click the "preview" button and wait for about 20 seconds, then I click the touchpad and the computer hangs.

When I used XUbuntu 14.04 "Trusty" with AMD's fglrx (proprietary) driver I didn't experience such issue. After upgrading to 16.04 "Xenial" (which doesn't support fglrx module/driver) amdgpu module is loaded by default but very frequently hangs the entire system.

Sometimes I'm quick enough to go to TTYS1 and then I get to see some messages such as "HARD LOCKUP on CPU0" and "HARD LOCKUP on CPU1". However, it's not a hardware problem because I've already executed the Dell Hardware Diagnostics straight from the boot (it's an EFI utility), it tested all the hardware components (CPU, GPU, RAM, keyboard, touchpad, hard disk etc.) and didn't detect any faulty component.

The attached file "amdgpu-bug.txt" is the reason why I'm pretty convinced that the problem is being caused by the amdgpu driver (although it seems to be related to how it interacts with the kernel, thus maybe the problem is kernel-related).

WORKAROUND: Boot from GRUB with the nomodeset parameter. The graphics performance becomes terribly slow.

-----
Apport output:

ApportVersion: 2.20.1-0ubuntu2.1
Architecture: amd64
BootLog:

CompizPlugins: No value set for `/apps/compiz-1/general/screen0/options/active_plugins'
CompositorRunning: None
CurrentDesktop: XFCE
DistUpgraded: Fresh install
DistroCodename: xenial
DistroRelease: Ubuntu 16.04
DistroVariant: ubuntu
ExtraDebuggingInterest: Yes
GraphicsCard:
 Intel Corporation Broadwell-U Integrated Graphics [8086:1616] (rev 09) (prog-if 00 [VGA controller])
   Subsystem: Dell Broadwell-U Integrated Graphics [1028:0643]
   Subsystem: Dell Topaz XT [Radeon R7 M260/M265] [1028:0643]
InstallationDate: Installed on 2016-07-29 (1 days ago)
InstallationMedia: Xubuntu 16.04.1 LTS "Xenial Xerus" - Release amd64 (20160719)
MachineType: Dell Inc. Inspiron 5548
Package: xorg 1:7.7+13ubuntu3
PackageArchitecture: amd64
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.4.0-31-generic.efi.signed root=UUID=24cb3fff-5c01-4674-9c59-58d7d776fd70 ro quiet splash nomodeset vt.handoff=7
ProcVersionSignature: Ubuntu 4.4.0-31.50-generic 4.4.13
Renderer: Software
Tags: xenial ubuntu
Uname: Linux 4.4.0-31-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm audio cdrom dialout dip fax floppy lpadmin netdev plugdev sambashare scanner sudo tape users video
_MarkForUpload: True
dmi.bios.date: 10/12/2015
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A06
dmi.board.name: 0YDTG3
dmi.board.vendor: Dell Inc.
dmi.board.version: A02
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: A06
dmi.modalias: dmi:bvnDellInc.:bvrA06:bd10/12/2015:svnDellInc.:pnInspiron5548:pvrA06:rvnDellInc.:rn0YDTG3:rvrA02:cvnDellInc.:ct8:cvrA06:
dmi.product.name: Inspiron 5548
dmi.product.version: A06
dmi.sys.vendor: Dell Inc.
version.compiz: compiz N/A
version.ia32-libs: ia32-libs N/A
version.libdrm2: libdrm2 2.4.67-1ubuntu0.16.04.1
version.libgl1-mesa-dri: libgl1-mesa-dri 11.2.0-1ubuntu2
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 11.2.0-1ubuntu2
version.xserver-xorg-core: xserver-xorg-core 2:1.18.3-1ubuntu2.2
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.10.1-1ubuntu2
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:7.7.0-1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.99.917+git20160325-1ubuntu1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.12-1build2

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :
description: updated
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : Dependencies.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : JournalErrors.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : ProcEnviron.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

Attached file shows what is shown on my screen before the computer freezes permanently.

Revision history for this message
Christopher M. Peñalver (penalvch) wrote :

Yuri Ribeiro Sucupira, thank you for reporting this and helping make Ubuntu better.

While engaging your WORKAROUND, could you please run the following command once from a terminal by ensuring you have the package xdiagnose installed, and that you click the Yes button for attaching additional debugging information:
apport-collect -p xorg 1608042

When reporting xorg related bugs in the future, please do so via the above method. You can learn more about this functionality at https://wiki.ubuntu.com/ReportingBugs.

description: updated
affects: xserver-xorg-video-amdgpu (Ubuntu) → xorg (Ubuntu)
Changed in xorg (Ubuntu):
importance: Undecided → Low
status: New → Incomplete
Revision history for this message
Peter Wu (lekensteyn) wrote :

Yuri, please do not add the whole hybrid-graphic-linux mailing list, it might be a bit noisy.

The error from comment 6 is the same as in https://bugs.freedesktop.org/show_bug.cgi?id=93460, though under different circumstances.

The amdgpu driver is under development by AMD and your hardware is quite new.
Can you try to install a newer kernel from
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7/ and see if the issue is resolved?

description: updated
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : CurrentDmesg.txt

apport information

tags: added: ubuntu
description: updated
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : Dependencies.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : DpkgLog.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : JournalErrors.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : LightdmDisplayLog.gz

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : LightdmLog.gz

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : Lspci.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : Lsusb.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : ProcEnviron.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : ProcModules.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : UdevDb.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : XorgLog.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : XorgLogOld.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : Xrandr.txt

apport information

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote : xdpyinfo.txt

apport information

tags: added: latest-bios-a06
Revision history for this message
Christopher M. Peñalver (penalvch) wrote :

Yuri Ribeiro Sucupira, in order to allow additional upstream developers to examine the issue, at your earliest convenience, could you please test the latest upstream kernel available from http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D ? Please keep in mind the following:
1) The one to test is at the very top line at the top of the page (not the daily folder).
2) The release names are irrelevant.
3) The folder time stamps aren't indicative of when the kernel actually was released upstream.
4) Install instructions are available at https://wiki.ubuntu.com/Kernel/MainlineBuilds .

If testing on your main install would be inconvenient, one may:
1) Install Ubuntu to a different partition and then test this there.
2) Backup, or clone the primary install.

If the latest kernel did not allow you to test to the issue (ex. you couldn't boot into the OS) please make a comment in your report about this, and continue to test the next most recent kernel version until you can test to the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this issue is fixed in the mainline kernel, please add the following tags by clicking on the yellow circle with a black pencil icon, next to the word Tags, located at the bottom of the report description:
kernel-fixed-upstream
kernel-fixed-upstream-X.Y-rcZ

Where X, and Y are the first two numbers of the kernel version, and Z is the release candidate number if it exists.

If the mainline kernel does not fix the issue, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-X.Y-rcZ

Please note, an error to install the kernel does not fit the criteria of kernel-bug-exists-upstream.

Also, you don't need to apport-collect further unless specifically requested to do so.

Once testing of the latest upstream kernel is complete, please mark this report Status Confirmed. Please let us know your results.

Thank you for your understanding.

Changed in xorg (Ubuntu):
importance: Low → Medium
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

@Peter Wu: ok, got it. Sorry for any inconvenience (for subscribing hybrid-graphic-linux to this bug). By the way: thanks for the tip. I decided to test some different kernels and see how they behave with amdgpu.

Some 4.4.X and 4.5.X kernels didn't work either, but I'm currently testing kernel 4.6.0-040600rc7-generic (version "v4.6-rc7-wily") with the module amdgpu loaded along with modules amd_iommu_v2, amdkfd, drm, drm_kms_helper, i915, ttm, video etc. and so far my laptop hasn't stopped (hang/freeze) not even once! It's been ~5 hours of testing, so far, and I already performed a lot of tasks that usually cause the lockups (previewing OpenGL screensavers on Xscreensaver, running 3D games, 3D CAD applications et cetera: all these would cause a CPU hard lockup or kernel panic) but I didn't experience any issue with kernel v4.6-rc7-wily.

I will also test kernel 4.7 as you suggested. Thank you for your feedback.

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

@Christopher M. Penalver: thanks for the tip about apport-collect. I'll keep that in mind in future bug reports. Hopefully there won't be any. :)

I'm currently testing kernel 4.6.0-040600rc7-generic ("v4.6-rc7-wily") and so far (~5 hours of intensive use running several 3D applications that would easily cause a system freeze / lockup) I haven't experienced any issue. Looks like the bug is gone on kernel v4.6-rc7-wily.

I'll keep testing this kernel for a bit more (just to make sure the bug is really gone) and then test kernel 4.7, as suggested by you and Peter Wu (to check if the bug isn't back on upstream kernel). If everything is ok, I understand that the next step will be to fully reverse commit bisect from kernel v4.6-rc7-wily downwardly (towards kernel 4.4.X) in order to identify the last bad commit followed immediately by the first good one. Right?

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

@Christopher M. Penalver: I've just read comment #26. Ok, I'll test a little more to confirm if v4.6-rc7-wily really fixed this bug and, if it really did, I'll reverse commit bisect until I find the last "downstream" (< v4.6-rc7-wily) kernel that carried the bad commit.

description: updated
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

christopher, you can still collect all the info with apport-collect without switching the package to 'xorg'. moving back to -amdgpu

affects: xorg (Ubuntu) → xserver-xorg-video-amdgpu (Ubuntu)
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

I performed some reverse commit bisect and thus tested some 4.X kernels. Here's the result:

1) The bug is present in all 4.4.X kernels until kernel "4.5.7-yakkety" (build 4.5.7-040507-generic), which's the last one carrying the bad commit.

Starting on kernel "4.6-rc1" (build 4.6.0-040600rc1-generic), the bug is gone.

Hence, a "good commit" removed the bad commit on 4.6-rc1.

tags: added: radeon
Revision history for this message
Christopher M. Peñalver (penalvch) wrote :

Yuri Ribeiro Sucupira, the next step is to fully reverse commit bisect from kernel 4.5.7 to 4.6-rc1 in order to identify the last bad commit, followed immediately by the first good one. Once this good commit has been identified, it may be reviewed for backporting. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection#How_do_I_reverse_bisect_the_upstream_kernel.3F ?

Please note, finding adjacent kernel versions is not fully commit bisecting.

Also, the kernel release names are irrelevant for the purposes of bisecting.

After the fix commit (not kernel version) has been identified, then please mark this report Status Confirmed.

Thank you for your help.

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

Hello.

It took me some days and 26 kernel compilations, but I finally found the first good commit. I thought kernel 4.6-rc1 carried the first good commit, but it turns out it also had the bug: the first commit is in kernel 4.6-rc4. Here's the output of my last RCB (reverse commit bisect) report:

"e9bef455af8eb0e837e179aab8988ae2649fd8d3 is the first good commit
 commit e9bef455af8eb0e837e179aab8988ae2649fd8d3
 Author: Alex Deucher <email address hidden>
 Date: Mon Apr 25 13:12:18 2016 -0400

    Revert ""drm/amdgpu: disable runtime pm on PX laptops without dGPU power control""

    This reverts commit bedf2a65c1aa8fb29ba8527fd00c0f68ec1f55f1.

    See the radeon revert for an extended description.

    Cc: <email address hidden>

 :040000 040000 5d8682184f857b970ced85be0fae2d4c177cad24 fd74c294984fd4c22ed116ec6a91962a70882b91 M drivers"

The attached PDF is a table with the description of all the RCB builds I performed until I reached the first bug-free commit/build. Notice that I had to use "git bisect good" for bad commits and "git bisect bad" for good commits because it's a reverse method. Thus, "good" means bad (has bug) and "bad" means good (is bug-free).

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

Note: on that entire PDF table, I used "good" meaning "has bug" and I used "bad" meaning "is bug free", but where it reads "e9bef455af8eb0e837e179aab8988ae2649fd8d3 is the first good commit" the word "good" really means "good": in fact, commit e9bef455af8eb0e837e179aab8988ae2649fd8d3 removed the bug that is present in previous kernel builds. Hence, commit e9bef455af8eb0e837e179aab8988ae2649fd8d3 caused all further kernel commits to build without the bug.

Changed in xserver-xorg-video-amdgpu (Ubuntu):
status: Incomplete → Confirmed
tags: added: kernel-fixed-upstream reverse-bisect-done
removed: amdgpu radeon
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

The amdgpu driver in xenial kernel is a backport from 4.5, so perhaps all it needs is pulling all commits from stable v4.5.x, since the revert was cc'd to stable@

Changed in linux (Ubuntu Xenial):
assignee: nobody → Robert Hooker (sarvatt)
status: New → Triaged
Changed in linux (Ubuntu):
status: New → Triaged
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Timo Aaltonen (tjaalton) wrote :

hm, that revert is already in the xenial kernel since Ubuntu-4.4.0-23.41, so you're seeing something else or it depends on some other commit which isn't there

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

@Timo Aaltonen (tjaalton): after reading your comment, I decided to install kernel 4.4.0-23.41, but it isn't available on the repo anymore, so I installed kernel 4.4.0-24 (if "that revert is already in the xenial kernel since Ubuntu-4.4.0-23.41", then installing kernel 4.4.0-24 shouldn't be a problem). Then I ran some tests and my system "froze". I had to press the power button of my laptop to shut it down (hardware power off).

First I executed "DRI_PRIME=1 glxgears" on the shell: all I got was a black screen. Then I executed "xscreensaver-demo" and previewed the "Molecule" screensaver, twice: at the first time, nothing happened (as usual. This is normal), but at the second time the screen locked and I had to use the "Ctrl Alt F1" combo to switch to TTYS1 and restart lightdm (with the command "sudo service lightdm restart"). Then I returned to TTYS7 (Ctrl Alt F7) and hit the "Power off" (software) button, but the system hanged, so I had to press the physical "power" button of my laptop.

If someone builds some 4.4.X kernel with the "good commit" e9bef455af8eb0e837e179aab8988ae2649fd8d3 (mentioned in comment #33) applied/patched, I can test it and confirm if the bug is still there.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xserver-xorg-video-amdgpu (Ubuntu Xenial):
status: New → Confirmed
Revision history for this message
Renê Barbosa (renebarbosa) wrote :

I'm with a Dell Inspiron 5447 (Radeon R7 M265) and just updated my system, now It's using the kernel 4.4.0-45-generic and the problem is still happening. My system is crashing entirely if I log out from my user, for example.

I know I can fix it by using a newer kernel package from Mainline but i don't want to use nothing that's not in the main repositories.

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

While the first good commit isn't applied downstream (in order to fix previous kernel versions), I currently recommend using stable kernel 4.7.4, available at http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/

There are newer stable kernel versions available, but I haven't tested them. I can confirm, though, that kernel 4.7.4 solves the "freezing" issue. One running 64-bit Ubuntu Linux and interested in installing kernel 4.7.4 must download these files:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/linux-headers-4.7.4-040704_4.7.4-040704.201609150330_all.deb

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/linux-headers-4.7.4-040704-generic_4.7.4-040704.201609150330_amd64.deb

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/linux-image-4.7.4-040704-generic_4.7.4-040704.201609150330_amd64.deb

...then open a shell terminal window, "cd" into the folder where these 3 DEB packages were downloaded to, and then run the command "sudo dpkg -i linux-*.deb" (without the quotation marks) in order to manually install the three DEBs. Then reboot the system.

For a complete list of kernels, visit http://kernel.ubuntu.com/~kernel-ppa/mainline/?C=N;O=D

PS: for 32-bit systems, the 3 DEBs to be downloaded and installed are:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/linux-headers-4.7.4-040704_4.7.4-040704.201609150330_all.deb

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/linux-headers-4.7.4-040704-generic_4.7.4-040704.201609150330_i386.deb

http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.7.4/linux-image-4.7.4-040704-generic_4.7.4-040704.201609150330_i386.deb

If your system is hanging/freezing to the point that you can't even login into the GUI, power off your computer, then turn it on again and press the Esc key until you get into the GRUB menu. Afterwards, press the E key in order to edit the grub menu, then go to the first instance of the code "$vt_handoff" and add the "nomodeset" parameter right before it.

Example:
=> Before = quiet splash $vt_handoff
=> After = quiet splash nomodeset $vt_handoff

After adding the "nomodeset" parameter, press the F10 key (or the Ctrl X key combo). Linux will boot with Kernel Mode Setting (KMS) disabled, then you will be able to log into the Graphical User Interface (GUI) and then download and install a stable kernel.

Sorry for the long comment that is not a relevant info for those investigating the bug. It just so other users know how to "deal" with the bug while it's not fixed/removed from previous kernel versions.

Revision history for this message
Renê Barbosa (renebarbosa) wrote :

I can confirm the bug is still happening with kernel version 4.4.0-47-generic.

Revision history for this message
Renê Barbosa (renebarbosa) wrote :

FYI: The problem is fixed in linux-image-generic-hwe-16.04-edge too.

Robert Hooker (sarvatt)
Changed in linux (Ubuntu Xenial):
assignee: Robert Hooker (sarvatt) → nobody
To post a comment you must log in.