amdgpu causes NULL pointer dereference when trying to use Topaz (Radeon R7 M265) GPU

Bug #1579374 reported by Paulo Matias
148
This bug affects 32 people
Affects Status Importance Assigned to Milestone
Linux
New
Undecided
Unassigned
linux (Ubuntu)
Low
Unassigned

Bug Description

A Dell Inspiron 5448 / 0YDTG3 laptop (BIOS A06 10/12/2015), which used to work well with Ubuntu trusty and Ubuntu wily releases (using fglrx driver), has an issue in Xenial where when the amdgpu driver tries to activate the Radeon GPU, a NULL pointer dereference occurs, causing the caller process to hang. This practically causes the system to hang when trying a logoff or shutdown.

The bug can be easily reproduced by calling:
DRI_PRIME=1 glxgears

Which hangs the glxgears process and causes the stacktrace attached below.
---[cut]---

$ dmesg
[...]
[ 129.188685] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000).
[ 129.429667] [drm] ring test on 0 succeeded in 14 usecs
[ 131.087501] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 1 (-2).
[ 131.087520] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 2 (-2).
[ 131.087533] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 3 (-2).
[ 131.087547] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 4 (-2).
[ 131.087558] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 5 (-2).
[ 131.087571] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 6 (-2).
[ 131.087582] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 7 (-2).
[ 131.087595] [drm:gfx_v8_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: cp failed to lock ring 8 (-2).
[ 131.191024] [drm:sdma_v2_4_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD)
[ 131.191034] [drm:amdgpu_resume [amdgpu]] *ERROR* resume 5 failed -22
[ 131.191043] [drm:amdgpu_resume_kms [amdgpu]] *ERROR* amdgpu_resume failed (-22).
[ 131.191149] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.191179] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.191203] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201417] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.201444] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.201462] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201465] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.201479] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.201493] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201494] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.201508] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.201523] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201524] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.201535] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.201546] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201548] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.201558] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.201569] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201571] amdgpu 0000:04:00.0: couldn't schedule ib
[ 131.201581] [drm:amdgpu_sched_run_job [amdgpu]] *ERROR* Error scheduling IBs (-22)
[ 131.201592] [drm:amd_sched_main [amdgpu]] *ERROR* Failed to run job!
[ 131.201616] BUG: unable to handle kernel NULL pointer dereference at 0000000000000248
[ 131.201648] IP: [<ffffffffc041ceb2>] amdgpu_vm_grab_id+0x122/0x310 [amdgpu]
[ 131.201684] PGD 0
[ 131.201693] Oops: 0000 [#1] SMP
[ 131.201707] Modules linked in: ctr ccm rfcomm pci_stub vboxpci(OE) vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) bnep binfmt_misc rtsx_usb_ms memstick nls_iso8859_1 dell_wmi sparse_keymap dell_laptop intel_rapl x86_pkg_temp_thermal arc4 intel_powerclamp dcdbas dell_smm_hwmon coretemp dell_led iwlmvm btusb uvcvideo btrtl btbcm btintel mac80211 bluetooth kvm_intel videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core snd_soc_rt5640 v4l2_common videodev snd_hda_codec_realtek snd_hda_codec_hdmi media snd_soc_ssm4567 snd_hda_codec_generic snd_soc_rl6231 kvm snd_soc_core snd_hda_intel hid_multitouch snd_hda_codec iwlwifi irqbypass snd_hda_core snd_compress ac97_bus snd_hwdep snd_pcm_dmaengine cfg80211 snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq joydev input_leds snd_seq_device mei_me
[ 131.202004] serio_raw snd_timer mei snd shpchp lpc_ich elan_i2c soundcore spi_pxa2xx_platform snd_soc_sst_acpi dw_dmac acpi_pad dw_dmac_core dell_rbtn i2c_designware_platform 8250_dw i2c_designware_core mac_hid parport_pc ppdev lp parport autofs4 drbg ansi_cprng algif_skcipher af_alg dm_crypt rtsx_usb_sdmmc usbhid rtsx_usb amdkfd amd_iommu_v2 amdgpu crct10dif_pclmul crc32_pclmul i915 aesni_intel i2c_algo_bit ttm aes_x86_64 lrw gf128mul glue_helper ablk_helper drm_kms_helper cryptd syscopyarea r8169 sysfillrect mii psmouse sysimgblt fb_sys_fops ahci drm libahci wmi sdhci_acpi sdhci video i2c_hid hid fjes
[ 131.202242] CPU: 3 PID: 189 Comm: gfx Tainted: G OE 4.4.0-22-generic #39-Ubuntu
[ 131.202268] Hardware name: Dell Inc. Inspiron 5448/0YDTG3, BIOS A06 10/12/2015
[ 131.202291] task: ffff88024ee544c0 ti: ffff880035088000 task.ti: ffff880035088000
[ 131.202317] RIP: 0010:[<ffffffffc041ceb2>] [<ffffffffc041ceb2>] amdgpu_vm_grab_id+0x122/0x310 [amdgpu]
[ 131.202389] RSP: 0018:ffff88003508bce0 EFLAGS: 00010246
[ 131.202418] RAX: 0000000000000000 RBX: ffff88024f348000 RCX: ffff88024d472000
[ 131.202457] RDX: 000003e8000003e8 RSI: ffff88024f34ad78 RDI: ffff88024b030240
[ 131.202498] RBP: ffff88003508bdb0 R08: ffff88024d472000 R09: 0000000000000000
[ 131.202533] R10: 00000000ffff5b25 R11: 0000000000001061 R12: ffff88024f34ad78
[ 131.202575] R13: ffff88024b030240 R14: ffff88024f348838 R15: 0000000000000001
[ 131.202617] FS: 0000000000000000(0000) GS:ffff88025ecc0000(0000) knlGS:0000000000000000
[ 131.202664] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 131.202698] CR2: 0000000000000248 CR3: 0000000002e0a000 CR4: 00000000003406e0
[ 131.202741] Stack:
[ 131.202754] ffff8802506e8648 0000000000000000 ffff88024d472000 0000000000000000
[ 131.202804] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 131.202855] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[ 131.202905] Call Trace:
[ 131.202942] [<ffffffffc041ecf0>] amdgpu_ib_schedule+0x90/0x390 [amdgpu]
[ 131.203004] [<ffffffffc045b466>] amdgpu_sched_run_job+0x36/0x140 [amdgpu]
[ 131.203065] [<ffffffffc045ac7f>] amd_sched_main+0x23f/0x400 [amdgpu]
[ 131.203107] [<ffffffff810c3a70>] ? wake_atomic_t_function+0x60/0x60
[ 131.203163] [<ffffffffc045aa40>] ? amd_sched_entity_wakeup+0x70/0x70 [amdgpu]
[ 131.203207] [<ffffffff810a0588>] kthread+0xd8/0xf0
[ 131.203238] [<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
[ 131.203278] [<ffffffff8182568f>] ret_from_fork+0x3f/0x70
[ 131.203311] [<ffffffff810a04b0>] ? kthread_create_on_node+0x1e0/0x1e0
[ 131.203350] Code: c0 44 89 bc 85 48 ff ff ff 41 83 c7 01 44 39 bb 1c 09 00 00 76 4f 49 83 c6 10 4d 8b 6e f0 4d 85 ed 74 66 4c 89 ef e8 fe 2e ff ff <8b> b8 48 02 00 00 48 8b b4 fd 50 ff ff ff 48 85 f6 74 b2 41 8b
[ 131.203558] RIP [<ffffffffc041ceb2>] amdgpu_vm_grab_id+0x122/0x310 [amdgpu]
[ 131.203628] RSP <ffff88003508bce0>
[ 131.203645] CR2: 0000000000000248
[ 131.208801] ---[ end trace 8aa5d3e55e43b8c1 ]---

WORKAROUND: Kernel http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.5-wily fixes the crashes and oopses.

WORKAROUND: In /etc/default/grub add amdgpu.runpm=0 to: GRUB_CMDLINE_LINUX_DEFAULT

In /etc/rc.local add the following command:
echo OFF > /sys/kernel/debug/vgaswitcheroo/switch

Create the file /etc/X11/xorg.conf if it doesn't exist, and add the following in order to get the backlight controls working:
Section "Device"
    Identifier "Card0"
    Driver "intel"
    Option "Backlight" "intel_backlight"
EndSection

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.4.0-22-generic 4.4.0-22.39
ProcVersionSignature: Ubuntu 4.4.0-22.39-generic 4.4.8
Uname: Linux 4.4.0-22-generic x86_64
ApportVersion: 2.20.1-0ubuntu2
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: matias 2141 F.... pulseaudio
 /dev/snd/controlC0: matias 2141 F.... pulseaudio
CurrentDesktop: i3
Date: Sat May 7 10:56:56 2016
HibernationDevice: RESUME=UUID=60147c89-0820-48f0-9b56-033547cfbae8
InstallationDate: Installed on 2015-08-12 (268 days ago)
InstallationMedia: Ubuntu 15.04 "Vivid Vervet" - Release amd64 (20150422)
MachineType: Dell Inc. Inspiron 5448
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.4.0-22-generic.efi.signed root=/dev/mapper/ubuntu--vg-root ro quiet splash
RelatedPackageVersions:
 linux-restricted-modules-4.4.0-22-generic N/A
 linux-backports-modules-4.4.0-22-generic N/A
 linux-firmware 1.157
SourcePackage: linux
UpgradeStatus: Upgraded to xenial on 2016-04-26 (11 days ago)
dmi.bios.date: 10/12/2015
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A06
dmi.board.name: 0YDTG3
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: A06
dmi.modalias: dmi:bvnDellInc.:bvrA06:bd10/12/2015:svnDellInc.:pnInspiron5448:pvrA06:rvnDellInc.:rn0YDTG3:rvrA00:cvnDellInc.:ct8:cvrA06:
dmi.product.name: Inspiron 5448
dmi.product.version: A06
dmi.sys.vendor: Dell Inc.

Revision history for this message
Paulo Matias (paulo-matias) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.6 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.6-rc7-wily/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Paulo Matias (paulo-matias) wrote :

Behavior with 4.6.0-040600rc7-generic appears to be even saner than with 4.5.0-050400, although the AMD GPU is still not usable ("DRI_PRIME=1 glxgears" opens a black window).

Only after reporting this bug, I found that 4.5.0-050400 presented issues with sound output in the HDMI port (sound output at a lower sampling rate, maybe half, than it was supposed to). These issues are fixed in 4.6.0-040600rc7-generic.

Also, 4.6.0-040600rc7-generic does not present any errors in dmesg output when trying to use the AMD GPU (again "DRI_PRIME=1 glxgears"):

---[cut]---
[ 251.019823] [drm] PCIE GART of 2048M enabled (table at 0x0000000000040000).
[ 251.258535] [drm] ring test on 0 succeeded in 15 usecs
[ 251.258748] [drm] ring test on 1 succeeded in 14 usecs
[ 251.258787] [drm] ring test on 2 succeeded in 13 usecs
[ 251.258797] [drm] ring test on 3 succeeded in 4 usecs
[ 251.258812] [drm] ring test on 4 succeeded in 2 usecs
[ 251.258818] [drm] ring test on 5 succeeded in 2 usecs
[ 251.258824] [drm] ring test on 6 succeeded in 2 usecs
[ 251.258830] [drm] ring test on 7 succeeded in 2 usecs
[ 251.258838] [drm] ring test on 8 succeeded in 3 usecs
[ 251.258865] [drm] ring test on 9 succeeded in 6 usecs
[ 251.258878] [drm] ring test on 10 succeeded in 6 usecs
---[cut]---

Whereas 4.5.0-050400 used to output the same kind of error message presented by xenial's 4.4 kernel ("*ERROR* amdgpu: cp failed to lock ring N (-2)"), although the NULL pointer dereference was already fixed in 4.5.0-050400.

In short, the NULL pointer dereference is fixed upstream. With upstream kernel, the system appears to run pretty stably using the Intel GPU as the renderer, whereas xenial's 4.4 kernel would constantly crash. However, the AMD GPU is still not usable even with upstream kernel, as it renders empty (black) output.

tags: added: kernel-fixed-upstream
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Bruno Fontana da Silva (fontanads) wrote :

I'm not sure if it helps to share it, but a couple o days before you created this thread I posted the same bug with my personal notebook crash report on AskUbuntu:
http://askubuntu.com/questions/767163/ubuntu-16-kernel-bug-oops-0000-1-smp-related-to-amdgpu/772787

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
Christopher M. Peñalver (penalvch) wrote :

Paulo Matias, the next step is to fully reverse commit bisect from kernel 4.4 to 4.6-rc7 in order to identify the last bad commit, followed immediately by the first good one. Once this good commit has been identified, it may be reviewed for backporting. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection#How_do_I_reverse_bisect_the_upstream_kernel.3F ?

Please note, finding adjacent kernel versions is not fully commit bisecting.

After the fix commit (not kernel version) has been identified, then please mark this report Status Confirmed.

Thank you for your understanding.

Helpful bug reporting tips:
https://wiki.ubuntu.com/ReportingBugs

tags: added: kernel-fixed-upstream-4.6-rc7 latest-bios-a06 needs-reverse-bisect
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Paulo Matias (paulo-matias) wrote :

Will bisect work in this case? I had understood from the release notes that most of the 4.5-vanilla's amdgpu was backported to Ubuntu's 4.4 kernel.

As I commented in the first message, 4.5.0 does not present the NULL ptr dereference, so it would not really be a bisect from 4.4 to 4.6-rc7, but from 4.4 to 4.5.0. However, would it make sense to carry this bisect since the amdgpu driver present in vanilla's 4.4 is completely different from the one shipped in Ubuntu's 4.4 kernel?

tags: added: kernel-fixed-upstream-4.5.0
removed: kernel-fixed-upstream-4.6-rc7
Revision history for this message
giannis androulidakis (g-0zek) wrote :

sorry, installed the 4.5.7 kernel on Ubuntu Xenial (16.04) and still problem persists. Also noticed that battery remaining and charging time are not quite right , i don't know if this has to do with the amdgpu driver.

Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :

As I reported here:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-amdgpu/+bug/1608042/comments/31

...my computer is exactly the same (Dell Inspiron 5448 laptop, BIOS A06, hybrid graphics Intel Broadwell - module i915 + AMD Topaz XT R7 M260/M265 - module amdgpu) and after installing 64-bit XUbuntu 16.04 "Xenial" I got the same problem, because of the kernel 4.4.

I got rid of the kernel-AMD GPU issues after doing some reverse commit and identifying that kernel 4.6-rc1 received a "good commit" that fixed the problem.

I'm currently running 4.6-rc7, but I can confirm that from 4.6-rc1 ahead the "NULL pointer dereference" bug's gone. From kernel 4.6-rc1 or newer I'm able to execute "DRI_PRIME=1 glxgears" without experiencing any issues.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Yuri Ribeiro Sucupira (yuri-sucupira) wrote :
Revision history for this message
Paulo Matias (paulo-matias) wrote :

I think they are the same. The bug manifests itself when using xorg, but the root cause is the kernel null pointer dereference.

Changed in linux (Ubuntu):
status: Expired → Incomplete
Revision history for this message
Paulo Matias (paulo-matias) wrote :

Maybe we should mark it as confirmed now that Yuri bisected the kernel in the other bug?

However I still think we should just put some test to check whether the pointer is NULL, we already have the stacktrace. Maybe I can try to patch it out when I get some spare time.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
tags: added: bios-outdated-a07 kernel-fixed-upstream-4.5
removed: kernel-fixed-upstream-4.5.0 latest-bios-a06
Revision history for this message
andre (acelere) wrote :

This bug is still present for me. Did you guys manage to fix it?
I have since upgraded to 16.10 and no change.

Thanks.

Revision history for this message
Christopher M. Peñalver (penalvch) wrote :

andre (acelere), it will help immensely if you filed a new report with Ubuntu, using the default repository kernel (not mainline/upstream/3rd party) via a terminal:
ubuntu-bug linux

Please feel free to subscribe me to it.

For more on why this is helpful, please see https://wiki.ubuntu.com/ReportingBugs.

Changed in linux (Ubuntu):
importance: Medium → Low
description: updated
Revision history for this message
Paulo Matias (paulo-matias) wrote :

Hi Andre,

I also have upgraded to 16.10, hoping for the bug to be fixed, as everything seemed to go smooth until kernel 4.7. However, I still got kernel panics.

In short, kernel 4.5 didn't have the bug: it was introduced by the backport Canonical made of amdgpu to 4.4. However the same bug, or a very similar one, was introduced as a regression in 4.8 (even vanilla 4.8).

However I found some workaround which prevents amdgpu from loading and crashing the system. Since then I lost the interest to make a patch to fix this: it's too much work to get only the Intel GPU working (the AMD GPU would not work anyway with the open drivers).

But it still would be nice if Ubuntu could ship with a kernel that doesn't crash on hardware that came with Ubuntu OEM (which, at least theoretically, should be tested by Canonical). If someone could go after this and fix the issue, it would be nice for the people who upgrade their Ubuntu OEM not to get an unusable system after reboot.

The workaround I'm currently using is as follows:

/etc/default/grub: Add amdgpu.runpm=0 to GRUB_CMDLINE_LINUX_DEFAULT

/etc/rc.local: Add the following command:

echo OFF > /sys/kernel/debug/vgaswitcheroo/switch

/etc/X11/xorg.conf: Create the file if it doesn't exist, and add the following in order to get the backlight controls working (otherwise it would be misdetected):

Section "Device"
    Identifier "Card0"
    Driver "intel"
    Option "Backlight" "intel_backlight"
EndSection

tags: added: needs-upstream-testing yakkety
removed: kernel-fixed-upstream
description: updated
Revision history for this message
Alexis Lecocq (axtux) wrote :
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers