Bug #1810546 “AMDGPU VMC page fault with Athlon 200GE APU” : Bugs : linux package : Ubuntu

Revision history for this message

In Linux Kernel Bug Tracker #199749, muelladdi (muelladdi-linux-kernel-bugs) wrote on 2018-05-17:

#10

System video freezes randomly during the day while working on the system normally (no 3D).
Logfile shows only:

May 17 11:18:00 ws01 kernel: [11831.268044] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, last signaled seq=2081253, last emitted seq=2081256
May 17 11:18:00 ws01 kernel: [11831.268051] [drm] No hardware hang detected. Did some blocks stall?

If I can assist any further, please tell.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-12:

#11

Can you load kernel with grub command line amdgpu.vm_update_mode=3 to force CPU VM update mode and see if this makes the issue go away ?

Andrey

Revision history for this message

In Linux Kernel Bug Tracker #199749, muelladdi (muelladdi-linux-kernel-bugs) wrote on 2018-06-13:

#12

Added the grub cmdline and will investigate

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-16:

#13

Download full text (3.9 KiB)

Hello, I think I'm experiencing the same problem here. My Ryzen 5 2400G system freezes often, especially under some high cpu and disk activity, even after the "Typical Current Idle" UEFI workaround. Sometimes I could reboot with sysrq, most of the times I need a hard reset. The freezes never leave a message in kernel log until today, after 2 months I built my machine.

6月 16 23:29:00 sfc-DESKTOP kernel: [drm] No hardware hang detected. Did some blocks stall?
6月 16 23:29:00 sfc-DESKTOP kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1217227, last emitted seq=1217229
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x000000011241a000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x0000000112414000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x0000000112416000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x0000000112410000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x0000000112412000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x000000011241f000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x0000000112425000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: at page 0x0000000112427000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP...

Hello, I think I'm experiencing the same problem here. My Ryzen 5 2400G system freezes often, especially under some high cpu and disk activity, even after the "Typical Current Idle" UEFI workaround. Sometimes I could reboot with sysrq, most of the times I need a hard reset. The freezes never leave a message in kernel log until today, after 2 months I built my machine.

6月 16 23:29:00 sfc-DESKTOP kernel: [drm] No hardware hang detected. Did some blocks stall?
6月 16 23:29:00 sfc-DESKTOP kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=1217227, last emitted seq=1217229
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x000000011241a000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112414000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112416000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112410000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112412000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x000000011241f000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112425000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112427000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112429000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0:   at page 0x0000000112421000 from 27
6月 16 23:28:50 sfc-DESKTOP kernel: amdgpu 0000:38:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)

The HDD led still flashes after this freeze, but in the previous freezes, it won't flash at all.

using kernel 4.17.0-041700-generic on Ubuntu 18.04, mesa 18.1.1-0~b~padoka0. CPU runs at stock 3.6GHz, memory 16G DDR4, running at 2133MHz, no overclocking.

Revision history for this message

In Linux Kernel Bug Tracker #199749, chewi (chewi-linux-kernel-bugs) wrote on 2018-06-16:

#14

Not sure if this is related to bug #199653, which concerns freezing on the 2500U and 2700U. It hasn't received any attention from AMD or other kernels devs but there is more information that could potentially be useful. I tried amdgpu.vm_update_mode=3 but that didn't help.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-16:

#15

Those two bugs are unrelated.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-16:

#16

(In reply to James Le Cuirot from comment #4)
> Not sure if this is related to bug #199653, which concerns freezing on the
> 2500U and 2700U. It hasn't received any attention from AMD or other kernels
> devs but there is more information that could potentially be useful. I tried
> amdgpu.vm_update_mode=3 but that didn't help.

Are you seeing sdma0 timeout message when the system freezes like muelladi above ? I expect
amdgpu.vm_update_mode=3 to help only in that case.

Revision history for this message

In Linux Kernel Bug Tracker #199749, chewi (chewi-linux-kernel-bugs) wrote on 2018-06-16:

#17

(In reply to Andrey Grodzovsky from comment #6)
> Are you seeing sdma0 timeout message when the system freezes like muelladi
> above ? I expect
> amdgpu.vm_update_mode=3 to help only in that case.

I haven't been able to get any information as I have been unable to access the system following these freezes. Judging by the output from the other reporter, they do indeed seem unrelated. Sorry for the noise but some attention on that issue would be hugely appreciated. It's so bad, I've considered selling the laptop.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-17:

#18

Today I tried the amdgpu.vm_update_mode=3 option, and my computer still freezed. This time no log was recorded.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-17:

#19

I found I'm able to reproduce the freeze when I'm compiling two Android ROMs at the same time while browsing or something else. It will freeze in at most 3 hours. I have a 300W PSU, and 2 SSD + 1 HDD storages. Android sources are stored in the HDD. This time I can reboot with sysrq, seems not a PSU fault.

Revision history for this message

In Linux Kernel Bug Tracker #199749, muelladdi (muelladdi-linux-kernel-bugs) wrote on 2018-06-18:

#20

So far I have not had any more freezes during normal, non-3D work.

Revision history for this message

In Linux Kernel Bug Tracker #199749, michel (michel-linux-kernel-bugs) wrote on 2018-06-18:

#21

Other Raven Ridge users have reported that updating to the current microcode files from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu has fixed stability issues.

Make sure your system BIOS and CPU microcode are up to date as well.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-21:

#22

I updated to the current microcode files at linux-firmware.git. Today I replicated the same workload, and my computer freezed in just 5 minutes! Even sysrq does not work. I have to do a hard reset.

Still, no log was recorded.

I have an Asrock AB350M-Pro4 motherboard, with latest UEFI version L4.82.

Revision history for this message

In Linux Kernel Bug Tracker #199749, michel (michel-linux-kernel-bugs) wrote on 2018-06-21:

#23

(In reply to notsyncing from comment #12)
> I updated to the current microcode files at linux-firmware.git.

Did you update the microcode files in the initrd as well?

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-22:

#24

(In reply to Michel Dänzer from comment #13)
> Did you update the microcode files in the initrd as well?

I just copied the files to /lib/firmware. I'll try with update-initramfs.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-24:

#25

Still freezed under two Android source compilation + 2 intellij idea + 10 firefox tabs + EVE online playing after 3 hours. Sysrq does not work, need hard reset. No log was recorded.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-25:

#26

What kernel version are you using ?

We can try and figure out what was the last commands in HW before you experienced the page fault.

You can clone and install our register analyzer from here - https://cgit.freedesktop.org/amd/umr/

Then launch your X with ENV variable GALLIUM_DDEBUG=always to dump all the 3D commands into files in ~/ddebug_dumps/

Run your workload.

After you again experience the GPU page fault please provide the following outputs

sudo umr -lb
sudo umr -O verbose,follow_ib -R gfx[.]
sudo umr -O bits -wa
sudo umr -O many,bits -r*.*.mmGRBM_STATUS
sudo umr -O many,bits -r *.*.HEADER_DUMP
sudo umr -O many,bits -r *.*.CP_EOP

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-26:

#27

(In reply to Andrey Grodzovsky from comment #16)
> Then launch your X with ENV variable GALLIUM_DDEBUG=always to dump all the
> 3D commands into files in ~/ddebug_dumps/

Would you mind telling me how to add this variable? I googled and cannot find any information. Should I add it to /etc/X11/Xsession or something else? Thanks very much!

I'm on kernel 4.17.2-041702-generic, Ubuntu 18.04, mesa 18.1.1-1ubuntu1~18.04.0~ppa1

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-26:

#28

BTW, I'm using KDE plasma 5.13 with sddm.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-26:

#29

(In reply to notsyncing from comment #18)
> BTW, I'm using KDE plasma 5.13 with sddm.

You just prepend this before command to start your graphic stack. You need to manually run your graphic stack from command line and add this before. E.G. for me on Ubuntu I will disable graphics launch on boot from GRUB. Then from terminal I will run
GALLIUM_DDEBUG=always service lightdm start

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-26:

#31

(In reply to Andrey Grodzovsky from comment #20)
> You just prepend this before command to start your graphic stack. You need
> to manually run your graphic stack from command line and add this before.
> E.G. for me on Ubuntu I will disable graphics launch on boot from GRUB. Then
> from terminal I will run
> GALLIUM_DDEBUG=always service lightdm start

I compiled and installed umr and executed "GALLIUM_DDEBUG=always service sddm start" on a root tty, and no "ddebug_dumps" directory was found in /root or in my home directory after desktop started. Is that normal?

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-26:

#32

(In reply to notsyncing from comment #21)
> (In reply to Andrey Grodzovsky from comment #20)
> > You just prepend this before command to start your graphic stack. You need
> > to manually run your graphic stack from command line and add this before.
> > E.G. for me on Ubuntu I will disable graphics launch on boot from GRUB.
> Then
> > from terminal I will run
> > GALLIUM_DDEBUG=always service lightdm start
>
> I compiled and installed umr and executed "GALLIUM_DDEBUG=always service
> sddm start" on a root tty, and no "ddebug_dumps" directory was found in
> /root or in my home directory after desktop started. Is that normal?

I checked myself, you need to use this variable with specific graphic application you are running, can you pinpoint what graphic work is going on while you have this faults ?

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-27:

#33

In fact there is no specified graphic work. I just put an Android source compilation running and go to bed, besides only firefox and dolphin running. The next morning, I found the machine has already freezed. Still no log. The log I posted is just a lucky one.

Would you mind telling me if the umr had some system-wide debugging methods? Or should I make a netconsole to see if there was anything?

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-27:

#34

(In reply to notsyncing from comment #23)
> In fact there is no specified graphic work. I just put an Android source
> compilation running and go to bed, besides only firefox and dolphin running.
> The next morning, I found the machine has already freezed. Still no log. The
> log I posted is just a lucky one.
>
> Would you mind telling me if the umr had some system-wide debugging methods?
> Or should I make a netconsole to see if there was anything?

UMR is system wide any way, it's memory/registers/HW debugging tool. you can provide the UMR outputs I asked before once the freeze happened assuming you still have SSH access (which seems like you don't).

Since the memory faults you experience are clearly due to some graphic rendering activity maybe you could try to isolate the app which triggers it, repeat what you do but close both firefox and dolphin and check if this still happens. If not try to find which of them was causing this and then we can run it with MESA debug flags.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-29:

#35

Created attachment 277059
Trace process

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-06-29:

#36

Created attachment 277061
Trace process 2

Attached 2 patches if applied to your kernel should tell which process caused the VM_FAULT please also launch your kernel from GRUB with following parameter -
amdgpu.vm_fault_stop=2

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-30:

#37

Download full text (3.9 KiB)

After 3 days, I managed to reproduce it again with 2 android compilation and firefox with kernel parameter mem=4096m (I have 16GB memory). I found that it's easier to reproduce when the memory is full.

sfc@sfc-DESKTOP:~$ sudo umr -lb
        raven1.gfx91
        raven1.vcn10
        raven1.dcn10
        raven1.nbio70
        raven1.sdma041
        raven1.hdp40
        raven1.oss40
        raven1.mmhub91
        raven1.mp100
sfc@sfc-DESKTOP:~$ sudo umr -O verbose,follow_ib -R gfx[.]
error: Unknown option [follow_ib]
sfc@sfc-DESKTOP:~$ sudo umr -O bits -wa
No active waves!
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits -r*.*.mmGRBM_STATUS
[ERROR]: Unknown option <-r*.*.mmGRBM_STATUS>
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits -r *.*.HEADER_DUMP
gfx91.mmCP_MEC_ME1_HEADER_DUMP => 0xc0000e00
        .HEADER_DUMP[0:31] == 3221229056 (0xc0000e00)
gfx91.mmCP_MEC_ME2_HEADER_DUMP => 0xdef0def0
        .HEADER_DUMP[0:31] == 3740327664 (0xdef0def0)
gfx91.mmCP_ME_HEADER_DUMP => 0xc0004200
        .ME_HEADER_DUMP[0:31] == 3221242368 (0xc0004200)
gfx91.mmCP_PFP_HEADER_DUMP => 0xffff1000
        .PFP_HEADER_DUMP[0:31] == 4294905856 (0xffff1000)
gfx91.mmCP_CE_HEADER_DUMP => 0xffff1000
        .CE_HEADER_DUMP[0:31] == 4294905856 (0xffff1000)
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits -r *.*.CP_EOP
gfx91.mmCP_EOPQ_WAIT_TIME => 0x0000052c
        .WAIT_TIME[0:9] == 300 (0x0000012c)
        .SCALE_COUNT[10:17] == 1 (0x00000001)
gfx91.mmCP_EOP_DONE_ADDR_LO => 0x00609000
        .ADDR_LO[2:31] == 1582080 (0x00182400)
gfx91.mmCP_EOP_DONE_ADDR_HI => 0x000000f5
        .ADDR_HI[0:15] == 245 (0x000000f5)
gfx91.mmCP_EOP_DONE_DATA_LO => 0x000008e4
        .DATA_LO[0:31] == 2276 (0x000008e4)
gfx91.mmCP_EOP_DONE_DATA_HI => 0x00000000
        .DATA_HI[0:31] == 0 (0x00000000)
gfx91.mmCP_EOP_LAST_FENCE_LO => 0x000008e4
        .LAST_FENCE_LO[0:31] == 2276 (0x000008e4)
gfx91.mmCP_EOP_LAST_FENCE_HI => 0x00000000
        .LAST_FENCE_HI[0:31] == 0 (0x00000000)
gfx91.mmCP_EOP_DONE_EVENT_CNTL => 0x00038060
        .WBINV_TC_OP[0:6] == 96 (0x00000060)
        .WBINV_ACTION_ENA[12:17] == 56 (0x00000038)
        .CACHE_POLICY[25:25] == 0 (0x00000000)
        .EXECUTE[28:28] == 0 (0x00000000)
gfx91.mmCP_EOP_DONE_DATA_CNTL => 0x40010000
        .DST_SEL[16:17] == 1 (0x00000001)
        .INT_SEL[24:26] ...

After 3 days, I managed to reproduce it again with 2 android compilation and firefox with kernel parameter mem=4096m (I have 16GB memory). I found that it's easier to reproduce when the memory is full.

sfc@sfc-DESKTOP:~$ sudo umr -lb
        raven1.gfx91
        raven1.vcn10
        raven1.dcn10
        raven1.nbio70
        raven1.sdma041
        raven1.hdp40
        raven1.oss40
        raven1.mmhub91
        raven1.mp100
sfc@sfc-DESKTOP:~$ sudo umr -O verbose,follow_ib -R gfx[.]
error: Unknown option [follow_ib]
sfc@sfc-DESKTOP:~$ sudo umr -O bits -wa
No active waves!
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits  -r*.*.mmGRBM_STATUS
[ERROR]: Unknown option <-r*.*.mmGRBM_STATUS>
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits  -r *.*.HEADER_DUMP
gfx91.mmCP_MEC_ME1_HEADER_DUMP => 0xc0000e00
        .HEADER_DUMP[0:31]                                               == 3221229056 (0xc0000e00)
gfx91.mmCP_MEC_ME2_HEADER_DUMP => 0xdef0def0
        .HEADER_DUMP[0:31]                                               == 3740327664 (0xdef0def0)
gfx91.mmCP_ME_HEADER_DUMP => 0xc0004200
        .ME_HEADER_DUMP[0:31]                                            == 3221242368 (0xc0004200)
gfx91.mmCP_PFP_HEADER_DUMP => 0xffff1000
        .PFP_HEADER_DUMP[0:31]                                           == 4294905856 (0xffff1000)
gfx91.mmCP_CE_HEADER_DUMP => 0xffff1000
        .CE_HEADER_DUMP[0:31]                                            == 4294905856 (0xffff1000)
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits  -r *.*.CP_EOP
gfx91.mmCP_EOPQ_WAIT_TIME => 0x0000052c
        .WAIT_TIME[0:9]                                                  ==      300 (0x0000012c)
        .SCALE_COUNT[10:17]                                              ==        1 (0x00000001)
gfx91.mmCP_EOP_DONE_ADDR_LO => 0x00609000
        .ADDR_LO[2:31]                                                   ==  1582080 (0x00182400)
gfx91.mmCP_EOP_DONE_ADDR_HI => 0x000000f5
        .ADDR_HI[0:15]                                                   ==      245 (0x000000f5)
gfx91.mmCP_EOP_DONE_DATA_LO => 0x000008e4
        .DATA_LO[0:31]                                                   ==     2276 (0x000008e4)
gfx91.mmCP_EOP_DONE_DATA_HI => 0x00000000
        .DATA_HI[0:31]                                                   ==        0 (0x00000000)
gfx91.mmCP_EOP_LAST_FENCE_LO => 0x000008e4
        .LAST_FENCE_LO[0:31]                                             ==     2276 (0x000008e4)
gfx91.mmCP_EOP_LAST_FENCE_HI => 0x00000000
        .LAST_FENCE_HI[0:31]                                             ==        0 (0x00000000)
gfx91.mmCP_EOP_DONE_EVENT_CNTL => 0x00038060
        .WBINV_TC_OP[0:6]                                                ==       96 (0x00000060)
        .WBINV_ACTION_ENA[12:17]                                         ==       56 (0x00000038)
        .CACHE_POLICY[25:25]                                             ==        0 (0x00000000)
        .EXECUTE[28:28]                                                  ==        0 (0x00000000)
gfx91.mmCP_EOP_DONE_DATA_CNTL => 0x40010000
        .DST_SEL[16:17]                                                  ==        1 (0x00000001)
        .INT_SEL[24:26]                                                  ==        0 (0x00000000)
        .DATA_SEL[29:31]                                                 ==        2 (0x00000002)
gfx91.mmCP_EOP_DONE_CNTX_ID => 0x00000000
        .CNTX_ID[0:31]                                                   ==        0 (0x00000000)

ddebug_dumps:

---
Command: /usr/lib/firefox/firefox 
Driver vendor: X.Org
Device vendor: AMD
Device name: AMD RAVEN (DRM 3.25.0, 4.17.2-041702-generic, LLVM 6.0.0)

Remainder of driver log:
---

I tried netconsole and could not get it working. I bought a serial converter and it's on the way. When it delivered, I will try getting the log from serial port. Then I could try your patches because currently the logs did not get persisted at all. Thanks for your patches.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-30:

#38

btw, these umr commands are executed after reboot.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-06-30:

#39

Download full text (6.2 KiB)

I just upgraded to mesa 18.1.3 and kernel 4.17.3. I ran firefox with GALLIUM_DDEBUG after reboot. It produces these after I opened some tabs and firefox stopped responsing.

---
Gallium debugger active. Logging all calls.
Hang detection timeout is 1000ms.
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000000
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000001
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000002
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000003
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000004
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000005
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000006
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000007
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000008
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000009
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000010
dd: can't create a directory (13)
dd: can't create a directory (13)
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000011
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000012
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000013
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000014
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000015
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000016
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000017
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000018
GPU hang detected, collecting information...

Draw # driver prev BOP TOP BOP dump file
-------------------------------------------------------------
8 YES NO NO NO dd: can't create a directory (13)
fopen failed

Done.
Sandbox: seccomp sandbox violation: pid 5057, tid 5268, syscall 162, args 140631883336544 7 140633020008624 140632174061559 7 140632344285188.
dd: Aborting the process...
[Parent 4748, Gecko_IOThread] WARNING: pipe error (72): 连接被对方重设: file /build/firefox-m9FtQy/firefox-60.0.2+build1/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353
[Parent 4748, Gecko_IOThread] WARNING: pipe error (113): 连接被对方重设: file /build/firefox-m9FtQy/firefox-60.0.2+build1/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353

###!!! [Parent][MessageChannel] Error: (msgtype=0x15007F,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

---

The umr commands are:

---
sfc@sfc-DESKTOP:~$ sudo umr -lb
        raven1.gfx91
        raven1.vcn10
        raven1.dcn10
        raven1.nbio70
        raven1.sdma041
        raven1.hdp40...

I just upgraded to mesa 18.1.3 and kernel 4.17.3. I ran firefox with GALLIUM_DDEBUG after reboot. It produces these after I opened some tabs and firefox stopped responsing.

---
Gallium debugger active. Logging all calls.
Hang detection timeout is 1000ms.
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000000
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000001
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000002
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000003
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000004
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000005
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000006
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000007
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000008
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000009
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000010
dd: can't create a directory (13)
dd: can't create a directory (13)
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000011
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000012
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000013
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000014
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000015
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000016
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000017
dd: can't create a directory (13)
dd: failed to open /home/sfc/ddebug_dumps/firefox_5057_00000018
GPU hang detected, collecting information...

Draw #   driver  prev BOP  TOP  BOP  dump file
-------------------------------------------------------------
8         YES      NO      NO   NO   dd: can't create a directory (13)
fopen failed

Done.
Sandbox: seccomp sandbox violation: pid 5057, tid 5268, syscall 162, args 140631883336544 7 140633020008624 140632174061559 7 140632344285188.
dd: Aborting the process...
[Parent 4748, Gecko_IOThread] WARNING: pipe error (72): 连接被对方重设: file /build/firefox-m9FtQy/firefox-60.0.2+build1/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353
[Parent 4748, Gecko_IOThread] WARNING: pipe error (113): 连接被对方重设: file /build/firefox-m9FtQy/firefox-60.0.2+build1/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353

###!!! [Parent][MessageChannel] Error: (msgtype=0x15007F,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

---

The umr commands are:

---
sfc@sfc-DESKTOP:~$ sudo umr -lb
        raven1.gfx91
        raven1.vcn10
        raven1.dcn10
        raven1.nbio70
        raven1.sdma041
        raven1.hdp40
        raven1.oss40
        raven1.mmhub91
        raven1.mp100
sfc@sfc-DESKTOP:~$ sudo umr -O verbose,follow_ib -R gfx[.]
error: Unknown option [follow_ib]
sfc@sfc-DESKTOP:~$ sudo umr -O bits -wa
No active waves!
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits  -r*.*.mmGRBM_STATUS
[ERROR]: Unknown option <-r*.*.mmGRBM_STATUS>
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits  -r *.*.HEADER_DUMP
gfx91.mmCP_MEC_ME1_HEADER_DUMP => 0xc0000e00
        .HEADER_DUMP[0:31]                                               == 3221229056 (0xc0000e00)
gfx91.mmCP_MEC_ME2_HEADER_DUMP => 0xdef0def0
        .HEADER_DUMP[0:31]                                               == 3740327664 (0xdef0def0)
gfx91.mmCP_ME_HEADER_DUMP => 0xc0004200
        .ME_HEADER_DUMP[0:31]                                            == 3221242368 (0xc0004200)
gfx91.mmCP_PFP_HEADER_DUMP => 0xffff1000
        .PFP_HEADER_DUMP[0:31]                                           == 4294905856 (0xffff1000)
gfx91.mmCP_CE_HEADER_DUMP => 0xffff1000
        .CE_HEADER_DUMP[0:31]                                            == 4294905856 (0xffff1000)
sfc@sfc-DESKTOP:~$ sudo umr -O many,bits  -r *.*.CP_EOP
gfx91.mmCP_EOPQ_WAIT_TIME => 0x0000052c
        .WAIT_TIME[0:9]                                                  ==      300 (0x0000012c)
        .SCALE_COUNT[10:17]                                              ==        1 (0x00000001)
gfx91.mmCP_EOP_DONE_ADDR_LO => 0x00609000
        .ADDR_LO[2:31]                                                   ==  1582080 (0x00182400)
gfx91.mmCP_EOP_DONE_ADDR_HI => 0x000000f5
        .ADDR_HI[0:15]                                                   ==      245 (0x000000f5)
gfx91.mmCP_EOP_DONE_DATA_LO => 0x00001d01
        .DATA_LO[0:31]                                                   ==     7425 (0x00001d01)
gfx91.mmCP_EOP_DONE_DATA_HI => 0x00000000
        .DATA_HI[0:31]                                                   ==        0 (0x00000000)
gfx91.mmCP_EOP_LAST_FENCE_LO => 0x00001d01
        .LAST_FENCE_LO[0:31]                                             ==     7425 (0x00001d01)
gfx91.mmCP_EOP_LAST_FENCE_HI => 0x00000000
        .LAST_FENCE_HI[0:31]                                             ==        0 (0x00000000)
gfx91.mmCP_EOP_DONE_EVENT_CNTL => 0x00038060
        .WBINV_TC_OP[0:6]                                                ==       96 (0x00000060)
        .WBINV_ACTION_ENA[12:17]                                         ==       56 (0x00000038)
        .CACHE_POLICY[25:25]                                             ==        0 (0x00000000)
        .EXECUTE[28:28]                                                  ==        0 (0x00000000)
gfx91.mmCP_EOP_DONE_DATA_CNTL => 0x40010000
        .DST_SEL[16:17]                                                  ==        1 (0x00000001)
        .INT_SEL[24:26]                                                  ==        0 (0x00000000)
        .DATA_SEL[29:31]                                                 ==        2 (0x00000002)
gfx91.mmCP_EOP_DONE_CNTX_ID => 0x00000000
        .CNTX_ID[0:31]                                                   ==        0 (0x00000000)
---

No GPU hang log in dmesg.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-07-01:

#40

Created attachment 277101
Kernel log from serial port when it freezes

Finally got logs from serial port when freezed. Seems my problem has nothing to do with amdgpu. Maybe I should file a new bug.

Revision history for this message

In Linux Kernel Bug Tracker #199749, chewi (chewi-linux-kernel-bugs) wrote on 2018-07-01:

#41

(In reply to notsyncing from comment #30)
> Finally got logs from serial port when freezed. Seems my problem has nothing
> to do with amdgpu. Maybe I should file a new bug.

I may be off the mark but that looks more like bug #196683. Have you tried adjusting "Power Supply Idle Control" in the BIOS (if you have it) or using zenstates.py to disable the C6 package state?

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-07-01:

#42

I've set that option to "Typical Current Idle" and still freezes. The logs in 196683 points to RCU, which seems not my case. I suspect it's due to the zram. Now I'm trying to reproduce it with zram disabled.

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-07-14:

#43

Now about half a month passed, and my machine is running fine with zram disabled. No more freezes. Seems my problem is related to zram, not this bug, while the "ring gfx timeout" error never happened again. Thanks for everyone who gave me advice.

Revision history for this message

In Linux Kernel Bug Tracker #199749, andrey.grodzovsky (andrey.grodzovsky-linux-kernel-bugs) wrote on 2018-08-02:

#44

notsyncing, can you close this ticket then ?

Revision history for this message

In Linux Kernel Bug Tracker #199749, song.fc (song.fc-linux-kernel-bugs) wrote on 2018-08-04:

#45

(In reply to Andrey Grodzovsky from comment #34)
> notsyncing, can you close this ticket then ?

This is not my ticket.

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#1

Example error message with 4.18.0.13.63 kernel

[11569.691337] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691344] amdgpu 0000:0a:00.0: at page 0x0000000106200000 from 27
[11569.691348] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
[11569.691356] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691359] amdgpu 0000:0a:00.0: at page 0x0000000106200000 from 27
[11569.691362] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691370] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691373] amdgpu 0000:0a:00.0: at page 0x0000000106200000 from 27
[11569.691376] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691383] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691386] amdgpu 0000:0a:00.0: at page 0x0000000106201000 from 27
[11569.691389] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691396] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691399] amdgpu 0000:0a:00.0: at page 0x0000000106200000 from 27
[11569.691402] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691409] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691412] amdgpu 0000:0a:00.0: at page 0x0000000106201000 from 27
[11569.691415] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691422] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691425] amdgpu 0000:0a:00.0: at page 0x0000000106200000 from 27
[11569.691428] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691435] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691438] amdgpu 0000:0a:00.0: at page 0x0000000106201000 from 27
[11569.691441] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691448] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691451] amdgpu 0000:0a:00.0: at page 0x0000000106201000 from 27
[11569.691454] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691461] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691464] amdgpu 0000:0a:00.0: at page 0x0000000106204000 from 27
[11569.691467] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11579.766498] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=42143, last emitted seq=42144
[11579.766504] [drm] GPU recovery disabled.

Example error message with 4.18.0.13.63 kernel

[11569.691337] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691344] amdgpu 0000:0a:00.0:   at page 0x0000000106200000 from 27
[11569.691348] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
[11569.691356] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691359] amdgpu 0000:0a:00.0:   at page 0x0000000106200000 from 27
[11569.691362] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691370] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691373] amdgpu 0000:0a:00.0:   at page 0x0000000106200000 from 27
[11569.691376] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691383] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691386] amdgpu 0000:0a:00.0:   at page 0x0000000106201000 from 27
[11569.691389] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691396] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691399] amdgpu 0000:0a:00.0:   at page 0x0000000106200000 from 27
[11569.691402] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691409] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691412] amdgpu 0000:0a:00.0:   at page 0x0000000106201000 from 27
[11569.691415] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691422] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691425] amdgpu 0000:0a:00.0:   at page 0x0000000106200000 from 27
[11569.691428] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691435] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691438] amdgpu 0000:0a:00.0:   at page 0x0000000106201000 from 27
[11569.691441] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691448] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691451] amdgpu 0000:0a:00.0:   at page 0x0000000106201000 from 27
[11569.691454] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11569.691461] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768)
[11569.691464] amdgpu 0000:0a:00.0:   at page 0x0000000106204000 from 27
[11569.691467] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[11579.766498] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=42143, last emitted seq=42144
[11579.766504] [drm] GPU recovery disabled.

tags:	added: apport-collected bionic
description:	updated

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcCpuinfoMinimal.txt

#2

ProcCpuinfoMinimal.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcEnviron.txt

#3

ProcEnviron.txt Edit (115 bytes, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#4

Download full text (3.7 KiB)

Reproduced with Mainline kernel

wget -c https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc4/linux-image-unsigned-4.20.0-042000rc4-generic_4.20.0-042000rc4.201812030528_amd64.deb https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc4/linux-modules-4.20.0-042000rc4-generic_4.20.0-042000rc4.201812030528_amd64.deb

apt-get install linux*4.20*

New error message

[ 6891.737234] gmc_v9_0_process_interrupt: 14 callbacks suppressed
[ 6891.737240] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737247] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101300000 from 27
[ 6891.737250] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
[ 6891.737259] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737262] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101300000 from 27
[ 6891.737265] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737274] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737277] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101300000 from 27
[ 6891.737279] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737287] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737290] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101301000 from 27
[ 6891.737293] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737301] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737303] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101300000 from 27
[ 6891.737306] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737314] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737317] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101301000 from 27
[ 6891.737319] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737327] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737330] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101300000 from 27
[ 6891.737332] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737340] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737343] amdgpu 0000:0a:00.0: in page starting at address 0x0000000101301000 from 27
[ 6891.737345] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737353] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737356] amdgpu 0000...

Reproduced with Mainline kernel

wget -c https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc4/linux-image-unsigned-4.20.0-042000rc4-generic_4.20.0-042000rc4.201812030528_amd64.deb https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.20-rc4/linux-modules-4.20.0-042000rc4-generic_4.20.0-042000rc4.201812030528_amd64.deb

apt-get install linux*4.20*

New error message

[ 6891.737234] gmc_v9_0_process_interrupt: 14 callbacks suppressed
[ 6891.737240] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737247] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101300000 from 27
[ 6891.737250] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
[ 6891.737259] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737262] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101300000 from 27
[ 6891.737265] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737274] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737277] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101300000 from 27
[ 6891.737279] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737287] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737290] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101301000 from 27
[ 6891.737293] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737301] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737303] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101300000 from 27
[ 6891.737306] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737314] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737317] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101301000 from 27
[ 6891.737319] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737327] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737330] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101300000 from 27
[ 6891.737332] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737340] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737343] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101301000 from 27
[ 6891.737345] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737353] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737356] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101302000 from 27
[ 6891.737358] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6891.737366] amdgpu 0000:0a:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:1 pasid:32768, for process Xorg pid 1741 thread amdgpu_cs:0 pid 1793)
[ 6891.737369] amdgpu 0000:0a:00.0:   in page starting at address 0x0000000101302000 from 27
[ 6891.737372] amdgpu 0000:0a:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[ 6901.825330] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=667244, emitted seq=667245

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#5

This appears to be similar to

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1767667
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1804505
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1772081

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#6

Note that crash is happening with MPEG2 and H.264 video playback under MythTV

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#7

Potential upstream kernel issue

https://bugzilla.kernel.org/show_bug.cgi?id=199749

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#8

# lspci -nn | grep -E 'VGA|Display'
0a:00.0 VGA compatible controller [0300]: Advanced Micro Devices, Inc. [AMD/ATI] Vega [Radeon Vega 8 Mobile] [1002:15dd] (rev cb)

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04:

#9

Another potential upstream BZ
- https://bugs.freedesktop.org/show_bug.cgi?id=108098

Bug Watch Updater (bug-watch-updater) on 2019-01-04

Changed in linux:
importance:	Unknown → Medium
status:	Unknown → Confirmed

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: AlsaInfo.txt

#46

AlsaInfo.txt Edit (21.5 KiB, text/plain)

apport information

description:

updated

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: CRDA.txt

#47

CRDA.txt Edit (468 bytes, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: CurrentDmesg.txt

#48

CurrentDmesg.txt Edit (94.6 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: IwConfig.txt

#49

IwConfig.txt Edit (140 bytes, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: Lspci.txt

#50

Lspci.txt Edit (83.7 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: Lsusb.txt

#51

Lsusb.txt Edit (1.3 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcCpuinfo.txt

#52

ProcCpuinfo.txt Edit (5.4 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcCpuinfoMinimal.txt

#53

ProcCpuinfoMinimal.txt Edit (1.4 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcEnviron.txt

#54

ProcEnviron.txt Edit (115 bytes, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcInterrupts.txt

#55

ProcInterrupts.txt Edit (5.0 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: ProcModules.txt

#56

ProcModules.txt Edit (7.3 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: UdevDb.txt

#57

UdevDb.txt Edit (283.9 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-04: WifiSyslog.txt

#58

WifiSyslog.txt Edit (132.1 KiB, text/plain)

apport information

Revision history for this message

Steven Ellis (steven-openmedia) wrote on 2019-01-06:

#59

Switched back to the Stock 18.04 LTS kernel and installed
- amdgpu-pro-18.50-708488-ubuntu-18.04
- https://www.amd.com/en/support/kb/release-notes/rn-rad-lin-18-50-unified

With straight amdgpu support VDPAU doesn't work correctly and I needed to install the full amdgpu-pro version. So far system is stable and I haven't had any new amdgpu lockups.

I'd prefer to use the pure amdgpu opensource drive, but even the upstream 4.20 kernel has issues.

Revision history for this message

experimancer (experimancer) wrote on 2019-01-15:

#60

Download full text (7.2 KiB)

I have it too: (random hangups of whole system either under load/idle)
- Ryzen5 2400 G
- Ubuntu 18.04 and 18.10
- kernel 4.18

Log: /var/log/syslog (or kern.log) juts before the freeze, after which only har reset, power off/on makes ysstem boot agian:

Jan 15 02:34:13 elrond kernel: [ 476.185425] gmc_v9_0_process_interrupt: 32 callbacks suppressed
Jan 15 02:34:13 elrond kernel: [ 476.185429] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 v
mid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185433] amdgpu 0000:0b:00.0: at page 0x000000010780a000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185434] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Jan 15 02:34:13 elrond kernel: [ 476.185440] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 v
mid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185441] amdgpu 0000:0b:00.0: at page 0x000000010780b000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185443] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185448] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 v
mid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185449] amdgpu 0000:0b:00.0: at page 0x0000000107805000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185450] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185455] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185457] amdgpu 0000:0b:00.0: at page 0x0000000107806000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185458] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185463] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185464] amdgpu 0000:0b:00.0: at page 0x0000000107808000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185465] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185470] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185471] amdgpu 0000:0b:00.0: at page 0x0000000107809000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185473] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185478] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185479] amdgpu 0000:0b:00.0: at page 0x0000000107803000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185480] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185485] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [ 476.185486] amdgpu 0000:0b:00.0: at page 0x0000000107804000 from 27
Jan 15 02:34:13 elrond kernel: [ 476.185487] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [ 476.185492] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pa...

I have it too: (random hangups of whole system either under load/idle)
- Ryzen5 2400 G
- Ubuntu 18.04 and 18.10
- kernel 4.18

Log: /var/log/syslog (or kern.log) juts before the freeze, after which only har reset, power off/on makes ysstem boot agian:

Jan 15 02:34:13 elrond kernel: [  476.185425] gmc_v9_0_process_interrupt: 32 callbacks suppressed
Jan 15 02:34:13 elrond kernel: [  476.185429] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 v
mid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185433] amdgpu 0000:0b:00.0:   at page 0x000000010780a000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185434] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031
Jan 15 02:34:13 elrond kernel: [  476.185440] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 v
mid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185441] amdgpu 0000:0b:00.0:   at page 0x000000010780b000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185443] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185448] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 v
mid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185449] amdgpu 0000:0b:00.0:   at page 0x0000000107805000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185450] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185455] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185457] amdgpu 0000:0b:00.0:   at page 0x0000000107806000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185458] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185463] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185464] amdgpu 0000:0b:00.0:   at page 0x0000000107808000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185465] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185470] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185471] amdgpu 0000:0b:00.0:   at page 0x0000000107809000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185473] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185478] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185479] amdgpu 0000:0b:00.0:   at page 0x0000000107803000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185480] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185485] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185486] amdgpu 0000:0b:00.0:   at page 0x0000000107804000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185487] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185492] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185493] amdgpu 0000:0b:00.0:   at page 0x0000000107809000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185494] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:13 elrond kernel: [  476.185500] amdgpu 0000:0b:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32768)
Jan 15 02:34:13 elrond kernel: [  476.185501] amdgpu 0000:0b:00.0:   at page 0x0000000107808000 from 27
Jan 15 02:34:13 elrond kernel: [  476.185502] amdgpu 0000:0b:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
Jan 15 02:34:23 elrond kernel: [  486.421830] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, last signaled seq=41331, last emitted seq=41333
Jan 15 02:34:23 elrond kernel: [  486.421834] [drm] GPU recovery disabled.

$ sudo lshw -c video
  *-display                 
       description: VGA compatible controller
       product: Advanced Micro Devices, Inc. [AMD/ATI]
       vendor: Advanced Micro Devices, Inc. [AMD/ATI]
       physical id: 0
       bus info: pci@0000:0b:00.0
       version: c6
       width: 64 bits
       clock: 33MHz
       capabilities: pm pciexpress msi msix vga_controller bus_master cap_list rom
       configuration: driver=amdgpu latency=0
       resources: irq:88 memory:e0000000-efffffff memory:f0000000-f01fffff ioport:d000(size=256) memory:fe300000-fe37ffff memory:c0000-dfff

$ uname -a
Linux elrond 4.18.20-041820-generic #201812030624 SMP Mon Dec 3 11:25:55 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ glxinfo | grep Mesa
client glx vendor string: Mesa Project and SGI
OpenGL core profile version string: 4.5 (Core Profile) Mesa 18.2.2
OpenGL version string: 4.4 (Compatibility Profile) Mesa 18.2.2
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 18.2.2

$ dmesg | grep drm
[    1.641459] [drm] amdgpu kernel modesetting enabled.
[    1.645616] fb: switching to amdgpudrmfb from EFI VGA
[    1.645835] [drm] initializing kernel modesetting (RAVEN 0x1002:0x15DD 0x1043:0x876B 0xC6).
[    1.645844] [drm] register mmio base: 0xFE300000
[    1.645845] [drm] register mmio size: 524288
[    1.645851] [drm] probing gen 2 caps for device 1022:15db = 700d03/e
[    1.645853] [drm] probing mlw for device 1022:15db = 700d03
[    1.645855] [drm] add ip block number 0 <soc15_common>
[    1.645856] [drm] add ip block number 1 <gmc_v9_0>
[    1.645856] [drm] add ip block number 2 <vega10_ih>
[    1.645857] [drm] add ip block number 3 <psp>
[    1.645857] [drm] add ip block number 4 <powerplay>
[    1.645858] [drm] add ip block number 5 <dm>
[    1.645859] [drm] add ip block number 6 <gfx_v9_0>
[    1.645859] [drm] add ip block number 7 <sdma_v4_0>
[    1.645860] [drm] add ip block number 8 <vcn_v1_0>
[    1.645889] [drm] VCN decode is enabled in VM mode
[    1.645890] [drm] VCN encode is enabled in VM mode
[    1.669303] [drm] BIOS signature incorrect 0 0
[    1.669348] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    1.669358] [drm] Detected VRAM RAM=1024M, BAR=1024M
[    1.669359] [drm] RAM width 128bits DDR4
[    1.669466] [drm] amdgpu: 1024M of VRAM memory ready
[    1.669467] [drm] amdgpu: 3072M of GTT memory ready.
[    1.669473] [drm] GART: num cpu pages 262144, num gpu pages 262144
[    1.669634] [drm] PCIE GART of 1024M enabled (table at 0x000000F400900000).
[    1.670752] [drm] use_doorbell being set to: [true]
[    1.670839] [drm] Found VCN firmware Version: 1.73 Family ID: 18
[    1.670841] [drm] PSP loading VCN firmware
[    1.843030] [drm] Display Core initialized with v3.1.44!
[    1.868553] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    1.868554] [drm] Driver supports precise vblank timestamp query.
[    1.891323] [drm] VCN decode and encode initialized successfully.
[    1.892629] [drm] fb mappable at 0xA1100000
[    1.892630] [drm] vram apper at 0xA0000000
[    1.892630] [drm] size 8294400
[    1.892630] [drm] fb depth is 24
[    1.892631] [drm]    pitch is 7680
[    1.892683] fbcon: amdgpudrmfb (fb0) is primary device
[    1.950133] amdgpu 0000:0b:00.0: fb0: amdgpudrmfb frame buffer device
[    1.966332] [drm] Initialized amdgpu 3.26.0 20150101 for 0000:0b:00.0 on minor 0

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-01-15:

#61

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

In Linux Kernel Bug Tracker #199749, Luca (zapduke) wrote on 2019-05-18:

#62

This fixes it for me (raven ridge 2200g though). But unfortunately it seems that the version in linux-firmware is not sufficient and after every update it gets back.

Revision history for this message

Paul Crawford (psc-sat) wrote on 2019-07-03:

#63

I am seeing the same sort of bug on my machine.
msi X470 motherboard (just tried latest firmware - no change)
AMD Ryzen 5 2400G and Ubuntu 18.04
Linux paul-ubuntu 4.18.0-25-generic #26~18.04.1-Ubuntu SMP Thu Jun 27 07:28:31 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Sometimes it happens overnight when machine is idle / screen locked, other times when using machine. Oddly enough the most recent one still had the mouse moving but keyboard unresponsive. Also a remote reboot fails - unless I use the watchdog with 'nowayout=1' to for a hard reset 60s after things should have gone down so it seem systemd won't/can't stop the hung graphics system.
Typically the error messages look like:

[ 2567.463753] amdgpu 0000:27:00.0: [gfxhub] VMC page fault (src_id:0 ring:24 vmid:3 pasid:32770)
[ 2567.463763] amdgpu 0000:27:00.0: at page 0x000000010be01000 from 27
[ 2567.463767] amdgpu 0000:27:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00301031

Often repeated many, many times.
Seeing this bug report this has been going on for over a year now it seems and still is not fixed! Also a friend of mine is seeing the same on another AMD-based system. All pretty crap and makes me seriously regret buying an AMD board.

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Revision history for this message

John Rose (johnaaronrose) wrote on 2023-06-28:

#64

I'm getting same on new AMD Ryzen 3270U with Ubuntu 22.04.

Affects		Status	Importance	Assigned to	Milestone
	Linux	Confirmed	Medium	linux-kernel-bugs #199749
	linux (Ubuntu)	Confirmed	Undecided	Unassigned

Ubuntu
linux package

AMDGPU VMC page fault with Athlon 200GE APU

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntulinux package

AMDGPU VMC page fault with Athlon 200GE APU

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package