linux-firmware 1.197 causes kernel to report error "amdgpu: [gfxhub0] retry page fault"

Bug #1928393 reported by Thiago Jung Bauermann
60
This bug affects 10 people
Affects Status Importance Assigned to Milestone
AMD
Fix Released
Undecided
Unassigned
linux-firmware (Ubuntu)
Fix Released
High
Unassigned
Focal
New
Undecided
Juerg Haefliger
Hirsute
Won't Fix
Undecided
Juerg Haefliger

Bug Description

After upgrading linux-firmware from 1.190.5 to 1.197 (as part of the upgrade from Ubuntu 20.10 to 21.04), I started experiencing frequent and severe GPU instability. When this happens, I see this error in dmesg:

[20061.061069] amdgpu 0000:03:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process Xorg pid 1141 thread Xorg:cs0 pid 1236)
[20061.061103] amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x800000401000 from client 27
[20061.061135] amdgpu 0000:03:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00101031
[20061.061147] amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: TCP (0x8)
[20061.061157] amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
[20061.061167] amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
[20061.061174] amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
[20061.061183] amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
[20061.061189] amdgpu 0000:03:00.0: amdgpu: RW: 0x0

I'll attach a couple of full dmesgs that I collected.

Many of the times when this happens, the screen and keyboard freeze irreversibly (I tried waiting for more than 30 minutes, but it doesn't help). I can still log in via ssh though. When there's no freeze, I can continue using the computer normally, but the laptop fans keep running are always running and the battery depletes fast. There's probably something on a permanent loop either in the kernel or in the GPU.

This bug happens several times a day, rendering the machine so unstable as to be almost unusable. It is a severe regression and I'm aghast that it passed AMD's Quality Assurance.

After downgrading back to linux-firmware 1.190.5, the machine is back to the previous, mostly-reliable state. Which is to say, this bug is gone, I'm just left with the other amdgpu suspend bug I've learned to live with since I bought this computer.

Please revert the amdgpu firmware in this package as soon as possible. This is unbearable.

Relevant information:
Ubuntu version: 21.04
Linux kernel: 5.11.0-17-generic x86_64
CPU model: AMD Ryzen 7 3700U with Radeon Vega Mobile Gfx
GPU: 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Picasso (rev c1)
Laptop model: Lenovo Ideapad S145

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :
Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

This is the dmesg of an instance where I was able to continue using the laptop despite the GPU bug (in the case of the dmesg I attached previously, I had to ssh in to the machine to turn it off).

Notice that there are two instances of the retry page fault, one of them right within 15 minutes of the machine being turned on. There was no suspend/resume event this time. The laptop was turned on the whole time.

Revision history for this message
Seth Forshee (sforshee) wrote :

Before we revert we should see if newer firmware fixes the issue, and make sure we are only changing the specific firmware files for your hardware.

I think your hardware is the "Picasso" series. Can you try the following? If you are unsure about any of the following steps, let me know and I can provide you with test packages to install instead.

Save all files matching /lib/firmware/amdgpu/picasso* from linux-firmware 1.190.5. Reinstall 1.197, then overwrite the picasso firmware files with the ones you saved. Reboot, and confirm that the issues you see with 1.197 are fixed. If they are not fixed, then there's no need to proceed as we haven't found the correct firmware files which are causing your issues.

Then please download the picasso firmware files from here:

https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu

Use the "plain" link next to each file to download the file. Overwrite the files in /lib/firmware/admgpu with these files, reboot, and see if you continue to have problems.

Changed in linux-firmware (Ubuntu):
assignee: nobody → Seth Forshee (sforshee)
importance: Undecided → High
status: New → Incomplete
Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote : Re: [Bug 1928393] Re: linux-firmware 1.197 causes kernel to report error "amdgpu: [gfxhub0] retry page fault"

Hello Seth,

Thank you for the quick and detailed response.

Em sexta-feira, 14 de maio de 2021, às 13:14:22 -03, Seth Forshee escreveu:
> Before we revert we should see if newer firmware fixes the issue, and
> make sure we are only changing the specific firmware files for your
> hardware.

Ok, sounds good.

> I think your hardware is the "Picasso" series. Can you try the
> following? If you are unsure about any of the following steps, let me
> know and I can provide you with test packages to install instead.

Thanks for the offer. It's not a problem getting the files from git and
overwrite them manually.

> Save all files matching /lib/firmware/amdgpu/picasso* from linux-
> firmware 1.190.5. Reinstall 1.197, then overwrite the picasso firmware
> files with the ones you saved. Reboot, and confirm that the issues you
> see with 1.197 are fixed. If they are not fixed, then there's no need to
> proceed as we haven't found the correct firmware files which are causing
> your issues.

Ah, ok. This morning I went ahead and overwrote the whole of /lib/firmware/
amdgpu/ with the files from the latest commit of the upstream linux-
firmware git repo you mention below. It's been only a little over 2 hours,
but my impression is that the latest firmware does solve the problem. If no
retry page fault happens by the end of the day, then I think it's safe to
say that it fixes the issue.

So at the end of the day (or earlier if I get a retry page fault) I'll do
the procedure you mention to use firmware 1.197 and only overwrite the
picasso* files to confirm if those are the ones that need to be changed.

I don't know much about GPUs, but from looking at Wikipedia¹ I think my
model is a "Vega 10". Should I overwrite the vega10* files as well?

¹ https://en.wikipedia.org/wiki/Radeon_RX_Vega_series#Picasso_(2019)_2

> Then please download the picasso firmware files from here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux->
firmware.git/tree/amdgpu
>
> Use the "plain" link next to each file to download the file. Overwrite
> the files in /lib/firmware/admgpu with these files, reboot, and see if
> you continue to have problems.
>
> ** Changed in: linux-firmware (Ubuntu)
> Importance: Undecided => High

Thanks!

>
> ** Changed in: linux-firmware (Ubuntu)
> Status: New => Incomplete
>
> ** Changed in: linux-firmware (Ubuntu)
> Assignee: (unassigned) => Seth Forshee (sforshee)

Revision history for this message
Seth Forshee (sforshee) wrote :

On Fri, May 14, 2021 at 06:01:50PM -0000, Thiago Jung Bauermann wrote:
> Ah, ok. This morning I went ahead and overwrote the whole of /lib/firmware/
> amdgpu/ with the files from the latest commit of the upstream linux-
> firmware git repo you mention below. It's been only a little over 2 hours,
> but my impression is that the latest firmware does solve the problem. If no
> retry page fault happens by the end of the day, then I think it's safe to
> say that it fixes the issue.

Sounds good.

> So at the end of the day (or earlier if I get a retry page fault) I'll do
> the procedure you mention to use firmware 1.197 and only overwrite the
> picasso* files to confirm if those are the ones that need to be changed.
>
> I don't know much about GPUs, but from looking at Wikipedia¹ I think my
> model is a "Vega 10". Should I overwrite the vega10* files as well?
>
> ¹ https://en.wikipedia.org/wiki/Radeon_RX_Vega_series#Picasso_(2019)_2

I'm basing it on the PCI id I see in dmesg, which is 1002:15d8, which
the driver looks to identify as CHIP_RAVEN and flag with
AMD_APU_IS_PICASSO. Based on that it will pick firmware files which
begin with "picasso". It would be informative though to have the output
from "lspci -vvnn" on your machine.

Let's try the picasso files first. If that doesn't work you can try the
vega10 files (without the new picasso files), then if that doesn't work,
try both. Then we'll have an idea about what the minimal set of files is
to fix the problem.

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

The latest upstream firmware is stable, so I reverted back to 1.197 so that
I could test only the picasso* files.

Before doing that, I decided to run for a while with pristine linux-
firmware 1.197 to double-check that the bug happens quickly.

Unexpectedly, 1.197 is now reliable too! I have been running it for about
half a day (which is more than what was possible before) and it is fine.
The only thing that changed was that flatpak's org.freedesktop.Platform and
org.freedesktop.Platform.GL.default were updated, not sure if yesterday or
the day before.

This is relevant because I use Firefox from flathub.

I'm suspecting that the instability comes from the combination of linux-
firmware 1.197 + a particular version of some userspace component (Mesa I
guess) that was in org.freedesktop.Platform{,.GL.default}.

I'll try reverting the flatpak update to see if I can get back to the
unstable state to confirm the hypothesis.

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

Em sábado, 15 de maio de 2021, às 11:24:17 -03, Thiago Jung Bauermann
escreveu:
> Unexpectedly, 1.197 is now reliable too! I have been running it for about
> half a day (which is more than what was possible before) and it is fine.

After 4 days of stability I just had the retry page fault problem again,
with stock linux-firmware 1.197 and kernel 5.11.0-17-generic.

> The only thing that changed was that flatpak's org.freedesktop.Platform
> and org.freedesktop.Platform.GL.default were updated, not sure if
> yesterday or the day before.
>
> This is relevant because I use Firefox from flathub.
>
> I'm suspecting that the instability comes from the combination of linux-
> firmware 1.197 + a particular version of some userspace component (Mesa I
> guess) that was in org.freedesktop.Platform{,.GL.default}.

So apparently the flatpak update made the problem less likely to happen,
but it still does.

> I'll try reverting the flatpak update to see if I can get back to the
> unstable state to confirm the hypothesis.

Life got in the way and I wasn't able to do this yet.

With it taking 4 days to reproduce the problem with the current stack, I
think it still makes sense to revert the flatpak update.

Unfortunately "flatpak history" seems to be broken on my system:

$ flatpak history
error: appstream2/x86_64 is not application or runtime

I'll try using ostree directly. I don't think I'll be able to do it today,
but hopefuly within the next couple of days.

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :
Download full text (3.2 KiB)

Over the weekend I was finally able to revert back to the previous versions
of the org.freedesktop.Platform and org.freedesktop.Platform.GL.default
flatpak runtimes. It turns out that the `flatpak history` command wasn't
necessay for the rollback.

Em sexta-feira, 14 de maio de 2021, às 13:14:22 -03, Seth Forshee escreveu:
> Before we revert we should see if newer firmware fixes the issue, and
> make sure we are only changing the specific firmware files for your
> hardware.
>
> I think your hardware is the "Picasso" series. Can you try the
> following? If you are unsure about any of the following steps, let me
> know and I can provide you with test packages to install instead.
>
> Save all files matching /lib/firmware/amdgpu/picasso* from linux-
> firmware 1.190.5. Reinstall 1.197, then overwrite the picasso firmware
> files with the ones you saved. Reboot, and confirm that the issues you
> see with 1.197 are fixed. If they are not fixed, then there's no need to
> proceed as we haven't found the correct firmware files which are causing
> your issues.

I did that exactly that, and I was able to run for 4 days without any retry
page fault error. This makes me confident that the 1.190.5 firmware doesn't
have the bug, and also that the amdgpu/picasso* files are the relevant
ones.

> Then please download the picasso firmware files from here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-> firmware.git/tree/amdgpu
>
> Use the "plain" link next to each file to download the file. Overwrite
> the files in /lib/firmware/admgpu with these files, reboot, and see if
> you continue to have problems.

I also did that, using the files from commit 55d964905a2b. More recent
commits in that repo didn't touch the amdgpu directory so they're still the
most recent firmware files for my hardware.

Unfortunately I still saw the retry page fault message on dmesg with it,
and very soon after boot (IIRC it happened while running the sddm login
manager, before I log in). On the bright side, it didn't have any advert
effect on my computer and I just noticed hours later because I specifically
grepped for it. So perhaps the latest firmware has a less nasty version of
the bug?

And just to double-check the baseline reference, I also ran with pristine
linux-firmware 1.197, the version which made my machine so unstable. I had
a somewhat different experience this time. The bug still happened, but only
after 20h of uptime. And the symptom was "just" a visual glitch while
scrolling inside Firefox, not a complete freeze of the display and
keyboard, as I was experiencing originally. Perhaps if I rebooted and
insisted on using it again I would experience worse effects. But I thought
that was enough to confirm that 1.197 is still bad.

So I'm not sure what to make of all this. I still wasn't able to pinpoint
exactly what triggers the worst manifestation of the bug. But of the three
versions of linux-firmware I used (1.190.5, 1.197 and upstream), 1.197 is
still the one where things are worse so IMHO the picasso files need to come
from one of the other two versions. 1.190.5 is the rock solid one, so I
think it's the safest bet. But per...

Read more...

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

Em quarta-feira, 26 de maio de 2021, às 16:46:14 -03, Thiago Jung Bauermann
escreveu:
> But perhaps the upstream version is not too bad?

I take this back. I've been running with the upstream picasso* files since
Wednesday, and I just had two freezes in less than one hour.

linux-firmware 1.190.5 is the only version which contains picasso* files
that are stable.

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Maybe folks at AMD can chime in? Picasso FW regressed.

Revision history for this message
Alex Deucher (alexander-deucher) wrote :

Can you narrow down which specific firmware file causes the problem? We haven't been able to repro this.

I think it may be related to a change in mesa. Specifically mesa commit 820dec3f7c7. For more info see https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866

Revision history for this message
Mario Limonciello (superm1) wrote :

Here's a PPA build with the mesa fix Alex mentioned backported:
https://launchpad.net/~superm1/+archive/ubuntu/lp1928393

If you can follow the directions to add that PPA and upgrade to that mesa package you can see if that indeed fixes it.

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

Thanks for your input.

Em terça-feira, 8 de junho de 2021, às 10:30:24 -03, Alex Deucher escreveu:
> Can you narrow down which specific firmware file causes the problem?

Ok, I will try.

Also, is it possible and/or worthwhile trying to bisect firmware versions from
the linux-firmware repo? How coupled is the firmware with the kernel
driver? E.g., can I try using firmware files from 1 year ago with current
kernel and Mesa?

> We haven't been able to repro this.

One thing that’s a bit “fishy” about my machine is that it doesn’t seem to
have a good clock:

[ 0.211436] TSC synchronization [CPU#0 -> CPU#1]:
[ 0.211436] Measured 3304683447 cycles TSC warp between CPUs, turning off TSC clock.
[ 0.211436] tsc: Marking TSC unstable due to check_tsc_sync_source failed

[ 0.252117] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[ 0.252117] hpet0: 3 comparators, 32-bit 14.318180 MHz counter
[ 0.253970] clocksource: Switched to clocksource hpet

[ 0.580451] Unstable clock detected, switching default tracing clock to "global"
               If you want to keep using the local clock, then add:
                 "trace_clock=local"
               on the kernel command line

Could this bug be related to that?

> I think it may be related to a change in mesa. Specifically mesa commit
> 820dec3f7c7. For more info see
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866

I’ll run with Mario’s build of Mesa with that patch backported.
Thanks, Mario!

> ** Bug watch added: gitlab.freedesktop.org/mesa/mesa/-/issues #4866
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866

Other upstream issues that look similar:

https://gitlab.freedesktop.org/drm/amd/-/issues/1598
https://gitlab.freedesktop.org/drm/amd/-/issues/920

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

21.04 comes with Mesa 21.0.1 which does not seem to have 820dec3f7c7

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

I was finally able to spend a bit of time on this. Unfortunately, there’s
not much to report back.

Em terça-feira, 8 de junho de 2021, às 15:13:36 -03, Thiago Jung Bauermann
escreveu:
> Em terça-feira, 8 de junho de 2021, às 10:30:24 -03, Alex Deucher
escreveu:
> > Can you narrow down which specific firmware file causes the problem?
>
> Ok, I will try.

I don’t think I can narrow down which firmware file causes the problem,
because I don’t have a last known good version. All firmware files that I
tested (Ubuntu versions 1.190.5, 1.197 and latest linux-firmware.git)
immediately trigger the bug when I try the only reliable reproducer I know
(i.e., running flatpak’s com.github.quaternion package).

Since it can take several days for the bug to happen if I just use the
machine normally, it would take weeks to narrow down which of the picasso_*
files is more stable relative to the others. And even then, I wouldn’t be
sure about it.

> > I think it may be related to a change in mesa. Specifically mesa
> > commit
> > 820dec3f7c7. For more info see
> > https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866
>
> I’ll run with Mario’s build of Mesa with that patch backported.
> Thanks, Mario!

I’m running with the Mesa build from Mario’s PPA now. If I don’t see any
issue within two weeks, I think it will be possible to say that the bug is
gone, or at least much harder to hit.

I can’t use my reproducer in this case, because I can’t change the Mesa
version inside the flatpak image.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mesa (Ubuntu):
status: New → Confirmed
Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

Em quinta-feira, 17 de junho de 2021, às 00:45:30 -03, Thiago Jung
Bauermann escreveu:
> > > I think it may be related to a change in mesa. Specifically mesa
> > > commit
> > > 820dec3f7c7. For more info see
> > > https://gitlab.freedesktop.org/mesa/mesa/-/issues/4866
> >
> > I’ll run with Mario’s build of Mesa with that patch backported.
> > Thanks, Mario!
>
> I’m running with the Mesa build from Mario’s PPA now. If I don’t see any
> issue within two weeks, I think it will be possible to say that the bug
> is gone, or at least much harder to hit.
>
> I can’t use my reproducer in this case, because I can’t change the Mesa
> version inside the flatpak image.

I just had this bug happen again spontaneously, while running with Mesa
from Mario’s PPA.

So this bug isn’t fixed by the patch mentioned by Alex.

Revision history for this message
Heinrich Schuchardt (xypron) wrote :

Also affected:

Ubuntu version: 21.04
Linux kernel: 5.11.0-22-generic x86_64
CPU model: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx
GPU: 05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Raven Ridge [Radeon Vega Series / Radeon Vega Mobile Series] (rev c4)
Laptop model: Lenovo Thinkpad E585

Revision history for this message
Alex Deucher (alexander-deucher) wrote :
Revision history for this message
Heinrich Schuchardt (xypron) wrote :

There is an upstream bug report https://bugzilla.kernel.org/show_bug.cgi?id=213391
Comment 9 suggest: "downgrade the firmware."
Comment 15 claims: "20210315 seems to work fine here (on an E595)."

Revision history for this message
Serj (s3rj1k) wrote :

@alexander-deucher

CPU model: AMD Ryzen 7 2700U with Radeon Vega Mobile Gfx
Kernel: 5.10.49
Firmware:
VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 52, firmware version: 0x000000a4
PFP feature version: 52, firmware version: 0x000000bc
CE feature version: 52, firmware version: 0x0000004f
RLC feature version: 1, firmware version: 0x00000213
RLC SRLC feature version: 1, firmware version: 0x00000001
RLC SRLG feature version: 1, firmware version: 0x00000001
RLC SRLS feature version: 1, firmware version: 0x00000001
MEC feature version: 52, firmware version: 0x000001c2
MEC2 feature version: 52, firmware version: 0x000001c2
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x21000055
TA RAS feature version: 0x00000000, firmware version: 0x2100002b
TA XGMI feature version: 0x00000000, firmware version: 0x2100002b
TA HDCP feature version: 0x17000011, firmware version: 0x2100002b
TA DTM feature version: 0x12000003, firmware version: 0x2100002b
SMC feature version: 0, firmware version: 0x00001e49
SDMA0 feature version: 41, firmware version: 0x00000028
VCN feature version: 0, firmware version: 0x0210c005
DMCU feature version: 0, firmware version: 0x00000000
DMCUB feature version: 0, firmware version: 0x00000000
VBIOS version: 113-RAVEN-107

Also affected

Revision history for this message
Heinrich Schuchardt (xypron) wrote :

After updating to linux-firmware commit d79c26779d45906 the problems persist on Lenovo Thinkpad E585:

amdgpu 0000:05:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:1 pasid:32769, for process Xorg pid 1336 thread Xorg:cs0 pid 1862)

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :
  • dmesg.log Edit (234.6 KiB, text/x-log; charset="UTF-8"; name="dmesg.log")

Em segunda-feira, 12 de julho de 2021, às 15:12:19 -03, Alex Deucher
escreveu:
> Does the latest firmware in the firmware git tree help?
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.g
> it/log/amdgpu

I updated the picasso* files from commit:

d79c26779d45 amdgpu: update vcn firmware for green sardine for 21.20

And I still see the issue. It took a while to reproduce: I updated the
firmware (and ran `update-initramfs -u -k all` to get it into the
initramfs) on July 12 and had the laptop turned on since then (closing the
lid to put it to sleep), and today I saw the problem again.

The full dmesg is attached.

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :
Download full text (5.7 KiB)

Hello,

For some reason, in the past week or so this bug has been freezing my
machine every couple of days or so (I’m surprised that AMD wasn’t able
to reproduce the problem yet¹). You can imagine how “pleasant” it makes
using this computer.

Today I got an interesting error in dmesg, perhaps it provides some clue:

[38454.299445] ------------[ cut here ]------------
[38454.299449] refcount_t: underflow; use-after-free.
[38454.299457] WARNING: CPU: 5 PID: 17577 at lib/refcount.c:28 refcount_warn_saturate+0xae/0xf0
[38454.299465] Modules linked in: overlay ccm rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nft_compat nft_counter nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct bridge stp llc nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink cmac algif_hash algif_skcipher af_alg bnep binfmt_misc nls_iso8859_1 snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation soundwire_cadence snd_hda_codec snd_hda_core snd_hwdep soundwire_bus snd_soc_core snd_compress ac97_bus snd_pcm_dmaengine intel_rapl_msr intel_rapl_common joydev snd_pcm edac_mce_amd snd_seq_midi ath10k_pci ath10k_core snd_seq_midi_event kvm_amd snd_rawmidi ath mac80211 kvm uvcvideo snd_seq btusb videobuf2_vmalloc rapl videobuf2_memops videobuf2_v4l2 videobuf2_common btrtl input_leds
[38454.299510] serio_raw btbcm videodev btintel wmi_bmof snd_seq_device efi_pstore bluetooth snd_timer mc cfg80211 k10temp ecdh_generic snd ecc ideapad_laptop ccp libarc4 sparse_keymap soundcore elan_i2c mac_hid sch_fq_codel msr parport_pc ppdev lp parport ip_tables x_tables autofs4 btrfs blake2b_generic xor raid6_pq libcrc32c dm_crypt zstd zram z3fold amdgpu crct10dif_pclmul crc32_pclmul ghash_clmulni_intel iommu_v2 gpu_sched aesni_intel i2c_algo_bit drm_ttm_helper ttm crypto_simd cryptd glue_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm i2c_piix4 nvme xhci_pci i2c_hid xhci_pci_renesas nvme_core wmi video hid
[38454.299550] CPU: 5 PID: 17577 Comm: kworker/u32:18 Not tainted 5.11.0-25-generic #27-Ubuntu
[38454.299552] Hardware name: LENOVO 81V7/LNVNB161216, BIOS BUCN23WW 11/05/2019
[38454.299554] Workqueue: events_unbound async_run_entry_fn
[38454.299559] RIP: 0010:refcount_warn_saturate+0xae/0xf0
[38454.299562] Code: f8 1c 96 01 01 e8 9f f1 62 00 0f 0b 5d c3 80 3d e5 1c 96 01 00 75 91 48 c7 c7 e8 c7 60 b9 c6 05 d5 1c 96 01 01 e8 7f f1 62 00 <0f> 0b 5d c3 80 3d c3 1c 96 01 00 0f 85 6d ff ff ff 48 c7 c7 40 c8
[38454.299564] RSP: 0018:ffffb60383537b58 EFLAGS: 00010282
[38454.299566] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8d4578b58ac8
[38454.299567] RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff8d4578b58ac0
[38454.299568] RBP: ffffb60383537b58 R08: ffffffffb9c73540 R09: ffffb60383537af0
[38454.299569] R10: 000000002d2d2d2d R11: ffffb603835379e8 R12: ffff8d44cf64d000
[38454.299570] R13: 0000000000000000 R14: ffffb6038b8cd000 R15: 0000000000000004
[38454.299571] FS: 0000000000000000(0000) GS:ffff...

Read more...

Revision history for this message
Leandro Scott (lsrzj) wrote (last edit ):

It happened to me too with recent firmware version so, I downgraded to linux-firmware-20210208.tar.gz and replaced all files from amdgpu folder. No problem at all until 27/09/2021

Revision history for this message
Michal Przybylowicz (michal-przybylowicz) wrote :

I have similar messages in journalctl:

Package: linux-firmware
Version: 1.197.3

Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:0 vmid:1 pasid:32778, for process vivaldi-bin pid 1673 thread vivaldi-bi:cs0 pid 1699)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800101140000 from client 0x12 (VMC)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00105631
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: VCN0 (0x2b)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:0 vmid:1 pasid:32778, for process vivaldi-bin pid 1673 thread vivaldi-bi:cs0 pid 1699)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800101188000 from client 0x12 (VMC)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00105631
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: VCN0 (0x2b)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:0 vmid:1 pasid:32778, for process vivaldi-bin pid 1673 thread vivaldi-bi:cs0 pid 1699)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: in page starting at address 0x0000800101189000 from client 0x12 (VMC)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00105631
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: Faulty UTCL2 client ID: VCN0 (0x2b)
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MORE_FAULTS: 0x1
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: WALKER_ERROR: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: PERMISSION_FAULTS: 0x3
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: MAPPING_ERROR: 0x0
Aug 29 16:58:44 dagon kernel: amdgpu 0000:03:00.0: amdgpu: RW: 0x0

Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

Hello,

I’d just like to report that I haven’t seen this problem in a while. The
last time I see the “retry page fault” messages in my log was on August 9.

I’ve been using the ‘amdgpu/picasso*‘ files from linux-firmware commit
c46b8c364b82 (“ice: update package file to 1.3.26.0”) so apparently this
particular problem was recently fixed.

Which isn’t to say that I’m having a trouble-free amdgpu experience,
unfortunately. Every week or so my laptop comes back from sleep with the
screen and keyboard frozen (I can still ssh into it), but now the error is:

kernel: [drm:drm_atomic_helper_wait_for_flip_done [drm_kms_helper]] *ERROR*
[CRTC:67:crtc-0] flip_done timed out

But it seems to be a separate problem from the one reported in this
particular launchpad issue. I’ll see if I can find a more appropriate
launchpad issue and post the details there.

Thank you all for your help and support with this issue.

Revision history for this message
Juerg Haefliger (juergh) wrote :

Hi. I'm picking up this ticket from Seth. Reading through the history it seems it's still an open issue? My understanding is that upstream 'fixed' this by reverting fw blobs in version 20210818. I can produce a linux-firmware test package for hirsute 20.04 with these reverts if necessary. Just let me know.

Changed in linux-firmware (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Alex Deucher (alexander-deucher) wrote :
Revision history for this message
Lancillotto (antonio-petricca) wrote (last edit ):

I have the same issue on:

Dell E5495
AMD Ryzen 7 PRO 2700U w/ Radeon Vega Mobile Gfx
16Gb RAM
Linux Mint 19.3 (Ubuntu 18.04)
Kernel 5.15.2
Linux Firmware 1.173.20

Revision history for this message
Juerg Haefliger (juergh) wrote :

@antonio-petricca, sorry but 5.15.2 is not a supported Ubuntu kernel and especially not on Bionic with (old) Bionic firmware.

Juerg Haefliger (juergh)
no longer affects: mesa (Ubuntu Hirsute)
Changed in mesa (Ubuntu):
status: Confirmed → Invalid
Changed in linux-firmware (Ubuntu Hirsute):
status: New → Confirmed
Changed in linux-firmware (Ubuntu):
status: Confirmed → Invalid
Changed in linux-firmware (Ubuntu Hirsute):
assignee: nobody → Juerg Haefliger (juergh)
Revision history for this message
Lancillotto (antonio-petricca) wrote :
Revision history for this message
Juerg Haefliger (juergh) wrote :

@antonio-petricca, What series? What kernel?

I can produce a hirsute linux-firmware package with the reverted sdma firmware but need someone to verify it on hirsute with the hirsute kernel. Any takers? Or have you all moved on to impish?

Seth Forshee (sforshee)
Changed in linux-firmware (Ubuntu):
assignee: Seth Forshee (sforshee) → nobody
Revision history for this message
Lancillotto (antonio-petricca) wrote :
Revision history for this message
Juerg Haefliger (juergh) wrote :

If you want this fixed in Ubuntu I need to know what series are affected. Hirsute goes EOL at the end of the month. Are Impish and/or Jammy working or affected as well?

Changed in linux-firmware (Ubuntu Hirsute):
status: Confirmed → Incomplete
Revision history for this message
Thiago Jung Bauermann (thiago-bauermann) wrote :

Hello Juerg,

Em quinta-feira, 20 de janeiro de 2022, às 12:32:48 -03, Juerg Haefliger
escreveu:
> If you want this fixed in Ubuntu I need to know what series are
> affected. Hirsute goes EOL at the end of the month. Are Impish and/or
> Jammy working or affected as well?

I upgraded to Impish a while ago.

I haven’t seen “retry page fault” messages in a long while (I don’t think
it’s related to the distro upgrade, but not sure) so I’d say this
particular bug is fixed at least for me (I have a Picasso GPU).

Which is not to say that things are rosy, unfortunately. But the other
issues I see don’t cause any message to appear in dmesg so it’s hard to
search for existing bug reports about them or open a new one.

The following is off-topic for this bug report, but I’ll mention anyway,
hope you’ll bear with me:

One thing I noticed is that things did get rosy when I did two things:

1. Switched from Xorg to Wayland.
2. Switched Firefox to use Wayland as well.

This led me to the conclusion that the bugs that plague my machine are
triggered by something that Firefox does when it uses X (both “natively” or
via XWayland). For some reason, when it uses Wayland it doesn’t trigger
these GPU bugs.

Another thing that might be relevant is that I have tons of tabs open
(probably more than 200) distributed in 27 open windows. Perhaps I’m
stressing some kind of resource limit in the driver or firmware?

Revision history for this message
Juerg Haefliger (juergh) wrote :

Hirsute is EOL so closing this bug. Please open a new one if the problem still persists with one of the supported series.

Changed in linux-firmware (Ubuntu Hirsute):
status: Incomplete → Won't Fix
Revision history for this message
Mario Limonciello (superm1) wrote :

Just to correct a few of the targets on this issue.
* The reverts mentioned in #30 need to be pulled into linux-firmware for focal.
* They're already included in jammy.

Changed in amd:
status: New → Fix Released
no longer affects: mesa (Ubuntu)
Changed in linux-firmware (Ubuntu Focal):
assignee: nobody → Juerg Haefliger (juergh)
Changed in linux-firmware (Ubuntu):
status: Invalid → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.